# 1. Introduction

Text data preparation is very important in this sentiment analysis project. In this section, firstly, we are going to load all the modules we need in this analysis and introduce the NLTK movie reviews corpora. Secondly, we store all the data in the python list. Thirdly, we briefly talk about how to erase the punctuation, contraction, etc.

# 2. Modules Preparation & Movie Reviews Corpora

The Python modules we are going to use are listed below

In [4]:
import nltk
import pickle
import random
import re
import gensim
import tensorflow as tf
import numpy as np
import string
import pandas as pd
import time

# Import the data set:movie reviews
from nltk.corpus import movie_reviews
from nltk.tag import pos_tag
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import word_tokenize
from nltk.classify import ClassifierI

from sklearn import grid_search
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, classification_report,confusion_matrix, roc_curve, auc
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
from sklearn.preprocessing import scale

from statistics import mode
from scipy.stats import uniform as sp_rand
from gensim.models import Word2Vec
import matplotlib.pyplot as plt

The movie reviews corpora in NLTK contains 2000 movie reviews and each movie review is stored in a text file. If you want to see the raw data directly in your PC, just type **appdata** in the path and go to the file **nltk_data**. Then choose the corpora and after opening the movie_reviews file, you can see the raw text data. 

In this corpora, you could see half of the reviews are positive and the second half are negative. You can also get the details of this corpora just by running the following codes.

In [5]:
movie_reviews.categories()

['neg', 'pos']

You can also get the text file names by using the fileids method

In [7]:
movie_reviews.fileids('pos')[:3]

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt']

Then, for instance, if we want to get access to all the words in a text file by a file name, use the code below:

In [15]:
movie_reviews.words(movie_reviews.fileids('pos')[movie_reviews.fileids('pos').index('pos/cv000_29590.txt')])

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

# 3. Input the Data to Python

One thing to remember is that we should random shuffle the documents to erase the bias in the documents

In [16]:
documents = []

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((movie_reviews.words(fileid), category))
        
random.shuffle(documents)