# 1. Introduction

Text data preparation is very important in this sentiment analysis project. In this section, firstly, we are going to load all the modules we need in this analysis and introduce the NLTK movie reviews corpora. Secondly, we store all the data in the python list. Thirdly, we briefly talk about how to erase the punctuation, contraction, etc.

# 2. Modules Preparation & Movie Reviews Corpora

The Python modules we are going to use are listed below

In [18]:
import nltk
import pickle
import random
import re
import gensim
import tensorflow as tf
import numpy as np
import string
import pandas as pd
import time

# Import the data set:movie reviews
from nltk.corpus import movie_reviews
from nltk.tag import pos_tag
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import word_tokenize
from nltk.classify import ClassifierI

from sklearn import grid_search
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, classification_report,confusion_matrix, roc_curve, auc
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
from sklearn.preprocessing import scale
import sklearn.gaussian_process as gp

from statistics import mode
from scipy.stats import uniform as sp_rand
from gensim.models import Word2Vec
import matplotlib.pyplot as plt

The movie reviews corpora in NLTK contains 2000 movie reviews and each movie review is stored in a text file. If you want to see the raw data directly in your PC, just type **appdata** in the path and go to the file **nltk_data**. Then choose the corpora and after opening the movie_reviews file, you can see the raw text data. 

In this corpora, you could see half of the reviews are positive and the second half are negative. You can also get the details of this corpora just by running the following codes.

In [2]:
movie_reviews.categories()

['neg', 'pos']

You can also get the text file names by using the fileids method

In [3]:
movie_reviews.fileids('pos')[:3]

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt']

Then, for instance, if we want to get access to all the words in a text file by a file name, use the code below:

In [4]:
movie_reviews.words(movie_reviews.fileids('pos')[movie_reviews.fileids('pos').index('pos/cv000_29590.txt')])

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

# 3. Input the Data to Python

After knowing these above methods, we can put these files in a document. One thing to remember is that we should random shuffle the documents to erase the bias in the documents

In [5]:
documents = []

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((movie_reviews.words(fileid), category))
        
random.shuffle(documents)

Since we have got all the words, we need to create the features for the following analysis. The features we are going to use here are the most frequent words used in the movie reviews(We would use new features later to see the change in the accuracy of the classifiers). Here, the method **FreqDist** is used to list the words with their frequencies so that we can pick, for instance, first 3000 of them and use these words as features.

In [6]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())  # convert all the words to lowercase
    
all_words = nltk.FreqDist(all_words)

# Use the top 3000 words as the keys
word_features = list(all_words.keys())[:5000]

Until now, we have created the documents consist of all the movie review files and the word features. So the next thing is to judge whether a specific file contains featured words. An option is to create a function and decide if a word in word feature list also exists in a movie review text file.

In [7]:
def find_features(text):
    words = set(text)
    featuresets = {}
    for w in word_features:
        featuresets[w] = (w in words)
        
    return featuresets

Then we could use this function to analyze each text file in the documents and at last create the trainning set, the validation set(a data set which is always being used to compare the performance of different models) and the testing set.

In [8]:
featuresets = [(find_features(rev),category) for (rev, category) in documents]

training_set = featuresets[:1800]
validation_set = featuresets[1800:1900]
testing_set = featuresets[1900:]

Until now, we use the first 5000 most frequent words as word features. However, this may lead to incorrect results because many nouns such as **movie** does not imply a viewer's attitude towards a movie. A more reliable approach is to consider the adjectives and adverbs as features because these words are more close to a viewer's opinion. To reach this goal, we need to use the **pos_tag** function to tag a word's part of speech and select the adjectives and adverbs.

In [9]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())  # convert all the words to lowercase

pos = pos_tag(all_words)

adj_adv = []

for w in pos:
    if w[1][0] == 'J' or w[1][0] == 'R':
        adj_adv.append(w[0])

The we are about to choose the most frequent 5000 adjectives and adverbs as word features to continue the classification task.

In [10]:
adj_adv = nltk.FreqDist(adj_adv)

# Use the top 5000 words as the keys
word_features = list(adj_adv.keys())[:5000]

After gaining the word_features, we construct new training set and testing set again

In [11]:
featuresets = [(find_features(rev),category) for (rev, category) in documents]

training_set = featuresets[:1800]
validation_set = featuresets[1800:1900]
testing_set = featuresets[1900:]

# 4. Clean the Text Data

If we want our model to achieve better results in doing classification, we should clean the text data first. In general, we could improve the quality of our text by doing the following things:

1. eliminate stopwords
2. eliminate punctuation
3. contraction (He's -> He is)
4. conversion into lowercase


## 4.1 Eliminate the stopwords

Stopwords like 'to' are meaningless in natural language processing. Hence, we can delete them first to get a much cleaner dataset.

In [12]:
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
words=["Don't", 'hesitate','to','ask','questions']
[word for word in words if word not in stops]

["Don't", 'hesitate', 'ask', 'questions']

The NLTK is marvelous because it contains many languages' stopwords. They can be directly used to help with our following analysis.

In [13]:
stopwords.fileids()

['danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'kazakh',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish',
 'turkish']

We can then calculate the proportion of stopwords in a given text.

In [14]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stops = stopwords.words('english')

text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty.","I am very happy"]
words = []
for sentence in text:
    for word in word_tokenize(sentence):
        words.append(word)

lower_words = [word.lower() for word in words]
        
text_f = [word for word in lower_words if word in stops]

proportion = len(text_f)/len(lower_words)
print(proportion)

0.4583333333333333


## 4.2 Eliminate punctuations

Punctuations are also useless in our analysis. To delete all the punctuations, we need to use the **re** module and regular expressions.

In [15]:
import re
import string
from nltk.tokenize import word_tokenize

text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."]
tokenized_docs = [word_tokenize(doc) for doc in text]
# [] means character classes. Character classes provide a way to match only one of a specific set of characters
x = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = x.sub(u'',token) # In python, r"XXX" means normal string. u"XXX" means unicode
        if not new_token == u'':
            new_review.append(new_token)
    tokenized_docs_no_punctuation.append(new_review)

print(tokenized_docs_no_punctuation)

[['It', 'is', 'a', 'pleasant', 'evening'], ['Guests', 'who', 'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'], ['Food', 'was', 'tasty']]


## 4.3 Contraction

In English, many contractions appear like she's, we're. We could also use regular expressions to convert them into their original form.

In [16]:
import re

replacement_patterns = [
    (r"won\'t", "will not"),
    (r"can\'t", "cannot"),
    (r"I\'m","I am"),
    (r"ain\'t", 'is not'),
    # \g<1> are using back-references to capture part of the matched pattern
    # \g means referencing group content in the previous pattern. <1> means the first group. In the following case, the first group is w+
    (r"(\w+)\'ll","\g<1> will"),
    (r"(\w+)n\'t", "\g<1> not"),
    (r"(\w+)\'ve", "\g<1> have"),
    (r"(\w+)\'s", "\g<1> is"),
    (r"(\w+)\'re", "\g<1> are"),
    (r"(\w+)\'d", "\g<1> would")
]

class RegexpReplacer(object):
    def __init__(self,patterns = replacement_patterns):
        self.patterns = [(re.compile(regex),repl) for (regex, repl) in replacement_patterns]
        
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            (s, count) = re.subn(pattern, repl,s) # subn returns the times of replacement
        return s
    
Object = RegexpReplacer()
Object.replace("I'm Jack. He's Jim. He'll participate in our team! He've played football for ten years~")

'I am Jack. He is Jim. He will participate in our team! He have played football for ten years~'

## 4.4 Conversion to Lowercase

In [17]:
text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty.","I am very happy"]
words = []
for sentence in text:
    for word in word_tokenize(sentence):
        words.append(word)

lower_words = [word.lower() for word in words]
print(lower_words)

['it', 'is', 'a', 'pleasant', 'evening', '.', 'guests', ',', 'who', 'came', 'from', 'us', 'arrived', 'at', 'the', 'venue', 'food', 'was', 'tasty', '.', 'i', 'am', 'very', 'happy']
