# 1. Introduction

Text data preparation is very important in this sentiment analysis project. In this section, firstly, we are going to load all the modules we need in this analysis and introduce the NLTK movie reviews corpora. Secondly, we store all the data in the python list. Thirdly, we briefly talk about how to erase the punctuation, contraction, etc.

# 2. Modules Preparation & Movie Reviews Corpora

The Python modules we are going to use in this sentiment analysis task are listed below

In [1]:
import nltk
import pickle
import random
import re
import gensim
import tensorflow as tf
import numpy as np
import string
import pandas as pd
import time

from nltk.corpus import movie_reviews, stopwords
from nltk.tag import pos_tag
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import word_tokenize
from nltk.classify import ClassifierI
from nltk.stem.lancaster import LancasterStemmer

from sklearn import grid_search
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, classification_report,confusion_matrix, roc_curve, auc
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
from sklearn.preprocessing import scale

from statistics import mode

from scipy.stats import uniform as sp_rand
from scipy.stats import expon as sp_expon
from scipy.optimize import minimize
from scipy.stats import norm

from bs4 import BeautifulSoup

from gensim.models import Word2Vec
import matplotlib.pyplot as plt

Using TensorFlow backend.


The movie reviews corpora in NLTK contains 2000 movie reviews and each movie review is stored in a text file. If you want to see the raw data directly in your PC, just type **appdata** in the path and go to the file **nltk_data**. Then choose the corpora and after opening the movie_reviews file, you can see the raw text data. 

In this corpora, you could see half of the reviews are positive and the second half are negative. You can also get the details of this corpora just by running the following codes.

In [4]:
movie_reviews.categories()

['neg', 'pos']

You can also get the text file names by using the fileids method

In [5]:
movie_reviews.fileids('pos')[:3]

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt']

Then, for instance, if we want to get access to all the words in a text file by a file name, use the code below:

In [6]:
movie_reviews.words(movie_reviews.fileids('pos')[movie_reviews.fileids('pos').index('pos/cv000_29590.txt')])

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

# 3. Input the Data to Python

After knowing these above methods, we can put these files in a document. One thing to remember is that we should random shuffle the documents to erase the bias in the documents

In [34]:
documents = []

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((movie_reviews.words(fileid), category))
        
random.shuffle(documents)

Since we have got all the words, we need to create the features for the following analysis. The features we are going to use here are the most frequent words used in the movie reviews (We would use new features later to see if there is change in the classification accuracy). Here, the method **FreqDist** is used to list the words with their frequencies so that we can pick, for instance, first 5000 of them and use these words as features.

In [35]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())  # convert all the words to lowercase
    
all_words = nltk.FreqDist(all_words)

# Use the top 5000 words as the keys
word_features1 = list(all_words.keys())[:5000]

Until now, we have created the documents consist of all the movie review files and the word features. So the next thing is to judge whether a specific file contains featured words. An option is to create a function and decide if a word in word feature list also exists in a movie review text file.

In [36]:
def find_features(text, word_features):
    words = set(text)
    featuresets = {}
    for w in word_features:
        featuresets[w] = (w in words)
        
    return featuresets

Then we could use this function to process each text file in the documents and at last create the trainning set, the validation set(a data set which is always being used to compare the performance of different models) and the testing set.

In [37]:
featuresets1 = [(find_features(text = rev, word_features = word_features1),category) for (rev, category) in documents]

training_set1 = featuresets1[:1800]
validation_set1 = featuresets1[1800:1900]
testing_set1 = featuresets1[1900:]

Until now, we use the first 5000 most frequent words as word features. However, this may lead to incorrect results because many nouns such as **movie** does not imply a viewer's attitude towards a movie. A more reliable approach is to only consider the adjectives and adverbs as features because these words are more close to a viewer's opinion. To reach this goal, we need to use the **pos_tag** function to tag a word's part of speech and select the adjectives and adverbs.

In [38]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())  # convert all the words to lowercase

pos = pos_tag(all_words)

adj_adv = []

for w in pos:
    if w[1][0] == 'J' or w[1][0] == 'R': # The tags of adjectives and adverbs begin with 'J' and 'R'
        adj_adv.append(w[0])

The we are about to choose the most frequent 5000 adjectives and adverbs as word features to continue the classification task.

In [39]:
adj_adv = nltk.FreqDist(adj_adv)

# Use the top 5000 words as the keys
word_features2 = list(adj_adv.keys())[:5000]

After gaining the word_features, we construct new training set and testing set again

In [40]:
featuresets2 = [(find_features(text = rev, word_features = word_features2),category) for (rev, category) in documents]

training_set2 = featuresets2[:1800]
validation_set2 = featuresets2[1800:1900]
testing_set2 = featuresets2[1900:]

# 4. Compare the results with different features

To compare the classification performance, we use the same classifier: logistics regression to finish this classification task. First, let's start with the most frequent 5000 words.

In [43]:
logistic_classifier = SklearnClassifier(LogisticRegression())
logistic_classifier.train(training_set1)                           
print('logistic_classifier accuracy rate with top 5000 most frequent words: ', 
      (nltk.classify.accuracy(logistic_classifier, validation_set1))*100)

logistic_classifier accuracy rate with top 5000 most frequent words:  72.0


However, if we use the featureset with top 5000 most frequent adj&adv, the accuracy rate is:

In [44]:
logistic_classifier = SklearnClassifier(LogisticRegression())
logistic_classifier.train(training_set2)                           
print('logistic_classifier accuracy rate with top 5000 most frequent adj&adv words: ', 
      (nltk.classify.accuracy(logistic_classifier, validation_set2))*100)

logistic_classifier accuracy rate with top 5000 most frequent adj&adv words:  82.0


We see a dramatic improve in the classification accuracy.

# 5. Appendix: Clean the Text Data

If we want our model to achieve better results in doing classification, one basic approach is to improve the quality of our raw text data. When we get a raw text data, the first thing we always should do is to clean it. In general, we could improve the quality of our text dataset by doing the following things:

1. eliminate stopwords
2. eliminate punctuation
3. contraction (He's -> He is)
4. conversion into lowercase
5. stemming
6. delete the HTML tags

**Note**: We don't need to do these cleaning every time. Use some of them based on the real situation.

## 5.1 Eliminate the stopwords

Stopwords like 'to' are meaningless in natural language processing. Hence, we can delete them first to get a much cleaner dataset.

NLTK actually provides a great resource of stopwords in many languages. Here we use the English stopwords and eliminate the unnecessary words in our text.

In [45]:
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
words=["Don't", 'hesitate','to','ask','questions']
[word for word in words if word not in stops]

["Don't", 'hesitate', 'ask', 'questions']

The NLTK is marvelous because it contains many languages' stopwords. They can be directly used to help with our following analysis.

In [46]:
print(stopwords.fileids(), end = ' ')

['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'kazakh', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish'] 

We can then calculate the proportion of stopwords in a given text.

In [54]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stops = stopwords.words('english')

text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty.","I am very happy"]
words = []
for sentence in text:
    for word in word_tokenize(sentence):
        words.append(word)

lower_words = [word.lower() for word in words]
        
text_f = [word for word in lower_words if word in stops]

proportion = len(text_f)/len(lower_words)
print('The proportion of stopwords in this text is about :', np.ceil(proportion*100), '%')

The proportion of stopwords in this text is about : 46.0 %


## 5.2 Eliminate punctuations

Punctuations are also useless in our analysis. To delete all the punctuations, we need to use the **re** module and regular expressions.

In [55]:
import re
import string
from nltk.tokenize import word_tokenize

text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."]
tokenized_docs = [word_tokenize(doc) for doc in text]
# [] means character classes. Character classes provide a way to match only one of a specific set of characters
x = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = x.sub(u'',token) # In python, r"XXX" means normal string. u"XXX" means unicode
        if not new_token == u'':
            new_review.append(new_token)
    tokenized_docs_no_punctuation.append(new_review)

print(tokenized_docs_no_punctuation)

[['It', 'is', 'a', 'pleasant', 'evening'], ['Guests', 'who', 'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'], ['Food', 'was', 'tasty']]


## 5.3 Contraction

In English, many contractions appear like she's, we're. We could also use regular expressions to convert them into their original form.

In [56]:
import re

replacement_patterns = [
    (r"won\'t", "will not"),
    (r"can\'t", "cannot"),
    (r"I\'m","I am"),
    (r"ain\'t", 'is not'),
    # \g<1> are using back-references to capture part of the matched pattern
    # \g means referencing group content in the previous pattern. <1> means the first group. In the following case, the first group is w+
    (r"(\w+)\'ll","\g<1> will"),
    (r"(\w+)n\'t", "\g<1> not"),
    (r"(\w+)\'ve", "\g<1> have"),
    (r"(\w+)\'s", "\g<1> is"),
    (r"(\w+)\'re", "\g<1> are"),
    (r"(\w+)\'d", "\g<1> would")
]

class RegexpReplacer(object):
    def __init__(self,patterns = replacement_patterns):
        self.patterns = [(re.compile(regex),repl) for (regex, repl) in replacement_patterns]
        
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            (s, count) = re.subn(pattern, repl,s) # subn returns the times of replacement
        return s
    
Object = RegexpReplacer()
Object.replace("I'm Jack. He's Jim. He'll participate in our team! He've played football for ten years~")

'I am Jack. He is Jim. He will participate in our team! He have played football for ten years~'

## 5.4 Conversion to Lowercase

In [57]:
text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty.","I am very happy"]
words = []
for sentence in text:
    for word in word_tokenize(sentence):
        words.append(word)

lower_words = [word.lower() for word in words]
print(lower_words)

['it', 'is', 'a', 'pleasant', 'evening', '.', 'guests', ',', 'who', 'came', 'from', 'us', 'arrived', 'at', 'the', 'venue', 'food', 'was', 'tasty', '.', 'i', 'am', 'very', 'happy']


## 5.5 Stemming

Stemming in NLP means that we treat each word's different variants as the same word. For instance for playing, played, play, we see them as a same word:play. nltk has many useful stemmers. The most well-know ones are Lancaster Stemmer and Porter Stemmer. Here we use Lancaster Stemmer to cope with this Coursera corpus.

In [58]:
text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty.","I am very happy"]

st = LancasterStemmer()
text_stemmed = [st.stem(word) for word in text]
print(text_stemmed, end = "\t")

[' it is a pleasant evening.', 'guests, who came from us arrived at the venue', 'food was tasty.', 'i am very happy']	

## 5.6 Remove the HTML Markup

Some type of text data is tsv(tab delimited file), which always includes the HTML markup. We also need to erase them because these signs are useless. To accomplish this task, we need to use the BeautifulSoup module.

Here we are going to use the labeled datasets in [Kaggle Movie Review Analysis](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) as an example.

In [3]:
labeled_train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [6]:
example1 = BeautifulSoup(labeled_train["review"][0]) 



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))


In [7]:
print(labeled_train['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [8]:
print(example1.get_text())

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

From the results shown above, we could see that the HTML markers have been delimited.

In the end, all the preprocessing steps could be written in the following one function.

In [3]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Stemming
    st = LancasterStemmer()
    text_stemmed = [st.stem(word) for word in meaningful_words]
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( text_stemmed ))   