### Sentiment analysis on reviews data
Kei Sato

ML310B - Advanced Machine Learning

March 25, 2019


We will be using reviews data to develop a sentiment analyzer, such that given a document, the model can predict if the review is positive (sentiment = 1) or negative (sentiment = 0)


In [1]:
# Load the data...
import pandas as pd
from nltk.tokenize import word_tokenize

data = pd.read_csv('data/Reviews.csv')

print("Number of positive and negative review", '\n', data["sentiment"].value_counts())
data.head()

Number of positive and negative review 
 1    25000
0    25000
Name: sentiment, dtype: int64


Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


#### Initial Text Processing
The reviews corpus has 50,000 reviews and is evenly split between positive and negative reviews, so that it contains 25,000 positive and 25,000 negative reviews.  Before doing any more data exploration, we process the text using standard techniques.  Much of this code was taken from the Lesson 8 HW assignment.

The first step is apply some basic text processing.  This function will transform all the letters to lowercase and replace any punctuation or symbols with spaces.  At this step we will also remove English stop words.  Because this corpus contains some <br \> HTML elements, we will strip those out from the text as well.  This function will return the words in a tokenized format such each word is an element in an array.  After cleaning the text, lemmatization is applied. 

I did apply stemming to the dataset, but that produced too many non words and so it has been omitted from the text processing steps.

In [2]:
# Taken Lesson 8 HW assignment
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

replace_re_by_space = re.compile('[/(){}\[\]\|@,;]')
delete_re_symbols = re.compile('[^0-9a-z #+_]')
stop_words = set(stopwords.words('english'))

def combine_tokened_words(tokened_words):
    length_of_string=len(tokened_words)
    text_new=""
    for w in tokened_words:
        if w!=tokened_words[length_of_string-1]:
             text_new=text_new+w+" " # when w is not the last word so separate by whitespace
        else:
            text_new=text_new+w
    return text_new

# converts to lowercase and removes <br />, punctuation, stop words, and numbers
def text_processing(text):
    text = text.lower()
    text = text.replace("<br />", '')
    text = re.sub(replace_re_by_space.pattern, ' ', text)
    text = re.sub(delete_re_symbols.pattern, '', text)
    token_word = word_tokenize(text)
    
    # filtered_sentence contain all words that are not in stopwords dictionary    
    filtered_sentence = [w for w in token_word if not w in stop_words]
    return filtered_sentence

# Lemmatizes words
def text_lemmatization(text):
    wordnet_lemmatizer = WordNetLemmatizer()
    text = list(map(lambda word: wordnet_lemmatizer.lemmatize(word), text))
    return text

# test_data = data[:500].copy(deep=True)
test_data = data.copy(deep=True)
test_data["review"] = test_data["review"].apply(lambda text:
                                                combine_tokened_words(
                                                    text_lemmatization(
                                                        text_processing(text)
                                                    )
                                                )
                                               )
print("done processing data")

done processing data


#### Data exploration
Below is some initial data exploration.  We can see that the average length of positive and negative reviews is roughtly the same.  The ten most frequently occuring words are also very similar across between the sets of positive and negative reviews.  I also outputted the ten least commonly occuring words, in part for my own curiosity and to verify that the ten least commonly occuring words were still complete words.  It seems like the ten least commonly occurring words in positive reviews are names, which may indicate that the positive reviews were praising an individual's work in the movie

In [3]:
import numpy as np
from collections import Counter 
from functools import reduce
from operator import itemgetter
import heapq

# Get average length of reviews
def get_avg_length_review(data, sentiment):
    relevant_reviews = data.loc[data["sentiment"] == sentiment]["review"]
    avg_review_length = list(map(lambda review: len(review.split()), relevant_reviews))
    return int(np.mean(avg_review_length))
print("Average word count of negative reviews:", get_avg_length_review(test_data, 0))
print("Average word count of positive reviews:", get_avg_length_review(test_data, 1))

# Get 10 most and least frequently occuring words, verify that real words are coming through
def get_most_least_common_words(data, sentiment):
    relevant_reviews = data.loc[data["sentiment"] == sentiment]["review"]
    all_relevant_reviews = reduce(lambda accum, curr: accum + curr, relevant_reviews)
    counted_words = Counter(all_relevant_reviews.split())
    most_common = counted_words.most_common(10)
    least_common = heapq.nsmallest(10, counted_words.items(), key=itemgetter(1))
    return most_common, least_common
negative_reviews = get_most_least_common_words(test_data, 0)
positive_reviews = get_most_least_common_words(test_data, 1)

print('\n')
print("Top 10 most common words in negative reviews", negative_reviews[0])
print("Bottom 10 least common words in negative reviews", negative_reviews[1])
print('\n')
print("Top 10 most common words in positive reviews", positive_reviews[0])
print("Bottom 10 least common words in positive reviews", positive_reviews[1])


Average word count of negative reviews: 118
Average word count of positive reviews: 120


KeyboardInterrupt: 

#### Running the model and cross validation

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer


from sklearn.cluster import KMeans

def graph_roc(y_true, y_pred):
    fpr, tpr, thresholds = metrics.roc_curve(y_true, y_pred)
    print("FPR:", fpr)
    print("TPR:", tpr)
    roc_auc = metrics.auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=1, alpha=0.3, label='test')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.show()
    
def get_sorted_predictions(data, y_true, y_pred):
    predicted_pos = 0
    predicted_neg = 0
    correct_predictions = []
    incorrect_predictions = []
    for i in range(0, len(y_true)):
        if y_true[i] == y_pred[i]:
            correct_predictions.append((y_true[i], data[i]))
        else:
            incorrect_predictions.append((y_true[i], data[i]))
            if y_pred[i] == 1:
                predicted_pos+=1
            else:
                predicted_neg+=1
    print("Predicted POSITIVE, actually NEGATIVE", float(predicted_pos)/float(len(y_true)))
    print("Predicted NEGATIVE, actually POSITIVE", float(predicted_neg)/float(len(y_true)))


def run_model_cv(data):   
    x_train, x_test, y_train, y_test = train_test_split(
        data["review"],
        data["sentiment"],
        test_size=0.3,
        random_state=42
    )
    
    ngrams = [
        (1, 1),
        (1, 2),
        (1, 3),
        (1, 4),
        (1, 5)
    ]
    
    for ngram_param in ngrams:
        print("ngram range", ngram_param)
#         tfid_vectorizer = TfidfVectorizer(min_df=10, max_df=0.8, use_idf=True, ngram_range=ngram_param).fit(x_train)
#         _x_train = tfid_vectorizer.transform(x_train)
#         _x_test = tfid_vectorizer.transform(x_test)

        count_vectorizer = CountVectorizer(max_df=0.8, min_df=0.1, ngram_range=ngram_param).fit(x_train)
        _x_train = count_vectorizer.transform(x_train)
        _x_test = count_vectorizer.transform(x_test)
        print("done vectorizing")

#         kmeans = KMeans(n_clusters=2, random_state=0).fit(_x_train)
#         y_pred = kmeans.predict(_x_test)
#         cv_clf = GridSearchCV(
#                     SVC(),
#                     [
#                         {
#                             "kernel": ["linear", "poly", "rbf"],
#                             "degree": [1, 2],
#                             "gamma": ["auto", "scale"]
#                         }
#                     ],
#                     cv=5,
#                     refit=True
#                 )
        cv_clf = SVC(kernel="linear", degree=1, gamma="auto", cache_size=0.8)
        print("fitting the model on dataset with these dimensions", _x_train.shape)
        cv_clf.fit(_x_train, y_train)
#         print("Best params", cv_clf.best_params_)
        
        y_pred = cv_clf.predict(_x_test)
        print("accuracy", metrics.accuracy_score(y_test, y_pred))
#         graph_roc(y_test, y_pred)
        get_sorted_predictions(list(x_test), list(y_test), y_pred)

#     (35000, 20274)
print("running the program????")    
run_model_cv(test_data)

x_train, x_test, y_train, y_test = train_test_split(data["review"], data["sentiment"],
                                                    test_size=0.3, random_state=42)
print("x train data shape", x_train.shape)
temp = CountVectorizer().fit_transform(x_train)
print("data shape", temp.shape)
temp = CountVectorizer().fit_transform(x_train)
print("data shape", temp.shape)

running the program????
ngram range (1, 1)
done vectorizing
fitting the model on dataset with these dimensions (35000, 124)


#### Notes
- Cannot use TFIDF because it was producing data sets that had more features than data points