### Sentiment analysis on reviews data
Kei Sato

ML310B - Advanced Machine Learning

March 25, 2019

#### Project overview
For this assignment, we want to use sentiment analysis to predict the polarity of a given film review.  To build the model, we are given a corpus of 50K reviews, each associated with a score of 0 or 1, which respectively indicate that the review is negative or positive.  

#### Metrics used
We will use accuracy as the main metric used to determine if the model is successful.  But, throughout the model training and cross validation, the proportion of false positives for both classes will be monitored.


In [1]:
# Load the data...
import pandas as pd
from nltk.tokenize import word_tokenize

data = pd.read_csv('resources/Reviews.csv')
print("Number of positive and negative reviews", '\n', data["sentiment"].value_counts())
data.head()

Number of positive and negative reviews 
 1    25000
0    25000
Name: sentiment, dtype: int64


Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


#### Remove HTML tags
Original sentence:
```
<br /><br />The trailer of ""Nasaan ka man"" caught my attention, my daughter in law's and daughter's so we took time out to watch it this afternoon.
```

Removed HTML tags:
```
The trailer of ""Nasaan ka man"" caught my attention, my daughter in law's and daughter's so we took time out to watch it this afternoon.
```

In [2]:
from bs4 import BeautifulSoup

# takes string, returns string
def strip_html(text):
    soup = BeautifulSoup(text)
    return soup.get_text()

#### Convert all text to lowercase
Original sentence:
```
The trailer of ""Nasaan ka man"" caught my attention, my daughter in law's and daughter's so we took time out to watch it this afternoon.
```
Lowercase:
```
the trailer of nasaan ka man caught my attention, my daughter in law's and daughter's so we took time out to watch it this afternoon.
```

In [3]:
# takes string, returns string
def lowercase(text):
    return text.lower()

#### Expand contractions
Original text:
```
The SF premise isn't unique (although it pretty much was back then)
```

Without contractions:
```
The SF premise is not unique (although it pretty much was back then)
```

In [4]:
from nltk import tag
import json

with open('resources/contractions.json', 'r') as f:
    contractions = json.load(f)
contractions_keys = contractions.keys()

# takes tokenized text, returns tokenized text
def expand_contractions(text):
    text = text.split()
    return ' '.join(list(map(lambda word: contractions[word] if word in contractions_keys else word, text)))


#### Remove symbols and punctuation
Original text:
```
though it makes the most sophisticated use of the ""cut-out"" method of animation (a la ""south sark""), the real talent behind
```
Removed symbols and punctuation:
```
though it makes the most sophisticated use of the cutout method of animation a la south park the real talent behind

```

In [5]:
import re

replace_re_by_space = re.compile('[/(){}\[\]\|@,;]')
delete_re_symbols = re.compile('[^0-9a-z #+_]')

def remove_symbols_punctuation(text):
    text = re.sub(delete_re_symbols.pattern, '', text)
    text = re.sub(replace_re_by_space.pattern, ' ', text)
    return text

#### Remove stop words
Original text:
```
I went to this movie tonight with a few friends not knowing more than the Actors that were in it, and that it was supposed to be a horror movie.
```
Removed stop words:
```
I went movie tonight friends knowing Actors , supposed horror movie

```

In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def remove_stop_words(text):
    text = text.split()
    filtered_sentence = [w for w in text if not w in stop_words]
    return filtered_sentence


#### Text Lemmatization
Original text:
```
I went to this movie tonight with a few friends not knowing more than the Actors that were in it, and that it was supposed to be a horror movie.
```
Lemmatization applied:
```
I went movie tonight friends knowing Actors , supposed horror movie

```

In [7]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

def text_lemmatization(text):
    wordnet_lemmatizer = WordNetLemmatizer()
    text = list(map(lambda word: wordnet_lemmatizer.lemmatize(word), text))
    return text

#### Initial Text Processing
The reviews corpus has 50,000 reviews and is evenly split between positive and negative reviews, so that it contains 25,000 positive and 25,000 negative reviews.  Before doing any more data exploration, we process the text using standard techniques.  Much of this code was taken from the Lesson 8 HW assignment.

The first step is apply some basic text processing, it was done in the following order.
1.  Remove proper nouns:  This was done by using the NLTK position tagging functionality to identify proper nouns.
2.  Expand contractions
3.  Convert all text to lowercase
4.  Remove `<br />` characters, this was because the `<br />` HTML tag was present in many reviews.  This part of cleaning the text was specific to this corpus.
5.  Remove symbols and punctuation
6.  Remove stop words.  For this application, I also removed the words "movie" and "film" because they were occured very often throughout positive and negative reviews.

After cleaning the text, lemmatization is applied.  I did try to apply stemming to the dataset, but that produced too many non words and so it has been omitted from the text processing steps.

In [17]:
# Taken Lesson 8 HW assignment
from nltk.corpus import stopwords
from nltk import tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import json
import nltk

# these are only to be run once
# nltk.download("stopwords")
# nltk.download('averaged_perceptron_tagger')
# nltk.download('punkt')
# nltk.download('wordnet')

# converts to lowercase and removes <br />, punctuation, stop words, and numbers
def text_processing(text):
    # remove HTML
    text = strip_html(text)
    
    # expand contractions
    text = expand_contractions(text)
    
    # lower case letters
    text = lowercase(text)
    
    # remove punctuation/symbols
    text = remove_symbols_punctuation(text)

    # remove stop words
    text = remove_stop_words(text)
    
    # text lemmatization
    text = text_lemmatization(text)
    
    return ' '.join(text)

test_data = data.copy(deep=True)
test_data["review"] = test_data["review"].apply(lambda text: text_processing(text))
print("done cleaning data")

done cleaning data


#### Data exploration
Below is some initial data exploration.  We can see that the average length of positive and negative reviews is roughtly the same.  The ten most frequently occuring words are also very similar across between the sets of positive and negative reviews.  I also outputted the ten least commonly occuring words, in part for my own curiosity and to verify that the ten least commonly occuring words were still complete words.

In [None]:
import numpy as np
from collections import Counter 
from functools import reduce
from operator import itemgetter
import heapq

# Get average length of reviews
def get_avg_length_review(data, sentiment):
    relevant_reviews = data.loc[data["sentiment"] == sentiment]["review"]
    avg_review_length = list(map(lambda review: len(review.split()), relevant_reviews))
    return int(np.mean(avg_review_length))
print("Average word count of negative reviews:", get_avg_length_review(test_data, 0))
print("Average word count of positive reviews:", get_avg_length_review(test_data, 1))

# Get 10 most and least frequently occuring words, verify that real words are coming through
def get_most_least_common_words(data, sentiment):
    relevant_reviews = data.loc[data["sentiment"] == sentiment]["review"]
    all_relevant_reviews = reduce(lambda accum, curr: accum + curr, relevant_reviews)
    counted_words = Counter(all_relevant_reviews.split())
    most_common = counted_words.most_common(10)
    least_common = heapq.nsmallest(10, counted_words.items(), key=itemgetter(1))
    return most_common, least_common

negative_reviews = get_most_least_common_words(test_data, 0)
positive_reviews = get_most_least_common_words(test_data, 1)

print('\n')
print("10 most common words in negative reviews:", negative_reviews[0])
print("10 least common words in negative reviews:", negative_reviews[1])
print('\n')
print("10 most common words in positive reviews:", positive_reviews[0])
print("10 least common words in positive reviews:", positive_reviews[1])


Average word count of negative reviews: 116
Average word count of positive reviews: 119


#### Methodology
The data transformed by using TFIDIF then fed into a Logistic Regression classifier.  I am using a test train split of 30% and 70%. Most settings were using the default values.

In [None]:
from sklearn import metrics
    
def get_incorrect_predictions(data, y_true, y_pred):
    predicted_pos = 0
    predicted_neg = 0
    correct_predictions = 0
    incorrect_predictions = pd.DataFrame({'review': [], 'sentiment': []})
    for i in range(0, len(y_true)):
        if y_true[i] == y_pred[i]:
            correct_predictions+=1
        else:
            incorrect_predictions.loc[len(incorrect_predictions)] = [data[i], y_pred[i]]
            if y_pred[i] == 1:
                predicted_pos+=1
            else:
                predicted_neg+=1
    print("Predicted POSITIVE, actually NEGATIVE", round(float(predicted_pos)/float(len(y_true)), 3))
    print("Predicted NEGATIVE, actually POSITIVE", round(float(predicted_neg)/float(len(y_true)), 3))
    print('\n')
    return incorrect_predictions

In [None]:
# model training
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

# test_data["review"].apply(lambda text: text_processing(text))

# split data
x_train, x_test, y_train, y_test = train_test_split(
    test_data["review"],
    test_data["sentiment"],
    test_size=0.3,
    random_state=42
)

print('done splitting data')

# transform data
vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.0005, ngram_range=(1,2)).fit(x_train)

print('done fitting vectorizer')

x_train = vectorizer.transform(x_train)
x_test = vectorizer.transform(x_test)

print('done transforming text')

# train model
model = LogisticRegression(solver='lbfgs')
model.fit(x_train, y_train)

print('done fitting model, will start scoring')

score = model.score(x_test, y_test)

print('model score', score)


|n-gram|min document freq|accuracy|
|------|------|-----|
|(1, 1)| 0.0001|0.884|
|(1, 1)| 0.00025|0.884|
|(1, 1)| 0.0005|0.885|
|(1, 1)| 0.001|0.883|
|(1, 1)| 0.0025|0.88|
|(1, 2)| 0.0001|n/a|
|(1, 2)| 0.00025|n/a|
|(1, 2)| 0.0005|0.889|
|(1, 2)| 0.001|0.887|
|(1, 2)| 0.0025|0.883|
|(2, 2)| 0.0001|n/a|
|(2, 2)| 0.00025|n/a|
|(2, 2)| 0.0005|0.816|
|(2, 2)| 0.001|0.793|
|(2, 2)| 0.0025|0.743|
|(2, 3)| 0.0001|n/a|
|(2, 3)| 0.00025|n/a|
|(2, 3)| 0.0005|0.816|
|(2, 3)| 0.001|0.792|
|(2, 3)| 0.0025|0.743|
|(2, 4)| 0.0001|n/a|
|(2, 4)| 0.00025|n/a|
|(2, 4)| 0.0005|0.816|
|(2, 4)| 0.001|0.792|
|(2, 4)| 0.0025|0.743|
|(3, 3)| 0.0001|0.675|
|(3, 3)| 0.00025|0.633|
|(3, 3)| 0.0005|0.585|
|(3, 3)| 0.001|0.554|
|(3, 3)| 0.0025|0.552|
|(3, 4)| 0.0001|0.675|
|(3, 4)| 0.00025|0.633|
|(3, 4)| 0.0005|0.585|
|(3, 4)| 0.001|0.554|
|(3, 4)| 0.0025|0.522|

#### Further analysis
While this model was able to achieve a roughly 90% accuracy across both classes, there are many more ways to improve on this, either by doing more feature engineering or using a more robust model.

Different models:  It would be worthwhile to try different models on the dataset.  An SVM would be appropriate because this is a binary classification problem.  However, I am also interested in the effects of measuring document similarity and using that for a clustering model.

Word attributes:  There are other qualities of the individual words that could be further processed.  For example, whether or not the average word length of a review is correlated to the sentiment of the review.  Other aspects could include the obscurity and if the words are mispelled.

Stemming:  I chose not to include stemming because the nltk libraries were producing too many non words, but it would definitely be worthwhile to invest more time into applying stemming correctly.

Sentence structure:  The sentence structure used could be indicative of the document's sentiment.  One hypothesis is that negative reviews have more sentences written in the first person, such beginning with "I think ..."  We can also explore if positive or negative reviews are correlated with incorrectly or correctly structured sentences.  

#### References:
NLTK book http://www.nltk.org/book/

Blog post on sentiment analysis https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41