# Sentiment Analysis for Yelp Customer Reviews 
### Monash University
### Author: Arunava Munshi
### Date: June 6, 2019
This project is about sentiment analysis of customer reviews of Yelp. In order to do this the following datasets have been provided -
1. **train data.csv:** trn id and review text. Contains 650,000 product reviews which acts as the training data.
2. **train label.csv:** trn id and sentiment labels. The label set (1,2,3,4,5) refer to positive polarity levels (strong negative, weak negative, neutral, weak positive, strong and positive) respectively.
3. **test data.csv:** test id and review text. Contains 50,000 product reviews which acts as the testing data.

## Importing required libraries
The below libraries have been imported as part of the text pre-processing, feature selection and model building.

In [1]:
# Importing libraries
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.util import ngrams
import itertools
from itertools import chain
import multiprocessing as mp
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC
from mpl_toolkits.mplot3d import Axes3D
import string
import re
import pickle
%matplotlib inline
from __future__ import division
import numpy as np
mpl.style.use("ggplot")

## Reading the Datasets
The below code reads the required datasets.
### Reading train_label.csv
The below code is to load the training labels into dataframe.

In [2]:
# Reading train_label.csv
train_label_df = pd.read_csv("train_label.csv")

### Reading train_data.csv
The below code is to load the training reviews into dataframes.

In [3]:
# Reading train_data.csv
train_data = pd.read_csv("train_data.csv")

### Merging the Reviews and Labels
Combining the training labels and reviews and removing the duplicate reviews.

In [4]:
# Combine train_data and train_label
train_data_set = pd.concat([train_data, train_label_df], axis=1)
# Remove duplicate trn_id
train_data_set = train_data_set.loc[:,~train_data_set.columns.duplicated()]

### Reading test_data.csv
The below code is to load the test data reviews into dataframes and renames the column of the dataframe.

In [5]:
test_data_set = pd.read_csv("test_data.csv")
test_data_set = test_data_set.rename(columns={'test_id': 'trn_id'})

## Execution
Now we do three executions of selected model with the given data. We chose support vector machine as our final model for this project (For the reason, please look into the report). Each execution has the below steps.
1. **Train Data Sampling:** Because this is a large dataset of more than 600000 reviews, we can't take all the data to build the model. This is not only memory centric, but also consumes more time. Rather we create samples of the input file.
2. **Data Pre-processing:** After data sampling, the data preprocessing is done that includes various ways to clean the data into managable format.
3. **Feature Selection:** The next step is feature selection. In this step, adequate features are selected for the sampled dataset.
4. **Train/Test Split:** Because we do not have any labelled test data to check our accuracy, so we need to do train test split to understand the model accuracy.
5. **Model Building:** This step builds the model on the training data from the train/test split and predict on test data from the same split.
6. **Model Accuracy and Checking Model Accuracy:** Last, but not the least, this step checks the model accuracy for each execution after predicting

This is the execution cycle one with one set of data. 
### Train Data Sampling
We aim to build a balanced dataset. We saw from our data exploration that the given review set is a balanced set and hence we need to generate balanced samples. So we are getting the data for each sentiment level first.

In [6]:
# Getting dataframe for each sentiment level
train_data_set_1 = train_data_set[train_data_set['label'] == 1]
train_data_set_2 = train_data_set[train_data_set['label'] == 2]
train_data_set_3 = train_data_set[train_data_set['label'] == 3]
train_data_set_4 = train_data_set[train_data_set['label'] == 4]
train_data_set_5 = train_data_set[train_data_set['label'] == 5]

After that we are taking 5000 random samples from each sentiment level.

In [7]:
# Random samples of data for each sentiment type
train_data_set_1 = train_data_set_1.sample(n = 5000)
train_data_set_2 = train_data_set_2.sample(n = 5000)
train_data_set_3 = train_data_set_3.sample(n = 5000)
train_data_set_4 = train_data_set_4.sample(n = 5000)
train_data_set_5 = train_data_set_5.sample(n = 5000)

After this we merge all these individual samples to create the main train data.

In [8]:
# merge these dataframes
train_data_set = pd.concat([train_data_set_1, train_data_set_2, train_data_set_3, train_data_set_4, train_data_set_5], axis=0)
train_data_set = train_data_set.sample(frac=1)
train_data_label = train_data_set['label']
train_data_set = pd.concat([train_data_set[['trn_id', 'text']], test_data_set], axis=0)

### Data Preprocessing
We do the very basic data pre-processing in our first execusion. These steps include -
1. **Emoticon handling:** Finding emoticons into the reviews and replacing them with correct vocabulary.
2. **Word contraction:** Expanding the contracted words such as "won't" or "can't" into "will not" or "cannot".
3. **Case normalization:** Lowercasing the respective texts.
4. **Word tokenization:** Seperating the words into word tokens.
5. **Digit removals:** Removing the digits.
6. **Removing lemmatized words with lemmatization with POS tagging:** Removing lemmatized words after doing the Parts-OfpSpeech tagging to them.
7. **Emphasis words:** Replacing the words with repeated characters.
9. **Punctuation removal:** Removing the punctuations.

In [9]:
# Function for handling Emoticons
def handlingEmoticons(each_review):
    each_review = re.sub("X-\(", ' angry ', each_review)
    each_review = re.sub("</3", ' broken heart ', each_review)
    each_review = re.sub("O.o", ' confused ', each_review)
    each_review = re.sub("B-\)", ' angry ', each_review)
    each_review = re.sub(":_\(|:'\(", ' crying ', each_review)
    each_review = re.sub("\\\:D/", ' dancing ', each_review)
    each_review = re.sub("\*-\*", ' dazed ', each_review)
    each_review = re.sub("=P|:-P|:P", ' tongue out ', each_review)
    each_review = re.sub("=\)|:-\)|:\)|\(-:", ' happy ', each_review)
    each_review = re.sub("<3", ' heart ', each_review)
    each_review = re.sub("{}", ' hug ', each_review)
    each_review = re.sub(":-\|", ' indifferent ', each_review)
    each_review = re.sub("X-p", ' joking ', each_review)
    each_review = re.sub("XD|=D", ' laughing ', each_review)
    each_review = re.sub("\)-:|:-\(|:\(|=\(", ' sad ', each_review)
    each_review = re.sub("=/", ' mad ', each_review)
    each_review = re.sub(":-B", ' nerd ', each_review)
    each_review = re.sub("\^_\^", ' overjoyed ', each_review)
    each_review = re.sub(":-/", ' perplexed ', each_review)
    each_review = re.sub(":S", ' sarcastic ', each_review)
    each_review = re.sub(":o|:O|:0|=O|:-o", ' surprised ', each_review)
    each_review = re.sub(":-J", ' tongue in cheek ', each_review)
    each_review = re.sub(":-\\\\", ' undecided ', each_review)
    each_review = re.sub("=D|:D|:-D", ' very happy ', each_review)
    each_review = re.sub(";-\)|;\)", ' winking ', each_review)
    each_review = re.sub("\|-O", ' yawn ', each_review)
    return(each_review)

In [10]:
# Function for Contraction word dictioary
contractions = {
"ain't": "am not / are not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is",
"i'd": "I had / I would",
"i'd've": "I would have",
"i'll": "I shall / I will",
"i'll've": "I shall have / I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [11]:
# Getting english vocabulary for POS tagging
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
# Function to identify Parts-of-speech
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [12]:
lemmatizer = nltk.WordNetLemmatizer()
def tokenizeWordTrain(text):
    # Emoticon handling
    text = handlingEmoticons(text)
    # Case normalization and Punctuation Removal
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    # Word contraction
    for word in text.split():
        if word.lower() in contractions:
            text = text.replace(word, contractions[word.lower()])
    # Word tokenization
    tokens = re.split('\W+', text)
    #Removing digits
    tokens= [word for word in tokens if word.isalpha()]
    # Removing lemmatized words with lemmatization with POS tagging
    pos_tagged_word_tokens = pos_tag(tokens)
    text = " ".join([lemmatizer.lemmatize(each_item[0]) if get_wordnet_pos(each_item[1]) is None else lemmatizer.lemmatize(each_item[0], get_wordnet_pos(each_item[1])) for each_item in pos_tagged_word_tokens])
    # Handling emphasis words
    emhasis_regex = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
    text = emhasis_regex.sub(" emphasis ", text).strip()
    return(text)

Doing multiprocessing, using a pool of 10 process for faster execution.

In [13]:
# Creating a pool of 30 processes in multiprocessing
pool = mp.Pool(processes=30)  
train_data_set['text'] = pool.map(tokenizeWordTrain, train_data_set['text'])

Now lets do some analysis of the sample review corpus. We can see that the vocabulary size 47334, while the total number of tokens is 3042634 and more importantly the lexical dicersity is 64%, which indicates that our sampling is pretty good.

In [14]:
# To convert the dataframe into dictionary
train_data_set_trn_id = train_data_set['trn_id'].tolist()
train_data_set_text = train_data_set['text'].tolist()
train_data_set_dict = dict(zip(train_data_set_trn_id, train_data_set_text))
for k, v in train_data_set_dict.items():
    train_data_set_dict[k] = re.split('\W+', train_data_set_dict[k])

# Calculating Lexical Diversity
words = list(chain.from_iterable(train_data_set_dict.values()))
vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab),"\nTotal number of tokens: ", len(words), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  86742 
Total number of tokens:  9148752 
Lexical diversity:  105.47084457356299


### Feature Selection
Now we do the feature selection from this preprocessed text. We use the tf-idf vectorizer for this feature selection.  We saw that, with this technique we could create 1998 features. In Feature selection we do the following -
1. Unigram and Bigram Features
2. We discarded any feature which occurs within less than 0.01% documents.

In [15]:
tfidf_vectorizer = tfidf_vectorizer = TfidfVectorizer(min_df = 0.01, ngram_range=(1,2))
tfidf = tfidf_vectorizer.fit_transform(train_data_set['text'])
X_features = pd.DataFrame(tfidf.toarray())
X_features.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.067455,0.067771,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.067374,0.0,0.0,0.0,0.0,0.043287,0.060176,0.08005,0.0
2,0.0,0.0,0.034965,0.0,0.072645,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Train/Test Split
Now the next step is train test split. We split the train data and test data depending on the labels.

In [16]:
X_train = X_features.head(25000)
X_test = X_features.tail(50000)
y_train = train_data_label

### Model Building
The below code builds classification model for this train test split. We are using LinearSVM as our final model (Explanation for it). This time we use simple LinearSVM.

In [17]:
SVM_model = LinearSVC(random_state=0, tol=1e-5, multi_class = "ovr").fit(X_train, y_train)

### Model Accuracy and Checking Model Accuracy
After the model building is done we do the prediction on test split of the train data and check the model accuracy.

In [18]:
SVM_model_predictions = SVM_model.predict(X_test)

The model accuracy is calculated in keggle which is 0.55686.

In [19]:
test_predictions = pd.concat([test_data_set['trn_id'], pd.DataFrame(SVM_model_predictions)], axis=1)
test_predictions = test_predictions.rename(columns={'trn_id': 'test_id', 0: 'label'})
test_predictions.to_csv('predict_label.csv')

## References
1. https://stackoverflow.com/questions/43018030/replace-apostrophe-short-words-in-python
2. https://towardsdatascience.com/natural-language-processing-nlp-for-machine-learning-d44498845d5b