# Portland Data Science Group
## Sentiment analysis (five week series)
### Week 2
In this week's session, I decide to start digging into the natural language processing (NLP) part of this project. Because this is my first NLP project, in order to keep my goal realistic, I am going to build a very simple, Multinomial Naive Bayes model.

Multinomial Naive Bayes models (https://en.wikipedia.org/wiki/Naive_Bayes_classifier) are quite frequently used as the first attempt when dealing with multinomial classification problems. By classification, it means the output y values are labels (e.g., quality grade of a product) instead of constinuous values (e.g., ratings in this dataset). Therefore, data preprocessing is a necessary part before feeding data to the model. 

The analysis in is notebook has followed a posted blog on sentiment analysis (Sentiment analysis for Yelp review classification https://medium.com/tensorist/classifying-yelp-reviews-using-nltk-and-scikit-learn-c58e71e962d9) and some codes are borrowed directly from the blog.

In [1]:
# import data, tryout on the sample data
import pandas as pd
reviews = pd.read_csv('Data/boardgame-comments-english.csv', low_memory=False)
reviews.rename(columns = {"Compiled from boardgamegeek.com by Matt Borthwick":'userID'}, inplace=True)
reviews.head()

Unnamed: 0,userID,gameID,rating,comment
0,172640,24068,7.0,Good: Unique take on the hidden role games. T...
1,86674,24068,7.0,A neat social deduction game with multiple tea...
2,10643,24068,7.0,Good hidden roles werewolf style game that can...
3,31171,24068,7.0,"Overall I hate Mafia/Werewolf, but this versio..."
4,165608,24068,7.0,Fun social deduction exercise that gets merrie...


In [2]:
# Some useful libraries
import numpy as np
import nltk
from nltk.corpus import stopwords

Now let's start the processing the text content. For each comment, we are going to break it into individual words and use these words later as the model input (i.e., the X matrix). There are various ways in doing so including the "Gap" module introduced by Andrew Ferlitsch (https://github.com/andrewferlitsch/Gap). But for now I'm going to use the conventional manual way to keep things short and simple (not necessary for the coding part but in terms of using fewer libraries and word/tag stuff).

#### Approach 1: "manually" data preprocessing


In [3]:
# Remove all punctuation
reviews['comment'] = reviews['comment'].str.replace('[^\w\s]','')
# Lowercase all words
reviews['comment'] = reviews['comment'].str.lower()
# # Remove all stopwords
# reviews['comment'] = reviews['comment'].apply(lambda x: [item for item in x.split() if item not in stopwords.words('english')])
# This above line takes a long time to run (>36 min then I terminated the running), so I do it in the CountVectorizar part instead. 
# Why it takes so long to run? I have no ideas...
reviews.head()

Unnamed: 0,userID,gameID,rating,comment
0,172640,24068,7.0,good unique take on the hidden role games the...
1,86674,24068,7.0,a neat social deduction game with multiple tea...
2,10643,24068,7.0,good hidden roles werewolf style game that can...
3,31171,24068,7.0,overall i hate mafiawerewolf but this version ...
4,165608,24068,7.0,fun social deduction exercise that gets merrie...


#### Vectorization
I use sklearn CountVectorizar (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to convert the comment to a matrix of token counts. This is essentially to create a sparse matrix as the X matrix input to our later Naive Bayes model. I use the nltk stopwords list as the reference for stopword removal. And if a word appears in >90% or <0.1% of all comments, I'll remove it from consideration.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
bow_transformer = CountVectorizer(stop_words=stopwords.words('english'), max_df=0.9, min_df=0.001).fit(reviews['comment'])

In [7]:
# Check the length of the vocabulary list (i.e., all the tokens considered)
len(bow_transformer.vocabulary_ )
# Actually not many tokens to consider

2212

In [9]:
# Let's check a single comment to understand what's going on under the hood
review_17 = reviews['comment'][16]
print('The original comment is:\n', review_17)

bow_17 = bow_transformer.transform([review_17])
print('After transformation, our input becomes:\n', bow_17)

# Let's see some words represented by these columns
print(bow_transformer.get_feature_names()[237])
print(bow_transformer.get_feature_names()[375])

The original comment is:
 played twice over the weekend still working out some of the rules  each time had a real blast hit the sweet spot with everyone  coop game with traitor element  very thematic and you dont have to be a fan of the series to enjoy it  31082014 play this for the experience   starstarstarstarnostar easy to learn starstarstarstarnostar fun factor starstarhalfstarnostarnostar   replayability  why this game stays in my collection the mechanics with the voting on decisions can feel repetitive and some turns you can find yourself with very little to do but the whole traitor role mechanic with other players possibly being a cylon or sympathiser can make this one of the most socially orientated and entertaining games 
After transformation, our input becomes:
   (0, 237)	1
  (0, 375)	1
  (0, 439)	1
  (0, 505)	1
  (0, 567)	1
  (0, 604)	1
  (0, 624)	1
  (0, 644)	1
  (0, 651)	1
  (0, 677)	1
  (0, 699)	1
  (0, 716)	1
  (0, 727)	1
  (0, 747)	1
  (0, 772)	1
  (0, 827)	1
  (0, 832

So now we can use this vectorizor to transform our comments

In [10]:
X = bow_transformer.transform(reviews['comment'])
# Check the shape of X
print('Shape of Sparse Matrix: ', X.shape)
print('Amount of Non-Zero occurrences: ', X.nnz)
# Percentage of non-zero values
density = (100.0 * X.nnz / (X.shape[0] * X.shape[1]))
print('Density: {}'.format((density)))

Shape of Sparse Matrix:  (841645, 2212)
Amount of Non-Zero occurrences:  13327632
Density: 0.7158778452216686


#### Processing on y values
Multinomial Naive Bayes only works for categorical labels, not for continuous scores like the ratings in our data. In order to make our simple model to work, we need to round the rating in our data so that we have a limited number of labels (because it does not make too much sense to have a label say '7.1111'). In our dataset, there are a considerable amount of ratings are in the .5 scales. I feel it maybe quite biased to directly round such numbers up as normaly done. So I decide to create our "rating labels" based on .5 increase with a function. This function is directly borrowed from https://stackoverflow.com/questions/24838629/round-off-float-to-nearest-0-5-in-python. 

In [11]:
def round_of_rating(number):
    """
    Round a number to the closest half integer.
    >>> round_of_rating(1.3)
    1.5
    >>> round_of_rating(2.6)
    2.5
    >>> round_of_rating(3.0)
    3.0
    >>> round_of_rating(4.1)
    4.0
    """

    return round(number * 2) / 2

In [12]:
y = reviews['rating'].apply(round_of_rating)
print('Shape of y values: ', y.shape)

Shape of y values:  (841645,)


However, after some experimentation, I found the Naive Bayes model only intakes integer lables, meaning it raises an error when the label is, for example 7.5. So we have to further modify the y values back to integers. One way to walk around this issue is to multiply the y values by two and then divide them by two after prediction. Actually, this is only for the purpose of intepretation. For model accuracy check, we don't even need to divide the ratings afterwards because ultimately we are only checking if the labels are predicted correctly or not. It's not super critical what the labels are.

In [13]:
y *= 2
y = y.astype(int)
y

0         14
1         14
2         14
3         14
4         14
5         20
6         20
7         20
8         20
9         20
10        17
11        17
12        17
13        17
14        17
15        17
16        17
17        17
18        17
19        17
20        17
21        17
22        17
23        17
24        17
25        17
26        17
27        17
28        17
29        17
          ..
841615    14
841616    14
841617    14
841618    14
841619    14
841620    14
841621    14
841622    14
841623    14
841624    14
841625    14
841626    14
841627    14
841628    14
841629    14
841630    14
841631    15
841632    15
841633    15
841634    15
841635    15
841636    15
841637    15
841638    15
841639    15
841640    15
841641    15
841642    15
841643    15
841644    15
Name: rating, Length: 841645, dtype: int64

#### Multinomial Naive Bayes model
After data preprocessing, the model construction, training, and testing process is actually pretty standard and straightforward. Again we split our entire dataset into train and dev sets.

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_dev, y_train, y_dev = train_test_split(X, y, test_size=0.3, random_state=17)

In [15]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
preds = nb.predict(X_dev)

In [17]:
from sklearn.metrics import confusion_matrix, classification_report
# print(confusion_matrix(y_test, preds))
# print('\n')
print(classification_report(y_dev, preds))

             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          2       0.14      0.20      0.17      1184
          3       0.01      0.01      0.01        82
          4       0.12      0.11      0.12      2489
          5       0.00      0.00      0.00       202
          6       0.13      0.10      0.11      4812
          7       0.01      0.00      0.00       399
          8       0.14      0.07      0.10      8611
          9       0.04      0.01      0.02       746
         10       0.19      0.10      0.13     15305
         11       0.03      0.00      0.01      2262
         12       0.24      0.21      0.22     31501
         13       0.25      0.01      0.02      7845
         14       0.28      0.44      0.34     52125
         15       0.23      0.01      0.03     14452
         16       0.28      0.46      0.35     52120
         17       0.18      0.01      0.03      9547
         18       0.27      0.16      0.20   

  'recall', 'true', average, warn_for)


Again this is not a very well-predicting model (what the scoring matrix mean? http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html ). But again this is a very rough model. More modifications (e.g., bigram, tagging, stemming) on the input data can be done to possibly get the X matrix "cleaner" and better modeling options can be examined. I'll explore more next week.

#### Approach 2: Gap
Gap module tryout (more details refer to https://github.com/andrewferlitsch/Gap). Prepare text data into words.

In [18]:
# import Document and Page from the document module# impor 
from splitter import Document, Page
# import the Words class
from syntax import Words

[nltk_data] Downloading package wordnet to /Users/jishe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jishe/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [19]:
# Process this well-known typing phrase which contains all 26 letters of the alphabet
w = Words(reviews['comment'].loc[0], bare=True)
print(w.words)

[{'word': 'good', 'tag': 0}, {'word': 'unique', 'tag': 0}, {'word': 'take', 'tag': 0}, {'word': 'on', 'tag': 0}, {'word': 'the', 'tag': 0}, {'word': 'hidden', 'tag': 0}, {'word': 'role', 'tag': 0}, {'word': 'games', 'tag': 0}, {'word': 'the', 'tag': 0}, {'word': 'good', 'tag': 0}, {'word': 'and', 'tag': 0}, {'word': 'evil', 'tag': 0}, {'word': 'team', 'tag': 0}, {'word': 'win', 'tag': 0}, {'word': 'if', 'tag': 0}, {'word': 'they', 'tag': 0}, {'word': 'eliminate', 'tag': 0}, {'word': 'each', 'tag': 0}, {'word': 'other', 'tag': 0}, {'word': 'where', 'tag': 0}, {'word': 'the', 'tag': 0}, {'word': 'neutral', 'tag': 0}, {'word': 'team', 'tag': 0}, {'word': 'has', 'tag': 0}, {'word': 'unique', 'tag': 0}, {'word': 'objectives', 'tag': 0}, {'word': 'depending', 'tag': 0}, {'word': 'on', 'tag': 0}, {'word': 'what', 'tag': 0}, {'word': 'was', 'tag': 0}, {'word': 'dealt', 'tag': 0}, {'word': 'bad', 'tag': 0}, {'word': 'component', 'tag': 0}, {'word': 'quality', 'tag': 0}, {'word': 'is', 'tag': 0}

In [20]:
# Process this well-known typing phrase which contains all 26 letters of the alphabet
num=1
w = Words(reviews['comment'].loc[num], bare=True)
print(w.words)

w = Words(reviews['comment'].loc[num], stem='porter')
print(w.words)

w = Words(reviews['comment'].loc[num], stem='gap')
print(w.words)

[{'word': 'a', 'tag': 0}, {'word': 'neat', 'tag': 0}, {'word': 'social', 'tag': 0}, {'word': 'deduction', 'tag': 0}, {'word': 'game', 'tag': 0}, {'word': 'with', 'tag': 0}, {'word': 'multiple', 'tag': 0}, {'word': 'teams', 'tag': 0}, {'word': 'and', 'tag': 0}, {'word': 'winning', 'tag': 0}, {'word': 'conditions', 'tag': 0}, {'word': 'happening', 'tag': 0}, {'word': 'at', 'tag': 0}, {'word': 'the', 'tag': 0}, {'word': 'same', 'tag': 0}, {'word': 'time', 'tag': 0}]
[{'word': 'neat', 'tag': 0}, {'word': 'social', 'tag': 0}, {'word': 'deduct', 'tag': 0}, {'word': 'game', 'tag': 0}, {'word': 'multipl', 'tag': 0}, {'word': 'team', 'tag': 0}, {'word': 'win', 'tag': 0}, {'word': 'condit', 'tag': 0}, {'word': 'happen', 'tag': 0}, {'word': 'time', 'tag': 0}]
[{'word': 'neat', 'tag': 0}, {'word': 'social', 'tag': 0}, {'word': 'deduction', 'tag': 0}, {'word': 'game', 'tag': 0}, {'word': 'multiple', 'tag': 0}, {'word': 'team', 'tag': 0}, {'word': 'win', 'tag': 0}, {'word': 'condition', 'tag': 0}, {

In [21]:
# Helper function (from Andrew) to make it easier to display preprocessed text w/o tagging
def towords(words):
    for word in words:
        print(word['word'], ' ')

In [22]:
towords(w.words)

neat  
social  
deduction  
game  
multiple  
team  
win  
condition  
happen  
time  


Just some simple manipulation based on Andrew's code. A lot of pruning work needed. More details come later.