# Sentiment Analysis over Yelp reviews

This notebook shows how to build a text classifier that can predict whether a review is negative or positive. We used Yelp dataset, made public thanks to the Yelp-challenge. I am only doing this for fun and practice, please don't hesitate to correct me if you see any mistake or if you have any question. The notebook will be improved overtime when I implement newer algorithms and corrections. 

The Yelp reviews consist in a list of json documents that contain the following information:
    
- business_id
- date
- review_id
- stars
- text
- type
- user_id
- votes
    
For classifying our reviews, we are going to use a Naive Bayes approach at fist, then use fancier model and see how it improves. Classification problem is a supervised learning task, which means we deal with labelled data. Here, every review has a rating, which consist of stars from 1 to 5. For our classifier, we are going to use two classes: `pos` and `neg` for positive versus negative reviews. As an arbitrary choice, we will say that a positve review is rated above 3 stars. 

I am programming in Python, along with libraries like `scikit-learn` or `gensim`. 

In [28]:
import numpy as np
import json

### Building a Corpus Class

To have an easier access to our data, we are going to build a Class called `JsonCorpus` and its parent Class called `Corpus`. 

`Corpus` will have two basics methods:

- `__init__(self, path)` for registering dataset path as an attribute
- `__iter__(self)` so we can read the full dataset review by review. Since the dataset is pretty big, the whole thing won't fit in RAM. Thanks to this approach, we can train our model on the whole corpus one review at a time. 

`JsonCorpus` will add a few methods to our `Corpus` class:

- `__parse_json(self, line)` which is private method, that parse a json string representation to a json object
- `__stars_to_sentiment(self, stars, pos_threshold)` is a private method that converts stars rating to `pos` or `neg` depending on `pos_threshold`.
- `head(self, n, return_type)` that returns the first `n` reviews in a specified format

In [95]:
class Corpus(object):
    def __init__(self, path):
        self.path = path
        
        
    def __iter__(self):
        for line in open(self.path):
            yield line


class JsonCorpus(Corpus):
    def __init__(self, path):
        super(JsonCorpus, self).__init__(path)
            
            
    def __parse_json(self, line):
        return json.loads(line)
    
    
    def __stars_to_sentiment(self, star, pos_threshold=3):
        return 'pos' if star > pos_threshold else 'neg'
            
        
    def head(self, n=1, return_type='json'):
        with open(self.path) as file:
            json = [self.__parse_json(next(file).strip()) for x in xrange(n)]
            
            if return_type is 'json':
                # returns a list of plain json documents
                return json
            elif return_type is 'text_rating':
                # returns a list of [text, stars] documents
                return [[j['text'], j['stars']] for j in json]
            elif return_type is 'text_sentiment':
                # returns a list of [text, sentiment] documents
                return [[j['text'], self.__stars_to_sentiment(j['stars'])] for j in json]
            else:
                raise NameError('invalid return_type')

In [96]:
corpus = JsonCorpus('../dataset/yelp_academic_dataset_review.json')

In [97]:
corpus.head()

[{u'business_id': u'5UmKMjUEUNdYWqANhGckJw',
  u'date': u'2012-08-01',
  u'review_id': u'Ya85v4eqdd6k9Od8HbQjyA',
  u'stars': 4,
  u'text': u'Mr Hoagie is an institution. Walking in, it does seem like a throwback to 30 years ago, old fashioned menu board, booths out of the 70s, and a large selection of food. Their speciality is the Italian Hoagie, and it is voted the best in the area year after year. I usually order the burger, while the patties are obviously cooked from frozen, all of the other ingredients are very fresh. Overall, its a good alternative to Subway, which is down the road.',
  u'type': u'review',
  u'user_id': u'PUFPaY9KxDAcGqfsorJp3Q',
  u'votes': {u'cool': 0, u'funny': 0, u'useful': 0}}]

In [98]:
corpus.head(1, 'text_sentiment')

[[u'Mr Hoagie is an institution. Walking in, it does seem like a throwback to 30 years ago, old fashioned menu board, booths out of the 70s, and a large selection of food. Their speciality is the Italian Hoagie, and it is voted the best in the area year after year. I usually order the burger, while the patties are obviously cooked from frozen, all of the other ingredients are very fresh. Overall, its a good alternative to Subway, which is down the road.',
  'pos']]

## NLP with scikit-learn

Now that our corpus is ready, let's dive into Natural Language Processing! We are first going to use `scikit-learn` library, as it provides a lot of algorithms, and is pretty easy to use.

### Introduction to classification, a playground example

With this playground example, we are goin to build a classifier prototype to understand how `scikit-learn` works, and what classification is about.

The first idea is to convert raw words into vectors, so that the algorithm can understand them and classify sentences. We use what is called a bag of words approach: we gather all the words occuring in our corpus as a vocabulary base, giving us a `N` dimensional space, where `N` is how many unique words we have in the vocabulary. We can then map our sentences as vectors in our `N` dimensional space according to the words they contains. Things get clearer with a simple example:

Let's say our corpus consists of two sentences "I feel good" and "I feel bad". Our vocabulary is `["I", "feel", "good", "bad"]`. This is a 4 dimensional space in which our two sentences can be expressed as vectors of components `[1, 1, 1, 0]` and `[1, 1, 0, 1]`. Basically, we can say we have "vectorized" our textual data by expressing our sentences as vectors in a high dimensional space. High dimensional because vocabulary size gets very big, hence the size of our vector space. Our two vectors form a term-document matrix, because every document (review) is mapped to terms in our vocabulary.

Then, our algorithm is going to learn how to classify based on the training data we feed it. Recall that our training data consists of the text and the corresponding label. We use a Naive Bayes classifier because it is the simplest classifier to begin with. It works in the following fashion: let's say we feed our algorithm `["I hate junk food", "neg"]` it will assign probabilities for words "I", "hate", "junk", "food" to express a negative feeling. Once the training is complete, the algorithm will be able to generalize to reviews it hasn't seen yet by looking at the words in each review, and compute the probability for the sentence to be positive or negative.

We start easy by only loading 100 reviews and the corresponding labels.

In [99]:
reviews = [c[0] for c in corpus.head(100, 'text_sentiment')]
labels = [c[1] for c in corpus.head(100, 'text_sentiment')]

We then initialize a `CountVectorizer` to build the term-document matrix `X_train_counts`. You see that it has one row per review, and one column per vocabulary word, as we explained earlier. So here, there are 2176 unique words within our 100 first reviews. 

In [144]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X_train_counts = count_vectorizer.fit_transform(reviews)

print "term-document matrix shape is:", X_train_counts.shape

term-document matrix shape is: (100, 2176)


Now we initialize our classifier, then train it with term-document matrix and target labels.

In [142]:
from sklearn.naive_bayes import MultinomialNB

multinomial_naive_bayes = MultinomialNB()
%time clf = multinomial_naive_bayes.fit(X_train_counts, labels)

CPU times: user 2.61 ms, sys: 2.37 ms, total: 4.98 ms
Wall time: 2.73 ms


Good, now let's try it by making up reviews and see if it can guess the sentiment:

In [127]:
print '"this is good food" classified as:',\
    clf.predict(count_vectorizer.transform(['this is good food']))
print '"that place was bad" classified as:',\
    clf.predict(count_vectorizer.transform(['that place was bad']))

"this is good food" classified as: ['pos']
"that place was bad" classified as: ['neg']


Hoorray it worked ! But wait, this is actually just the beginning, we have no idea whether our model really works, or if we just got lucky with two fairly easy examples. Read the following to discover how we can improve this classifier !

### Evaluating the model

As a first try, let's keep the same model we used, and train it on more data. Also, we will evaluate more precisely how our model performed, and to do so we are going to split the dataset in training and test data. Usually, we use 80% and 20% size for training and test respectively. Our goal is to see how well the model we trained is capable of generalizing on reviews it hasn't seen in training data.

Let's load 20000 reviews and their labels, then split this dataset in training and test sets thanks to `scikit-learn` `train_test_split` function.

In [136]:
from sklearn.model_selection import train_test_split

X, y = [c[0] for c in corpus.head(20000, 'text_sentiment')],\
        [c[1] for c in corpus.head(20000, 'text_sentiment')]
    
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    random_state=42)

Now we will train the model. We now use a `Pipeline`, which is a Class allowing to easily chain steps, like our vectorizer followed by the classifier. Once trained, `metrics` module displays some useful training results.

In [145]:
from sklearn import metrics
from sklearn.pipeline import Pipeline

text_clf_1 = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB())
])

%time text_clf_1.fit(X_train, y_train)

print metrics.classification_report(y_test, text_clf_1.predict(X_test))

CPU times: user 2.41 s, sys: 166 ms, total: 2.58 s
Wall time: 2.51 s
             precision    recall  f1-score   support

        neg       0.77      0.72      0.75      1433
        pos       0.85      0.88      0.87      2567

avg / total       0.82      0.82      0.82      4000



0.82 precision score, which means about 80% of our reviews have been correctly classified. Is it good ? Not really... we sure can do better !

### Trying another model

In [141]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier

text_clf_2 = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', SGDClassifier(loss='hinge', 
                          penalty='l2', 
                          alpha=1e-3, 
                          n_iter=5, 
                          random_state=42)),
    ])

%time text_clf_2.fit(X_train, y_train)
print '\ntext_clf_2\n %s \n' %metrics.classification_report(y_test, 
                                                            text_clf_2.predict(X_test))

CPU times: user 2.54 s, sys: 84.4 ms, total: 2.62 s
Wall time: 2.65 s

text_clf_2
              precision    recall  f1-score   support

        neg       0.90      0.61      0.73      1433
        pos       0.82      0.96      0.88      2567

avg / total       0.85      0.84      0.83      4000
 



In [140]:
from sklearn.model_selection import GridSearchCV
import pandas as pd

In [94]:
print text_clf_2.predict(['oh I love this place it is so good the food is nice'])
print text_clf_2.predict(['the food was really bad'])

['pos']
['neg']


In [21]:
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), ((1, 3))],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3) 
}

In [22]:
gs_clf = GridSearchCV(text_clf_2, parameters, n_jobs=-1)

In [23]:
%time gs_clf.fit(X_train, y_train)

CPU times: user 11.6 s, sys: 1.38 s, total: 13 s
Wall time: 3min 28s


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...     penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'vect__ngram_range': [(1, 1), (1, 2), (1, 3)], 'tfidf__use_idf': (True, False), 'clf__alpha': (0.01, 0.001)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [24]:
pd.DataFrame(gs_clf.cv_results_ ).sort_values(by='mean_test_score', ascending=False)

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_clf__alpha,param_tfidf__use_idf,param_vect__ngram_range,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
6,2.464347,1.183293,0.815149,0.840485,0.001,True,"(1, 1)","{u'vect__ngram_range': (1, 1), u'tfidf__use_id...",1,0.80752,0.841245,0.814375,0.842064,0.823556,0.838146,0.069137,0.020419,0.00657,0.001687
9,2.527857,1.168161,0.793806,0.807724,0.001,False,"(1, 1)","{u'vect__ngram_range': (1, 1), u'tfidf__use_id...",2,0.790734,0.806538,0.788401,0.809828,0.802284,0.806805,0.11291,0.036887,0.006069,0.001492
10,10.088305,3.299886,0.790896,0.810709,0.001,False,"(1, 2)","{u'vect__ngram_range': (1, 2), u'tfidf__use_id...",3,0.789167,0.810233,0.784147,0.808708,0.799373,0.813186,0.127348,0.033627,0.006335,0.001859
11,21.759013,3.852263,0.779925,0.799515,0.001,False,"(1, 3)","{u'vect__ngram_range': (1, 3), u'tfidf__use_id...",4,0.779096,0.80206,0.772056,0.797515,0.788625,0.79897,0.180668,0.072844,0.006789,0.001895
7,11.589886,3.412248,0.722239,0.725859,0.001,True,"(1, 2)","{u'vect__ngram_range': (1, 2), u'tfidf__use_id...",5,0.727171,0.731079,0.721451,0.723416,0.718092,0.72308,0.499936,0.09524,0.003748,0.003694
8,24.081614,5.39006,0.646791,0.6375,0.001,True,"(1, 3)","{u'vect__ngram_range': (1, 3), u'tfidf__use_id...",6,0.646822,0.639275,0.644872,0.637788,0.648679,0.635438,0.165116,0.091102,0.001554,0.00158
3,2.691615,1.147724,0.628731,0.628657,0.01,False,"(1, 1)","{u'vect__ngram_range': (1, 1), u'tfidf__use_id...",7,0.628469,0.628751,0.629198,0.628498,0.628527,0.628722,0.340326,0.046332,0.000331,0.000113
0,2.713394,1.354113,0.628507,0.628507,0.01,True,"(1, 1)","{u'vect__ngram_range': (1, 1), u'tfidf__use_id...",8,0.628469,0.628527,0.628527,0.628498,0.628527,0.628498,0.080793,0.047543,2.7e-05,1.4e-05
1,13.7965,4.137727,0.628507,0.628507,0.01,True,"(1, 2)","{u'vect__ngram_range': (1, 2), u'tfidf__use_id...",8,0.628469,0.628527,0.628527,0.628498,0.628527,0.628498,1.050651,0.703418,2.7e-05,1.4e-05
2,24.997063,5.352116,0.628507,0.628507,0.01,True,"(1, 3)","{u'vect__ngram_range': (1, 3), u'tfidf__use_id...",8,0.628469,0.628527,0.628527,0.628498,0.628527,0.628498,1.41984,0.046664,2.7e-05,1.4e-05
