Overview

Unstructured data makes up the vast majority of data. This is a basic intro to handling unstructured data. Our objective is to be able to extract the sentiment (positive or negative) from review text. We will do this from Yelp review data.

Your model will be assessed based on how root mean squared error of the number of stars you predict. There is a reference solution (which should not be too hard to beat). The reference solution has a score of 1.

Download the data here : http://thedataincubator.s3.amazonaws.com/coursedata/mldata/yelp_train_academic_dataset_review.json.gz
Download and parse the data

In [1]:
import re
import gzip
import nltk
import simplejson
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
import nltk.tokenize as tokenize
from sklearn import linear_model
from sklearn.externals import joblib
from sklearn import cross_validation, grid_search
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer

data = gzip.open('yelp_train_academic_dataset_review.json.gz')
data_content = data.read()
data.close()
lines= re.split('\n',data_content)
json_data = [simplejson.loads(line) for line in lines[:-1]]
df = pd.DataFrame(json_data)

In [2]:
del(json_data)

I will first build a bag of words model. My strategy will be to build a linear model based on the count of the words in each document (review).

### Tokenization

I first need to tokenize the yelp reviews to break them into individual words. For this process I used the `nltk` package and the `HashingVectorizer` from sklearn. `HashingVectorizer` uses a hashtable to more efficiently store the large dictionary from the yelp reviews. For testing (and scoring against a reference) I am deploying all of my models to heroku, so I essentially wanted to keep things as small as possible. The heroku box only has 500 Mb ram and is easily overwhelmed. Without this limitation a `CountVectorizer` could employed.

An additional limitation of the `HashingVectorizer` is that is does not record the mapping: if you want to go back and forth that information is simply lost. But here I am concerned with a BagOfWords model to predict rating stars. I don't care what words predict 5 stars: `HashingVectorizer` should be perfectly good (and small!)

Common english language words were filtered using a nltk subpackage and the stopwords command in HashingVectorizer. This removes words like "the", "and", etc. which shouldn't have any particular relevence to good or bad reviews.

### Model Fitting

I used a `RidgeCV` linear regression to fit my bag of words model to the number of review stars on Yelp. Cross validation was used to validate the test set and prevent overfitting.

The `Pipeline` function of sklearn was used to quickly and easily link preprocessing and estimator steps of my machine learning model.



In [None]:
hv = HashingVectorizer(norm='l2',stop_words=nltk.corpus.stopwords.words('english'))
hvcounts = hv.fit_transform(df['text'])

In [None]:
hvcounts_train, hvcounts_test, stars_train,stars_test = cross_validation.train_test_split(hvcounts,df['stars'],test_size=0.2,random_state=23)
ridge = linear_model.RidgeCV()
ridge.fit(hvcounts_train,stars_train)
score = ridge.score(hvcounts_test,stars_test)
print score

In [None]:
q1pipe = Pipeline([
  ('bagofwords', hv),
  ('ridge', ridge)
])
joblib.dump(pipeline,'/home/vagrant/miniprojects/questions/nlp-q1pipe.pkl')

The pickle dump from joblib is then a model ready for deployment on Heroku (where it beat the benchmark for a bag of words model).

Let's see what our model would predict for the first review!

In [None]:
q1pipe.predict([review['text'],])[0]

This first quick pass could be improved in a number of ways. I used the built in "l2" normalization of HashingVectorizer, but alternatively I could try an TF-IDF (term frequency - inverse document frequency) normalization scheme to control for common words.

I can use cross validation with GridSearchCV to optimize the hyperparameters of the RidgeCV linear regressor or just test out the performance of other linear regression estimators. Alternatively, I could test out non-linear regression algorithms, but remember we want something small for deployment on Heroku!

Let's try stochastic gradient descent:

In [None]:
sgd = linear_model.SGDRegressor(alpha=0.0001, l1_ratio=0.15, eta0=0.01, power_t=0.25)
sgd.fit(hvcounts_train,stars_train)
score = sgd.score(hvcounts_test,stars_test)
print score

In [None]:
cv = cross_validation.KFold(len(df['stars']), n_folds=10, shuffle=True)
params = {'alpha':np.logspace(-6,-3,10)}
grid = grid_search.GridSearchCV(linear_model.SGDRegressor(),cv=cv,param_grid=params)
grid.fit(hvcounts,df['stars'])
grid.best_score_

As a first pass, a quickly optimized SGDregressor is outperformed by RidgeCV without hyperparameter tuning!
Bigrams

My first bag of words model only considered single words (monograms) but HashingVectorizer can easily accomodate word pairings. Is there predictive power in pairs of words on the number of stars in the yelp review?

In [None]:
hvl2 = HashingVectorizer(norm='l2',ngram_range=(2, 2),stop_words=nltk.corpus.stopwords.words('english'))
hvcounts = hvl2.fit_transform(df['text'])

In [None]:
hvcounts_sp_train, hvcounts_sp_test, stars_train,stars_test = cross_validation.train_test_split(hvcounts,df['stars'],test_size=0.2)
ridge_sp = linear_model.Ridge()
ridge_sp.fit(hvcounts_sp_train,stars_train)
score = ridge_sp.score(hvcounts_sp_test,stars_test)
print score

In [None]:
ridge_sp = linear_model.Ridge(alpha=4.)
ridge_sp.fit(hvcounts_sp_train,stars_train)
score = ridge_sp.score(hvcounts_sp_test,stars_test)
#joblib.dump(ridge,'/home/vagrant/miniprojects/questions/nlp-q3ridge.pkl')
print score

In [None]:
ridge_sp = linear_model.Ridge(alpha=2.)
ridge_sp.fit(hvcounts_sp_train,stars_train)
score = ridge_sp.score(hvcounts_sp_test,stars_test)
#joblib.dump(ridge,'/home/vagrant/miniprojects/questions/nlp-q3ridge.pkl')
print score

In [None]:
q2pipe = Pipeline([
    ('hv',hvl2),
    ('ridge',ridge_sp)
])

It works, but...

Not as well as monograms. I next combined the monogram and bigram predictions to get a better on the whole prediction of yelp reviews.

In [None]:
text_train, text_test, stars_train, stars_test = cross_validation.train_test_split(df['text'],df['stars'],test_size=0.2)
pred_q1_test = q1pipe.predict(text_test)
pred_q2_test = q2pipe.predict(text_test)
pred_test = [[p1,p2] for p1,p2 in zip(pred_q1_test,pred_q2_test)]
lm = linear_model.LinearRegression()
lm.fit(pred,stars_train)
score = lm.score(pred_test,stars_test)
print score

In [None]:
joblib.dump(lm,'/home/vagrant/miniprojects/questions/nlp-q3final.pkl')

A simple linear regression combining a model trained on monograms and a model trained on bigrams has a stronger predictive power than either alone!

This combined model beat the benchmark for a bigram model on the Heroku app.
Top restaurant bigrams

Looking at only reviews of restaurants I next wanted to identify word pairs in reviews that are more likely than the individual words alone. These might be strongly indicative of "foodie" type words that you might expect to find in a yelp review such as "huevos rancheros".

We can find word pairs that are unlikely to occur consecutively based on the underlying probability of their words.

Mathematically, if p(w) be the probability of a word w and p(w1w2) is the probability of the bigram w1w2, then we want to look at word pairs w1w2 where the statistic

p(w1w2)/p(w1)/p(w2)

is high.

This metric is, however, problematic when p(w_1) and/or p(w_2) are small. This can be fixed with Bayesian smoothing or additive smoothing which essentially adds a constant factor to all probabilities. This factor sets the scale for the number of appearances a word must be used in the overall corpus before it is considered relevent.

First I need to load in a second data set that idenitifies which businesses are restaurants and do an SQL style join on my two pandas dataframes. This will allow me to select reviews that only correspond to restaurants and by extension special food bigrams.

In [None]:
data = gzip.open('/home/vagrant/miniprojects/questions/yelp_train_academic_dataset_business.json.gz')
data_content_biz = data.read()
data.close()
lines= re.split('\n',data_content_biz)
json_data = [simplejson.loads(line) for line in lines[:-1]]
dfbiz = pd.DataFrame(json_data)

restaurant = []
for i in dfbiz.index:
    restaurant.append(sum([1 for cat in dfbiz.iloc[i]['categories'] if re.match('Restaurants',cat)]))
dfbiz['restaurant'] = restaurant
df_big = pd.merge(df,dfbiz,on='business_id')
df_rest = df_big[df_big['restaurant']==1]

I also found it necessary at this stage to consider word lemmatization. Lemmatization is an NLP strategy to lower the vocabulary space by combining words that have the same root. For example, lemmatization should catch the plural form of a word and remove the trailing "s".

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

CountVectorizer

Previously I used HashingVectorizer to take advantage of its smaller pickle sizes for Heroku deployment. As discussed before, HashingVectorizer cannot give you backward compatibility: you lose what word corresponds to what index. Since I now want to know what the words are, I switched over to CountVectorizer.

In [None]:
cvbi = CountVectorizer(tokenizer=LemmaTokenizer(),ngram_range=(2,2),stop_words=nltk.corpus.stopwords.words('english'))
bi = cvbi.fit_transform(df_rest['text'])
cvmono = CountVectorizer(tokenizer=LemmaTokenizer(),ngram_range=(1,1),stop_words=nltk.corpus.stopwords.words('english'))
mono = cvmono.fit_transform(df_rest['text'])
bi_keys = cvbi.vocabulary_.keys() #[key for key in cvbi.vocabulary_.keys() if not re.match('.*[0-9_-].*',key)]
mono_keys = cvmono.vocabulary_.keys() #[key for key in cvmono.vocabulary_.keys() if not re.match('.*[0-9_-].*',key)]
bi_keys_split = [re.split('\s',key) for key in bi_keys]

Bayesian smoothing function

I tried several different approaches for setting the alpha factor (as can be seen in the commented out lines). In general, an alpha set around the mean count of all words appeared to be roughly appropriate.

In [None]:
def bayesian_smooth(vocab,keys,data,alpha_factor=1):
    N = data.sum()#float(sum([data[:,vocab[key]].sum() for key in keys]))
    d = float(len(keys))
    #alpha = float(alpha)
    count = np.array(data.sum(0))[0]
    print np.mean(count)
    #alpha = np.mean(count)*float(alpha_factor)
    bayes = {}
    for key in keys:
        bayes[key] = float((count[vocab[key]]+alpha_factor))
    
    return bayes

In [None]:
monoalpha = 61
bialpha = 0
mono_vocab_smooth = bayesian_smooth(cvmono.vocabulary_,cvmono.vocabulary_.keys(),mono,monoalpha)
bi_vocab_smooth = bayesian_smooth(cvbi.vocabulary_,cvbi.vocabulary_.keys(),bi,bialpha)

First results

I calculated p_w and built the results into a dataframe for analysis and selection of the top 100.

In [None]:
p_w = [bi_vocab_smooth[b]/(mono_vocab_smooth[s[0]]*mono_vocab_smooth[s[1]]) for b,s in zip(bi_keys,bi_keys_split)]
dfq4 = pd.DataFrame({'prob w':p_w,
        'bi keys':bi_keys,
        'bi keys split':bi_keys_split})
dfq4 = dfq4.sort('prob w',ascending=False)
dfq4 = dfq4[dfq4['prob w'] != np.inf]
print dfq4['prob w'].describe()
dfq4.head()

Slightly more in depth

Where does a phrase such as "huevos rancheros" appear in our list? I picked out the top phrase, huevos rancheros, and the 100th phrase to put them side by side.

In [None]:
x = dfq4.set_index('bi keys')
top100 = list(dfq4['bi keys'][:100])
#print type(top100)
print x.xs(top100[0])
print x.xs('huevos rancheros')
print x.xs(top100[-1])
top100

It works! (mostly)

I definitely see key word pairs (mostly for various ethnic foods, which is not surprising) including the alluring "spam musubi" which apparently is some horrible spam based 7/11 food that Hawaiians love. Who knew?