# NLP: Analyzing Review Text


Unstructured data makes up the vast majority of data.  This is a basic intro to handling unstructured data.  Our objective is to be able to extract the sentiment (positive or negative) and gain insight from review text.  We will do this from Yelp review data.

## Metrics and scoring

The first two questions task you to build models, of increasing complexity, to predict the rating of a review from its text. The grader uses a test set to evaluate your model's performance against our reference solution, using the $R^2$ score. It **is** possible to receive a score greater than one, indicating that you've beaten our reference model. We compare our model's score on a test set to your score on the same test set. See how high you can go!

The final two questions asks only for the result of a calculation, and your results will be compared directly to those of a reference solution.

## Download and parse the data


The training data are a series of JSON objects, in a Gzipped file. Python supports Gzipped files natively: [`gzip.open`](https://docs.python.org/3/library/gzip.html) has the same interface as `open`, but handles `.gz` files automatically.

The built-in `json` package has a `loads` function that converts a JSON string into a Python dictionary. We could call that once for each row of the file. [`ujson`](http://docs.micropython.org/en/latest/library/ujson.html) has the same interface as the built-in `json` package, but is *substantially* faster (at the cost of non-robust handling of malformed JSON). We will use that inside a list comprehension to get a list of dictionaries:

In [4]:
import gzip
import ujson as json

with gzip.open('yelp_train_academic_dataset_review_reduced.json.gz') as f:
    data = [json.loads(line) for line in f]

The scikit-learn API requires that we keep labels (in this case, the star ratings) and features in separate data structures.

In [5]:
stars = [row['stars'] for row in data]

# Questions


## Question 1: bag_of_words_model

Build a linear model predicting the star rating based on the text reviews. Apply the bag-of-words model using the [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) to produce a feature matrix giving the counts of each word in each review.

**Hints**:
1. You will need to extract the review text from the raw input data, a list of dictionaries. You can take a similar approach you took in the `ml` miniproject by first converting the data into a pandas data frame and then using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer) or you can build a custom transform to extract the text. Either way, remember that the `CountVectorizer` accepts as input to its `transform` method a 1D array of text.

1. Try choosing different values for `min_df` (minimum document frequency cutoff) and `max_df` in `CountVectorizer`. Setting `min_df` to zero admits rare words which might only appear once in the entire corpus.  This is both prone to overfitting and makes your data unmanageably large.  Don't forget to use cross-validation to select the right value.

1. Try using [`LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) or [`Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?highlight=ridge#sklearn.linear_model.Ridge). There is also [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html?highlight=ridge#sklearn.linear_model.RidgeCV) which has built-in leave-on-out cross-validation. If the memory footprint is too big, try switching to [Stochastic Gradient Descent](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor). Don't forget to search for the optimal value of the regularization parameter. How do the regularization parameter `alpha` and the values of `min_df` and `max_df` from `CountVectorizer` change the answer?

1. You will likely pick up several hyperparameters between the vectorization step and the regularization of the predictor. While it is more strictly correct to do a grid search over all of them at once, this can take a long time. Quite often, doing a grid search over a single hyperparameter at a time can produce similar results.  Alternatively, the grid search may be done over a smaller subset of the data, as long as it is representative of the whole.

1. Finally, assemble a pipeline that will transform the data from list of dictionaries all the way to predictions.  This will allow you to submit the model's `predict` method to the grader for scoring as the test set used by the grader is a list of dictionaries.

In [6]:
text = [row['text'] for row in data]

In [7]:
import pandas as pd

df = pd.DataFrame(data)
df.head()

Unnamed: 0,votes,user_id,review_id,stars,date,text,type,business_id
0,"{'funny': 0, 'useful': 0, 'cool': 0}",Qrs3EICADUKNFoUq2iHStA,_ePLBPrkrf4bhyiKWEn4Qg,1,2013-04-19,I don't know what Dr. Goldberg was like before...,review,vcNAWiLM4dR7D2nwwJ7nCA
1,"{'funny': 6, 'useful': 0, 'cool': 0}",ZYaumz29bl9qHpu-KVtMGA,ow1c4Lcl3ObWxDC2yurwjQ,4,2009-05-04,"If you like lot lizards, you'll love the Pine ...",review,JwUE5GmEO-sH1FuwJgKBlQ
2,"{'funny': 0, 'useful': 0, 'cool': 0}",EEYwj6_t1OT5WQGypqEPNg,4iPPOQIo5Mr1NAUPUgCUrQ,4,2011-03-31,Only went here once about a year and a half ag...,review,JwUE5GmEO-sH1FuwJgKBlQ
3,"{'funny': 0, 'useful': 1, 'cool': 0}",MnXcXwr0keJpkIiwuPsOKg,_utPYHIdXeq8CqQ4iYD1bw,3,2012-01-08,Ate a Saturday morning breakfast at the Pine C...,review,JwUE5GmEO-sH1FuwJgKBlQ
4,"{'funny': 0, 'useful': 1, 'cool': 0}",wC8r-m6KHifL6R2i8ok8yg,gksnzyc9jQ9hNXESjvTrQw,3,2012-08-26,This is definitely not your usual truck stop. ...,review,JwUE5GmEO-sH1FuwJgKBlQ


In [8]:
print(stars[:3])

[1, 4, 4]


In [9]:
print(text[:3])

["I don't know what Dr. Goldberg was like before  moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. I was going to Dr. Johnson before he left and Goldberg took over when Johnson left. He is not a caring doctor. He is only interested in the co-pay and having you come in for medication refills every month. He will not give refills and could less about patients's financial situations. Trying to get your 90 days mail away pharmacy prescriptions through this guy is a joke. And to make matters even worse, his office staff is incompetent. 90% of the time when you call the office, they'll put you through to a voice mail, that NO ONE ever answers or returns your call. Both my adult children and husband have decided to leave this practice after experiencing such frustration. The entire office has an attitude like they are doing you a favor. Give me a break! Stay away from this doc and the practice. You deserve better and they will not be there when you really ne

In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [11]:
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

In [12]:
model = GridSearchCV(
    Ridge(),
    param_grid={'alpha': [100, 500]},
    scoring='neg_mean_squared_error',
    cv=5
    )

In [13]:
STOP_WORDS = spacy.lang.en.stop_words.STOP_WORDS

def tokenize_lemma(text):
    return [w.lemma_.lower() for w in nlp(text)]

stop_words_lemma = set(tokenize_lemma(' '.join(STOP_WORDS)))

In [14]:
bag_of_words_vectorizer = HashingVectorizer(
    n_features = 1000, 
    token_pattern=None, 
    stop_words=stop_words_lemma,
    tokenizer = tokenize_lemma
    )

In [15]:
from sklearn import base

class ColumnTransformer(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col_names):
        self.col_names = ['text']
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        self.X = X
        list_of_texts =[row['text'] for row in self.X]
        return list_of_texts

In [16]:
from sklearn.pipeline import Pipeline

bag_of_words_model = Pipeline([
                            ('Transformer', ColumnTransformer(['text'])),
                            ('Vectorizer', bag_of_words_vectorizer),
                            ('Classifier' , Ridge(alpha=430))                           
])

In [20]:
grader.score('nlp__bag_of_words_model', bag_of_words_model.predict)

Your score: 0.9340


## Question 2: bigram_model

...

## Question 3: word_polarity

Let's consider a different approach and try to derive some insight from our analysis.  

We want to determine the most "polarizing words" in the corpus of reviews.  In other words, we want to identify words that strongly signal a review is either positive or negative.  For example, we understand that a word like "terrible" will most likely appear in negative rather than positive reviews.  

During training, the [naive Bayes model](https://scikit-learn.org/stable/modules/naive_bayes.html#) calculates probabilities such as $Pr(\textrm{terrible}\ |\ \textrm{negative}),$ the probability that the word "terrible" appears in the review text, given that the review is negative.  Using these probabilities, we can define a **polarity score** for each word $w$,

$$\textrm{polarity}(w) = \log\left(\frac{Pr(w\ |\ \textrm{positive})}{Pr(w\ |\ \textrm{negative})}\right).$$

Polarity analysis is an example where a simpler model (naive Bayes) offers more explicability than more complicated models.  Aside from this, naive Bayes models are easy to train, the training process is parallelizable, and these models lend themselves well to online learning.  Given enough training data, naive Bayes models have performed well in NLP applications such as spam filtering.  

For this problem, you are asked to determine the top 25 most positive polar words and the 25 most negative polar words.  For this analysis, you should:

1.  **Filter** the collection of reviews you were using above to **only keep** the one-star and five-star reviews. Since these are the "most polar" reviews, it should give us the most polarizing words.   
1.  Use the naive Bayes model, `MultinomialNB`.  
1.  Use TF-IDF weighting.
1.  Remove stop words.
1.  As mentioned, generate a (Python) list with most positive (25 words) and most negative (25 words) polar words.  

A naive Bayes model (after training) stores the log of the probabilities in an attribute of the model.  It is a `numpy` array of shape (number of classes, number of features).  You will need the mapping between feature indices to words to find the most polarizing words.  

In [32]:
pos_stars = [row['stars'] for row in data if row['stars']==5]
neg_stars = [row['stars'] for row in data if row['stars']==1]

In [33]:
pos_stars[:3]

[5, 5, 5]

In [34]:
neg_stars[:3]

[1, 1, 1]

In [35]:
pos_text = [row['text'] for row in data if row['stars']==5]
neg_text = [row['text'] for row in data if row['stars']==1]

In [36]:
pos_text[:3]

["OMG!  The bakery items at Pinecone are AMAZING!   Cinnamon rolls as big as your head and absolutely scrumptious!   The food in the restaurant is great for truck stop food.  We've only come for breakfast but stop every single time we head up to Rhinelander,  WI.  Do NOT miss this place if even only for their baked goods!",
 'Very nice and clean place to have breakfast or lunch',
 'I eat here regularily because it is consistently good.  Very large menu which makes it tough to choose but never disappointed!']

In [37]:
neg_text[:3]

["I don't know what Dr. Goldberg was like before  moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. I was going to Dr. Johnson before he left and Goldberg took over when Johnson left. He is not a caring doctor. He is only interested in the co-pay and having you come in for medication refills every month. He will not give refills and could less about patients's financial situations. Trying to get your 90 days mail away pharmacy prescriptions through this guy is a joke. And to make matters even worse, his office staff is incompetent. 90% of the time when you call the office, they'll put you through to a voice mail, that NO ONE ever answers or returns your call. Both my adult children and husband have decided to leave this practice after experiencing such frustration. The entire office has an attitude like they are doing you a favor. Give me a break! Stay away from this doc and the practice. You deserve better and they will not be there when you really ne

In [38]:
q3_stars = [row['stars'] for row in data if row['stars']==1 or row['stars']==5]

In [41]:
# We're only keeping the one and five star reviews
grader.check(len(polar_data) == 116576)

True

In [42]:
label  = ['positive']*len(pos_text) + ['negative']*len(neg_text)

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

q3_stem_tfidf = TfidfVectorizer(
    stop_words=STOP_WORDS
    )

In [44]:
fitted_vec = q3_stem_tfidf.fit(pos_text + neg_text)



In [45]:
names = fitted_vec.get_feature_names()



In [46]:
transf_vec = fitted_vec.transform(pos_text + neg_text)

In [47]:
from sklearn.naive_bayes import MultinomialNB

clf =MultinomialNB()
clf.fit(transf_vec, label)

MultinomialNB()

In [48]:
prob = clf.feature_log_prob_
prob

array([[ -7.62000205,  -9.43918861, -12.38338493, ..., -12.21192181,
        -12.21192181, -12.21192181],
       [ -8.45216194,  -9.88847859, -13.20583639, ..., -13.27227032,
        -13.27227032, -13.27227032]])

In [49]:
pol = prob[0]- prob[1] # using log rule of division equivalent to substraction
pol

array([0.83215989, 0.44928998, 0.82245146, ..., 1.06034851, 1.06034851,
       1.06034851])

In [50]:
merged_list = [(names[i], pol[i]) for i in range(0, len(names))]
merged_list

[('00', 0.8321598905816474),
 ('000', 0.4492899809542106),
 ('0000', 0.8224514609533173),
 ('000000', 1.184149556152196),
 ('000000001', 1.3664092055207337),
 ('000s', 0.7590072422127498),
 ('000x', 1.1594000147873338),
 ('001871e3ce6c', 1.1129340643996812),
 ('003', 0.7168244349430566),
 ('0049', 1.0688727285345507),
 ('007', 0.1436739074329001),
 ('007851', 0.9464152098024758),
 ('00a', 0.4725633996404266),
 ('00am', 0.10646749123797328),
 ('00ish', 0.4749285320551344),
 ('00p', 0.8167044131755397),
 ('00person', 0.7437144397668991),
 ('00pm', 0.48005147763764455),
 ('00s', 1.150944420409255),
 ('00sf', 0.7491631012096214),
 ('01', 0.77946524428876),
 ('0101', 0.7453720031170246),
 ('013', 0.7848516811762938),
 ('017', 0.7304440294843602),
 ('01am', 0.763395853778503),
 ('01pm', 1.545895182600443),
 ('02', 0.8154027390808523),
 ('0200', 1.0671706370928664),
 ('0202', 1.046646921847941),
 ('0241', 0.609705850783028),
 ('025', 0.6532751157034831),
 ('027', 0.7806687159980257),
 ('028',

In [51]:
sorted_list = sorted(merged_list, key=lambda x: x[1])
sorted_list

[('perfection', -3.5924380554037807),
 ('delicious', -3.224325219305898),
 ('fantastic', -3.2109407647884005),
 ('gem', -3.195662656594118),
 ('yummy', -3.1205299300772413),
 ('delish', -3.112369701609479),
 ('yum', -3.0897242078414475),
 ('amazing', -3.0625297940501373),
 ('impeccable', -3.047217805330714),
 ('refreshing', -3.02613688435798),
 ('excellent', -2.993246038261649),
 ('perfect', -2.954378016031912),
 ('superb', -2.947275046630674),
 ('outstanding', -2.944834265056249),
 ('notch', -2.9382804732292405),
 ('terrific', -2.9159537388106624),
 ('awesome', -2.913457334366239),
 ('divine', -2.8837276551369833),
 ('incredible', -2.8060864130085523),
 ('perfectly', -2.7783253350752863),
 ('favorites', -2.7636905044239075),
 ('favorite', -2.7139330470932297),
 ('wonderful', -2.7102368858739876),
 ('loved', -2.701463175999283),
 ('die', -2.6965378225862837),
 ('hooked', -2.6664504079069307),
 ('phenomenal', -2.655320547836771),
 ('heaven', -2.648783284458327),
 ('deliciousness', -2.64

In [52]:
features = [x for (x,y) in sorted_list]
polar_words = features[:25] + features[-25:] # top 25 positive and top 25 negative
polar_words

['perfection',
 'delicious',
 'fantastic',
 'gem',
 'yummy',
 'delish',
 'yum',
 'amazing',
 'impeccable',
 'refreshing',
 'excellent',
 'perfect',
 'superb',
 'outstanding',
 'notch',
 'terrific',
 'awesome',
 'divine',
 'incredible',
 'perfectly',
 'favorites',
 'favorite',
 'wonderful',
 'loved',
 'die',
 'insult',
 'disgusted',
 'lukewarm',
 'terrible',
 'aweful',
 'apology',
 'rancid',
 'crooks',
 'awful',
 'inedible',
 'disgusting',
 'refund',
 'disrespectful',
 'rudest',
 'rude',
 'blamed',
 'horrible',
 'tasteless',
 'poisoning',
 'rudely',
 'unhelpful',
 'incompetent',
 'worst',
 'unprofessional',
 'unacceptable']

In [53]:
# polar_words = ['perfection'] * 50

In [54]:
grader.score('nlp__word_polarity', polar_words)

Your score: 1.0000


## Question 4: food_bigrams

...

*Copyright &copy; 2022 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material in whole is strictly prohibited.*