In [1]:
import seaborn as sns
sns.set()

# NLP: Analyzing Review Text


Unstructured data makes up the vast majority of data.  This is a basic intro to handling unstructured data.  Our objective is to be able to extract the sentiment (positive or negative) and gain insight from review text.  We will do this from Yelp review data.

## Metrics and scoring

The first two questions task you to build models, of increasing complexity, to predict the rating of a review from its text. The grader uses a test set to evaluate your model's performance against our reference solution, using the $R^2$ score. It **is** possible to receive a score greater than one, indicating that you've beaten our reference model. We compare our model's score on a test set to your score on the same test set. See how high you can go!

The final two questions asks only for the result of a calculation, and your results will be compared directly to those of a reference solution.

## Download and parse the data


The training data are a series of JSON objects, in a Gzipped file. Python supports Gzipped files natively: [`gzip.open`](https://docs.python.org/3/library/gzip.html) has the same interface as `open`, but handles `.gz` files automatically.

The built-in `json` package has a `loads` function that converts a JSON string into a Python dictionary. We could call that once for each row of the file. [`ujson`](http://docs.micropython.org/en/latest/library/ujson.html) has the same interface as the built-in `json` package, but is *substantially* faster (at the cost of non-robust handling of malformed JSON). We will use that inside a list comprehension to get a list of dictionaries:

In [77]:
import gzip
import ujson as json
## https://mailuc-my.sharepoint.com/:u:/r/personal/rezaal_mail_uc_edu/Documents/yelp_train_academic_dataset_review_reduced.json.gz?csf=1&web=1&e=e9qC5e
with gzip.open('yelp_train_academic_dataset_review_reduced.json.gz') as f:
    data = [json.loads(line) for line in f]

The scikit-learn API requires that we keep labels (in this case, the star ratings) and features in separate data structures.

In [5]:
# data[:5]

In [6]:
# texts = [row['text'] for row in data]
# texts[:5]

In [7]:
stars = [row['stars'] for row in data]
len(stars)

253272

# Questions


## Question 1: bag_of_words_model

Build a linear model predicting the star rating based on the text reviews. Apply the bag-of-words model using the [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) to produce a feature matrix giving the counts of each word in each review.

**Hints**:
1. You will need to extract the review text from the raw input data, a list of dictionaries. You can take a similar approach you took in the `ml` miniproject by first converting the data into a pandas data frame and then using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer) or you can build a custom transform to extract the text. Either way, remember that the `CountVectorizer` accepts as input to its `transform` method a 1D array of text.

1. Try choosing different values for `min_df` (minimum document frequency cutoff) and `max_df` in `CountVectorizer`. Setting `min_df` to zero admits rare words which might only appear once in the entire corpus.  This is both prone to overfitting and makes your data unmanageably large.  Don't forget to use cross-validation to select the right value.

1. Try using [`LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) or [`Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?highlight=ridge#sklearn.linear_model.Ridge). There is also [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html?highlight=ridge#sklearn.linear_model.RidgeCV) which has built-in leave-on-out cross-validation. If the memory footprint is too big, try switching to [Stochastic Gradient Descent](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor). Don't forget to search for the optimal value of the regularization parameter. How do the regularization parameter `alpha` and the values of `min_df` and `max_df` from `CountVectorizer` change the answer?

1. You will likely pick up several hyperparameters between the vectorization step and the regularization of the predictor. While it is more strictly correct to do a grid search over all of them at once, this can take a long time. Quite often, doing a grid search over a single hyperparameter at a time can produce similar results.  Alternatively, the grid search may be done over a smaller subset of the data, as long as it is representative of the whole.

1. Finally, assemble a pipeline that will transform the data from list of dictionaries all the way to predictions.  This will allow you to submit the model's `predict` method to the grader for scoring as the test set used by the grader is a list of dictionaries.

In [8]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class extract_text(BaseEstimator, TransformerMixin):

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        return pd.Series([j['text'] for j in X])


# class ToDataFrame(BaseEstimator, TransformerMixin):
#     def fit(self, X, y=None):
#         # This transformer doesn't need to learn anything about the data,
#         # so it can just return self without any further processing
#         return self

#     def transform(self, X):
#         # Return a pandas data frame from X
#         return pd.DataFrame(X)


# to_data_frame = ToDataFrame()

In [9]:
gg = extract_text().fit_transform(data)

In [10]:
gg

0         I don't know what Dr. Goldberg was like before...
1         If you like lot lizards, you'll love the Pine ...
2         Only went here once about a year and a half ag...
3         Ate a Saturday morning breakfast at the Pine C...
4         This is definitely not your usual truck stop. ...
                                ...                        
253267    What a horrible time yesterday... shYEAH - and...
253268    My new favorite restaurant.  They have 22 diff...
253269    GreAt food awesome service . The best fish in ...
253270    I love this place! I think the staff struggle ...
253271    I visit here once or twice a month. Just to ge...
Length: 253272, dtype: object

In [11]:
# from sklearn.compose import ColumnTransformer
# text_selector = ColumnTransformer([
#     ('text_extract','passthrough' , ['text'])
# ])



# # ColumnTransformer([('something', SomeTransformer(), ['column'])])


# text_selector.fit_transform(gg)

In [12]:
import re
import nltk
import pandas as pd
from nltk.stem import PorterStemmer
# init stemmer
porter_stemmer = PorterStemmer()


def text_preprocessor(text):

    text = text.lower()
    text = re.sub("\\W", " ", text)  # remove special chars
#     text=re.sub("\\s+(in|the|all|for|and|on)\\s+"," _connector_ ",text) # normalize certain words

    # stem words
    words = re.split("\\s+", text)
    stemmed_words = [porter_stemmer.stem(word=word) for word in words]
    return ' '.join(stemmed_words)

In [59]:
from sklearn.feature_extraction.text import CountVectorizer
from spacy.lang.en.stop_words import STOP_WORDS


STOP_WORDS = STOP_WORDS.difference({'he', 'his', 'her', 'hers'})
STOP_WORDS = STOP_WORDS.union({'ll', 've', 'abov', 'afterward',
                               'alon', 'alreadi', 'alway', 'ani',
                               'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'doe', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'formerli', 'forti', 'ha', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'latterli', 'mani', 'meanwhil', 'moreov', 'mostli', 'nobodi',
                              'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'quit', 'realli', 'regard', 'seriou', 'sever', 'sinc', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thu', 'togeth', 'twelv', 'twenti', 'use', 'variou', 'veri', 'wa', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv', 'anywh', 'becau', 'el', 'elsewh', 'everywh', 'ind', 'otherwi', 'plea', 'somewh'})

count_vect = CountVectorizer(
    preprocessor=text_preprocessor,
    stop_words=STOP_WORDS,
    min_df=0.0001,
    max_df=0.9)

# parameters = {'max_df': np.linspace(0.0, 1.0, 11),
#               'min_df': np.linspace(0.0, 1.0, 11)}

# count_vect_grid = GridSearchCV(count_vect,
#                                parameters,
#                                cv=KFold(n_splits=5,
#                                         shuffle=True,
#                                         random_state=1169),
#                                verbose=3)


kk = count_vect.fit_transform(gg)

kk.shape

(253272, 12398)

In [47]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge , RidgeCV
n_alphas = 10
alphas = np.logspace(-10, 1, n_alphas)

parameters = {'alpha': alphas}
ridge_ = Ridge()
ridge_grid = GridSearchCV(ridge_, parameters, verbose=3)

# ridge_grid.fit(kk,stars)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END .........................alpha=1e-10;, score=nan total time=   0.0s
[CV 2/5] END .........................alpha=1e-10;, score=nan total time=   0.0s
[CV 3/5] END .........................alpha=1e-10;, score=nan total time=   0.0s
[CV 4/5] END .........................alpha=1e-10;, score=nan total time=   0.0s
[CV 5/5] END .........................alpha=1e-10;, score=nan total time=   0.0s
[CV 1/5] END ........alpha=1.6681005372000556e-09;, score=nan total time=   0.0s
[CV 2/5] END ........alpha=1.6681005372000556e-09;, score=nan total time=   0.0s
[CV 3/5] END ........alpha=1.6681005372000556e-09;, score=nan total time=   0.0s
[CV 4/5] END ........alpha=1.6681005372000556e-09;, score=nan total time=   0.0s
[CV 5/5] END ........alpha=1.6681005372000556e-09;, score=nan total time=   0.0s
[CV 1/5] END .........alpha=2.782559402207126e-08;, score=nan total time=   0.0s
[CV 2/5] END .........alpha=2.782559402207126e-0



GridSearchCV(estimator=Ridge(),
             param_grid={'alpha': array([1.00000000e-10, 1.66810054e-09, 2.78255940e-08, 4.64158883e-07,
       7.74263683e-06, 1.29154967e-04, 2.15443469e-03, 3.59381366e-02,
       5.99484250e-01, 1.00000000e+01])},
             verbose=3)

In [48]:
stars[:5]

[1, 4, 4, 3, 3]

In [49]:
ridge_grid.predict(kk)

array([1., 4., 4., 3., 3.])

In [50]:
from sklearn.pipeline import Pipeline

bag_of_words_vectorizer = Pipeline([
#     ('to_df', to_data_frame),
#     ('text_selector', text_selector),
    ('extract_text',extract_text()),
    ('count_vect', count_vect),
    ('ridge_grid', ridge_grid)
], verbose=True)

In [60]:
bag_of_words_model = bag_of_words_vectorizer

bag_of_words_model.fit(data, stars)

[Pipeline] ...... (step 1 of 3) Processing extract_text, total=   0.1s
[Pipeline] ........ (step 2 of 3) Processing count_vect, total= 5.9min
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END .......................alpha=1e-10;, score=0.449 total time=  12.2s
[CV 2/5] END .......................alpha=1e-10;, score=0.414 total time=  11.5s
[CV 3/5] END .......................alpha=1e-10;, score=0.434 total time=  11.6s
[CV 4/5] END .......................alpha=1e-10;, score=0.456 total time=  12.0s
[CV 5/5] END .......................alpha=1e-10;, score=0.466 total time=  12.2s
[CV 1/5] END ......alpha=1.6681005372000556e-09;, score=0.449 total time=  12.6s
[CV 2/5] END ......alpha=1.6681005372000556e-09;, score=0.414 total time=  11.5s
[CV 3/5] END ......alpha=1.6681005372000556e-09;, score=0.434 total time=  12.1s
[CV 4/5] END ......alpha=1.6681005372000556e-09;, score=0.456 total time=  12.0s
[CV 5/5] END ......alpha=1.6681005372000556e-09;, score=0.466 total 

Pipeline(steps=[('extract_text', extract_text()),
                ('count_vect',
                 CountVectorizer(max_df=0.9, min_df=0.0001,
                                 preprocessor=<function text_preprocessor at 0x7f2429a88b80>,
                                 stop_words={"'d", "'ll", "'m", "'re", "'s",
                                             "'ve", 'a', 'about', 'above',
                                             'across', 'after', 'afterwards',
                                             'again', 'against', 'all',
                                             'almost', 'alone', 'along',
                                             'already', 'also', 'although',
                                             'always', 'am', 'among', 'amongst',
                                             'amount', 'an', 'and', 'another',
                                             'any', ...})),
                ('ridge_grid',
                 GridSearchCV(estimator=Ridge(),
              

## Question 2: bigram_model

In a bigram model, we'll consider both single words and pairs of consecutive words that appear. This is going to be a much higher-dimensional problem so you should be careful about overfitting. You should also use a vectorizer that applies some sort of normalization, e.g., the `TfidfVectorizer` or a word count vectorizer combined with `TfidfTransformer`.

Sometimes, reducing the dimension can be useful. If you're using the `TfidfVectorizer`, you can change the `max_features` hyperparameter to reduce the size of the resulting vocabulary. For `HashingVectorizer`, you can adjust the size of the feature matrix through `n_features`.

**A side note on multi-stage model evaluation:** When your model consists of a pipeline with several stages, it can be worthwhile to evaluate which parts of the pipeline have the greatest impact on the overall accuracy (or other metric) of the model. This allows you to focus your efforts on improving the important algorithms, and leaving the rest "good enough".

One way to accomplish this is through ceiling analysis, which can be useful when you have a training set with ground truth values at each stage. Let's say you're training a model to extract image captions from websites and return a list of names that were in the caption. Your overall accuracy at some point reaches 70%. You can try manually giving the model what you know are the correct image captions from the training set, and see how the accuracy improves (maybe up to 75%). Alternatively, giving the model the perfect name parsing for each caption increases accuracy to 90%. This indicates that the name parsing is a much more promising target for further work, and the caption extraction is a relatively smaller factor in the overall performance.

If you don't know the right answers at different stages of the pipeline, you can still evaluate how important different parts of the model are to its performance by changing or removing certain steps while keeping everything else constant. You might try this kind of analysis to determine how important adding stopwords and stemming to your NLP model actually is, and how that importance changes with parameters like the number of features.

In [51]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class extract_text(BaseEstimator, TransformerMixin):

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        return pd.Series([j['text'] for j in X])

In [52]:
gg = extract_text().fit_transform(data)

In [53]:
import spacy
nlp = spacy.load("en_core_web_sm") 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS

def tokenize_lemma(text):
    return [w.lemma_.lower() for w in nlp(text)]


stop_words_lemma = set(tokenize_lemma(' '.join(sorted(STOP_WORDS))))


ng_stem_tfidf = TfidfVectorizer(max_features=5000, 
                                stop_words=stop_words_lemma,
#                                 tokenizer=tokenize_lemma,
#                                 token_pattern=None, # Is ignored, since tokenizer is specified
                                ngram_range = (1,2)
                               )

In [54]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge , RidgeCV
n_alphas = 10
alphas = np.logspace(-10, 1, n_alphas)

parameters = {'alpha': alphas}
ridge_ = Ridge()
ridge_grid = GridSearchCV(ridge_, parameters, verbose=3)

In [55]:
from sklearn.pipeline import Pipeline

bigram_model = Pipeline([
#     ('to_df', to_data_frame),
#     ('text_selector', text_selector),
    ('extract_text',extract_text()),
    ('ng_stem_tfidf', ng_stem_tfidf),
    ('ridge_grid', ridge_grid)
], verbose=True)

In [56]:
# bigram_model = ...

bigram_model.fit(data, stars)

[Pipeline] ...... (step 1 of 3) Processing extract_text, total=   0.1s
[Pipeline] ..... (step 2 of 3) Processing ng_stem_tfidf, total=  49.5s
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END .......................alpha=1e-10;, score=0.592 total time=   1.3s
[CV 2/5] END .......................alpha=1e-10;, score=0.571 total time=   1.2s
[CV 3/5] END .......................alpha=1e-10;, score=0.583 total time=   1.2s
[CV 4/5] END .......................alpha=1e-10;, score=0.606 total time=   1.2s
[CV 5/5] END .......................alpha=1e-10;, score=0.628 total time=   1.2s
[CV 1/5] END ......alpha=1.6681005372000556e-09;, score=0.592 total time=   1.3s
[CV 2/5] END ......alpha=1.6681005372000556e-09;, score=0.571 total time=   1.3s
[CV 3/5] END ......alpha=1.6681005372000556e-09;, score=0.583 total time=   1.2s
[CV 4/5] END ......alpha=1.6681005372000556e-09;, score=0.606 total time=   1.2s
[CV 5/5] END ......alpha=1.6681005372000556e-09;, score=0.628 total 

Pipeline(steps=[('extract_text', extract_text()),
                ('ng_stem_tfidf',
                 TfidfVectorizer(max_features=5000, ngram_range=(1, 2),
                                 stop_words={"'", "'ll", 'a', 'about', 'above',
                                             'across', 'after', 'afterwards',
                                             'again', 'against', 'all',
                                             'almost', 'alone', 'along',
                                             'already', 'also', 'although',
                                             'always', 'among', 'amongst',
                                             'amount', 'an', 'and', 'another',
                                             'any', 'anyhow', 'anyone',
                                             'anything', 'anyway', 'anywhere', ...})),
                ('ridge_grid',
                 GridSearchCV(estimator=Ridge(),
                              param_grid={'alpha': array([1.00000000e-10

## Question 3: word_polarity

Let's consider a different approach and try to derive some insight from our analysis.  

We want to determine the most "polarizing words" in the corpus of reviews.  In other words, we want to identify words that strongly signal a review is either positive or negative.  For example, we understand that a word like "terrible" will most likely appear in negative rather than positive reviews.  

During training, the [naive Bayes model](https://scikit-learn.org/stable/modules/naive_bayes.html#) calculates probabilities such as $Pr(\textrm{terrible}\ |\ \textrm{negative}),$ the probability that the word "terrible" appears in the review text, given that the review is negative.  Using these probabilities, we can define a **polarity score** for each word $w$,

$$\textrm{polarity}(w) = \log\left(\frac{Pr(w\ |\ \textrm{positive})}{Pr(w\ |\ \textrm{negative})}\right).$$

Polarity analysis is an example where a simpler model (naive Bayes) offers more explicability than more complicated models.  Aside from this, naive Bayes models are easy to train, the training process is parallelizable, and these models lend themselves well to online learning.  Given enough training data, naive Bayes models have performed well in NLP applications such as spam filtering.  

For this problem, you are asked to determine the top 25 most positive polar words and the 25 most negative polar words.  For this analysis, you should:

1.  **Filter** the collection of reviews you were using above to **only keep** the one-star and five-star reviews. Since these are the "most polar" reviews, it should give us the most polarizing words.   
1.  Use the naive Bayes model, `MultinomialNB`.  
1.  Use TF-IDF weighting.
1.  Remove stop words.
1.  As mentioned, generate a (Python) list with most positive (25 words) and most negative (25 words) polar words.  

A naive Bayes model (after training) stores the log of the probabilities in an attribute of the model.  It is a `numpy` array of shape (number of classes, number of features).  You will need the mapping between feature indices to words to find the most polarizing words.  

In [8]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class extract_text(BaseEstimator, TransformerMixin):

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        filtered_data = [{'text':j['text'], 'stars':j['stars']} for j in X if j['stars']==1 or j['stars']==5] 
        filtered_texts = pd.Series([i['text'] for i in filtered_data])
        filtered_stars = pd.Series([i['stars'] for i in filtered_data])
        return filtered_data, filtered_texts , filtered_stars

In [9]:
polar_data,polar_texts, polar_stars = extract_text().fit_transform(data)

In [10]:
sum(['great' in i for i in polar_texts])/len(polar_texts)

0.2893820340378809

In [11]:
# We're only keeping the one and five star reviews
grader.check(len(polar_data) == 116576)

True

In [221]:
import spacy
nlp = spacy.load("en_core_web_sm") 

# from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# from spacy.lang.en.stop_words import STOP_WORDS

# def tokenize_lemma(text):
#     return [w.lemma_.lower() for w in nlp(text)]


# # stop_words_lemma = set(tokenize_lemma(' '.join(sorted(STOP_WORDS))))
# STOP_WORDS = STOP_WORDS.union({'ll', 've'})

tfidf = TfidfVectorizer(#max_features=5000, 
                                stop_words='english',
#                                 tokenizer=tokenize_lemma,
#                                 token_pattern=None, # Is ignored, since tokenizer is specified
                                #ngram_range = (1,2),
#                                 max_df = 0.13,
#                                 smooth_idf= False
                               )

In [222]:
kk = tfidf.fit_transform(polar_texts)

In [223]:
len(tfidf.get_feature_names_out())

88242

In [224]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

n_alphas = 50
alphas = np.logspace(-10, 1, n_alphas)

parameters = {'alpha': alphas}
multinomial_nb_ = MultinomialNB()
m_nb_grid = GridSearchCV(multinomial_nb_, parameters, verbose=3)

m_nb_grid.fit(kk, polar_stars)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END .......................alpha=1e-10;, score=0.892 total time=   0.1s
[CV 2/5] END .......................alpha=1e-10;, score=0.891 total time=   0.1s
[CV 3/5] END .......................alpha=1e-10;, score=0.896 total time=   0.1s
[CV 4/5] END .......................alpha=1e-10;, score=0.897 total time=   0.1s
[CV 5/5] END .......................alpha=1e-10;, score=0.898 total time=   0.1s
[CV 1/5] END ........alpha=1.67683293681101e-10;, score=0.893 total time=   0.1s
[CV 2/5] END ........alpha=1.67683293681101e-10;, score=0.891 total time=   0.1s
[CV 3/5] END ........alpha=1.67683293681101e-10;, score=0.897 total time=   0.1s
[CV 4/5] END ........alpha=1.67683293681101e-10;, score=0.897 total time=   0.1s
[CV 5/5] END ........alpha=1.67683293681101e-10;, score=0.898 total time=   0.1s
[CV 1/5] END .......alpha=2.811768697974225e-10;, score=0.893 total time=   0.1s
[CV 2/5] END .......alpha=2.811768697974225e-10

[CV 2/5] END ......alpha=3.0888435964774785e-06;, score=0.903 total time=   0.1s
[CV 3/5] END ......alpha=3.0888435964774785e-06;, score=0.908 total time=   0.1s
[CV 4/5] END ......alpha=3.0888435964774785e-06;, score=0.908 total time=   0.1s
[CV 5/5] END ......alpha=3.0888435964774785e-06;, score=0.908 total time=   0.1s
[CV 1/5] END .......alpha=5.179474679231212e-06;, score=0.904 total time=   0.1s
[CV 2/5] END .......alpha=5.179474679231212e-06;, score=0.904 total time=   0.1s
[CV 3/5] END .......alpha=5.179474679231212e-06;, score=0.908 total time=   0.1s
[CV 4/5] END .......alpha=5.179474679231212e-06;, score=0.909 total time=   0.1s
[CV 5/5] END .......alpha=5.179474679231212e-06;, score=0.909 total time=   0.1s
[CV 1/5] END ........alpha=8.68511373751352e-06;, score=0.906 total time=   0.1s
[CV 2/5] END ........alpha=8.68511373751352e-06;, score=0.905 total time=   0.1s
[CV 3/5] END ........alpha=8.68511373751352e-06;, score=0.910 total time=   0.1s
[CV 4/5] END ........alpha=8

[CV 4/5] END .........alpha=0.09540954763499924;, score=0.929 total time=   0.1s
[CV 5/5] END .........alpha=0.09540954763499924;, score=0.929 total time=   0.1s
[CV 1/5] END .........alpha=0.15998587196060574;, score=0.926 total time=   0.1s
[CV 2/5] END .........alpha=0.15998587196060574;, score=0.922 total time=   0.1s
[CV 3/5] END .........alpha=0.15998587196060574;, score=0.929 total time=   0.1s
[CV 4/5] END .........alpha=0.15998587196060574;, score=0.927 total time=   0.1s
[CV 5/5] END .........alpha=0.15998587196060574;, score=0.927 total time=   0.1s
[CV 1/5] END .........alpha=0.26826957952797276;, score=0.923 total time=   0.1s
[CV 2/5] END .........alpha=0.26826957952797276;, score=0.916 total time=   0.1s
[CV 3/5] END .........alpha=0.26826957952797276;, score=0.923 total time=   0.1s
[CV 4/5] END .........alpha=0.26826957952797276;, score=0.922 total time=   0.1s
[CV 5/5] END .........alpha=0.26826957952797276;, score=0.921 total time=   0.1s
[CV 1/5] END ..........alpha

GridSearchCV(estimator=MultinomialNB(),
             param_grid={'alpha': array([1.00000000e-10, 1.67683294e-10, 2.81176870e-10, 4.71486636e-10,
       7.90604321e-10, 1.32571137e-09, 2.22299648e-09, 3.72759372e-09,
       6.25055193e-09, 1.04811313e-08, 1.75751062e-08, 2.94705170e-08,
       4.94171336e-08, 8.28642773e-08, 1.38949549e-07, 2.32995181e-07,
       3.90693994e-07, 6.55128557e-07, 1.09854114e-06...
       2.44205309e-05, 4.09491506e-05, 6.86648845e-05, 1.15139540e-04,
       1.93069773e-04, 3.23745754e-04, 5.42867544e-04, 9.10298178e-04,
       1.52641797e-03, 2.55954792e-03, 4.29193426e-03, 7.19685673e-03,
       1.20679264e-02, 2.02358965e-02, 3.39322177e-02, 5.68986603e-02,
       9.54095476e-02, 1.59985872e-01, 2.68269580e-01, 4.49843267e-01,
       7.54312006e-01, 1.26485522e+00, 2.12095089e+00, 3.55648031e+00,
       5.96362332e+00, 1.00000000e+01])},
             verbose=3)

In [225]:
m_nb_grid.best_params_
# m_nb_grid.best_estimator_

{'alpha': 0.05689866029018305}

In [226]:
m_nb = m_nb_grid.best_estimator_

In [227]:
polarity_score = m_nb.feature_log_prob_[1]/m_nb.feature_log_prob_[0]

In [228]:
polar_words = []

for i in m_nb.feature_log_prob_:
    words = list(pd.DataFrame({'word':tfidf.get_feature_names_out(), 'prob': i }).sort_values('prob', ascending=False)['word'])
    polar_words.extend(words[:25])

In [229]:
word_polarity_data = pd.DataFrame({'word':tfidf.get_feature_names_out(), 'prob': polarity_score }).sort_values('prob', ascending=False)

In [230]:
polar_words =list(word_polarity_data[-25:]['word']) + list(word_polarity_data[:25]['word'])

In [231]:
polar_words

['favorite',
 'yum',
 'delish',
 'yummm',
 'love',
 'divine',
 'montagu',
 'impeccable',
 'deliciousness',
 'great',
 'mouthwatering',
 'gem',
 'perfect',
 'yummy',
 'mazing',
 'unpretentious',
 'troy',
 'textures',
 'excellent',
 'awesome',
 'fantastic',
 'awsome',
 'amazing',
 'perfection',
 'delicious',
 'worst',
 'unacceptable',
 'crooks',
 'aweful',
 'horrible',
 'fraud',
 'unprofessional',
 'rude',
 'blamed',
 'disgrace',
 'unhelpful',
 'incompetent',
 'terrible',
 'awful',
 'livid',
 'rudest',
 'rudely',
 'poisoning',
 'disgusting',
 'unethical',
 'tasteless',
 'disrespectful',
 'harassing',
 'refund',
 'uncalled']

In [232]:
d = {}
for word in polar_words:
    d[word] = sum([word in i for i in polar_texts ])/len(polar_texts)

In [233]:
pd.DataFrame({'word':d.keys(), 'freq':d.values()}).sort_values('freq')

Unnamed: 0,word,freq
6,montagu,0.000172
47,harassing,0.00018
49,uncalled,0.000197
28,aweful,0.000197
39,livid,0.000283
27,crooks,0.00036
44,unethical,0.00036
34,disgrace,0.000395
21,awsome,0.00048
30,fraud,0.000583


## Question 4: food_bigrams

Look over all reviews of restaurants.  You can determine which businesses are restaurants by looking in the `yelp_train_academic_dataset_business.json.gz` file.

In [115]:
with gzip.open('yelp_train_academic_dataset_business.json.gz') as f:
    business_data = [json.loads(line) for line in f]

Each row of this file corresponds to a single business.  The category key gives a list of categories for each; take all where "Restaurants" appears.

In [136]:
'Restaurants' in business_data[1]['categories']

True

In [137]:
restaurants = [i for i in business_data if 'Restaurants' in i['categories']]

In [144]:
data[0]

{'votes': {'funny': 0, 'useful': 0, 'cool': 0},
 'user_id': 'Qrs3EICADUKNFoUq2iHStA',
 'review_id': '_ePLBPrkrf4bhyiKWEn4Qg',
 'stars': 1,
 'date': '2013-04-19',
 'text': "I don't know what Dr. Goldberg was like before  moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. I was going to Dr. Johnson before he left and Goldberg took over when Johnson left. He is not a caring doctor. He is only interested in the co-pay and having you come in for medication refills every month. He will not give refills and could less about patients's financial situations. Trying to get your 90 days mail away pharmacy prescriptions through this guy is a joke. And to make matters even worse, his office staff is incompetent. 90% of the time when you call the office, they'll put you through to a voice mail, that NO ONE ever answers or returns your call. Both my adult children and husband have decided to leave this practice after experiencing such frustration. The entire office ha

In [150]:
restaurant_ids = pd.DataFrame(restaurants)[['business_id']]

In [151]:
# Look at the categories to check for spelling and capitalization
grader.check(len(restaurant_ids) == 12876)

True

The "business_id" here is the same as in the review data.  Use this to extract the review text for all reviews of restaurants.

In [158]:
restaurant_reviews = restaurant_ids.merge(pd.DataFrame(data), on='business_id', how='left')[['business_id','text']].dropna()

In [159]:
# Just reviews of restaurants
# restaurant_ids is helpful here
grader.check(len(restaurant_reviews) == 143361)

True

We want to find collocations --- that is, bigrams that are "special" and appear more often than you'd expect from chance. We can think of the corpus as defining an empirical distribution over all *n*-grams.  We can find word pairs that are unlikely to occur consecutively based on the underlying probability of their words. Mathematically, if $p(w)$ be the probability of a word $w$ and $p(w_1 w_2)$ is the probability of the bigram $w_1 w_2$, then we want to look at word pairs $w_1 w_2$ where the statistic

  $$ \frac{p(w_1 w_2)}{p(w_1) p(w_2)} $$

is high.  Return the top 100 (mostly food) bigrams with this statistic with the 'right' prior factor (see below).

Estimating the probabilities is simply a matter of counting, and there are number of approaches that will work.  One is to use one of the tokenizers to count up how many times each word and each bigram appears in each review, and then sum those up over all reviews.  You might want to know that the `CountVectorizer` has a `.get_feature_names_out()` method which gives the string associated with each column.  (Question for thought: Why doesn't the `HashingVectorizer` have a similar method?)

*Questions:* This statistic is a ratio and problematic when the denominator is small.  We can fix this by applying Bayesian smoothing to $p(w)$ (i.e. mixing the empirical distribution with the uniform distribution over the vocabulary).

1. How does changing this smoothing parameter affect the word pairs you get qualitatively?

2. We can interpret the smoothing parameter as adding a constant number of occurrences of each word to our distribution.  Does this help you determine set a reasonable value for this 'prior factor'?

3. For fun: also check out [Amazon's Statistically Improbable Phrases](http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases).

*Implementation note:*
As you adjust the size of the Bayesian smoothing parameter, you will notice first nonsense phrases being removed and then legitimate bigrams being removed, leaving you with only generic bigrams.  The goal is to find a value of the smoothing parameter between these two transitions.

The reference solution is not an aggressive filterer: it errors in favor of leaving apparently nonsensical words. On further consideration, many of these are actually somewhat meaningful. The smoothing parameter chosen in the reference solution is equivalent to giving each word 30 previous appearances prior to considering this data.  This was chosen by generating a list of bigrams for a range of smoothing parameters and seeing how many of the bigrams were shared between neighboring values.  When the shared fraction reached 95%, we judged the solution to have converged.

There are a few reviews that include the same nonsense strings multiple times.  To keep these from showing up in our results, we set `min_df=10`, to ensure that a bigram occurs in at least 10 reviews before we consider it.

In [236]:
restaurant_reviews['text']

0         If you like lot lizards, you'll love the Pine ...
1         Only went here once about a year and a half ag...
2         Ate a Saturday morning breakfast at the Pine C...
3         This is definitely not your usual truck stop. ...
4         I like this location better than the one near ...
                                ...                        
144941    Barely open less than a week and I've been her...
144942    Healthy Food that Keeps this Realtor on the Go...
144943    So happy to have this healthy eatery option ri...
144945    My new favorite restaurant.  They have 22 diff...
144946    GreAt food awesome service . The best fish in ...
Name: text, Length: 143361, dtype: object

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter


##count single words
cv = CountVectorizer(min_df=10,stop_words='english')
single=cv.fit_transform(restaurant_reviews['text'])

unigrams=list(cv.get_feature_names_out())
counts = single.sum(axis=0).A1

freq_distribution = Counter(dict(zip(unigrams, counts)))

#count double words
cv = CountVectorizer(ngram_range=(2,2),min_df=10,stop_words='english')
double = cv.fit_transform(restaurant_reviews['text'])

bigrams=list(cv.get_feature_names_out ())
counts_bi = double.sum(axis=0).A1

freq_distribution_bi = Counter(dict(zip(bigrams, counts_bi)))

In [None]:
##calculate p(w)
value=sum(freq_distribution.values())
for item, count in freq_distribution.items():
    freq_distribution[item]+=30 
    freq_distribution[item]/= value

##calculate p(w1w2)
value=sum(freq_distribution_bi.values())
for item, count in freq_distribution_bi.items():
    freq_distribution_bi[item] /= value

##calculate ratio
for item, count in freq_distribution_bi.items():
    lis_grams=item.split()
    value1=freq_distribution[lis_grams[0]]
    value2=freq_distribution[lis_grams[1]]
    value=value1*value2
    freq_distribution_bi[item] /= value

In [None]:

top100=[item[0] for item in freq_distribution_bi.most_common(100)]