# NLP: Analyzing Review Text


Unstructured data makes up the vast majority of data.  This is a basic intro to handling unstructured data.  Our objective is to be able to extract the sentiment (positive or negative) from review text.  We will do this from Yelp review data.

The first three questions task you to build models, of increasing complexity, to predict the rating of a review from its text.  These models will be assessed based on the root mean squared error of the number of stars predicted.  There is a reference solution (which should not be too hard to beat) that defines the score of 1.

The final question asks only for the result of a calculation, and your results will be compared directly to those of a reference solution.


## A note on scoring

It **is** possible to score >1 on these questions. This indicates that you've beaten our reference model - we compare our model's score on a test set to your score on a test set. See how high you can go!


## Download and parse the data


To start, let's download the data set from Amazon S3:

In [5]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_review_reduced.json.gz'

The training data are a series of JSON objects, in a Gzipped file. Python supports Gzipped files natively: [`gzip.open`](https://docs.python.org/3/library/gzip.html) has the same interface as `open`, but handles `.gz` files automatically.

The built-in json package has a `loads()` function that converts a JSON string into a Python dictionary.  We could call that once for each row of the file. [`ujson`](http://docs.micropython.org/en/latest/library/ujson.html) has the same interface as the built-in `json` library, but is *substantially* faster (at the cost of non-robust handling of malformed json).  We will use that inside a list comprehension to get a list of dictionaries:

In [4]:
import gzip
import ujson as json

with gzip.open('yelp_train_academic_dataset_review_reduced.json.gz') as f:
    data = [json.loads(line) for line in f]

# Questions


Each of the "model" questions asks you to create a function that models the number of stars given in a review from the review text.  It will be passed a list of dictionaries.  Each of these will have the same format as the JSON objects you've just read in.  This function should return a list of numbers of the same length, giving the predicted star ratings.

This function is passed to the `score()` function, which will receive input from the grader, run your function with that input, report the results back to the grader, and print out the score the grader returned.  Depending on how you constructed your estimator, you may be able to pass the predict method directly to the `score()` function.  If not, you will need to write a small wrapper function to mediate the data types.


## bag_of_words_model

Build a linear model predicting the star rating based on the count of the words in each document (bag-of-words model).  Use a [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) to produce a feature matrix giving the counts of each word in each review.  Feed the feature matrix into a linear model, such as `Ridge` or `SGDRegressor`, to predict the number of stars from each review.

**Hints**:
1. Don't forget to use tokenization!  This is important for good performance but it is also the most expensive step.  Try vectorizing as a first initial step and then running grid-search and cross-validation only on of this pre-processed data.  `CountVectorizer` has to memorize the mapping between words and the index to which it is assigned.  This is linear in the size of the vocabulary.

```python
from sklearn.feature_extraction.text import CountVectorizer

text = [row['text'] for row in data]
X = CountVectorizer().fit_transform(text)

# Now, this can be run with many different parameters
# without needing to retrain the vectorizer:
model.fit(X, stars, hyperparameter=something)
```

2. Try choosing different values for `min_df` (minimum document frequency cutoff) and `max_df` in `CountVectorizer`.  Setting `min_df` to zero admits rare words which might only appear once in the entire corpus.  This is both prone to overfitting and makes your data unmanageably large.  Don't forget to use cross-validation to select the right value.  
3. Try using [`LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) or [`RidgeCV`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV).  If the memory footprint is too big, try switching to [Stochastic Gradient Descent](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) You might find that even ordinary linear regression fails due to the data size.  Don't forget to use [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) to determine the regularization parameter!  How do the regularization parameter `alpha` and the values of `min_df` and `max_df` from `CountVectorizer` change the answer?

4. You will likely pick up several hyperparameters between the tokenization step and the regularization of the estimator.  While it is more strictly correct to do a grid search over all of them at once, this can take a long time. Quite often, doing a grid search over a single hyperparameter at a time can produce similar results.  Alternatively, the grid search may be done over a smaller subset of the data, as long as it is representative of the whole.

5. Finally, assemble a pipeline that will transform the data from records all the way to predictions.  This will allow you to submit its predict method to the grader for scoring.

In [48]:
text = [row['text'] for row in data]
text

["I don't know what Dr. Goldberg was like before  moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. I was going to Dr. Johnson before he left and Goldberg took over when Johnson left. He is not a caring doctor. He is only interested in the co-pay and having you come in for medication refills every month. He will not give refills and could less about patients's financial situations. Trying to get your 90 days mail away pharmacy prescriptions through this guy is a joke. And to make matters even worse, his office staff is incompetent. 90% of the time when you call the office, they'll put you through to a voice mail, that NO ONE ever answers or returns your call. Both my adult children and husband have decided to leave this practice after experiencing such frustration. The entire office has an attitude like they are doing you a favor. Give me a break! Stay away from this doc and the practice. You deserve better and they will not be there when you really ne

In [60]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import RidgeCV
text = [row['text'] for row in data]
vectorizer = CountVectorizer( #parameters)
parameters = {'alphas':[1e-3, 1e-2, 1e-1, 1]}
cst1 = RidgeCV()
X = vectorizer.fit_transform(text)
clf = GridSearchCV(cst1, parameters)
clf.fit(X,stars)
sorted(clf.cv_results_.keys())

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object

TypeError: len() of unsized object



RuntimeError: Cannot clone object RidgeCV(alphas=0.001, cv=None, fit_intercept=True, gcv_mode=None,
        normalize=False, scoring=None, store_cv_values=False), as the constructor either does not set or modifies parameter alphas

In [79]:
from sklearn.feature_extraction.text import CountVectorizer
cst = ColumnSelectTransformer(['text'])
X = cst.fit_transform(data[:2])
#cst1 = CountVectorizer( ngram_range=(2, 2), max_df= 0.9 ,min_df=0.1)
#v = cst1.fit_transform(X)
X

["I don't know what Dr. Goldberg was like before  moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. I was going to Dr. Johnson before he left and Goldberg took over when Johnson left. He is not a caring doctor. He is only interested in the co-pay and having you come in for medication refills every month. He will not give refills and could less about patients's financial situations. Trying to get your 90 days mail away pharmacy prescriptions through this guy is a joke. And to make matters even worse, his office staff is incompetent. 90% of the time when you call the office, they'll put you through to a voice mail, that NO ONE ever answers or returns your call. Both my adult children and husband have decided to leave this practice after experiencing such frustration. The entire office has an attitude like they are doing you a favor. Give me a break! Stay away from this doc and the practice. You deserve better and they will not be there when you really ne

In [6]:
from sklearn import base
class ColumnSelectTransformer(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col_names):
        self.col_names = col_names # We will need these in transform()
    
    def fit(self, X,y):
        # This transformer doesn't need to learn anything about the data,
        # so it can just return self without any further processing
        return self
    
    def transform(self, X):
        ####
        return result

In [7]:
class predict_residual(base.BaseEstimator, base.RegressorMixin):
    
    def __init__(self, est1,est2):
        self.est1 = est1  # We will need these in transform()
        self.est2 = est2
        #self.X = X
        #self.y = y
    
    def fit(self, X, y):
        self.X = X
        self.y = y
        ######
        return self
    
    def predict(self, X):
        y1 = self.est1.predict(X)
        y2 = self.est2.predict(X)
        return ####

In [8]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [31]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import Ridge
import xgboost as xgb
bag_of_words_est = Pipeline([
    ('transform',ColumnSelectTransformer(['text'])),                               # Column selector (remember the ML project?)
    ('CountVectorizer',CountVectorizer(#parameters),              # Vectorizer
    #('HashingVectorizer',HashingVectorizer()),
    #('RidgeCV',RidgeCV(alphas=[1e-4,1e-3, 1e-2, 1e-1, 1,10,100,1000]))          # Regressor
    #('SGD', SGDRegressor())
    ('Res', predict_residual(Ridge(alpha=0.01),xgb.XGBRegressor()))
])

In [32]:
bag_of_words_est.fit(data,stars)

Pipeline(memory=None,
         steps=[('transform', ColumnSelectTransformer(col_names=['text'])),
                ('CountVectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=500000,
                                 max_features=50000, min_df=1000,
                                 ngram_range=(1, 2), preprocessor=None,
                                 stop_words=None, strip_...
                                                    interaction_constraints='',
                                                    learning_rate=0.300000012,
                                                    max_delta_step=0,
                                                    max_depth=6,
                                                    min_child_weight=1,
                     

In [33]:
y = bag_of_words_est.predict(data)

In [35]:
grader.score('nlp__bag_of_words_model', bag_of_words_est.predict)

Your score:  1.5682791997739698


## normalized_model

# Normalization is key for good linear regression. Previously, we used the count as the normalization scheme.  Add in a normalization transformer to your pipeline to improve the score.  Try some of these:

1. You can use the "does this word present in this document" as a normalization scheme, which means the values are always 1 or 0.  So we give no additional weight to the presence of the word multiple times.

1. By default, the [`HashingVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) normalizes the counts using an L-2 norm. If you go this route, keep in mind that `HashingVectorizer` doesn't support `min_df`  and `max_df`. However, it's not hard to roll your own transformer that solves for these.

1. Try using the log of the number of counts (or more precisely, $log(x+1)$). This is often used because we want the repeated presence of a word to count for more but not have that effect tapper off.

1. [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common normalization scheme used in text processing.  Use the [`TfidfTransformer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer). There are options for using `idf` and taking the logarithm of `tf`.  Do these significantly affect the result?

Finally, if you can't decide which one is better, don't forget that you can combine models with a linear regression.

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import Ridge
import xgboost as xgb
normalized_est = Pipeline([
    ('transform',ColumnSelectTransformer(['text'])),
    ('TFIDF', TfidfVectorizer(#parameters),
    ('Res', predict_residual(Ridge(alpha=0.01),xgb.XGBRegressor()))
])

In [27]:
normalized_est.fit(data,stars)

Pipeline(memory=None,
         steps=[('transform', ColumnSelectTransformer(col_names=['text'])),
                ('TFIDF',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=500000,
                                 max_features=50000, min_df=1000,
                                 ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop...
                                                    interaction_constraints='',
                                                    learning_rate=0.300000012,
                                                    max_delta_step=0,
                                                    max_depth=6,
                          

In [28]:
y = normalized_est.predict(data)

In [30]:
grader.score('nlp__normalized_model', normalized_est.predict)

Your score:  0.9752520180409083


## bigram_model

In a bigram model, we'll consider both single words and pairs of consecutive words that appear.  This is going to be a much higher dimensional problem (large $p$) so you should be careful about overfitting.

Sometimes, reducing the dimension can be useful.  Because we are dealing with a sparse matrix, we have to use [`TruncatedSVD`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD).  If we reduce the dimensions, we can use a more sophisticated models than linear ones.

As before, memory problems can crop up due to the engineering constraints. Playing with the number of features, using the `HashingVectorizer`, incorporating `min_df` and `max_df` limits, and handling stop-words in some way are all methods of addressing this issue. If you are using `CountVectorizer`, it is possible to run it with a fixed vocabulary (based on a training run, for instance). Check the documentation.

**A side note on multi-stage model evaluation:** When your model consists of a pipeline with several stages, it can be worthwhile to evaluate which parts of the pipeline have the greatest impact on the overall accuracy (or other metric) of the model. This allows you to focus your efforts on improving the important algorithms, and leaving the rest "good enough".

One way to accomplish this is through ceiling analysis, which can be useful when you have a training set with ground truth values at each stage. Let's say you're training a model to extract image captions from websites and return a list of names that were in the caption. Your overall accuracy at some point reaches 70%. You can try manually giving the model what you know are the correct image captions from the training set, and see how the accuracy improves (maybe up to 75%). Alternatively, giving the model the perfect name parsing for each caption increases accuracy to 90%. This indicates that the name parsing is a much more promising target for further work, and the caption extraction is a relatively smaller factor in the overall performance.

If you don't know the right answers at different stages of the pipeline, you can still evaluate how important different parts of the model are to its performance by changing or removing certain steps while keeping everything else constant. You might try this kind of analysis to determine how important adding stopwords and stemming to your NLP model actually is, and how that importance changes with parameters like the number of features.

In [9]:
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
nlp = spacy.load("en")

def tokenize_lemma(text):
    return [w.lemma_.lower() for w in nlp(text)]
stop_words_lemma = set(tokenize_lemma(' '.join(STOP_WORDS)))

In [34]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import Ridge
import xgboost as xgb

bigram_est = Pipeline([
    ('transform',ColumnSelectTransformer(['text'])),                               # Column selector (remember the ML project?)
    #              # Vectorizer
    ('TFIDF', TfidfVectorizer(#parameters),            
    ('SGD',SGDRegressor())
])

In [35]:
bigram_est.fit(data,stars)

Pipeline(memory=None,
         steps=[('transform', ColumnSelectTransformer(col_names=['text'])),
                ('TFIDF',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.95,
                                 max_features=20000, min_df=15,
                                 ngram_range=(2, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('SGD',
                 Ridge(alpha=0.01, 

In [36]:
y = bigram_est.predict(data)

In [13]:
grader.score('nlp__bigram_model', bigram_est.predict)
#0.8739

Your score:  0.8739528254881328


## food_bigrams

Look over all reviews of restaurants.  You can determine which businesses are restaurants by looking in the `yelp_train_academic_dataset_business.json.gz` file from the ml project or downloaded below.

In [58]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_business.json.gz'

download: s3://dataincubator-course/mldata/yelp_train_academic_dataset_business.json.gz to ./yelp_train_academic_dataset_business.json.gz


In [71]:
Res_data = []
for i in range(len(business_data)):
    if 'Restaurants' in business_data[i]['categories']:
        Res_data.append(business_data[i])
Res_data

[{'business_id': 'JwUE5GmEO-sH1FuwJgKBlQ',
  'full_address': '6162 US Highway 51\nDe Forest, WI 53532',
  'hours': {},
  'open': True,
  'categories': ['Restaurants'],
  'city': 'De Forest',
  'review_count': 26,
  'name': 'Pine Cone Restaurant',
  'neighborhoods': [],
  'longitude': -89.335844,
  'state': 'WI',
  'stars': 4.0,
  'latitude': 43.238893,
  'attributes': {'Take-out': True,
   'Good For': {'dessert': False,
    'latenight': False,
    'lunch': True,
    'dinner': False,
    'breakfast': False,
    'brunch': False},
   'Caters': False,
   'Noise Level': 'average',
   'Takes Reservations': False,
   'Delivery': False,
   'Ambience': {'romantic': False,
    'intimate': False,
    'touristy': False,
    'hipster': False,
    'divey': False,
    'classy': False,
    'trendy': False,
    'upscale': False,
    'casual': False},
   'Parking': {'garage': False,
    'street': False,
    'validated': False,
    'lot': True,
    'valet': False},
   'Has TV': True,
   'Outdoor Seating'

We want to find collocations --- that is, bigrams that are "special" and appear more often than you'd expect from chance. We can think of the corpus as defining an empirical distribution over all *n*-grams.  We can find word pairs that are unlikely to occur consecutively based on the underlying probability of their words. Mathematically, if $p(w)$ be the probability of a word $w$ and $p(w_1 w_2)$ is the probability of the bigram $w_1 w_2$, then we want to look at word pairs $w_1 w_2$ where the statistic

  $$ \frac{p(w_1 w_2)}{p(w_1) p(w_2)} $$

is high.  Return the top 100 (mostly food) bigrams with this statistic with the 'right' prior factor (see below).

Estimating the probabilities is simply a matter of counting, and there are number of approaches that will work.  One is to use one of the tokenizers to count up how many times each word and each bigram appears in each review, and then sum those up over all reviews.  You might want to know that the `CountVectorizer` has a `.get_feature_names()` method which gives the string associated with each column.  (Question for thought: Why doesn't the `HashingVectorizer` have a similar method?)

*Questions:* This statistic is a ratio and problematic when the denominator is small.  We can fix this by applying Bayesian smoothing to $p(w)$ (i.e. mixing the empirical distribution with the uniform distribution over the vocabulary).

1. How does changing this smoothing parameter affect the word pairs you get qualitatively?

2. We can interpret the smoothing parameter as adding a constant number of occurrences of each word to our distribution.  Does this help you determine set a reasonable value for this 'prior factor'?

3. For fun: also check out [Amazon's Statistically Improbable Phrases](http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases).

*Implementation note:*
As you adjust the size of the Bayesian smoothing parameter, you will notice first nonsense phrases being removed and then legitimate bigrams being removed, leaving you with only generic bigrams.  The goal is to find a value of the smoothing parameter between these two transitions.

The reference solution is not an aggressive filterer: it errors in favor of leaving apparently nonsensical words. On further consideration, many of these are actually somewhat meaningful. The smoothing parameter chosen in the reference solution is equivalent to giving each word 30 previous appearances prior to considering this data.  This was chosen by generating a list of bigrams for a range of smoothing parameters and seeing how many of the bigrams were shared between neighboring values.  When the shared fraction reached 95%, we judged the solution to have converged.  Note that `min_df` should not be set too high, where it could exclude these borderline words.

In [6]:
len(word_counts)

87880

In [7]:
len(bigram_counts)

143361

In [8]:
Bigram_counter = []
for i in range(len(bigram_counts)):
    Words = bigram_counts[i][0].split()
    c = [bigram_counts[i][0]]
    for w in Words:
        ###Magic
    Bigram_counter.append(c)
Bigram_counter

[['00 00', 11, 11],
 ['00 all', 11, 139],
 ['00 am', 11, 40],
 ['00 and', 11, 95],
 ['00 as', 11, 28],
 ['00 at', 11, 19],
 ['00 before', 11, 240],
 ['00 bill', 11, 143],
 ['00 bucks', 11, 64],
 ['00 but', 11, 41],
 ['00 can', 11, 60],
 ['00 dollars', 11, 12],
 ['00 each', 11, 863],
 ['00 extra', 11, 73],
 ['00 for', 11, 35],
 ['00 glass', 11, 56],
 ['00 if', 11, 103],
 ['00 in', 11, 115],
 ['00 including', 11, 39],
 ['00 is', 11, 353],
 ['00 it', 11, 257],
 ['00 later', 11, 165],
 ['00 lunch', 11, 146],
 ['00 meal', 11, 26],
 ['00 more', 11, 132],
 ['00 my', 11, 78],
 ['00 no', 11, 527],
 ['00 not', 11, 88],
 ['00 off', 11, 114],
 ['00 on', 11, 66],
 ['00 or', 11, 13],
 ['00 per', 11, 39],
 ['00 person', 11, 15],
 ['00 plus', 11, 157],
 ['00 pm', 11, 128],
 ['00 reservation', 11, 88],
 ['00 so', 11, 142],
 ['00 that', 11, 216],
 ['00 the', 11, 361],
 ['00 there', 11, 41],
 ['00 they', 11, 75],
 ['00 this', 11, 86],
 ['00 tip', 11, 159],
 ['00 to', 11, 310],
 ['00 total', 11, 215],
 ['

In [28]:
Food_grams = []
alpha = 30
d = len(word_counts)
N1 = 0
for i in range(len(bigram_counts)):
    N1 += bigram_counts[i][1]
N2 = 0
for i in range(len(word_counts)):
    N2 += word_counts[i][1]

    
for i in range(len(Bigram_counter)):
    f = [Bigram_counter[i][0]]
    if len(Bigram_counter[i]) == 3:
        ####Magic
    else:
        f.insert(1,0)
    Food_grams.append(f)

Food_grams = sorted(Food_grams,key=lambda x:x[1], reverse=True)
nlp__food_bigrams = []
for i in range(Food_grams):
    nlp__food_bigrams.append(Food_grams[i][0])
nlp__food_bigrams

['rula bula',
 'dac biet',
 'knick knacks',
 'ropa vieja',
 'gulab jamun',
 'itty bitty',
 'patatas bravas',
 'puerto rican',
 'wal mart',
 'lomo saltado',
 'bradley ogden',
 'valle luna',
 'har gow',
 'pina colada',
 'vice versa',
 'kao tod',
 'artery clogging',
 'sous vide',
 'pin kaow',
 'ping pang',
 'bells whistles',
 'casey moore',
 'harry potter',
 'cochinita pibil',
 'kilt lifter',
 'moscow mule',
 'lactose intolerant',
 'aguas frescas',
 'hustle bustle',
 'thit nuong',
 'scantily clad',
 'tres leches',
 'demi glace',
 'kee mao',
 'kool aid',
 'arnold palmer',
 'osso bucco',
 'woody allen',
 'coca cola',
 'cabo wabo',
 'bok choy',
 'stainless steel',
 'rick moonen',
 'mt everest',
 'insult injury',
 'hush puppies',
 'panna cotta',
 'jean philippe',
 'hong kong',
 'huli huli',
 'toby keith',
 'van buren',
 'tilted kilt',
 'parmigiano reggiano',
 'quench thirst',
 'identity crisis',
 'peter piper',
 'pet peeve',
 'sierra bonita',
 'osso buco',
 'petit fours',
 'croque madame',
 '

In [27]:
grader.score('nlp__food_bigrams', nlp__food_bigrams[:100])#top100

Your score:  0.9400000000000006
