<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Sprint Challenge
## *Data Science Unit 4 Sprint 1*

After a week of Natural Language Processing, you've learned some cool new stuff: how to process text, how turn text into vectors, and how to model topics from documents. Apply your newly acquired skills to one of the most famous NLP datasets out there: [Yelp](https://www.yelp.com/dataset/challenge). As part of the job selection process, some of my friends have been asked to create analysis of this dataset, so I want to empower you to have a head start.  

The real dataset is massive (almost 8 gigs uncompressed). I've sampled the data for you to something more managable for the Sprint Challenge. You can analyze the full dataset as a stretch goal or after the sprint challenge. As you work on the challenge, I suggest adding notes about your findings and things you want to analyze in the future.

## Challenge Objectives
*Successfully complete these all these objectives to earn a 2. There are more details on each objective further down in the notebook.*
* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use your tokens in a classification model on yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

In [89]:
# Import statements
import os
import re
import gensim
import spacy
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pyLDAvis.gensim

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel

from spacy.tokenizer import Tokenizer
from collections import Counter

from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA

# Data file
yelp = pd.read_json('./data/review_sample.json', lines=True)

# NLP library 
nlp = spacy.load("en_core_web_lg")

# Tokenizer
tokenizer = Tokenizer(nlp.vocab)

In [90]:
def clean_str(string):
    """
    Tokenization/string cleaning for dataset
    Every dataset is lower cased except
    """
    string = re.sub(r"[0-9]", "", string)
    string = re.sub(r"\n", "", string)    
    string = re.sub(r"\r", "", string)
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()

In [91]:
yelp.head(10)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"BEWARE!!! FAKE, FAKE, FAKE....We also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,Came here for lunch Togo. Service was quick. S...,0,5CgjjDAic2-FAvCtiHpytA
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,I've been to Vegas dozens of times and had nev...,2,BdV-cf3LScmb8kZ7iiBcMA
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,We went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,"3.5 to 4 stars\n\nNot bad for the price, $12.9...",5,n9QO4ClYAS7h9fpQwa5bhA
5,Db3CfZWrtG33UZSs8Tdlsg,1,2016-10-23 22:43:56,1,nXYV_0joQEMXYAfNyOPsRw,4,"Tasty, fast casual Latin street food. The men...",1,Gjz2PCbLZ6midk1n_0LaUg
6,gJhMeq2nVH27tz8LqbD3eQ,0,2013-05-20 19:09:43,0,ZA7SRi6fTRWwpo-B9O72qQ,5,This show is absolutely amazing!! What an incr...,0,BeKPVuqX-2at4izqVwUFEg
7,Yt5gK4E9NqVa14WNiQdBlQ,0,2018-07-12 01:19:53,0,4_GnHPkyTirzK6onIKO4jw,4,Came for the Pho and really enjoyed it! We go...,0,PuXpIJzTBQejeBZh9hwynQ
8,c7WsC8SbUcLyZkREzx9dGA,1,2017-09-27 22:10:26,0,XGGHc7pYgOm5s6SWr8NMXA,5,Absolutely the most Unique experience in a nai...,0,NVVknS1I51z8wY5NNrJ6vQ
9,NSifXpsCRvnsBRqrHF9CJA,0,2015-01-25 08:43:15,0,--e66tyhwCE6eoRmcK2w8g,1,Wow. I walked in and sat at the bar for 10 min...,2,J7MsJKJDSA5OGo2-Hn7MbA


In [92]:
yelp['text'] = [clean_str(item) for item in yelp['text']]
yelp.head(10)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"beware!!! fake, fake, fake....we also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,came here for lunch togo. service was quick. s...,0,5CgjjDAic2-FAvCtiHpytA
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,ive been to vegas dozens of times and had neve...,2,BdV-cf3LScmb8kZ7iiBcMA
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,we went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,". to starsnot bad for the price, $. for lunch...",5,n9QO4ClYAS7h9fpQwa5bhA
5,Db3CfZWrtG33UZSs8Tdlsg,1,2016-10-23 22:43:56,1,nXYV_0joQEMXYAfNyOPsRw,4,"tasty, fast casual latin street food. the men...",1,Gjz2PCbLZ6midk1n_0LaUg
6,gJhMeq2nVH27tz8LqbD3eQ,0,2013-05-20 19:09:43,0,ZA7SRi6fTRWwpo-B9O72qQ,5,this show is absolutely amazing!! what an incr...,0,BeKPVuqX-2at4izqVwUFEg
7,Yt5gK4E9NqVa14WNiQdBlQ,0,2018-07-12 01:19:53,0,4_GnHPkyTirzK6onIKO4jw,4,came for the pho and really enjoyed it! we go...,0,PuXpIJzTBQejeBZh9hwynQ
8,c7WsC8SbUcLyZkREzx9dGA,1,2017-09-27 22:10:26,0,XGGHc7pYgOm5s6SWr8NMXA,5,absolutely the most unique experience in a nai...,0,NVVknS1I51z8wY5NNrJ6vQ
9,NSifXpsCRvnsBRqrHF9CJA,0,2015-01-25 08:43:15,0,--e66tyhwCE6eoRmcK2w8g,1,wow. i walked in and sat at the bar for minut...,2,J7MsJKJDSA5OGo2-Hn7MbA


## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. Your function should
- accept one document at a time
- return a list of tokens

You are free to use any method you have learned this week.

In [93]:
yelp.shape

(10000, 9)

In [94]:
def tokenize(df_col):
    tokens = []

    for doc in tokenizer.pipe(df_col, batch_size=1000):
        doc_tokens = []
        for token in doc:
            if (token.is_stop == False) & (token.is_punct == False):
                doc_tokens.append(token.text.lower())
        tokens.append(doc_tokens)
    return tokens
    
tokens = tokenize(yelp['text'])
yelp['tokens'] = tokens
yelp.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,tokens
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"beware!!! fake, fake, fake....we also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w,"[beware!!!, fake,, fake,, fake....we, small, b..."
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,came here for lunch togo. service was quick. s...,0,5CgjjDAic2-FAvCtiHpytA,"[came, lunch, togo., service, quick., staff, f..."
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,ive been to vegas dozens of times and had neve...,2,BdV-cf3LScmb8kZ7iiBcMA,"[ive, vegas, dozens, times, stepped, foot, cir..."
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,we went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ,"[went, night, closed, street, party..., best, ..."
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,". to starsnot bad for the price, $. for lunch...",5,n9QO4ClYAS7h9fpQwa5bhA,"[ , starsnot, bad, price,, $., lunch,, seniors..."


## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews
2. Write a fake review and query for the 10 most similiar reviews, print the text of the reviews. Do you notice any patterns?
    - Given the size of the dataset, it will probably be best to use a `NearestNeighbors` model for this. 

In [95]:
# Getting rid of Chinese characters (since I have no clue what they say!)
yelp['text'] = yelp['text'].apply(lambda row: row.encode('ascii',errors='ignore').decode())

In [96]:
def vector_representation(df_col):
    """ Creates a vector representation of specified dataframe column"""
    # text
    data = [item for item in df_col]

    # create the transformer
    vect = CountVectorizer()

    # build vocab
    vect.fit(data)

    # transform text into document term matrix
    dtm = vect.transform(data)
    dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
    return dtm

In [97]:
vector_representation(yelp['text'])

Unnamed: 0,___,______,_____________,____________this,_er,_is_,_jjbwhwlfagjwfwgljzq,aa,aaa,aaaahhhs,...,zumanity,zumba,zuni,zupas,zuzana,zuzu,zyr,zyrtec,zzaplon,zzzzzzzzzmy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Nearest Neighbors for similarity
 - also using TFIDF

In [98]:
data = [item for item in yelp['text']]

# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words=nlp.Defaults.stop_words)

# Similiar to fit_predict
dtm = tfidf.fit_transform(data)

# Get feature names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

# View Feature Matrix as DataFrame
dtm.head()

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,___,______,_____________,____________this,_er,_is_,_jjbwhwlfagjwfwgljzq,aa,aaa,aaaahhhs,...,zumanity,zumba,zuni,zupas,zuzana,zuzu,zyr,zyrtec,zzaplon,zzzzzzzzzmy
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [99]:
nn = NearestNeighbors(n_neighbors=10, algorithm='kd_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=None, n_neighbors=10, p=2, radius=1.0)

In [100]:
nn.kneighbors([dtm.iloc[100]])

(array([[0.        , 1.        , 1.        , 1.27492432, 1.29100986,
         1.30155906, 1.30602022, 1.30830529, 1.33021768, 1.33743104]]),
 array([[ 100, 6204, 6311, 4238,  154,  683, 7749, 4984, 5192, 7101]],
       dtype=int64))

In [101]:
fake_review = [
    """
    I can't believe how awful the food was. I ordered the fish and they gave me Phish. I hate that band!
    """
]

In [102]:
similar = tfidf.transform(fake_review)
nn.kneighbors(similar.todense())

(array([[1.        , 1.        , 1.2194245 , 1.26316368, 1.26568219,
         1.26888111, 1.27180754, 1.27778069, 1.28456331, 1.28603231]]),
 array([[6311, 6204, 4769,  507, 3757, 3886,  281, 9220, 5414, 6419]],
       dtype=int64))

In [103]:
print([item for item in nn.kneighbors(similar.todense())[1]])

[array([6311, 6204, 4769,  507, 3757, 3886,  281, 9220, 5414, 6419],
      dtype=int64)]


In [104]:
similar_arr = nn.kneighbors(similar.todense())[1]
most_similar = [i for i in similar_arr[0]]
for i in most_similar:
    print(f'review #{i}: {data[i]} \n')

review #6311:  

review #6204:   

review #4769: i hate chains.  i want to hate this place so bad. however.... every time i come i have a great experience.  the food is good.  the service is excellent, bordering on perfect.  the prices are very fair. 

review #507: this place has excellent fried fish tecos. although i wish it had seasoned grilled fish options, the  fried fish was absolutely delicious. the portions were generous and the fish tacos were very reasonably priced. you have to doctor them up with all the special sauces and toppings they have at their salsa bar which is part of the fun.  you get them exactly how you want them. 

review #3757: i ordered the aloo tikki and found this rubber band inside one of them. its unacceptable and it took like thirty minutes to get the food! 

review #3886: love the food here but i hate restaurants that randomly decide to shut down their to go orders.  i have ordered to go for years and apparently they have now decided not to do it when the

## Part 3: Classification
<a id="#p3"></a>
Your goal in this section will be to predict `stars` from the review dataset. 

1. Create a piepline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier. Use that pipeline to estimate a model to predict `stars`. Use the Pipeline to predict a star rating for your fake review from Part 2. 
2. Tune the entire pipeline with a GridSearch

In [105]:
yelp.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,tokens
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"beware!!! fake, fake, fake....we also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w,"[beware!!!, fake,, fake,, fake....we, small, b..."
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,came here for lunch togo. service was quick. s...,0,5CgjjDAic2-FAvCtiHpytA,"[came, lunch, togo., service, quick., staff, f..."
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,ive been to vegas dozens of times and had neve...,2,BdV-cf3LScmb8kZ7iiBcMA,"[ive, vegas, dozens, times, stepped, foot, cir..."
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,we went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ,"[went, night, closed, street, party..., best, ..."
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,". to starsnot bad for the price, $. for lunch...",5,n9QO4ClYAS7h9fpQwa5bhA,"[ , starsnot, bad, price,, $., lunch,, seniors..."


In [106]:
df = yelp[['text', 'tokens', 'cool', 'funny', 'useful', 'stars']]
df.head()

Unnamed: 0,text,tokens,cool,funny,useful,stars
0,"beware!!! fake, fake, fake....we also own a sm...","[beware!!!, fake,, fake,, fake....we, small, b...",1,0,10,1
1,came here for lunch togo. service was quick. s...,"[came, lunch, togo., service, quick., staff, f...",0,0,0,4
2,ive been to vegas dozens of times and had neve...,"[ive, vegas, dozens, times, stepped, foot, cir...",1,1,2,3
3,we went here on a night where they closed off ...,"[went, night, closed, street, party..., best, ...",3,4,5,1
4,". to starsnot bad for the price, $. for lunch...","[ , starsnot, bad, price,, $., lunch,, seniors...",1,0,5,4


In [107]:
X = []
for i in range(df.shape[0]):
    X.append(df['text'][i])
y = np.array(df['stars'])

In [108]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [109]:
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
clf = RandomForestClassifier()

pipe = Pipeline([('vect', vect), ('clf', clf)])

In [110]:
parameters = {
    'vect__max_df': (0.85, 1.0),
    'vect__min_df': (0.02, 0.05),
    'clf__max_depth':(5,10,15,20),
    'clf__n_estimators':(10, 20,),
}

grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X, y)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   11.1s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  1.1min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'vect__max_df': (0.85, 1.0), 'vect__min_df': (0.02, 0.05), 'clf__max_depth': (5, 10, 15, 20), 'clf__n_estimators': (10, 20)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [111]:
grid_search.best_score_

0.5485

In [112]:
predictions = grid_search.predict(X_test)

In [131]:
# full_pred = [print(f'Review: {X_test[i][:50]}...,  Predicted score: {predictions[i]}') for i in range(df.shape[0])]

# Shortened output for viewing
full_pred = [print(f'Review: {X_test[i][:50]}...,  Predicted score: {predictions[i]}') for i in range(15)]

Review: this place has amazing food!  its really close to ...,  Predicted score: 5
Review: had a good lunch here with a large party.  the spa...,  Predicted score: 4
Review: wow...what an absolutely awesome experience a+...f...,  Predicted score: 5
Review: great service. very knowledgeable about color and ...,  Predicted score: 5
Review: perfect lunch! jennifer was our server and she was...,  Predicted score: 5
Review: my daughter cracked the screen on her samsung gala...,  Predicted score: 5
Review: this location is always very clean and well kept. ...,  Predicted score: 5
Review: tasty, super fast and friendly staff. made for a g...,  Predicted score: 5
Review: we were on our way out of the southwest all we wer...,  Predicted score: 5
Review: i enjoyed my visit at this location. it was a huge...,  Predicted score: 5
Review: the predecessor in this location was a hotpot plac...,  Predicted score: 1
Review: this office is a joke!!! the only reason i decided...,  Predicted score: 1
Revi

## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
    - Keep the `iterations` parameter at or below 5 to reduce run time
    - The `workers` parameter should match the number of physical cores on your machine.
2. Create 1-2 visualizations of the results
    - You can use the most important 3 words of a topic in relevant visualizations. Refer to yesterday's notebook to extract. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

__*Note*__: You can pass the DataFrame column of text reviews to gensim. You do not have to use a generator.

In [132]:
from gensim.models import LdaMulticore
from gensim.corpora import Dictionary

Learn the vocabulary of the yelp data:

In [146]:
def doc_stream(df_col):
    for i in range(df.shape[0]):
        yield df_col[i]

In [147]:
streaming_data = doc_stream(df['tokens'])

In [None]:
next(streaming_data)

In [154]:
id2word = corpora.Dictionary(doc_stream(df['tokens']))

In [155]:
id2word.filter_extremes(no_below=5, no_above=0.95)

Create a bag of words representation of the entire corpus

In [157]:
corpus = [id2word.doc2bow(text) for text in doc_stream(df['tokens'])]

Your LDA model should be ready for estimation: 

In [158]:
lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   iterations=5,
                   workers=4,
                   num_topics = 10 # You can change this parameter
                  )

Create 1-2 visualizations of the results

In [160]:
lda.print_topics()

[(0,
  '0.040*" " + 0.008*"like" + 0.008*"good" + 0.007*"food" + 0.007*"place" + 0.005*"great" + 0.005*"time" + 0.005*"nice" + 0.004*"dont" + 0.004*"didnt"'),
 (1,
  '0.038*" " + 0.011*"place" + 0.008*"great" + 0.006*"good" + 0.006*"food" + 0.006*"got" + 0.005*"service" + 0.005*"like" + 0.005*"time" + 0.004*"dont"'),
 (2,
  '0.055*" " + 0.009*"great" + 0.008*"food" + 0.008*"like" + 0.007*"place" + 0.006*"good" + 0.006*"time" + 0.005*"service" + 0.005*"im" + 0.004*"love"'),
 (3,
  '0.033*" " + 0.007*"like" + 0.007*"good" + 0.006*"place" + 0.006*"service" + 0.006*"food" + 0.005*"great" + 0.004*"didnt" + 0.004*"love" + 0.004*"ordered"'),
 (4,
  '0.044*" " + 0.011*"food" + 0.009*"place" + 0.007*"great" + 0.007*"got" + 0.007*"like" + 0.006*"service" + 0.006*"time" + 0.005*"good" + 0.004*"come"'),
 (5,
  '0.051*" " + 0.009*"food" + 0.008*"like" + 0.007*"good" + 0.007*"great" + 0.006*"place" + 0.005*"time" + 0.005*"service" + 0.005*"dont" + 0.005*"best"'),
 (6,
  '0.050*" " + 0.010*"place" + 

In [163]:
words = [re.findall(r'"([^"]*)"',t[1]) for t in lda.print_topics()]
topics = [' '.join(t[0:5]) for t in words]
for id, t in enumerate(topics): 
    print(f"------ Topic {id} ------")
    print(t, end="\n\n")

------ Topic 0 ------
  like good food place

------ Topic 1 ------
  place great good food

------ Topic 2 ------
  great food like place

------ Topic 3 ------
  like good place service

------ Topic 4 ------
  food place great got

------ Topic 5 ------
  food like good great

------ Topic 6 ------
  place food service good

------ Topic 7 ------
  great place like good

------ Topic 8 ------
  time food place great

------ Topic 9 ------
  great good food place



#### Topic Coherance Score

In [167]:
import pyLDAvis.gensim
from gensim.models.coherencemodel import CoherenceModel

In [168]:
def compute_coherence_values(dictionary, corpus, limit, start=2, step=3, passes=5):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    limit : Max num of topics
    passes: the number of times the entire lda model & coherence values are calculated

    Returns:
    -------
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    
    coherence_values = []
    
    for iter_ in range(passes):
        for num_topics in range(start, limit, step):
            model = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=4)
            coherencemodel = CoherenceModel(model=model,dictionary=dictionary,corpus=corpus, coherence='u_mass')
            coherence_values.append({'pass': iter_, 
                                     'num_topics': num_topics, 
                                     'coherence_score': coherencemodel.get_coherence()
                                    })

    return coherence_values

In [None]:
# NOTE: I have not run this yet as I didn't have time to wait for it to finish.
# It should, however, work just fine. 

coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus,
                                                        start=2, 
                                                        limit=40, 
                                                        step=6,
                                                        passes=40)

In [None]:
topic_coherence = pd.DataFrame.from_records(coherence_values)
ax = sns.lineplot(x="num_topics", y="coherence_score", data=topic_coherence)

## Stretch Goals

Complete one of more of these to push your score towards a three: 
* Incorporate named entity recognition into your analysis
* Compare vectorization methods in the classification section
* Analyze more (or all) of the yelp dataset - this one is v. hard. 
* Use a generator object on the reviews file - this would help you with the analyzing the whole dataset.
* Incorporate any of the other yelp dataset entities in your analysis (business, users, etc.)