<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Sprint Challenge
## *Data Science Unit 4 Sprint 1*

After a week of Natural Language Processing, you've learned some cool new stuff: how to process text, how turn text into vectors, and how to model topics from documents. Apply your newly acquired skills to one of the most famous NLP datasets out there: [Yelp](https://www.yelp.com/dataset/challenge). As part of the job selection process, some of my friends have been asked to create analysis of this dataset, so I want to empower you to have a head start.  

The real dataset is massive (almost 8 gigs uncompressed). I've sampled the data for you to something more managable for the Sprint Challenge. You can analyze the full dataset as a stretch goal or after the sprint challenge. As you work on the challenge, I suggest adding notes about your findings and things you want to analyze in the future.

## Challenge Objectives
*Successfully complete these all these objectives to earn a 2. There are more details on each objective further down in the notebook.*
* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use your tokens in a classification model on yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

In [56]:
#Imports


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import re

import spacy
nlp = spacy.load('en_core_web_lg')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora
from gensim.corpora import Dictionary
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel
import pyLDAvis.gensim

In [17]:
#read CSV

yelp = pd.read_json('./data/review_sample.json', lines=True)

In [18]:
yelp.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"BEWARE!!! FAKE, FAKE, FAKE....We also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,Came here for lunch Togo. Service was quick. S...,0,5CgjjDAic2-FAvCtiHpytA
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,I've been to Vegas dozens of times and had nev...,2,BdV-cf3LScmb8kZ7iiBcMA
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,We went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,"3.5 to 4 stars\n\nNot bad for the price, $12.9...",5,n9QO4ClYAS7h9fpQwa5bhA


In [19]:
yelp.shape

(10000, 9)

## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. Your function should
- accept one document at a time
- return a list of tokens

You are free to use any method you have learned this week.

In [21]:
STOPWORDS = set(STOPWORDS)

def tokenize(text):
    return [token for token in simple_preprocess(text, deacc=True, min_len=4, max_len=20) if token not in STOPWORDS]

In [22]:
yelp['tokens'] = yelp['text'].apply(tokenize)

In [26]:
yelp['tokens'][5]

['tasty',
 'fast',
 'casual',
 'latin',
 'street',
 'food',
 'menu',
 'overwhelming',
 'tried',
 'good',
 'recommend',
 'trying',
 'arepa',
 'nachos',
 'extremely',
 'good',
 'bang',
 'buck',
 'people',
 'space',
 'pretty',
 'small',
 'problematic',
 'friday',
 'lunch',
 'taco',
 'tuesday']

## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews
2. Write a fake review and query for the 10 most similiar reviews, print the text of the reviews. Do you notice any patterns?
    - Given the size of the dataset, it will probably be best to use a `NearestNeighbors` model for this. 

In [28]:
# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words = 'english',tokenizer=tokenize)

# Create a vocabulary and get word counts per document
dtm = tfidf.fit_transform(yelp['text'])

# Get feature names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(),columns = tfidf.get_feature_names())

# View Feature Matrix as DataFrame
dtm.head()

Unnamed: 0,aaaahhhs,aaasssk,aabs,aamco,aand,aaron,aback,abandoned,abby,abdc,abdominal,abend,aber,aberration,abgeht,abhorrent,abiance,abide,abiding,abigail,abilities,ability,abit,ablation,able,abmormal,abnormal,abnormally,aboard,abord,abordable,abordables,abound,abraham,abrasive,abreast,abricot,abroad,abrupt,abruptly,abruzzo,abscess,absence,absent,absinthe,absolument,absolute,absolutely,absolutley,absolutly,absolving,absorb,absorbed,absorbs,abstains,abstecher,absurd,absurdly,abundance,abundant,aburi,aburiya,aburri,abuse,abusive,abut,abutment,abutments,abyss,abyssinian,academic,academy,acadia,acai,acapella,acapulco,acceleration,accent,accented,accentuate,accept,acceptable,accepted,accepting,accepts,acces,access,accessable,accessed,accessibility,accessible,accessibles,accessories,accessory,acchs,accident,accidental,accidentally,accidently,accidents,...,を体験しました,を出て左折して,んしゃゆっくり飯ても食うへかと,アルティメットハッケーシ,インストラクターは皆とっても親切てす,ウラカンに乗車することかてきました,カヤルトを体験,カートて支払いをしますか,キッスメニューかちょうといい量てす,コース上の自分の位置,サーキット体験はてきます,サーヒスかあまりよくない,シートに座ったり,ストラッフ型のusbを渡されて,ストレートてのトッフスヒートと,スヒートとサウントを楽しんてくたさい,スヒートヘカス,スーハーカーに乗りたい気持ち,トルてトリフト,ヒテオ撮影のオフション,フィーニックス,フェニックス,フェラーリ,フルスロットルハッケーシ,フロの運転はまさにシェットコースター,フート店へ,ホテル街からならi,マッスルハッケーシのみ,ラッフより,ランホルキーニ,ランホルキーニlp,中華の割りにホリューム少ないなとの不評と,乗ってみたいたけなら,乗ってみたいな,乗り継きを急く場合は別のファースト,予定になかったまつ毛エクステもお願いし,予算や時間に余裕かあれは,今天我點了一個韓國冷面湯,以後會常常來,佐料衛生,來一個熱辣辣牛肉粉,写真を撮ったりして楽しんてくたさい,冰沙系列不會太甜膩,分乗り換えは全く間に合わす,到着便と乗り換え便のターミナルか違うと,前々日は,包含擺盤精緻,受付か終わったら,受付てはipadを利用して,受付て見学のみと伝えると,台て満足しましたか,台湾鸡排,吃了太多tim,同しターミナル乗り換えて,同行者かいれは,周ても充分,周というものて,周回数は,問題ありませんてした,地元の発音ては,夏日想開胃,天氣很熱吃不下東西,安心てす,実際に体験したのはわたしてはなく夫てすか,少し周回数を多めに取るのかお薦めてす,当然車か好きなのて,彼はますランホルキーニlp,彼らは皆,待ち時間か長い,待ち時間長い,手振りと単語て理解てきると思います,探したお店てした,日本人の方も日本語か話せる方も居て,日本語を話せるスタッフはおりませんか,時間半の乗り換え時間,時間後の便,服務人員也很敬業,次にラスヘカスを訪れたときもまた行きたい,特に女性には,現在はまた仮設の建物ての営業てすか,番ターミナルにあります,紙てのレシートはもらえす,終わったら,結構待ち時間はありますから,美味的味道,自分か乗りたい車のものを選ひます,英語か得意てなくても,見学たけすることも可能てす,視覚的に分かりやすく進めることかてきます,覺得店家很用心製作,言えはレシートももらえるかもしれません,誓約書にサイン,誰も乗車しなくても,質問にも丁寧に答えてくれましたし,車好きさんには,這是一個不錯的選擇,運転しない,運転中も英語て指導かあります,食へ物はうまい,餐後點了甜點
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
# Fit on TF-IDF Vectors
nn  = NearestNeighbors(n_neighbors = 5, algorithm = 'kd_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)

In [31]:
fake_review = ["Always wonderful. Crowded on a Saturday night, but the dinner pacing was well handled. The food prep is excellent, but I think that's one of the reasons we make excuses to come here.Fish was amazing and fresh. Our evening was a definite success."]

In [32]:
#Find most similar review to fake review

new = tfidf.transform(fake_review)

nn.kneighbors(new.todense())

(array([[1.27509132, 1.28756209, 1.29519023, 1.30448106, 1.30530487]]),
 array([[1470,  123, 2041, 8958, 9220]]))

In [33]:
yelp['text'][1470]

'Excellent food & service. French cuisine in our own backyard. Table-side prep for many entrees & desserts is great fun!'

In [34]:
yelp['text'][123]

'Amazing food! Amazing service! Tons of fun! Come here if you want to have a great night!'

In [35]:
yelp['text'][2041]

"I've been to a couple Scaddabush's before, and they are usually okay. But this location is new and awesome.\n\nThe ambiance outside is relaxed when we were initially outside on the patio. Warm with a cool breeze at times. We has a few drink specials which change regularly. BUT the fresh mozzarella was AMAZING. Drool...... I was sharing with my friend, but I was about to kick him an take them all for myself. Its a definite try, I don;t think you will be disappointed.\n\nIt began to rain and we went inside. Inside was relaxed and more for groups or dates. Not really a bar-ish environment. But really nice especially for a relaxed evening out. I had the Spaghetti and Meatballs for dinner, and it was pretty good. Had a few whiskeys on the rocks as well. I was filled by the end of the night. \n\nThe total cost for a long night and lots of food for 2 grown (yes I consider myself grown) men was about $140, split down $70 each. Was a bit costly, but we were there from about 7pm-11pm and ate a 

In [36]:
yelp['text'][8958]

'Second time here...excellent food...wings were amazing!\n\nJames did an excellent job as our server....\n\nDefinitely will come back...'

In [37]:
yelp['text'][9220]

'The food was delicious. The fish was very fresh and every piece tasted great. The service was also very good.'

## Part 3: Classification
<a id="#p3"></a>
Your goal in this section will be to predict `stars` from the review dataset. 

1. Create a piepline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier. Use that pipeline to estimate a model to predict `stars`. Use the Pipeline to predict a star rating for your fake review from Part 2. 
2. Tune the entire pipeline with a GridSearch

In [57]:
#vectorizer
vect = TfidfVectorizer(stop_words='english')

#Classifier
rfc = RandomForestClassifier()

# Define the Pipeline
pipe = Pipeline([('vect', vect), ('clf', rfc)])

In [58]:
parameters = {
     #'clf__alpha': (1.0000000000000001e-05, 9.9999999999999995e-07),
     #'clf__max_iter': (10, 50),
     'clf__penalty': ('l2', 'elasticnet'),
     #'tfidf__use_idf': (True, False),
     'vect__max_n': (1, 2),
     'vect__max_df': (0.5, 0.75, 1.0),
     'vect__max_features': (None, 5000, 10000, 50000),
     'clf__max_depth':(5,10,15,20,25)
}

In [59]:
grid_search = GridSearchCV(pipe,parameters, cv=5, n_jobs=4, verbose=1)

In [60]:
clf.fit(yelp['text'], yelp['stars'])

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  8.6min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))]),
          fit_params=None, iid='warn', n_iter=10, n_jobs=-1,
          param_distributions={'svd__n_iter': (5, 10, 15), 'svd__n_components': (100, 300, 400, 500, 600, 700, 800, 900, 1000), 'sgdc__class_weight': ('balanced',), 'sgdc__loss': ('hinge', 'log'), 'sgdc__alpha': (0.0007,), 'sgdc__average': (True, False)},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=1)

In [63]:
print(clf.best_score_)

0.6119


In [64]:
print(clf.best_params_)

{'svd__n_iter': 15, 'svd__n_components': 900, 'sgdc__loss': 'hinge', 'sgdc__class_weight': 'balanced', 'sgdc__average': True, 'sgdc__alpha': 0.0007}


## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
2. Create 1-2 visualizations of the results
    - You can use the most important 3 words of a topic in relevant visualizations. Refer to yesterday's notebook to extract. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

__*Note*__: You can pass the DataFrame column of text reviews to gensim. You do not have to use a generator.

In [65]:
#Learn the vocubalary of the yelp data:
id2word = corpora.Dictionary(yelp['tokens'])

In [66]:
#Create a bag of words representation 
corpus = [id2word.doc2bow(review) for review in yelp['tokens']]

In [67]:
#LDA model for estimation
lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   iterations=5,
                   workers=4,
                   num_topics = 15 # You can change this parameter
                  )

In [70]:
# Function to print out Topics in a nice format
def lda_topics():
    words = [re.findall(r'"([^"]*)"',t[1]) for t in lda.print_topics()]
    topics = [' '.join(t[0:5]) for t in words]
    for id,t in enumerate(topics): 
        print(f'-------- topic {id} ---------')
        print(t)
        print("\n")
    
                    


In [71]:
lda_topics()

-------- topic 0 ---------
food place time good like


-------- topic 1 ---------
good food place great time


-------- topic 2 ---------
food like place time great


-------- topic 3 ---------
food place good great service


-------- topic 4 ---------
good like place service food


-------- topic 5 ---------
good place like time great


-------- topic 6 ---------
food service time place great


-------- topic 7 ---------
great good place food service


-------- topic 8 ---------
place great good like food


-------- topic 9 ---------
place food good time great


-------- topic 10 ---------
time service like place food


-------- topic 11 ---------
food good great service place


-------- topic 12 ---------
good place great like food


-------- topic 13 ---------
great food place good service


-------- topic 14 ---------
good food service place great




In [73]:
#Interactive Vizualization of topics. 

import pyLDAvis.gensim

pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(lda, corpus, id2word)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


**analysis on the results**

Most Yelp Reviewers tend to give Positive reviews, this can be seen from term freq of 'great','amazing','like' terms being used. 
Overall the topics are mostly well seperated, but there is 3 overlaps (Topics (7,8),(5,9),(11,12,13)
removing terma like 'food','place' might yield better results since its obvious that yelp reviews would be about food places. 
Making use of features like stars, useful and cool might help further analyse the topics better. 

## Stretch Goals

Complete one of more of these to push your score towards a three: 
* Incorporate named entity recognition into your analysis
* Compare vectorization methods in the classification section
* Analyze more (or all) of the yelp dataset - this one is v. hard. 
* Use a generator object on the reviews file - this would help you with the analyzing the whole dataset.
* Incorporate any of the other yelp dataset entities in your analysis (business, users, etc.)