### Yelp Project Part II: Feature Engineering - Review Analysis - LDA

In [3]:
import pandas as pd
df = pd.read_csv('restaurant_reviews.csv', encoding ='utf-8')

In [2]:
df.head()

Unnamed: 0,business_id,text,useful,stars
0,Bh5VbI_9msk3GaD0kiKkmg,"Gluten free, dairy free and vegan ""ice cream""....",4.0,3.0
1,Bh5VbI_9msk3GaD0kiKkmg,"I struggled with the rating on this one, but I...",3.0,2.0
2,Bh5VbI_9msk3GaD0kiKkmg,I love how the other reviewers only ever had o...,7.0,1.0
3,Bh5VbI_9msk3GaD0kiKkmg,I love to see new vegan places pop up in and a...,3.0,4.0
4,Bh5VbI_9msk3GaD0kiKkmg,A bit pricy for frozen banana mash? \n\nI woul...,8.0,3.0


In [4]:
# getting the training or testing ids is to use the LDA fitting the training sets and predict
# the topic categories of the testing set

train_id = pd.read_csv('train_set_id.csv', encoding ='utf-8')
train_id.columns = ['business_id']

test_id = pd.read_csv('test_set_id.csv', encoding ='utf-8')
test_id.columns = ['business_id']

In [5]:
df_train = train_id.merge(df, how = 'left', left_on='business_id', right_on='business_id')
df_train.dropna(how='any', inplace = True)

df_test = test_id.merge(df, how = 'left', left_on='business_id', right_on='business_id')
df_test.dropna(how='any', inplace = True)

In [6]:
df_train.shape

(334838, 4)

In [7]:
df_test.shape

(145724, 4)

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=0.1,
                        max_features=10000)
X_train = count.fit_transform(df_train['text'].values)
X_test = count.transform(df_test['text'].values)

In [10]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 10, 
                                random_state = 1, 
                                learning_method = 'online',
                                max_iter  = 15,
                                verbose=1,
                                n_jobs = -1)

X_topics_train = lda.fit_transform(X_train)
X_topics_test = lda.transform(X_test)

iteration: 1 of max_iter: 15
iteration: 2 of max_iter: 15
iteration: 3 of max_iter: 15
iteration: 4 of max_iter: 15
iteration: 5 of max_iter: 15
iteration: 6 of max_iter: 15
iteration: 7 of max_iter: 15
iteration: 8 of max_iter: 15
iteration: 9 of max_iter: 15
iteration: 10 of max_iter: 15
iteration: 11 of max_iter: 15
iteration: 12 of max_iter: 15
iteration: 13 of max_iter: 15
iteration: 14 of max_iter: 15
iteration: 15 of max_iter: 15


In [22]:
n_top_words = 30

feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print('Topic %d:' % (topic_idx))
    print(" ".join([feature_names[i] 
                    for i in topic.argsort()  
                     [:-n_top_words - 1: -1]]))

Topic 0:
vegas hotel strip las gets lamb floor casino stay pool club pita rooms smell hummus market greek naan check dance bathroom line staying walk resort gyro play bars smoke wrap
Topic 1:
told waitress manager customer left ask finally walked waiting waited guy waiter away customers let water check owner later tell seated gave called brought friend looked worst arrived rude sat
Topic 2:
pork fried spicy soup dish beef shrimp dishes noodles bbq chinese thai bowl tea green egg pho curry broth crispy fish served noodle corn cooked korean tender tofu red rolls
Topic 3:
breakfast coffee cream chocolate ice cake dessert brunch toast bacon french butter morning fruit desserts sugar waffles apple glasses hash potatoes waffle cup bread cafe pie strawberry flavors pudding bakery
Topic 4:
tacos buffet quality chips stars person worth location mexican line probably maybe ok taco salsa decent places prices reviews star half extra actually pay burrito items isn tasted average fast
Topic 5:
pizza

In [12]:
# identify the column index of the max values in the rows, which is the class of each row
import numpy as np
idx = np.argmax(X_topics_train, axis=1)

In [13]:
df_train['label'] = (df_train['stars'] >= 4)*1

In [14]:
df_train['Topic'] = idx

In [15]:
df_train.head()
df_train.to_csv('review_train.csv', index = False)

In [16]:
df_test['label'] = (df_test['stars'] >= 4)*1

In [17]:
# identify the column index of the max values in the rows, which is the class of each row
import numpy as np
idx = np.argmax(X_topics_test, axis=1)

In [18]:
df_test['Topic'] = idx

In [19]:
df_test.head()
df_test.to_csv('review_test.csv', index = False)

In [62]:
import pandas as pd
import numpy as np

In [63]:
df_train = pd.read_csv('review_train.csv')
df_test = pd.read_csv('review_test.csv')

In [64]:
df_train['score'] = df_train['label'].replace(0, -1)

In [65]:
df_test['score'] = df_test['label'].replace(0, -1)

In [68]:
len(df_train['business_id'].unique())

29598

In [36]:
topic_train = df_train.groupby(['business_id', 'Topic']).mean()['score'].unstack().fillna(0).reset_index()
topic_train.index.name = None
topic_train.columns = ['business_id', 'Topic0', 'Topic1', 'Topic2', 'Topic3', 'Topic4', 
                       'Topic5', 'Topic6', 'Topic7', 'Topic8', 'Topic9']
topic_train.head()

Unnamed: 0,business_id,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9
0,--6MefnULPED_I942VcFNA,0.0,-1.0,0.6,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,--DaPTJW3-tB1vP-PfdTEg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,--FBCX-N37CMYDfs790Bnw,0.0,-0.5,0.0,0.0,-1.0,0.6,0.0,0.0,0.0,0.333333
3,--S62v0QgkqQaVUhFnNHrw,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,--SrzpvFLwP_YFwB_Cetow,0.0,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [28]:
topic_train.to_csv('train_topic_score.csv', index = False)

In [37]:
topic_test = df_test.groupby(['business_id', 'Topic']).mean()['score'].unstack().fillna(0).reset_index()
topic_test.index.name = None
topic_test.columns = ['business_id', 'Topic0', 'Topic1', 'Topic2', 'Topic3', 'Topic4', 
                       'Topic5', 'Topic6', 'Topic7', 'Topic8', 'Topic9']
topic_test.head()

Unnamed: 0,business_id,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9
0,--9e1ONYQuAa-CB_Rrw7Tw,0.0,-0.333333,1.0,1.0,0.333333,0.428571,0.631206,0.0,0.0,1.0
1,--I7YYLada0tSLkORTHb5Q,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,--KCl2FvVQpvjzmZSPyviA,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.0,0.0,0.0
3,-092wE7j5HZOogMLAh40zA,0.0,-0.5,-0.333333,0.0,-1.0,0.0,0.0,0.0,0.0,1.0
4,-0BxAGlIk5DJAGVkpqBXxg,0.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0


In [29]:
topic_test.to_csv('test_topic_score.csv', index = False)

In [38]:
print(topic_train.shape)
print(topic_test.shape)

(29598, 11)
(12672, 11)


In [44]:
topic = pd.concat([topic_train, topic_test])

In [49]:
topic.to_csv('topic_score.csv', index = False)

In [18]:
horror = X_topics[:, 0].argsort()
for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\n Horror moive #%d:' % (iter_idx+1))
    print(df['text'][movie_idx][:300], '...')


 Horror moive #1:
I. Don't Get. It. That about sums up my two experiences at Happy Dog. It's a gimmick. The place basically serves two things, tater tots and hot dogs, with the rest of the menu a mishmash of pantry items and sauces that you can then throw on top of said tater tots and hot dogs. It's that simple. Oh,  ...

 Horror moive #2:
Never have I been more pissed off at a restaurant before.
I will later describe how horrible their service is but first let's get the other stuff out of the way.

Ordering system:
We went for lunch and we got to use the ipad. The price is no longer the same as the menu picture on Yelp. Lunch is now  ...

 Horror moive #3:
This place is obviously a well-kept secret - two months since my visit to The Theatre Centre and still no FTR. It's been a while since my last epic review, and my trusty keyboard and I have been dying for a good reason to let the horses out on the open road. Buckle your seatbelts, kids - this Queen  ...


In [None]:
#### Now is the example in the slide

In [19]:
# E.g. take restaurant 'cInZkUSckKwxCqAR7s2ETw' as an example: First Watch

eg_res = df[df['business_id'] == 'cInZkUSckKwxCqAR7s2ETw']

In [17]:
eg = pd.read_csv('topic_score.csv')

In [18]:
eg[eg['business_id'] == 'cInZkUSckKwxCqAR7s2ETw']

Unnamed: 0,business_id,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9
18652,cInZkUSckKwxCqAR7s2ETw,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


In [42]:
eg_res

Unnamed: 0,business_id,text,useful,stars
711,cInZkUSckKwxCqAR7s2ETw,I've been here twice to this same location. Th...,3.0,4.0
712,cInZkUSckKwxCqAR7s2ETw,First Watch is my first encounter with a chain...,3.0,3.0
713,cInZkUSckKwxCqAR7s2ETw,First Watch was so delicious and much needed a...,4.0,4.0
714,cInZkUSckKwxCqAR7s2ETw,Super healthy breakfast foods. Super tasty to...,5.0,4.0
715,cInZkUSckKwxCqAR7s2ETw,This place rules!\nI have never had even a sli...,3.0,5.0


In [41]:
eg_res.loc[715, :]['text']

'This place rules!\nI have never had even a slightly negative experience here.\n\nThe place is clean, staff is friendly and the food...\n\nThey have a regular menu that\'s packed full of breakfast variety as well as a seasonal menu that changes semi-regularly.  \nRight now they have a pumpkin pancake special that donates to the www.nokidhungry.com foundation.  I have not tried it, due to my lack of obsession over pumpkin spiced things.\nWhat I did try, last time, was the mushroom frittata and the pork chop and egg special.  So flipping good!\n\nThis round we started with the butternut squash soup, with sour cream on the side.  They put quite a bit in the bowl and I feel like it diluted the flavor.  On the side is perfect because hints every now and then were perfect.\nI ordered the Benedict this time around and got the salmon and vine grown tomato as the topping.  The farm fresh cage free eggs really made the dish shine with the bright orange yolk.  The greens they serve as a side were