# Homework 4 (Due 5:30pm PST April 30th, 2019): Word Embeddings

### Submit one notebook per project group via Slack/email.

1. Pick your dataset for approval by me by Friday 11:59pm PST. Not submitting for approval will result in no credit for this HW.


2. Find the **most similar sentences or documents in your dataset using word count, TF-IDF, and word-embeddings** as your vectorization techniques. If the computation is slow, **you may subsample** for only a few thousand rows. (2 pts)


In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

[nltk_data] Downloading package wordnet to /Users/tina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/tina/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/tina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Preparing and processing corpus

In [2]:
tea_review = pd.read_csv("../data/yelp_bubble_tea_reviews.csv")
tea_review.shape

(40277, 17)

In [3]:
tea_review.rename(columns = {'reivew':'review'}, inplace = True)
tea_review.head()

Unnamed: 0,review_date,review_rating,review,cool,funny,useful,business_id,restaurant_name,categories,city,state,restaurant_rating,restaurant_review_count,business_parking,ambience,is_open,attributes
0,2015-04-15 06:25:19.000000,4,I have to say I really enjoy the Boba Tea sele...,0,0,0,SVUxmYs6_TvX5kWv0ok-MA,No. 1 Boba Tea,"""Coffee & Tea, Food, Juice Bars & Smoothies, B...",Las Vegas,NV,4.0,295,"{""lot"": ""TRUE"", ""valet"": ""FALSE"", ""garage"": ""F...",,1,"{""BusinessAcceptsCreditCards"": ""True"", ""BikePa..."
1,2017-10-02 02:46:08.000000,3,I appreciate that it opens earlier than all th...,0,0,0,SVUxmYs6_TvX5kWv0ok-MA,No. 1 Boba Tea,"""Coffee & Tea, Food, Juice Bars & Smoothies, B...",Las Vegas,NV,4.0,295,"{""lot"": ""TRUE"", ""valet"": ""FALSE"", ""garage"": ""F...",,1,"{""BusinessAcceptsCreditCards"": ""True"", ""BikePa..."
2,2017-07-28 19:09:36.000000,5,My favorite Boba tea place in Vegas. So many c...,0,0,0,SVUxmYs6_TvX5kWv0ok-MA,No. 1 Boba Tea,"""Coffee & Tea, Food, Juice Bars & Smoothies, B...",Las Vegas,NV,4.0,295,"{""lot"": ""TRUE"", ""valet"": ""FALSE"", ""garage"": ""F...",,1,"{""BusinessAcceptsCreditCards"": ""True"", ""BikePa..."
3,2014-06-28 01:52:37.000000,5,"Love, love, love their boba and the variety of...",1,0,1,SVUxmYs6_TvX5kWv0ok-MA,No. 1 Boba Tea,"""Coffee & Tea, Food, Juice Bars & Smoothies, B...",Las Vegas,NV,4.0,295,"{""lot"": ""TRUE"", ""valet"": ""FALSE"", ""garage"": ""F...",,1,"{""BusinessAcceptsCreditCards"": ""True"", ""BikePa..."
4,2015-01-04 02:12:56.000000,2,what the heck?? came here because QQ Boba was ...,0,0,1,SVUxmYs6_TvX5kWv0ok-MA,No. 1 Boba Tea,"""Coffee & Tea, Food, Juice Bars & Smoothies, B...",Las Vegas,NV,4.0,295,"{""lot"": ""TRUE"", ""valet"": ""FALSE"", ""garage"": ""F...",,1,"{""BusinessAcceptsCreditCards"": ""True"", ""BikePa..."


In [4]:
## drop Vietnamese restaurants, which sells bubble tea but is not our focus
# 2712 reviews out of total 40277 reviews are Vietnamese style restaurants
tea_review2 = tea_review[~tea_review['categories'].str.contains('Vietnamese')]
print(tea_review2.shape)

(37565, 17)


In [5]:
## Randomly select 3000 reviews for our analysis
import random
num_reviews = 3000
tea_review_final = tea_review2.sample(n=num_reviews, random_state=3).reset_index().drop("index", axis =1) 
tea_review_final.head()

Unnamed: 0,review_date,review_rating,review,cool,funny,useful,business_id,restaurant_name,categories,city,state,restaurant_rating,restaurant_review_count,business_parking,ambience,is_open,attributes
0,2018-06-17 03:25:13.000000,3,Cute lil cafe. I think for the price you pay f...,0,0,0,YCm7wypibp04buWh-jRRpg,Cafe Summer,"""Food, Desserts, Asian Fusion, Cafes, Delicate...",Las Vegas,NV,4.0,321,"{""lot"": ""TRUE"", ""valet"": ""FALSE"", ""garage"": ""F...","{""divey"": ""FALSE"", ""casual"": ""TRUE"", ""classy"":...",1,"{""BusinessParking"": ""{'garage': False, 'street..."
1,2010-11-08 15:00:43.000000,4,Good service and atmosphere. The lychee milk t...,0,0,0,2OJrznHaA4Gz_KYbQnAuzQ,Volcano Tea House,"""Bubble Tea, Food, Coffee & Tea, Restaurants, ...",Las Vegas,NV,3.5,381,"{""lot"": ""FALSE"", ""valet"": ""FALSE"", ""garage"": ""...",,1,"{""WiFi"": ""'free'"", ""BikeParking"": ""True"", ""Bus..."
2,2015-11-20 17:09:47.000000,5,East Coast bubble tea has arrived in Cleveland...,2,0,0,jmTirQw-n4V_Z4g1-RE9iw,Kung Fu Tea,"""Bubble Tea, Japanese, Restaurants, Coffee & T...",Cleveland,OH,4.0,46,"{""lot"": ""FALSE"", ""valet"": ""FALSE"", ""garage"": ""...","{""divey"": ""FALSE"", ""casual"": ""FALSE"", ""classy""...",1,"{""Caters"": ""False"", ""OutdoorSeating"": ""False"",..."
3,2017-12-21 05:28:10.000000,4,A little expensive but overall quite good. The...,0,0,0,g-JpN7DDCV6Mvth1Yodf5w,Mango Mania,"""Sandwiches, Asian Fusion, Restaurants, Desser...",Calgary,AB,4.0,48,"{""lot"": ""FALSE"", ""valet"": ""FALSE"", ""garage"": ""...","{""divey"": ""FALSE"", ""casual"": ""FALSE"", ""classy""...",1,"{""WiFi"": ""u'free'"", ""Alcohol"": ""'none'"", ""Whee..."
4,2017-01-19 23:22:12.000000,5,Absolutely delicious! Must try! \nI had the Ho...,0,0,0,V-aCFCkkRyakP6SeIfG9-A,Taste Tea,"""Bubble Tea, Food, Tea Rooms, Coffee & Tea""",Las Vegas,NV,4.5,229,"{""lot"": ""FALSE"", ""valet"": ""FALSE"", ""garage"": ""...",,1,"{""RestaurantsTakeOut"": ""True"", ""BusinessParkin..."


In [6]:
print(tea_review_final.iloc[1]['review'])

Good service and atmosphere. The lychee milk tea was delicious and the boba cooked to perfection. We went back the next day and had the coconut slush, which was also very good! Nice little hang out spot for locals and tourists alike! They also serve appetizers.


In [7]:
## create a function that remove stopwords, lower case, Stem/lemmatize on a document
# source: https://www.programcreek.com/python/example/107282/nltk.stem.WordNetLemmatizer
def preprocessing(text):
    import pandas as pd
    from nltk import sent_tokenize
    from nltk import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from nltk.stem.porter import PorterStemmer
    from nltk.corpus import stopwords

    #text = text.decode("utf8")
    # tokenize into words
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]

    # remove stopwords
    stop = stopwords.words('english') #+ ['morestopwords','morestopwords2']  
    tokens = [token for token in tokens if token not in stop]

    # remove words less than 3 letters
    #tokens = [word for word in tokens if len(word) >= 3]

    # lower capitalization
    tokens = [word.lower() for word in tokens]
    
    # Porter stemming
    #stemmer = PorterStemmer()
    #tokens = [stemmer.stem(word) for word in tokens]
    #preprocessed_text= ' '.join(tokens)
    
    # lemmatize
    lmtzr = WordNetLemmatizer()
    tokens = [lmtzr.lemmatize(word) for word in tokens]
    preprocessed_text= ' '.join(tokens)

    return preprocessed_text

In [8]:
## load documents from a csv file and apply text processing
corpus = list(tea_review_final['review'].values)

## process corpus
# corpus must be a list of long string (each is a document) before processing
corpus_processed=[]
for doc in corpus:
    corpus_processed.append(preprocessing(doc))
corpus_processed[0:5]

["cute lil cafe . i think price pay boba tea , 're better going brew quality much better opinion . i like variety dessert 's average .",
 'good service atmosphere . the lychee milk tea delicious boba cooked perfection . we went back next day coconut slush , also good ! nice little hang spot local tourist alike ! they also serve appetizer .',
 "east coast bubble tea arrived cleveland ! like others mentioned , cleveland super slow uptake bubble tea . my husband happy kft opened joke quality life cleveland improved 38 % . we 've tried kft boston location , location comparable . ( i peek inside san mateo location horrified small dirty ! how allow part franchise ? ) the interior super modern spacious large led screen menu . plenty seating inside outside patio . it 's part kenko , local fast japanese casual place franchised kft . i n't care much kenko - i skip right boba . kft free super sugary high fructose corn syrup . they also bubble right ! you also customize like . i usually le ice , h

## Finding most similar documents using word count

In [9]:
## Prepare CountVectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

stop=stopwords.words('english') #+ ['morestopwords','morestopwords2'] 

#regex cleaning
vectorizer= CountVectorizer(#token_pattern=r'\b[a-zA-Z]{3,}\b',  
                            # remove token_pattern to avoid losing any signal e.g. 15min 
                            ngram_range=(2,3), 
                            max_df=0.5, 
                            stop_words=stop,
                            max_features=500)
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=500, min_df=1,
        ngram_range=(2, 3), preprocessor=None,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',... 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [10]:
## Count Vectorize the corpus
X = vectorizer.fit_transform(corpus_processed)
vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
vectorized_df.head()

Unnamed: 0,10 10,10 minute,15 min,15 minute,20 minute,30 minute,absolutely love,almond milk,almond milk tea,also good,...,would come back,would definitely,would definitely come,would get,would give,would go,would like,would make,would recommend,would say
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
#vectorizer.get_feature_names()

In [12]:
##  identify the two reviews that are the most "similar" based on cosine similarity.
from sklearn.metrics.pairwise import cosine_similarity

# cosine_similarity() returns an array, but it's hard to manipulate data in an array 
# -> convert to a pd DataFrame and use unstack() to ease value sorting
cos_df =  pd.DataFrame(cosine_similarity(vectorized_df))
cos_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.338062,0.0,0.282843,0.0,0.182574,0.0,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.13484,0.0,0.105409
2,0.0,0.0,1.0,0.0,0.0,0.444444,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.100504,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.57735,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.348155,0.0,0.0
4,0.0,0.338062,0.0,0.0,1.0,0.0,0.239046,0.0,0.154303,0.0,...,0.0,0.0,0.0,0.169031,0.0,0.169031,0.0,0.113961,0.0,0.089087


In [13]:
## identify the most similar documents 

sorted_cos = cos_df.unstack().sort_values(ascending = False, kind="quicksort") # still a pd Series

#print(sorted_pos_cos)
# will filter out rows correspond to same reviews, because they must have cosine similarity = 1 but are not of our interest
# To ease doing so and avoid setting conditions based on indexes, I further convert the Series to a DataFrame,
# Exclude rows of same reviews, and Check only 5 reviews that have the highest cosine similarity
sorted_cos_df = sorted_cos.reset_index()
sorted_cos_df.rename(columns = {'level_0': "review1", 'level_1': "review2", 0: 'similarity' }\
                     , inplace = True)

# drop rows that are either have the same review itself or whose reviews' contents are identical 
sorted_cos_df = sorted_cos_df[sorted_cos_df['similarity'] < 0.9999999].reset_index().drop('index', axis =1) 
sorted_cos_df.head(10)

# check only the top 5 similar documents
#sorted_cos_df.loc[sorted_cos_df['review1'] != sorted_cos_df['review2']][0:10]

Unnamed: 0,review1,review2,similarity
0,2587,952,0.973329
1,2587,2257,0.973329
2,2587,1014,0.973329
3,575,2587,0.973329
4,2172,2587,0.973329
5,714,2587,0.973329
6,2587,410,0.973329
7,2587,575,0.973329
8,2587,1640,0.973329
9,572,2587,0.973329


In [14]:
## this cell just allow us to check whether the prints below are correct
#print(tea_review_final['review'][2587])
#print('\n')
#print(tea_review_final['review'][1640])

print(corpus_processed[2587])
print('\n')
print(corpus_processed[2257])
print('\n')
print(vectorized_df.iloc[2587,:].index[vectorized_df.iloc[2587,:] != 0])
print(vectorized_df.iloc[2257,:].index[vectorized_df.iloc[2257,:] != 0])

this one worst ice cream i eaten . their juice okay , ice cream really terrible . i interested taste vape come , process using liquid nitrogen create ice cream ruin everything make ice cream good . i tried finishing pina colada flavoured ice cream , bad . i recommend ice cream , overpriced quality waste money . it good instagram pic nothing else . overall , overpriced food taste cute decor , worth . real disappointment .


the macaroon ice cream horrible , macaroon old hard . i waiting write good review 's nothing write good even worth drive .


Index(['cream good', 'ice cream', 'ice cream good'], dtype='object')
Index(['ice cream'], dtype='object')


In [15]:
## print the top 10 similar reviews using word count
for i in range(10):
    print(f"reivew1:\n{tea_review_final['review'].iloc[sorted_cos_df['review1'][i]]}\n")
    print(f"reivew2:\n{tea_review_final['review'].iloc[sorted_cos_df['review2'][i]]}")
    print("---" * 60)
#    tea_review_final['review'].iloc[sorted_cos_df['review1'][i]]

reivew1:
This is one of the worst ice creams I have eaten. Their juice is okay, but the ice cream is really terrible. I was interested to taste it because of the vape that comes out of it, but the process of using liquid nitrogen to create the ice cream ruins everything that makes an ice cream good. I tried finishing my pina colada flavoured ice cream, but it was too bad. I do not recommend this ice cream, it is overpriced for the quality of it and a waste of money. It is only good for an instagram pic and nothing else. Overall, overpriced food for the taste and cute decor, but not worth it. Real disappointment.

reivew2:
Okay, I had to edit my review because this place was a lifesaver as soon as I developed dietary restrictions. They offer a few different vegan options for ice cream, and while they are not as fun or delicious as their regular ice cream offerings, they are such a nice treat when you cannot have dairy (or in my case dairy and soy)! We have also recommended this place to

#### The problem of using wordcount vectorizer is that the top similar reviews above don't actually look similar. Some have really different context or evening the opposite sentiments.
Posible reasons:
* The parameters of the wordcount vectorizer are not optimized: ngram=(2,3) and max_feature=500 are not good enough to really understand content -> may need to tune them.
* Wordcount vectorizer doesn't capture meaning of a document well: wordcount vectorizer only captures the appearence of the top ngram phrases, which may not be useful to understand reviews' meaning. As we compute cosine similarity based on wordcount vectorizer, it's unavoidable that the "similar reviews" do not look similar to us or say don't have similar meaning.   

## Finding most similar documents using TFIDF

In [16]:
## Prepare TfidfVectorizer
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

stop=stopwords.words('english') #+ ['morestopwords','morestopwords2'] 

#regex cleaning
vectorizer= TfidfVectorizer(#token_pattern=r'\b[a-zA-Z]{3,}\b',  
                            # remove token_pattern to avoid losing any signal e.g. 15min 
                            ngram_range=(2,3), 
                            max_df=0.5, 
                            stop_words=stop,
                            max_features=500)
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=500, min_df=1,
        ngram_range=(2, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',... 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [17]:
## Count Vectorize the corpus
X = vectorizer.fit_transform(corpus_processed)
vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
vectorized_df.head()

Unnamed: 0,10 10,10 minute,15 min,15 minute,20 minute,30 minute,absolutely love,almond milk,almond milk tea,also good,...,would come back,would definitely,would definitely come,would get,would give,would go,would like,would make,would recommend,would say
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.48864,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
#vectorizer.get_feature_names()

In [19]:
##  identify the two reviews that are the most "similar" based on cosine similarity.
from sklearn.metrics.pairwise import cosine_similarity

# cosine_similarity() returns an array, but it's hard to manipulate data in an array 
# -> convert to a pd DataFrame and use unstack() to ease value sorting
cos_df =  pd.DataFrame(cosine_similarity(vectorized_df))
cos_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.230778,0.0,0.07582,0.0,0.041382,0.0,...,0.0,0.0,0.0,0.041043,0.0,0.0,0.0,0.030489,0.0,0.025784
2,0.0,0.0,1.0,0.0,0.0,0.215637,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.143178,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.568934,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.180973,0.0,0.0
4,0.0,0.230778,0.0,0.0,1.0,0.0,0.066637,0.0,0.03637,0.0,...,0.0,0.0,0.0,0.036072,0.0,0.200029,0.0,0.026796,0.0,0.022661


In [20]:
## identify the most similar documents 

sorted_cos = cos_df.unstack().sort_values(ascending = False, kind="quicksort") # still a pd Series

#print(sorted_pos_cos)
# will filter out rows correspond to same reviews, because they must have cosine similarity = 1 but are not of our interest
# To ease doing so and avoid setting conditions based on indexes, I further convert the Series to a DataFrame,
# Exclude rows of same reviews, and Check only 5 reviews that have the highest cosine similarity
sorted_cos_df = sorted_cos.reset_index()
sorted_cos_df.rename(columns = {'level_0': "review1", 'level_1': "review2", 0: 'similarity' }\
                     , inplace = True)

# drop rows that are either have the same review itself or whose reviews' contents are identical 
sorted_cos_df = sorted_cos_df[sorted_cos_df['similarity'] < 0.9999999].reset_index().drop('index', axis =1) 
sorted_cos_df.head(10)

# check only the top 5 similar documents
#sorted_cos_df.loc[sorted_cos_df['review1'] != sorted_cos_df['review2']][0:10]

Unnamed: 0,review1,review2,similarity
0,681,1631,0.968454
1,1631,681,0.968454
2,283,687,0.964284
3,687,283,0.964284
4,554,1211,0.963251
5,1211,554,0.963251
6,2656,12,0.962347
7,12,2656,0.962347
8,1649,1185,0.959248
9,1185,1649,0.959248


In [21]:
## this cell help us to check why any 2 reviews are processed as top similar based on TFIDF vectorizer
print(corpus_processed[681])
print('\n')
print(corpus_processed[1631])
print('\n')
print(vectorized_df.iloc[681,:].index[vectorized_df.iloc[681,:] != 0])
print(vectorized_df.iloc[1631,:].index[vectorized_df.iloc[1631,:] != 0])

i 'm regular customer coco eglinton location . i stumbled upon location 's really close work , however , i ca n't believe 're brand , location 's roasted pearl milk tea taste bland watery . i give benefit doubt ordered three separate occasion , well let 's say milk tea least consistent - bland watery .


the tapioca sweeter , milk tea taste better


Index(['milk tea', 'milk tea taste', 'tea taste'], dtype='object')
Index(['milk tea', 'milk tea taste', 'tea taste'], dtype='object')


In [22]:
## print the top 10 similar reviews using word count
for i in range(10):
    print(f"reivew1:\n{tea_review_final['review'].iloc[sorted_cos_df['review1'][i]]}\n")
    print(f"reivew2:\n{tea_review_final['review'].iloc[sorted_cos_df['review2'][i]]}")
    print("---" * 60)
#    tea_review_final['review'].iloc[sorted_cos_df['review1'][i]]

reivew1:
I'm a regular customer at the CoCo Eglinton location. I stumbled upon this location as it's really close to work, however, I can't believe that they're the same brand, as this location's roasted pearl milk tea taste so bland and watery. I give it the the benefit of the doubt and ordered here on three separate occasions, well let's just say their milk tea is at least very consistent - bland and watery.

reivew2:
The tapioca here are sweeter, 
milk tea taste better than most
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
reivew1:
The tapioca here are sweeter, 
milk tea taste better than most

reivew2:
I'm a regular customer at the CoCo Eglinton location. I stumbled upon this location as it's really close to work, however, I can't believe that they're the same brand, as this location's roasted pearl milk tea taste so bland and watery. I give it the

#### TFIDF vectorizer performs better than wordcount vectorizer in terms of finding similar reviews

A possible reason is that TFIDF vectorizer highlights rare but signigicant ngram phrases such that it captures a bit more meaning of reviews than wordcount vectorizer does. 

But still, using TFIDF vectorizer to understand reviews, we are at the level of capture meaning based on appreanace of top ngram phrases, some are not so helpful to comprehend a document.

## Finding most similar documents using Word Embedding

In [23]:
import spacy
import en_core_web_md
from scipy.spatial.distance import cosine
nlp = en_core_web_md.load()
# loading vectors of words

In [24]:
# get vectors for each review
review_vectors = []
for review in corpus_processed:
    processed_review = nlp(review)
    #print(len(processed_review.vector))
    #print(processed_review.vector[:10]) # review vector, but Spacy default is using the average approach
    review_vectors.append(processed_review.vector)

In [25]:
vector_df = pd.DataFrame(review_vectors)
vector_df['text'] = tea_review_final['review']
vector_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,291,292,293,294,295,296,297,298,299,text
0,-0.232489,0.206861,-0.072061,-0.220383,0.038611,0.058881,0.079486,-0.203828,0.009794,1.779169,...,0.056953,-0.118106,-0.02339,0.061727,0.00019,-0.039053,-0.150056,0.043689,0.208696,Cute lil cafe. I think for the price you pay f...
1,-0.033927,0.126635,0.035919,-0.112907,0.075135,0.085961,0.012315,-0.12116,0.0117,1.93065,...,0.066859,-0.05206,-0.081594,-0.042503,-0.007328,0.091801,-0.207242,-0.000402,0.108564,Good service and atmosphere. The lychee milk t...
2,0.001318,0.146115,-0.083746,-0.144675,0.13021,0.03666,0.008962,-0.128621,0.042276,1.665743,...,0.048544,-0.056313,-0.023993,0.03545,-0.005575,-0.028998,-0.091358,-0.00929,0.12626,East Coast bubble tea has arrived in Cleveland...
3,-0.129251,0.224553,-0.033462,-0.071737,0.031776,0.068935,-0.031398,-0.116254,0.031774,1.895368,...,0.065351,-0.058263,-0.144662,-0.06817,0.049422,-0.043754,-0.251772,0.080915,0.130501,A little expensive but overall quite good. The...
4,-0.085851,0.233693,-0.069304,-0.137752,-0.009839,0.087096,0.072243,-0.171197,-0.02614,1.877826,...,0.080577,-0.098863,-0.061137,0.019624,-0.077023,0.06869,-0.142036,0.030205,0.147325,Absolutely delicious! Must try! \nI had the Ho...


In [26]:
vector_df.set_index('text',inplace = True)
vector_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Cute lil cafe. I think for the price you pay for a boba tea, you're better off going to Brew where the quality is so much better in my opinion. I do like that they have a variety of desserts but it's just average.",-0.232489,0.206861,-0.072061,-0.220383,0.038611,0.058881,0.079486,-0.203828,0.009794,1.779169,...,-0.090701,0.056953,-0.118106,-0.02339,0.061727,0.00019,-0.039053,-0.150056,0.043689,0.208696
"Good service and atmosphere. The lychee milk tea was delicious and the boba cooked to perfection. We went back the next day and had the coconut slush, which was also very good! Nice little hang out spot for locals and tourists alike! They also serve appetizers.",-0.033927,0.126635,0.035919,-0.112907,0.075135,0.085961,0.012315,-0.12116,0.0117,1.93065,...,-0.072576,0.066859,-0.05206,-0.081594,-0.042503,-0.007328,0.091801,-0.207242,-0.000402,0.108564
"East Coast bubble tea has arrived in Cleveland! \n\nLike others have mentioned, Cleveland has been super slow on the uptake for bubble tea. My husband was so happy that KFT opened that we joke his quality of life in Cleveland improved by 38%. We've tried KFT in Boston and other locations, and this location is comparable. (I did peek inside the San Mateo location once and was horrified at how small and dirty it was! How can you allow it to be part of your franchise?) The interior is super modern and spacious with large LED screens for the menu. Plenty of seating inside and outside on their patio. \n\nIt's a part of Kenko, which is a local fast Japanese casual place that franchised KFT. I don't care much for Kenko - I just skip right to the boba. \n\nKFT is free of the super sugary high fructose corn syrup. They also do their bubbles right! You can also customize how you like. I usually do less ice, half sugar, and less bubbles.\n\nMy favorite flavors are the oolong tea. I also tried the yakult and it's so good!",0.001318,0.146115,-0.083746,-0.144675,0.13021,0.03666,0.008962,-0.128621,0.042276,1.665743,...,-0.056216,0.048544,-0.056313,-0.023993,0.03545,-0.005575,-0.028998,-0.091358,-0.00929,0.12626
"A little expensive but overall quite good. Their milk green tea is with real 2% milk. It tastes really different from other stores. Their coconut ice shake is also amazing! If you enjoy mangoes, this is 100% the place for you! I somewhat dislike their hot mango drink since it tastes awkward. They also open at 11:00 AM! Will be back!",-0.129251,0.224553,-0.033462,-0.071737,0.031776,0.068935,-0.031398,-0.116254,0.031774,1.895368,...,-0.122762,0.065351,-0.058263,-0.144662,-0.06817,0.049422,-0.043754,-0.251772,0.080915,0.130501
"Absolutely delicious! Must try! \nI had the Hokkaido milk tea, it was delicious and perfect. Creamy, not too sweet. Just perfect. And the service is great, I asked questions about different flavors and she took her time explaining them to me.",-0.085851,0.233693,-0.069304,-0.137752,-0.009839,0.087096,0.072243,-0.171197,-0.02614,1.877826,...,-0.012977,0.080577,-0.098863,-0.061137,0.019624,-0.077023,0.06869,-0.142036,0.030205,0.147325


In [27]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = pd.DataFrame(cosine_similarity(vector_df.values),\
            columns = tea_review_final['review'],\
            index = tea_review_final['review'])
similarities

review,"Cute lil cafe. I think for the price you pay for a boba tea, you're better off going to Brew where the quality is so much better in my opinion. I do like that they have a variety of desserts but it's just average.","Good service and atmosphere. The lychee milk tea was delicious and the boba cooked to perfection. We went back the next day and had the coconut slush, which was also very good! Nice little hang out spot for locals and tourists alike! They also serve appetizers.","East Coast bubble tea has arrived in Cleveland! Like others have mentioned, Cleveland has been super slow on the uptake for bubble tea. My husband was so happy that KFT opened that we joke his quality of life in Cleveland improved by 38%. We've tried KFT in Boston and other locations, and this location is comparable. (I did peek inside the San Mateo location once and was horrified at how small and dirty it was! How can you allow it to be part of your franchise?) The interior is super modern and spacious with large LED screens for the menu. Plenty of seating inside and outside on their patio. It's a part of Kenko, which is a local fast Japanese casual place that franchised KFT. I don't care much for Kenko - I just skip right to the boba. KFT is free of the super sugary high fructose corn syrup. They also do their bubbles right! You can also customize how you like. I usually do less ice, half sugar, and less bubbles. My favorite flavors are the oolong tea. I also tried the yakult and it's so good!","A little expensive but overall quite good. Their milk green tea is with real 2% milk. It tastes really different from other stores. Their coconut ice shake is also amazing! If you enjoy mangoes, this is 100% the place for you! I somewhat dislike their hot mango drink since it tastes awkward. They also open at 11:00 AM! Will be back!","Absolutely delicious! Must try! I had the Hokkaido milk tea, it was delicious and perfect. Creamy, not too sweet. Just perfect. And the service is great, I asked questions about different flavors and she took her time explaining them to me.","I love bubble tea so honestly I hardly ever give a bad review for bubble tea places. They had a good amount of options, and I love the places that let you choose the ice and sugar level. Nice seating area too.","Our first time in Vegas and we were craving for milk tea. Checked on yelp and found this place with good reviews, i don't think milk tea is really popular in Vegas. This place looks new and the interior is pretty cool, the area were clean! The crew were super friendly. The wait for three people order was pretty quick. Love this place!","We were excited to see the Alley closer to home and decided to give this location a try. It's nice to see more sitting space compared to most Alley locations which will probably serve best for the students nearby. We ordered the Jasmine Milk Green Tea with tapioca and the Lime honey aloe drink. Firstly, perhaps because it is still relatively new, the service was quite slow and there wasn't a lot of people but they had a steady flow of customers. Secondly, my friend actually wanted a slew of other drinks before she settled with the lime honey aloe one and she settled for it because she wasn't able to alter the sweetness of the other ones she wanted. We didn't like the inconsistency of sugar options in their drinks and can't quite understand why some can be altered but some cannot (perhaps someone can enlighten me? ) Thirdly, now this I find in all Alley locations, I just find that the amount of tapioca they give compared to a lot of other boba places lacks noticeably. I feel like I get the same amount in a regular cup as a large and that doesn't really make sense (perhaps someone can enlighten me with this as well). Other than that though, it's pretty much the same as any other Alley.","Oh my gosh, so good! Went in right before closing and got a Thai milk tea boba (my fav) and it totally hit the spot. Noticed this spot while driving by, so glad we stopped! Yum!!!","I like getting their buns here but the women who work here are extremely unfriendly and unhelpful. They always squish all my purchases in a tiny plastic bag and refuse to give me a larger bag even if I pay the 5 cents for it. Whenever I let them know that I'd like a larger bag as I don't want my buns to end up all squished when I return home, they are extremely rude and condescending.",...,"Love coming by here trying new things and the comforting staff. Tory and Tammy are awesome. I recommend to at least pop in and try it for yourself. P.s. Insider knowledge: The owner makes pastries fresh/home made, Delicious! Great place for my girlfriend and I to get coffee and start our day.","If I can rate this place a ten star, I will. But Yelp only allows up to five stars. My husband and I originally 'just' wanted to get their cheese rolls. We've been craving for the Porto's cheese rolls from California for a long while. Anyway, we ended up getting more than that. We had pancit malabon, longsilog, fried vegetable egg roll, 1 bag of pandisal, a dozen of cheese rolls, three guava tools and blueberry rolls. Cheese rolls and pandisal we took them home for our kids. We were hungry so we ate the others in the cafe. And let me tell you. They were all delicious! Chef Zen was very welcoming. She doesn't treat you like a customer. She treats you like family! Before we left she gave us chicken empanada, on the house! We will definitely becoming back Tita Chef Zen! Oh and also, she had us sample her cheese rolls too. We highly recommend this place!","This is the best Boba I ever had in Vegas. The pearls are not too soft and not too hard, it is made to perfection! My favorite here is the avocado smoothie with boba, its not like other places that make it too sweet. For good and cheap boba go to volcano tea!","Rose milk tea? YESSSSSS. Let's be real, this place can be super annoying. The parking is atrocious (usually going next door to the main Chinatown plaza and walking over is the thing to do if you're not going right when they open) and the credit card minimum obnoxious ($10), but on top of that, the service is also nonexistent (will someone please come and take my money?). However, if you know what you're in for, the drinks and snacks are great.","Overall: Craving for something sweet in Mesa? Stop by here! Food: Their yummy! Their light and not too sweet! Price: Decent. I would say each item would average about $4 weather is a drink or the dessert. Food: I ordered their mango pancake which I was so confused why it's called a pancake haha. But it tasted light and sweet but not too sweet. I'm not a huge fan of sweet, but this is definitely do-able . I also got the yogi dessert. I think that's what it's called. It was refreshing and very light and sweet. This place is a great place to try something new. A new kind of dessert. Other: I love their interior design too. They got cute chairs that matches their dessert theme haha. Love it.","I'm sure you've already read what Snoh is all about, so let me preface this review with my favorite flavors for you to try if you haven't already. Taro Snoh is my ultimate favorite. For those who have never tried it before, don't let its purple hue deceive you--it is delicious! My second favorite is the honeydew. Honeydew Snoh is a very refreshing flavor to try during this summer! Do it! The AZ heat don't got nothing on these Snohs! Might I add, that you must try the mochi balls with both the taro and the honeydew Snoh--the coconut mochi balls are my favorite topping with both Snoh flavors thus far. Let me tell you, never in my life have I ordered a small of something and then ordered a regular/large size of it immediately after just because I could not get enough of it... I just love how its pillowy texture and delectable flavors never leave me feeling bloated and guilty. If you're also wondering about their drinks, they're good too! I like their Thai Tea with boba. So if you're looking for a new chill treat to try, have a go with Snoh--it's worth it. They did receive a Best of Phoenix... ""Snoh"" will always be year-round in Arizona for me! Bonus: You can enjoy your delicious dessert with visuals of the artwork that's for sale; watching what's on their TVs; and playing all sorts of fun games with the people you came with!","This is a rebranding of Tea Time, so if you're coming back and expecting the same drinks, pout about it elsewhere. As said in a previous post, the focus has swayed towards coffee and it shows. I was extremely happy with the Vietnamese Cold Brew with condensed milk. The espresso seems to be the same brand, but possibly a different roast. It's a clean, bold flavor that has a medium mouthfeel and lingers slightly. The staff is welcoming and quick, so I can definitely appreciate that. This store has a lot of potential. I love the industrial feel and spaciousness; overall, I look forward to seeing the growth and getting a taste of the new house-made items.","Loving my almond milk tea with coconut jelly and boba would be good with vanilla and popping pearls. I also tried green tea snow fluff with sweet condense milk, lycee jelly, blueberry popping pearls. Guava green tea and kiwi popping pearls. This is ok and I don't think I would get it again.","Ice cream is so good and they have so many flavors: flavors I've never even had before, like: Horchata, Creme Brûlée, Cinnabon, and so many more. It was delish.",Heard there was a new boba place so I had to try it out. I got the taro milk tea and it was perfectly creamy with just enough boba. Maybe it's because a lot of the boba places in AZ don't cook the boba right but they got it spot on here. They can also adjust the sweetness level of it for you and I recommend getting it a little less sweet. I'm happy that ASU finally gets a good boba place near campus but sad that it's right after I've graduated and am moving away.
review,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Cute lil cafe. I think for the price you pay for a boba tea, you're better off going to Brew where the quality is so much better in my opinion. I do like that they have a variety of desserts but it's just average.",1.000000,0.910503,0.946049,0.935189,0.927949,0.952906,0.933475,0.940540,0.884498,0.923256,...,0.937498,0.925956,0.921072,0.910221,0.949124,0.946318,0.937027,0.849119,0.889971,0.944277
"Good service and atmosphere. The lychee milk tea was delicious and the boba cooked to perfection. We went back the next day and had the coconut slush, which was also very good! Nice little hang out spot for locals and tourists alike! They also serve appetizers.",0.910503,1.000000,0.938184,0.957581,0.958731,0.925170,0.950300,0.933236,0.890188,0.873220,...,0.933735,0.947394,0.922303,0.899601,0.939236,0.944735,0.936739,0.879277,0.917190,0.907857
"East Coast bubble tea has arrived in Cleveland! \n\nLike others have mentioned, Cleveland has been super slow on the uptake for bubble tea. My husband was so happy that KFT opened that we joke his quality of life in Cleveland improved by 38%. We've tried KFT in Boston and other locations, and this location is comparable. (I did peek inside the San Mateo location once and was horrified at how small and dirty it was! How can you allow it to be part of your franchise?) The interior is super modern and spacious with large LED screens for the menu. Plenty of seating inside and outside on their patio. \n\nIt's a part of Kenko, which is a local fast Japanese casual place that franchised KFT. I don't care much for Kenko - I just skip right to the boba. \n\nKFT is free of the super sugary high fructose corn syrup. They also do their bubbles right! You can also customize how you like. I usually do less ice, half sugar, and less bubbles.\n\nMy favorite flavors are the oolong tea. I also tried the yakult and it's so good!",0.946049,0.938184,1.000000,0.952961,0.938421,0.964230,0.962997,0.955535,0.895157,0.920781,...,0.946254,0.955581,0.924356,0.939412,0.951460,0.954279,0.948942,0.859996,0.899330,0.944908
"A little expensive but overall quite good. Their milk green tea is with real 2% milk. It tastes really different from other stores. Their coconut ice shake is also amazing! If you enjoy mangoes, this is 100% the place for you! I somewhat dislike their hot mango drink since it tastes awkward. They also open at 11:00 AM! Will be back!",0.935189,0.957581,0.952961,1.000000,0.953595,0.948925,0.958779,0.955052,0.895958,0.911857,...,0.939017,0.946438,0.912985,0.922869,0.957759,0.961411,0.955499,0.880381,0.918226,0.918086
"Absolutely delicious! Must try! \nI had the Hokkaido milk tea, it was delicious and perfect. Creamy, not too sweet. Just perfect. And the service is great, I asked questions about different flavors and she took her time explaining them to me.",0.927949,0.958731,0.938421,0.953595,1.000000,0.937305,0.951840,0.932046,0.910393,0.883548,...,0.953143,0.948999,0.926162,0.902567,0.960239,0.958111,0.945064,0.874629,0.934744,0.912140
"I love bubble tea so honestly I hardly ever give a bad review for bubble tea places. They had a good amount of options, and I love the places that let you choose the ice and sugar level. Nice seating area too.",0.952906,0.925170,0.964230,0.948925,0.937305,1.000000,0.952264,0.960182,0.880346,0.937290,...,0.944177,0.937021,0.911087,0.919972,0.948486,0.955999,0.944659,0.868764,0.893378,0.942180
"Our first time in Vegas and we were craving for milk tea. Checked on yelp and found this place with good reviews, i don't think milk tea is really popular in Vegas. This place looks new and the interior is pretty cool, the area were clean! The crew were super friendly. The wait for three people order was pretty quick. Love this place!",0.933475,0.950300,0.962997,0.958779,0.951840,0.952264,1.000000,0.953235,0.900564,0.913125,...,0.959623,0.944950,0.920760,0.927971,0.955504,0.958263,0.957993,0.832168,0.884378,0.928514
"We were excited to see the Alley closer to home and decided to give this location a try. It's nice to see more sitting space compared to most Alley locations which will probably serve best for the students nearby. \nWe ordered the Jasmine Milk Green Tea with tapioca and the Lime honey aloe drink. \nFirstly, perhaps because it is still relatively new, the service was quite slow and there wasn't a lot of people but they had a steady flow of customers. \nSecondly, my friend actually wanted a slew of other drinks before she settled with the lime honey aloe one and she settled for it because she wasn't able to alter the sweetness of the other ones she wanted. We didn't like the inconsistency of sugar options in their drinks and can't quite understand why some can be altered but some cannot (perhaps someone can enlighten me? ) \nThirdly, now this I find in all Alley locations, I just find that the amount of tapioca they give compared to a lot of other boba places lacks noticeably. I feel like I get the same amount in a regular cup as a large and that doesn't really make sense (perhaps someone can enlighten me with this as well). \nOther than that though, it's pretty much the same as any other Alley.",0.940540,0.933236,0.955535,0.955052,0.932046,0.960182,0.953235,1.000000,0.868560,0.943975,...,0.936234,0.943081,0.888678,0.944563,0.938450,0.960492,0.966315,0.845772,0.893501,0.934690
"Oh my gosh, so good! Went in right before closing and got a Thai milk tea boba (my fav) and it totally hit the spot. Noticed this spot while driving by, so glad we stopped! Yum!!!",0.884498,0.890188,0.895157,0.895958,0.910393,0.880346,0.900564,0.868560,1.000000,0.846129,...,0.905943,0.915845,0.889026,0.886133,0.920375,0.913541,0.861609,0.816164,0.863046,0.899435
"I like getting their buns here but the women who work here are extremely unfriendly and unhelpful. They always squish all my purchases in a tiny plastic bag and refuse to give me a larger bag even if I pay the 5 cents for it. Whenever I let them know that I'd like a larger bag as I don't want my buns to end up all squished when I return home, they are extremely rude and condescending.",0.923256,0.873220,0.920781,0.911857,0.883548,0.937290,0.913125,0.943975,0.846129,1.000000,...,0.914705,0.923663,0.870243,0.902158,0.913696,0.931401,0.915928,0.821243,0.833533,0.928001


In [28]:
## set index's name to none otherwise can't unstack the array, column names conflict 
similarities.index.name = ''

In [29]:
top_similarities = similarities.unstack().reset_index()
top_similarities.columns = ['review1','review2','similarity']
top_similarities = top_similarities[top_similarities['similarity'] < 0.9999999999]
top_similarities.sort_values(by = 'similarity', ascending = False, inplace = True)

In [30]:
top_similarities.head(10)

Unnamed: 0,review1,review2,similarity
7156919,"After coming here a few times, there's still a...",Having moved from the super-boba-shop-saturate...,0.991299
5759385,Having moved from the super-boba-shop-saturate...,"After coming here a few times, there's still a...",0.991299
7155122,"After coming here a few times, there's still a...","After two visits here, I think I can say for s...",0.991266
368385,"After two visits here, I think I can say for s...","After coming here a few times, there's still a...",0.991266
7156769,"After coming here a few times, there's still a...",Sharetea was suggested by a friend who really ...,0.990944
5309385,Sharetea was suggested by a friend who really ...,"After coming here a few times, there's still a...",0.990944
1491576,Who doesn't love boba?!?! The service here is ...,Boba was 0/5 extremely sweet\nCaramel tea 0/5 ...,0.990819
1728497,Boba was 0/5 extremely sweet\nCaramel tea 0/5 ...,Who doesn't love boba?!?! The service here is ...,0.990819
5308496,Sharetea was suggested by a friend who really ...,I'm not sure if they really sweeten their drin...,0.990687
4489769,I'm not sure if they really sweeten their drin...,Sharetea was suggested by a friend who really ...,0.990687


In [31]:
count = 0
for idx, row in top_similarities.iterrows():
    print(f"reivew1:\n{row['review1']}\n")
    print(f"reivew2:\n{row['review2']}")
    print("----" * 60)
    count += 1
    if count >10:
        break
    
# for long sentence, using Spacy's default review embedding computation just average the vectors -> results are not so similar

reivew1:
After coming here a few times, there's still a bit of inconsistencies with their drinks but my go tos are always the sea salt jasmine and the jasmine milk tea. I've noticed that they now offer ice cream macaroons for a little over $5. 

3.5/5 popcorn chicken was pretty good, well seasoned and had a nice kick to it. The dipping sauce they provided complimented it very well. I did feel that it could be a little crispier and the portion could be a tad bigger. I'm not sure if it's because of the to go box they put it in, but when I opened the bag, it left me wondering... where did the rest of my chicken go? I think I'm just very used to other boba spots putting their popcorn chicken in a white paper bag. It's a good size for one person, but if you're expecting to share then you should probably order another one because it's pretty darn tasty. 

 The honey boba, would be the most disappointing thing I've had so far here. It definitely needs to be a bit sweeter or soaked in honey lo

#### Overall Word Embedding approah did a good job comprehending reviews' meaning, especially if we compare with the results of Wordcount vectorizer or TFIDF approach. 
For example, we check the 2 reivews in the top_similarities table's 9th row. Both reviews talked about the expereince of not able to redeem to the discount a store or clerk claimed. Such contexts are so embedded in the context and need human-like comprehension ability to catch such similarity. Therefore, we think Word Embedding approah did a good job finding similar documents in terms of meaning.