# Restaurant Reviews Clustering
    In this project, we will use Natural Language Processing techniques and unsupervised learning 
    models to cluster restaurant reviews.

## 0. Data Collection
    Data source: https://www.kaggle.com/d4rklucif3r/restaurant-reviews

In [1]:
import numpy as np
import pandas as pd
import nltk

import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dante\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dante\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Load data
df = pd.read_csv("C:\\Users\\dante\\Desktop\\DS Project\\Restaurant Review\\Restaurant_Reviews.tsv", sep = '\t')

In [3]:
# Take a peek
df.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  1000 non-null   object
 1   Liked   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


    Small dataset, no missing value. We're doing clustering instead of prediction, so did not 
    check 'Liked' column.

In [5]:
data = df.loc[:, 'Review']

## 1. Data Preprocessing
    Tokenization dan stemming

In [6]:
stopwords = nltk.corpus.stopwords.words('english') 
stopwords.append("'m")
stopwords.append("'s")
stopwords.remove('not')
stopwords.remove('no')

print('We use ' + str(len(stopwords)) + ' stop-words fron nltk library.')
print(stopwords)

We use 179 stop-words fron nltk library.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'nor', 'only', 'o

In [7]:
from nltk.stem.snowball import SnowballStemmer
# from nltk.stem import WordNetLemmatizer

stemmer = SnowballStemmer("english")

# Define a function to do Tokenization and stemming
def tokenization_and_stemming(text):
    
    # exclude stopwords and tokenize the document, generate a list of string
    tokens = []
    for word in nltk.word_tokenize(text):
        if word.lower() not in stopwords:
            tokens.append(word.lower())    

    # filter out any tokens not containing letters (eg. numeric tokens, raw punctuations)
    filtered_tokens = []
    for token in tokens:
        if token.isalpha():
            filtered_tokens.append(token)

    # stemming
    stems = [stemmer.stem(t) for t in filtered_tokens]

    return stems

In [8]:
tokenization_and_stemming(data[13])

['tri', 'cape', 'cod', 'ravoli', 'chicken', 'cranberri', 'mmmm']

In [9]:
data[13]

'I tried the Cape Cod ravoli, chicken, with cranberry...mmmm!'

    Seems to be working!

## 2. TF-IDF
    Term Frequency Inverse Document Frequency

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
tfidf_model = TfidfVectorizer(max_df = 0.997, max_features = 1000, min_df = 0.003,\
    stop_words = 'english', use_idf = True, tokenizer = tokenization_and_stemming,\
    ngram_range = (1,2))

tfidf_matrix = tfidf_model.fit_transform(data) # fit the vectorizer to synopses

print('In total, there are ' + str(tfidf_matrix.shape[0]) + ' reviews and ' + \
    str(tfidf_matrix.shape[1]) + ' terms.')

In total, there are 1000 reviews and 478 terms.




In [12]:
# Key terms:
tf_selected_words = tfidf_model.get_feature_names()
tf_selected_words[:10]

['absolut',
 'actual',
 'ago',
 'alway',
 'amaz',
 'ambianc',
 'ambienc',
 'anoth',
 'anoth minut',
 'anytim']

## 3. Clustering Modeling
    K-Means

In [13]:
from sklearn.cluster import KMeans

# number of clusters
num_clusters = 3

km = KMeans(n_clusters = num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

### Clustering Results:

In [14]:
product = {'Review': df.Review, 'Cluster': clusters}
frame = pd.DataFrame(product, columns = ['Review', 'Cluster'])

In [15]:
frame.head(10)

Unnamed: 0,Review,Cluster
0,Wow... Loved this place.,0
1,Crust is not good.,2
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,0
4,The selection on the menu was great and so wer...,1
5,Now I am getting angry and I want my damn pho.,0
6,Honeslty it didn't taste THAT fresh.),0
7,The potatoes were like rubber and you could te...,0
8,The fries were great too.,1
9,A great touch.,1


### Count in each subgroup:

In [16]:
result = pd.merge(frame, df, left_index= True, right_index=True)
result = result.loc[:, ['Cluster', 'Liked']]
result.groupby(['Cluster','Liked']).size()

Cluster  Liked
0        0        445
         1        349
1        0         36
         1         84
2        0         19
         1         67
dtype: int64

    Looks like the majority of reviewers in Cluster 2 also 'liked' the restaurant. 
    While in other two clusters the reviews are more mixed.

### keywords in each cluster:

In [17]:
print("<Document clustering result by K-means>")

# km.cluster_centers_ denotes the importances of each items in centroid
# we need to sort it in decresing-order and get the top k items.

order_centroids = km.cluster_centers_.argsort()[:, ::-1]

Cluster_keywords_summary = {}
for i in range(num_clusters):
    print('Cluster ' + str(i) + ' keywords: ', end = '')
    Cluster_keywords_summary[i] = []
    for ind in order_centroids[i, :10]: # top 6 items
        Cluster_keywords_summary[i].append(tf_selected_words[ind])
        print(tf_selected_words[ind] + ',', end = '')
    print()    

<Document clustering result by K-means>
Cluster 0 keywords: place,food,like,time,disappoint,love,delici,amaz,wo,eat,
Cluster 1 keywords: servic,great,food,slow,friend,great food,great place,terribl,place,great servic,
Cluster 2 keywords: good,food,food good,good food,realli,realli good,pizza,servic,select,place,


    The clustering seems OK. Admittedly it looks a little hectic: mixed good and bad in cluster 
    0 and 1. Cluster 2 seems to agree on the good side. This result agrees with the previous
    count in each subgroup. 

## 4. Topic Modeling
    Latent Dirichlet Allocation (LDA)

In [18]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 3)

In [19]:
# Document topic matrix for tfidf_matrix_lda
lda_output = lda.fit_transform(tfidf_matrix)
print(lda_output.shape)
print(lda_output)

(1000, 3)
[[0.11527425 0.13915242 0.74557332]
 [0.16767929 0.16961207 0.66270864]
 [0.56105358 0.31582987 0.12311655]
 ...
 [0.13860161 0.43352276 0.42787563]
 [0.45975244 0.14040976 0.3998378 ]
 [0.09811508 0.80372664 0.09815827]]


In [20]:
# Topic and word matrix
topic_word = lda.components_
print(topic_word.shape)
print(topic_word)

(3, 478)
[[0.34551523 0.37313292 0.33493329 ... 0.33838505 0.33516836 0.33529299]
 [0.34412442 1.86540414 0.34924409 ... 0.35064214 0.33484736 0.33493246]
 [5.14011999 0.35114393 1.61580988 ... 2.48582499 2.38981853 2.03093862]]


### Results:

In [21]:
# Column names
topic_names = ['Topic' + str(i) for i in range(lda.n_components)]

# Index names
doc_names = ['Doc' + str(i) for i in range(len(data))]

# Creating document topic matrix
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topic_names, \
    index = doc_names)

# Get dominant topic for each document
topic = np.argmax(df_document_topic.values, axis = 1)
df_document_topic['Topic'] = topic

# Document topic matrix
df_document_topic.head(10)

Unnamed: 0,Topic0,Topic1,Topic2,Topic
Doc0,0.12,0.14,0.75,2
Doc1,0.17,0.17,0.66,2
Doc2,0.56,0.32,0.12,0
Doc3,0.12,0.15,0.72,2
Doc4,0.32,0.28,0.39,2
Doc5,0.67,0.19,0.14,0
Doc6,0.72,0.14,0.14,0
Doc7,0.77,0.12,0.11,0
Doc8,0.65,0.14,0.21,0
Doc9,0.14,0.14,0.71,2


### Counts:

In [22]:
df_newidx = df.copy()
df_newidx.index = doc_names

result_LDA = pd.merge(df_document_topic, df_newidx, left_index = True, right_index = True)

result_LDA[['Topic','Liked']].groupby(['Topic','Liked']).size()

Topic  Liked
0      0        186
       1        130
1      0        168
       1        149
2      0        146
       1        221
dtype: int64

    LDA clustered pretty evenly. However, among three topics, there appear to be no difference 
    in proportion of 'Liked' reviews.

In [23]:
# Topic word matrix
print(lda.components_)

# Create DF
df_topic_words = pd.DataFrame(lda.components_)

# Column and index name
df_topic_words.columns = tfidf_model.get_feature_names()
df_topic_words.index = topic_names

df_topic_words.head()

[[0.34551523 0.37313292 0.33493329 ... 0.33838505 0.33516836 0.33529299]
 [0.34412442 1.86540414 0.34924409 ... 0.35064214 0.33484736 0.33493246]
 [5.14011999 0.35114393 1.61580988 ... 2.48582499 2.38981853 2.03093862]]


Unnamed: 0,absolut,actual,ago,alway,amaz,ambianc,ambienc,anoth,anoth minut,anytim,...,worst,worth,wow,wrap,wrong,year,year ago,yummi,zero,zero star
Topic0,0.345515,0.373133,0.334933,0.36112,5.131986,0.382933,0.346009,0.338053,0.334502,0.339801,...,9.192484,3.096285,0.336556,0.551994,0.337099,0.726343,0.334933,0.338385,0.335168,0.335293
Topic1,0.344124,1.865404,0.349244,0.349561,0.385349,0.335882,2.114312,4.105665,2.09804,0.334766,...,0.341849,0.353203,1.617888,0.338248,2.844924,0.346953,0.349244,0.350642,0.334847,0.334932
Topic2,5.14012,0.351144,1.61581,7.020998,8.153946,4.400497,0.335103,0.354371,0.33767,2.86341,...,0.339355,2.547201,0.894719,1.50772,0.691146,2.09647,1.61581,2.485825,2.389819,2.030939


### keywords:

In [24]:
def print_topic_words(tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
    # for each topic, we have words weight
    for topic_words_weights in lda_model.components_:
        top_words = topic_words_weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

topic_keywords = print_topic_words(tfidf_model=tfidf_model, lda_model=lda, n_words=15)        

df_topic_words = pd.DataFrame(topic_keywords)
df_topic_words.columns = ['Word '+str(i+1) for i in range(df_topic_words.shape[1])]
df_topic_words.index = ['Topic '+str(i+1) for i in range(df_topic_words.shape[0])]
df_topic_words

Unnamed: 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14,Word 15
Topic 1,like,disappoint,eat,wo,best,friend,worst,bad,place,staff,fri,food,burger,probabl,fresh
Topic 2,delici,time,come,definit,servic,price,food,vega,came,steak,pretti,got,perfect,minut,terribl
Topic 3,good,place,servic,food,great,love,restaur,realli,wait,pizza,fantast,amaz,way,awesom,alway


    Although the clustering is more even using LDA, it seems more ambiguous. We have 
    'good' and 'disappoint' in Topic 1, 'worst' right next to 'best' in Topic 2. This
    result again agrees with the count in each group. However, I'm not sure how to use 
    this information. 

    Maybe in the future I will try to use supervised machine learning on this data given 
    there is the 'liked' column and see what I can do with a prediction model.