# TripAdvisor restaurants info for 31 European cities

### Sentiment analysis: In this dataset, based on the customer reviews the ratings of the restaurants are predicted. Each class of rating is associated to a sentiment. Using Natural Language Processing, the raw text data is preprocessed and a vector of stemmed words is assigned as the input to the model for training. The classifier model predicts the rating or the sentiment. 

In [1]:
# Importing packages for training 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

In [3]:
# Loading and understanding the dataset
data = pd.read_csv("C:/Users/attawut/Documents/Web_scraping_NLP/1_NLP/data/TA_restaurants_curated.csv")

In [5]:
data['Reviews']

0         [['Just like home', 'A Warm Welcome to Wintry ...
1         [['Great food and staff', 'just perfect'], ['0...
2         [['Satisfaction', 'Delicious old school restau...
3         [['True five star dinner', 'A superb evening o...
4         [['Best meal.... EVER', 'super food experience...
                                ...                        
125522                                                  NaN
125523                                                  NaN
125524                                                  NaN
125525                                                  NaN
125526                                                  NaN
Name: Reviews, Length: 125527, dtype: object

In [6]:
data.describe()

Unnamed: 0.1,Unnamed: 0,Ranking,Rating,Number of Reviews
count,125527.0,115876.0,115897.0,108183.0
mean,3974.686131,3657.463979,3.987441,125.184983
std,4057.687698,3706.255301,0.678814,310.833311
min,0.0,1.0,-1.0,2.0
25%,1042.0,965.0,3.5,9.0
50%,2445.0,2256.0,4.0,32.0
75%,5626.0,5237.0,4.5,114.0
max,18211.0,16444.0,5.0,16478.0


In [8]:
restaurant_data = data[['Rating', 'Reviews']]
restaurant_data

Unnamed: 0,Rating,Reviews
0,5.0,"[['Just like home', 'A Warm Welcome to Wintry ..."
1,4.5,"[['Great food and staff', 'just perfect'], ['0..."
2,4.5,"[['Satisfaction', 'Delicious old school restau..."
3,5.0,"[['True five star dinner', 'A superb evening o..."
4,4.5,"[['Best meal.... EVER', 'super food experience..."
...,...,...
125522,,
125523,,
125524,,
125525,,


### Data Preprocessing

In [9]:
# Number of missing values in the dataset
restaurant_data.isna().sum()

Rating     9630
Reviews    9616
dtype: int64

In [10]:
# Missing values in Reviews
restaurant_data['Reviews'] = restaurant_data['Reviews'].fillna('["No Review"]', axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [11]:
restaurant_data['Reviews'].tail()

125522    ["No Review"]
125523    ["No Review"]
125524    ["No Review"]
125525    ["No Review"]
125526    ["No Review"]
Name: Reviews, dtype: object

In [12]:
restaurant_data['Reviews'][3233]

'[[], []]'

In [13]:
restaurant_data['Reviews'] = restaurant_data['Reviews'].replace(['[[], []]'], 'No Review')    # Replacing empty Review values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [14]:
restaurant_data['Reviews'][3233]

'No Review'

In [15]:
# Missing Rating values
restaurant_data.isna().sum()

Rating     9630
Reviews       0
dtype: int64

In [16]:
restaurant_data.Rating.value_counts()

 4.0    39843
 4.5    31326
 3.5    19745
 5.0    11257
 3.0     8524
 2.5     2720
 2.0     1437
 1.0      620
 1.5      384
-1.0       41
Name: Rating, dtype: int64

In [17]:
restaurant_data.Rating.unique()

array([ 5. ,  4.5,  4. ,  3.5,  3. ,  2.5,  2. ,  1.5,  1. , -1. ,  nan])

In [18]:
restaurant_data["Rating"].fillna(restaurant_data['Rating'].mean(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


In [19]:
# Discarding the decimal places and considering only the integer
for col in ['Rating']:
    restaurant_data[col] = restaurant_data[col].astype(int)   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [20]:
restaurant_data.Rating.value_counts()

 4    71169
 3    37899
 5    11257
 2     4157
 1     1004
-1       41
Name: Rating, dtype: int64

In [21]:
sentiment_data = restaurant_data
sentiment_data.tail()

Unnamed: 0,Rating,Reviews
125522,3,"[""No Review""]"
125523,3,"[""No Review""]"
125524,3,"[""No Review""]"
125525,3,"[""No Review""]"
125526,3,"[""No Review""]"


### Cleaning the text data in Reviews

In [22]:
sdata = sentiment_data.values    # Gives a numpy array
sdata 

array([[5,
        "[['Just like home', 'A Warm Welcome to Wintry Amsterdam'], ['01/03/2018', '01/01/2018']]"],
       [4,
        "[['Great food and staff', 'just perfect'], ['01/06/2018', '01/04/2018']]"],
       [4,
        "[['Satisfaction', 'Delicious old school restaurant'], ['01/04/2018', '01/04/2018']]"],
       ...,
       [3, '["No Review"]'],
       [3, '["No Review"]'],
       [3, '["No Review"]']], dtype=object)

In [23]:
len(sdata)

125527

In [24]:
count = len(sdata)
all_data = []

for i in range(count):
    rating = sdata[i, 0]
    reviews = sdata[i, 1].split('], [')[0]     # Splitting the reviews and date strings from a single list and considering only the reviews
    reviews = reviews.replace("[[", "")
    reviews = reviews.replace("'", "")
    reviews = reviews.replace('"', '')
    reviews = reviews.split(',')
    print(reviews)
    for review in reviews:
        all_data.append([review, rating])

['Just like home', ' A Warm Welcome to Wintry Amsterdam']
['Great food and staff', ' just perfect']
['Satisfaction', ' Delicious old school restaurant']
['True five star dinner', ' A superb evening of fine dining', ' hospitali...']
['Best meal.... EVER', ' super food experience']
['A treat!', ' Wow just Wow']
['40th Birthday with my Family', ' One of the best meals ever!']
['Great Experience', ' A true delight']
['Great Food & Service!', ' Superior food and exciting setting around...']
['Excellent Herring', ' Lovely', ' rustic fish shop in the smack of A...']
['Simply AMAZING!', ' Delicious Burgers']
['A hidden gem', ' Fantastic!']
['Love it!', ' As pure as Paradise: Adam!']
['Awesome little pub', ' An amazing little place with a vast choice...']
['Best meal of our trip', ' It was like falling in love..']
['So. Much. Food', ' Hidden Gem']
['Brunch', ' Worth the wait!']
['Wonderful Christmas dinner', ' Fantastic restaurant with impeccable servi...']
['Very good tibetan and indian food',

In [None]:
sent_data = pd.DataFrame(all_data, columns = ['Review', 'Rating'])

In [None]:
sent_data.head()

In [None]:
# Removing the puntuation marks
sent_data['Review'] = sent_data['Review'].str.replace('[^\w\s]','')
sent_data.head(3)

In [None]:
# Tokenization
# Splitting sentences into list of individual words  
tokenized_data = sent_data['Review'].apply(lambda x : x.lower().split())    
tokenized_data.head(5)

In [None]:
# Importing natural language toolkit
import nltk      
from nltk.stem.porter import *

In [None]:
# Stemming
# Reducing a word to its stem word
stemmer = PorterStemmer()
stem_data = tokenized_data.apply(lambda x: [stemmer.stem(i) for i in x])  

In [None]:
stem_data

In [None]:
# Joining the stemmed words to reframe sentences
stemmed_data = []
for i in range(len(stem_data)):
    stemmed_data.append(' '.join(stem_data[i]))    

stemmed_data

In [None]:
np.array(stemmed_data).reshape(-1,1)   # Converting into a numpy array

In [None]:
sent_data['Cleaned_Review'] = np.array(stemmed_data).reshape(-1,1)  # Reshaping

### Visualizing customer reviews using WordCloud module

In [None]:
# ALl the stemmed words in reviews 
all_words = ' '.join([text for text in sent_data['Cleaned_Review']])    #stemmed words

from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
# Words used by the customers in reviews corresponding to rating 5
# Rating the restaurants as excellent
excellent_words = ' '.join([text for text in sent_data['Cleaned_Review'][sent_data['Rating'] == 5]])

wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(excellent_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
# Words used by the customers in reviews corresponding to rating 4
# Rating the restaurants as good
good_words = ' '.join([text for text in sent_data['Cleaned_Review'][sent_data['Rating'] == 4]])

wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(good_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
# Words used by the customers in reviews corresponding to rating 3
# Rating the restaurants as good
good_words = ' '.join([text for text in sent_data['Cleaned_Review'][sent_data['Rating'] == 3]])

wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(good_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
# Words used by the customers in reviews corresponding to rating 2
# Rating the restaurants as good
good_words = ' '.join([text for text in sent_data['Cleaned_Review'][sent_data['Rating'] == 2]])

wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(good_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
# Words used by the customers in reviews corresponding to rating 1
# Rating the restaurants as good
good_words = ' '.join([text for text in sent_data['Cleaned_Review'][sent_data['Rating'] == 1]])

wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(good_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
# Words used by the customers in reviews corresponding to rating -1
# Rating the restaurants as terrible 
terrible_words = ' '.join([text for text in sent_data['Cleaned_Review'][sent_data['Rating'] == -1]])

wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(terrible_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

### Sentiment Analysis using Natural Language Processing

In [None]:
sent_data.head()

In [None]:
# Splitting train/test data
x = sent_data['Cleaned_Review']
Y = sent_data['Rating']

In [None]:
Y = np.array(Y)

In [None]:
for i in range(len(Y)):   # Changing the rating class -1 into 0
    Y[Y < 0] = 0

In [None]:
Y

#### Sentiments associated with ratings:
#### Excellent = [5]
#### Very Good = [4]
#### Good = [3]
#### Average = [2]
#### Bad = [1]
#### Terrible = [0]

In [None]:
xtrain, xtest, Y_train, Y_test = train_test_split(x, Y,   # data we want to split 
                                            train_size = 0.7,
                                            random_state = 500,       # shuffle rows
                                            stratify = Y)             # ensure classes the same in train/test

In [None]:
# Change text data into numerical for the classifer to understand.
# We will use bag of words model. To break up our Reviews sentences into words (tokens) with Count Vectorizer.

In [None]:
# We call the sklearn count Vectorizer (transformer) to transform
# Text into a vector
from sklearn.feature_extraction.text import CountVectorizer 
bow_vectorizer = CountVectorizer(max_df = 0.90, min_df = 2, max_features = 10000, stop_words = 'english') 

In [None]:
# Use vetorizer var to fit on text
# bag-of-words feature matrix
X_train = bow_vectorizer.fit_transform(xtrain.values)
X_test = bow_vectorizer.fit_transform(xtest.values)

In [None]:
X_train.toarray

In [None]:
print(X_train.shape)
print(X_train.toarray())

In [None]:
print(X_test.shape)
print(X_test.toarray())

In [None]:
# Use vocabulary_ to see words in the vocabulary
vocabulary = bow_vectorizer.vocabulary_
vocabulary

In [None]:
# Dimensionality reduction using Truncated Singular Value Decomposition (SVD)

from sklearn.decomposition import TruncatedSVD

tsvd = TruncatedSVD(n_components = 700)  
x_train = tsvd.fit_transform(X_train)
x_test = tsvd.fit_transform(X_test)

In [None]:
x_train.shape

In [None]:
x_test.shape

### Model training using keras

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense

In [None]:
x_train.shape

In [None]:
# Model architecture
model = Sequential([   
    Dense(570, activation = 'relu', input_shape = (x_train[1].shape)),
    Dense(390, activation = 'relu'),
    Dense(135, activation = 'relu'),      #hidden layers
    Dense(6, activation = 'softmax')
])

In [None]:
model.summary()

In [None]:
adam = keras.optimizers.adam(lr = 0.01)

model.compile(loss = 'sparse_categorical_crossentropy',
              optimizer = adam,
              metrics = ['accuracy'])

In [None]:
# Creating batches to increase computational speed and for better accuracy
history = model.fit(x_train, Y_train, epochs = 3, validation_split = 0.2, batch_size = 512, verbose = 1)    

In [None]:
plt.plot(history.epoch, history.history['val_loss'], 'k',
        history.epoch, history.history['loss'],'m')

In [None]:
plt.plot(history.epoch, history.history['val_acc'],'k',
         history.epoch, history.history['acc'],'m')

In [None]:
model.evaluate(x_test, Y_test)

### Model training using Support Vector Machine (SVM)

In [None]:
# Train classifier model
from sklearn.svm import LinearSVC

In [None]:
#Support Vector Classification
svcModel = LinearSVC()   # Instantiate it (create an instance of an object) 

In [None]:
svcModel.fit(x_train, Y_train)   # Train/Fit model on training data

In [None]:
pred = svcModel.predict(x_test)  # Predict Ratings or Sentiment
pred

In [None]:
svcModel.score(x_train, Y_train)  

In [None]:
svcModel.score(x_test, Y_test)  # Evaluate model on test dataset. Use the score to see how correct classifier was on test.
# We get 61.84%. Later we use TF-IDF and Logistic Regression.

In [None]:
# Import confusion matrix and classification report 
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
confusion_matrix(Y_test, pred)  

### Pipeline

In [None]:
# Using text processing
# Import tfidf transformer
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
# Import pipeline
from sklearn.pipeline import Pipeline

In [None]:
steps = ([
         ('tfidf', TfidfTransformer()),
          ('classifierSVC', LinearSVC())
   ])
pipeline = Pipeline(steps)

In [None]:
pipeline.fit(x_train, Y_train)

In [None]:
# Use pipeline to predict x_test, 
predict2 = pipeline.predict(x_test)
predict2

In [None]:
pipeline.score(x_test, Y_test) 

In [None]:
print(classification_report(predict2, Y_test))

### Model training using Logistic Regression 

In [None]:
from sklearn.linear_model import LogisticRegression   

In [None]:
steps2 = ([
         ('tfidf', TfidfTransformer()),
          ('logreg clf', LogisticRegression()),
   ])
pipeline2 = Pipeline(steps2)

In [None]:
pipeline2.fit(x_train, Y_train)

In [None]:
pred3 = pipeline2.predict(x_test)
pred3

In [None]:
pipeline2.score(x_train, Y_train)

In [None]:
pipeline2.score(x_test, Y_test)   # 61% using Logistic Regression

In [None]:
print(classification_report(pred3, Y_test)) 

#### The dataset is trained on different algorithms like NN using Keras, SVM and Logistic Regression. The accuarcy of all the algorithms are in the range of 60 - 65%. 
#### When it came to predicting class 4 (rating) our model performed better than the rest, as there are more samples to train on. The accuracy can be tried to increase, by splitting the ratings data into positive (4 - 5) and negative (-1 - 3), rather than having 6 different classes. 