In this notebook basically feature extraction is done from the Restaurant Reviews. Libraries like nltk, spacy and scikit-learn are used here. the dataset is very small, it's only having 1000 reviews, and another column having only 0's and 1's represents whether customer liked the restaurant or not.

In [35]:
# importing the librarires
import re
from string import punctuation
import spacy
import nltk
import pandas as pd
from nltk.corpus import stopwords

In [36]:
data = pd.read_csv('Data/Restaurant_Reviews.tsv', delimiter='\t')

In [37]:
data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


First the text reviews are filtered using reguler expressions and only the words has been taken for consideration. Next the data is normalized, all the words has been converted into lower case.

In [38]:
data['Review'] = data['Review'].str.replace('[^a-zA-Z]', ' ')

In [39]:
data['Review'] = data['Review'].str.lower()

In [40]:
data.head()

Unnamed: 0,Review,Liked
0,wow loved this place,1
1,crust is not good,0
2,not tasty and the texture was just nasty,0
3,stopped by during the late may bank holiday of...,1
4,the selection on the menu was great and so wer...,1


First the number of characters and the number of words present in the reviews have been taken as features and two new columns have been created as 'char_count' and 'word_count' then the average word count has been taken as another feature, avreage word count is calculated by dividing the number of characters and number of words in the reviews.    

In [41]:
# Number of characters in the reviews

data['char_count'] = data['Review'].str.len()

In [42]:
data.head()

Unnamed: 0,Review,Liked,char_count
0,wow loved this place,1,24
1,crust is not good,0,18
2,not tasty and the texture was just nasty,0,41
3,stopped by during the late may bank holiday of...,1,87
4,the selection on the menu was great and so wer...,1,59


In [43]:
# Number of words in the reviews

data['word_count'] = data['Review'].str.split().str.len()

In [44]:
data.head()

Unnamed: 0,Review,Liked,char_count,word_count
0,wow loved this place,1,24,4
1,crust is not good,0,18,4
2,not tasty and the texture was just nasty,0,41,8
3,stopped by during the late may bank holiday of...,1,87,15
4,the selection on the menu was great and so wer...,1,59,12


In [45]:
# Average word count of the reviews

data['avg_word_len'] = data['char_count'] / data['word_count']

In [46]:
data.head()

Unnamed: 0,Review,Liked,char_count,word_count,avg_word_len
0,wow loved this place,1,24,4,6.0
1,crust is not good,0,18,4,4.5
2,not tasty and the texture was just nasty,0,41,8,5.125
3,stopped by during the late may bank holiday of...,1,87,15,5.8
4,the selection on the menu was great and so wer...,1,59,12,4.916667


In [47]:
len(data)

1000

In [48]:
# Loading spacy's language model

nlp = spacy.load('en_core_web_sm')

In [49]:
# Creating the corpus from the reviews
# Splitting the reviews
# Removing the stopwords and punctuations
# Lemmatize the words to get the root form of the words

corpus = []
for rev in range(0,1000):
    review = data['Review'][rev].split()
    review = [word for word in review if not word in set(stopwords.words('english') + list(punctuation))]
    review = ' '.join(review)
    review = [token.lemma_ for token in nlp(str(review))]
    review = ' '.join(review)
    corpus.append(review)

In [50]:
# Showing the first 5 entries in the corpus list

corpus[:5]

['wow love place',
 'crust good',
 'tasty texture nasty',
 'stop late may bank holiday rick steve recommendation love',
 'selection menu great price']

Now the text data is taken and bag of words as well the term frequency inverse document frequency is calculated.
scikit-learn's CountVectorizer and TfidfVectorizer is used here.

In [51]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [52]:
# Speicifying number of features as 100

cv = CountVectorizer(max_features=100)

In [53]:
count_vec = cv.fit_transform(corpus)

In [54]:
cv_df = pd.DataFrame(count_vec.toarray(), columns=cv.get_feature_names()).add_prefix('Count_')

In [55]:
cv_df.head()

Unnamed: 0,Count_also,Count_always,Count_amazing,Count_another,Count_ask,Count_atmosphere,Count_awesome,Count_back,Count_bad,Count_bland,...,Count_thing,Count_think,Count_time,Count_try,Count_vegas,Count_wait,Count_want,Count_way,Count_well,Count_would
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
cv_sum = cv_df.sum()
cv_sum = cv_sum.sort_values(ascending=False)
cv_sum.head()

Count_good       135
Count_food       127
Count_place      112
Count_service     87
Count_go          78
dtype: int64

In [57]:
tv = TfidfVectorizer(max_features=100)

In [58]:
tv_vec = tv.fit_transform(corpus)

In [59]:
tv_df = pd.DataFrame(tv_vec.toarray(), columns=tv.get_feature_names()).add_prefix('TFIDF_')

In [60]:
tv_df.head()

Unnamed: 0,TFIDF_also,TFIDF_always,TFIDF_amazing,TFIDF_another,TFIDF_ask,TFIDF_atmosphere,TFIDF_awesome,TFIDF_back,TFIDF_bad,TFIDF_bland,...,TFIDF_thing,TFIDF_think,TFIDF_time,TFIDF_try,TFIDF_vegas,TFIDF_wait,TFIDF_want,TFIDF_way,TFIDF_well,TFIDF_would
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Here because the data is small so only the word based tf-idf value is calculated, also we can specify the bi-gram as well as tri-gram and take those as feature. So that the context of the words will be taken as features to some extent. 

In [61]:
tv_sum = tv_df.sum()
tv_sum = tv_sum.sort_values(ascending=False)
tv_sum.head()

TFIDF_good       69.006789
TFIDF_food       59.057241
TFIDF_place      54.801282
TFIDF_service    48.664254
TFIDF_go         39.277020
dtype: float64

In [62]:
# Combining the bag of word dataframe to the original data to get the final dataset.

combined_cv_df = pd.concat([data, cv_df], axis=1, sort=False)

In [63]:
combined_cv_df.head()

Unnamed: 0,Review,Liked,char_count,word_count,avg_word_len,Count_also,Count_always,Count_amazing,Count_another,Count_ask,...,Count_thing,Count_think,Count_time,Count_try,Count_vegas,Count_wait,Count_want,Count_way,Count_well,Count_would
0,wow loved this place,1,24,4,6.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,crust is not good,0,18,4,4.5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,not tasty and the texture was just nasty,0,41,8,5.125,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,stopped by during the late may bank holiday of...,1,87,15,5.8,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,the selection on the menu was great and so wer...,1,59,12,4.916667,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [64]:
combined_tv_df = pd.concat([data, tv_df], axis=1, sort=False)

In [65]:
combined_tv_df.head()

Unnamed: 0,Review,Liked,char_count,word_count,avg_word_len,TFIDF_also,TFIDF_always,TFIDF_amazing,TFIDF_another,TFIDF_ask,...,TFIDF_thing,TFIDF_think,TFIDF_time,TFIDF_try,TFIDF_vegas,TFIDF_wait,TFIDF_want,TFIDF_way,TFIDF_well,TFIDF_would
0,wow loved this place,1,24,4,6.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,crust is not good,0,18,4,4.5,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,not tasty and the texture was just nasty,0,41,8,5.125,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,stopped by during the late may bank holiday of...,1,87,15,5.8,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,the selection on the menu was great and so wer...,1,59,12,4.916667,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
# Saving the final data in an excel file.

writer = pd.ExcelWriter("Data/data-count-vector.xls")
combined_cv_df.to_excel(writer, sheet_name="Sheet1")
writer.save()

In [67]:
writer = pd.ExcelWriter("Data/data-tfidf-vector.xls")
combined_tv_df.to_excel(writer, sheet_name="Sheet1")
writer.save()