## CZ4045 Natural Language Processing 
### Assignment 

Dataset: Yelp, a dataset containing 15,300 reviews (https://www.yelp.com/dataset/download)

### 3.2 Dataset Analysis [60m]

#### Tokenization and Stemming

In [1]:
import json
import random

import pandas as pd
import numpy as np

#NLTK packages
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import ngrams
from nltk.stem import PorterStemmer
from nltk import FreqDist
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

#Utils
from collections import Counter
import seaborn as sns
import matplotlib.pyplot as plt
from random import randint
from random import choice



In [26]:
with open("reviewSelected100.json", 'r') as read_file:
    data = [json.loads(line) for line in read_file]

In [37]:
#---------------- Check Business Data ----------------------
'''
Total number of reviews = 15300
Total number of business types = 153
Each business has 100 reviews
'''
biz_type = []
all_biz = []
no_of_reviews = len(data)
for i in range(no_of_reviews):
    b = data[i]['business_id']
    biz_type.append(b)
    
#Store unique business IDs in a list
uniq_biz_type = []
for i in biz_type:
    if i not in uniq_biz_type:
        uniq_biz_type.append(i)

no_biz_type = len(uniq_biz_type)

In [40]:
def random_biz_select(no_of_reviews, no_biz_type, data):
    #1. Randomly select a business b1 from the dataset
    b1 = uniq_biz_type[randint(0, no_biz_type)] #everytime the code runs a new business_id is chosen

    #2. Extract all reviews for b1 and create a small dataset B1
    B1 = {"small_dataset": []}
    count = 0
    for i in range(no_of_reviews):
        if data[i]['business_id'] == b1:
            count +=1 #must be 100 on every run
            B1['small_dataset'].append(data[i])

    #Consolidated 100 reviews of a randomly chosen business ID
    return B1

In [43]:
#Stopwords 
yelp_stop_words = set(stopwords.words('english')+ list(ENGLISH_STOP_WORDS))

In [51]:
#Word frequency before stemming
from nltk import FreqDist

def word_freq(biz_data):
    all_reviews = ''
    for i in range(len(biz_data['small_dataset'])):
        all_reviews += biz_data['small_dataset'][i]['text']

    lowercase_review = all_reviews.lower()
    word_tokens = word_tokenize(lowercase_review)

    tokens = list()
    for word in word_tokens:
        if word.isalpha() and word not in yelp_stop_words:
            tokens.append(word)

    token_dist = FreqDist(tokens)
    dist = pd.DataFrame(token_dist.most_common(10),columns=['Word', 'Frequency'])

    return tokens, dist

In [52]:
#Word frequency after stemming
from nltk.stem import PorterStemmer

def stem_porter(tokens):
    porter = PorterStemmer()
    stem_word =[porter.stem(word) for word in tokens]
    stem_word_dist = FreqDist(stem_word)
    stem_dist = pd.DataFrame(stem_word_dist.most_common(10),columns=['Word', 'Frequency'])

    return stem_dist

In [56]:
# Function for display two tables side by side
from IPython.display import display_html
from itertools import chain,cycle
def display_side_by_side(*args,titles=cycle([''])):
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
        html_str+='<th style="text-align:center"><td style="vertical-align:top">'
        html_str+=f'<h2>{title}</h2>'
        html_str+=df.to_html().replace('table','table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str,raw=True)

In [57]:
#First randomly extracted business review 
biz_data = random_biz_select(no_of_reviews, no_biz_type, data)
#Before stemming 
tokens, dist = word_freq(biz_data)
#After stemming using porter
stem_dist = stem_porter(tokens)

In [59]:
display_side_by_side(dist, stem_dist, titles=['Before Stemming','After Stemming'])

Unnamed: 0,Word,Frequency
0,place,76
1,good,73
2,food,68
3,tacos,67
4,salsa,66
5,chicken,48
6,taco,47
7,burrito,43
8,like,37
9,mexican,36

Unnamed: 0,Word,Frequency
0,taco,114
1,place,86
2,salsa,77
3,good,73
4,food,69
5,burrito,55
6,chicken,48
7,chico,41
8,like,41
9,order,41


### Observations:

Using the above tables as an example, the word 'taco' is the most common word used for this company's reviews. Initially before stemming we can see that at rank 3 and 6 the word has been split into 2 different categories because of the affixe 's'. Hence after stemming the word tacos changes to taco, therefore there is a sharp increase in the frequency of the word taco. 

In [61]:
#Repeat the process for a new business type. 
biz_data = random_biz_select(no_of_reviews, no_biz_type, data)
#Before stemming 
tokens, dist = word_freq(biz_data)
#After stemming using porter
stem_dist = stem_porter(tokens)

In [62]:
display_side_by_side(dist, stem_dist, titles=['Before Stemming','After Stemming'])

Unnamed: 0,Word,Frequency
0,fries,82
1,good,62
2,burger,60
3,custard,55
4,food,53
5,burgers,47
6,like,41
7,place,41
8,sauce,35
9,really,33

Unnamed: 0,Word,Frequency
0,fri,117
1,burger,107
2,good,62
3,custard,55
4,food,54
5,like,51
6,place,44
7,sauc,35
8,freddi,34
9,time,34


### Observations:

Similar to example 1, words such as burgers is affected after stemming. The word "fri", does not have any meaning. Upon stemming, words such as "fries" are changed to "fri" however this is not an accurate analysis as the core meaning of the word has changed. This is one of the disadvantages of Porter's Stemming as it does not use a dictionary to identify words and hence there is a chance that after stemming some words are not English words. 