In this project Natural Language Processing (NLP) strategies will be used to analyze Yelp reviews data. 

Number of 'stars' indicate the business rating given by a customer, ranging from 1 to 5

'Cool', 'Useful', and 'Funny' are votes given to reviews by other Yelp Users.

## STEP 0: IMPORT LIBRARIES

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## STEP 1: IMPORT DATASET

In [None]:
yelp_df = pd.read_csv('yelp.csv')

In [None]:
yelp_df.head(7)

In [None]:
yelp_df.describe()

In [None]:
yelp_df.info()

In [None]:
yelp_df['text'][0]

In [None]:
yelp_df['text'][1]

In [None]:
yelp_df['text'][9998]

In [None]:
yelp_df['text'][9999]

## STEP 2: VISUALIZE THE DATASET

In [None]:
yelp_df['length'] = yelp_df['text'].apply(len)

In [None]:
yelp_df.head(1)

In [None]:
yelp_df['length'].plot(bins = 100, kind = 'hist')

In [None]:
yelp_df.length.describe()

In [None]:
yelp_df[ yelp_df['length'] == 4997 ]['text'].iloc[0]

In [None]:
yelp_df[ yelp_df['length'] == 1 ]['text'].iloc[0]

In [None]:
sns.countplot(y = 'stars', data = yelp_df)

In [None]:
g = sns.FacetGrid(data=yelp_df, col='stars', col_wrap=3)

In [None]:
g = sns.FacetGrid(data=yelp_df, col='stars', col_wrap=3)
g.map(plt.hist, 'length', bins = 20, color = 'r')

In [None]:
yelp_df_1 = yelp_df[ yelp_df['stars'] == 1 ]

In [None]:
yelp_df_1

In [None]:
yelp_df_5 = yelp_df[ yelp_df['stars'] == 5 ]

In [None]:
yelp_df_5

In [None]:
yelp_df_1_5 = pd.concat([yelp_df_1, yelp_df_5])

In [None]:
yelp_df_1_5

In [None]:
yelp_df_1_5.info()

In [None]:
print('1-Star Review Percentage = ', (len(yelp_df_1) / len(yelp_df_1_5))*100, '%')

In [None]:
print('1-Star Review Percentage = ', (len(yelp_df_5) / len(yelp_df_1_5))*100, '%')

In [None]:
sns.countplot(yelp_df_1_5 ['stars'], label = 'Count')

## STEP 3: CREATE TESTING AND TRAINING DATASET/DATA CLEANING

## STEP 3.1 EXERCISE: REMOVE PUNCTUATION

In [None]:
import string
string.punctuation

In [None]:
Test = 'Hello Mr. Future, I am so happy to be learning AI now!!'

In [None]:
Test_punc_removed = [char   for char in Test if char not in string.punctuation  ]

In [None]:
Test_punc_removed

In [None]:
Test_punc_removed_join = ''.join(Test_punc_removed)

In [None]:
Test_punc_removed_join

## STEP 3.2 EXERCISE: REMOVE STOPWORDS

In [None]:
from nltk.corpus import stopwords
stopwords.words('english')

In [None]:
Test_punc_removed_join

In [None]:
Test_punc_removed_join.split()

In [None]:
Test_punc_removed_join_clean = [ word  for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')  ]

In [None]:
Test_punc_removed_join_clean

In [None]:
mini_challenge = 'Here is a mini challenge, that will teach you how to remove stopwords and punctuations!'

In [None]:
challege = [ char     for char in mini_challenge  if char not in string.punctuation    ]
challenge = ''.join(challege)
challenge = [  word for word in challenge.split() if word.lower() not in stopwords.words('english')  ] 


In [None]:
challenge

## STEP 3.3 EXERCISE: COUNT VECTORIZER EXAMPLE

In [None]:
sample_data = ['This is the first document.', 'This document is the seond document','And this is the third one.','Is this the first document?']
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_data)

In [None]:
print(vectorizer.get_feature_names())

In [None]:
X

In [None]:
print(X.toarray())

## APPLY TO OUR DATA

In [None]:
# Let's define a pipeline to clean up all the messages 
# The pipeline performs the following: (1) remove punctuation, (2) remove stopwords

def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_removed_join_clean

In [None]:
# Let's test the newly added function
yelp_df_clean = yelp_df_1_5['text'].apply(message_cleaning)

In [None]:
print(yelp_df.clean[0]) # cleaned up review

In [None]:
print(yelp_df_1_5['text'][0]) # Original review

In [None]:
yelp_df_1_5[ yelp_df_1_5['length'] == 662  ]['text'].iloc[0]

In [None]:
yelp_df_1_5[ yelp_df_1_5['length'] == 662  ]['text']

In [None]:
print(yelp_df_clean[3571]) # cleaned up review

## APPLY COUNT VECTORIZER TO OUR YELP REVIEWS

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = message_cleaning)
yelp_countvectorizer = vectorizer.fit_transform(yelp_df_1_5['text'])

In [None]:
print(vectorizer.get_feature_names())

In [None]:
print(yelp_countvectorizer.toarray())

In [None]:
yelp_countvectorizer.shape

## STEP 4: TRAINING THE MODEL WITH ALL DATASET

In [None]:
from sklearn.naive_bayes import MultinomialNB

NB_classifier = MultinomialNB()
label = yelp_df_1_5['stars'].values

In [None]:
yelp_df_1_5['stars'].values

In [None]:
NB_classifier.fit(yelp_countervectorizer, label)

In [None]:
testing_sample = ['amazing food! highly recommend']

testing_sample_countvectorizer = vectorizer.transform(testing_sample)
test_predict = NB_classifier.predict(testing_sample_countvectorizer)

test_predict

## DIVIDE THE DAT INTO TRAINING AND TESTING PRIOR TO TRAINING

In [None]:
X = yelp_countvectorizer


In [None]:
y = label

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)

## STEP 5: EVALUATING THE MODEL

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
y_predict_train = NB_classifier.predict(X_train)
y_predict_train

In [None]:
cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)

In [None]:
y_predict_test = NB_classifier.predict(X_test)
y_predict_test

cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot = True)

In [None]:
print(classification_report(y_test, y_predict_test))

## STEP 6: ADD ADDITIONAL FEATURE TF-IDF 

Tf–idf stands for "Term Frequency–Inverse Document Frequency" is a numerical statistic used to reflect how important a word is to a document in a collection or corpus of documents.

TFIDF is used as a weighting factor during text search processes and text mining.
T
he intuition behing the TFIDF is as follows: if a word appears several times in a given document, this word might be meaningful (more important) than other words that appeared fewer times in the same document. However, if a given word appeared several times in a given document but also appeared many times in other documents, there is a probability that this word might be common frequent word such as 'I' 'am'..etc. (not really important or meaningful!).

TF: Term Frequency is used to measure the frequency of term occurrence in a document:

TF(word) = Number of times the 'word' appears in a document / Total number of terms in the document

IDF: Inverse Document Frequency is used to measure how important a term is:

IDF(word) = log_e(Total number of documents / Number of documents with the term 'word' in it).
   
Example: Let's assume we have a document that contains 1000 words and the term “John” appeared 20 times, the Term-Frequency for the word 'John' can be calculated as follows:

TF|john = 20/1000 = 0.02
Let's calculate the IDF (inverse document frequency) of the word 'john' assuming that it appears 50,000 times in a 1,000,000 million documents (corpus).

IDF|john = log (1,000,000/50,000) = 1.3
Therefore the overall weight of the word 'john' is as follows

TF-IDF|john = 0.02 * 1.3 = 0.026

In [None]:
yelp_countvectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
yelp_tfidf = TfidfTransformer().fit_transform(yelp_countvectorizer)
print(yelp_tfidf.shape)

In [None]:
yelp_tfidf

In [None]:
print(yelp_tfidf[:,:])

In [None]:
X = yelp_tfidf
y = label

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)

y_predict_test = NB_classifier.predict(X_test)
y_predict_test

cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot = True)
