# Yelp Review Classification - Natural Language Processing Project

In this NLP project I will be attempting to classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews.

I will use the [Yelp Review Data Set from Kaggle](https://www.kaggle.com/c/yelp-recsys-2013).

Each observation in this dataset is a review of a particular business by a particular user.

The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.

The "cool" column is the number of "cool" votes this review received from other Yelp users. 

All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.

The "useful" and "funny" columns are similar to the "cool" column.


## Imports


In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## The Data

**Reading the yelp.csv file and setting it as a dataframe called yelp.**

In [None]:
yelp=pd.read_csv('yelp.csv')

** Checking the head, info , and describe methods on yelp.**

In [None]:
yelp.head()

In [None]:
yelp.info()

In [None]:
yelp.describe()

In [None]:
review_text=[line.rstrip() for line in yelp['text']]

In [None]:
len(review_text)

In [None]:
yelp['text length']=yelp['text'].apply(len)

# EDA


## Imports

**Importing the data visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

**Using FacetGrid from the seaborn library to create a grid of 5 histograms of text length based off of the star ratings.**

In [None]:
g=sns.FacetGrid(data=yelp,col='stars')
g.map(plt.hist,'text length')

**Creating a boxplot of text length for each star category.**

In [None]:
sns.boxplot(x='stars',y='text length',data=yelp)

**Creating a countplot of the number of occurrences for each type of star rating.**

In [None]:
sns.countplot(x='stars',data=yelp)

** Using groupby to get the mean values of the numerical columns.

In [None]:
mean_yelp=yelp.groupby('stars').mean()
mean_yelp

In [None]:
mean_yealp.corr()

In [None]:
sns.heatmap(mean_yealp.corr(),annot=True,cmap='coolwarm')

## NLP Classification Task



**Creating a dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews.**

In [None]:
yelp_class=yelp[(yelp.stars==1)|(yelp.stars==5)]

In [None]:
yelp_class

In [None]:
X = yelp_class['text']
y = yelp_class['stars']

**Importing CountVectorizer and creating a CountVectorizer object.**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()

** Using the fit_transform method on the CountVectorizer object and pass in X (the 'text' column). 

In [None]:
X=cv.fit_transform(X)

## Train Test Split

Let's split our data into training and testing data.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Training a Model


** Importing MultinomialNB and creating an instance of the estimator and call is nb **

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()

In [None]:
nb.fit(X_train,y_train)

## Predictions and Evaluations



In [None]:
predictions=nb.predict(X_test)

** Creating a confusion matrix and classification report using these predictions and y_test **

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(confusion_matrix(y_test,predictions))
print('\n')
print(classification_report(y_test,predictions))

# Using Text Processing

** Importing TfidfTransformer from sklearn. **

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

** Importing Pipeline from sklearn. **

In [None]:
from sklearn.pipeline import Pipeline

Creating a pipeline with the following steps:CountVectorizer(), TfidfTransformer(),MultinomialNB()**

In [None]:
pipeline=Pipeline([
    
    ('bow',CountVectorizer()),
    ('tfidf',TfidfTransformer()),
    ('classifier',MultinomialNB()),
])

## Using the Pipeline



### Train Test Split



In [None]:
X = yelp_class['text']
y = yelp_class['stars']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)

Fitting the pipeline to the training data.

In [None]:
pipeline.fit(X_train,y_train)

### Predictions and Evaluation



In [None]:
predictions=pipeline.predict(X_test)

In [None]:
print(confusion_matrix(y_test,predictions))
print('\n')
print(classification_report(y_test,predictions))