## Imports
 **Import the usual suspects. :) **

In [1]:
import numpy as np
import pandas as pd

## The Data

**Read the yelp.csv file and set it as a dataframe called yelp.**

In [3]:
yelp = pd.read_csv('yelp.csv')

** Check the head, info , and describe methods on yelp.**

In [None]:
yelp.head()

In [None]:
yelp.info()

In [None]:
yelp.describe()

**Create a new column called "text length" which is the number of words in the text column.**

In [11]:
yelp['text length'] = yelp['text'].apply(len)

# EDA


## Imports

In [15]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

**Use FacetGrid from the seaborn library to create a grid of 5 histograms of text length based off of the star ratings.**

In [None]:
g = sns.FacetGrid(yelp,col='stars')
g.map(plt.hist,'text length')

**Create a boxplot of text length for each star category.**

In [None]:
sns.boxplot(x='stars',y='text length',data=yelp,palette='rainbow')

**Create a countplot of the number of occurrences for each type of star rating.**

In [None]:
sns.countplot(x='stars',data=yelp,palette='rainbow')

** Use groupby to get the mean values of the numerical columns, you should be able to create this dataframe with the operation:**

In [None]:
stars = yelp.groupby('stars').mean(numeric_only=True)
stars

**Use the corr() method on that groupby dataframe to produce this dataframe:**

In [None]:
stars.corr()

**Then use seaborn to create a heatmap based off that .corr() dataframe:**

In [None]:
sns.heatmap(stars.corr(),cmap='coolwarm',annot=True)

## NLP Classification Task

grab reviews that were either 1 star or 5 stars.

In [39]:
yelp_class = yelp[(yelp.stars==1) | (yelp.stars==5)]

In [None]:
yelp.head()

** Create two objects X and y. X will be the 'text' column of yelp_class and y will be the 'stars' column of yelp_class. (Your features and target/labels)**

In [49]:

X = yelp_class['text']
y = yelp_class['stars']


**Import CountVectorizer and create a CountVectorizer object.**

In [45]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

** Use the fit_transform method on the CountVectorizer object and pass in X (the 'text' column). Save this result by overwriting X.**

In [51]:
X = cv.fit_transform(X)

## Train Test Split

Let's split our data into training and testing data.

** Use train_test_split to split up the data into X_train, X_test, y_train, y_test. Use test_size=0.3 and random_state=101 **

In [53]:
from sklearn.model_selection import train_test_split

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)

## Training a Model

Train a model

In [57]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

**Now fit nb using the training data.**

In [None]:
nb.fit(X_train,y_train)

## Predictions and Evaluations


In [61]:
predictions = nb.predict(X_test)

In [63]:
from sklearn.metrics import confusion_matrix,classification_report

In [None]:
print(confusion_matrix(y_test,predictions))
print('\n')
print(classification_report(y_test,predictions))

**Great! Let's see what happens if we try to include TF-IDF to this process using a pipeline.**

# Using Text Processing

** Import TfidfTransformer from sklearn. **

In [67]:
from sklearn.feature_extraction.text import  TfidfTransformer

** Import Pipeline from sklearn. **

In [69]:
from sklearn.pipeline import Pipeline

** Now create a pipeline with the following steps:CountVectorizer(), TfidfTransformer(),MultinomialNB()**

In [71]:
pipeline = Pipeline([
    ('bow', CountVectorizer()),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

## Using the Pipeline

### Train Test Split

**Redo the train test split on the yelp_class object.**

In [73]:
X = yelp_class['text']
y = yelp_class['stars']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)

In [None]:
# May take some time
pipeline.fit(X_train,y_train)

### Predictions and Evaluation


In [77]:
predictions = pipeline.predict(X_test)

In [None]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))