<a href="https://colab.research.google.com/github/christabs27/Linear-Regression-for-Heights/blob/main/11_12_3_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity 11.12.3

How many of us decide where to eat or what product to buy based on consumer reviews?  This is a very important source of information to both businesses and consumers.  However, there are way too many reviews being generated for humans to read and classify all of them.  Enter machine learning and natural language processing.

In this lesson, you will build on your base Naive Bayes model by incorporating stop words, stemming, lemmatization, and n-grams to see how they can improve model accuracy.

##Step 1: Install the necessary packages
Run the following code block to import the necessary packages and set up the stemmer and lemmatizer.

In [1]:
#Step 1

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

import nltk
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('wordnet')

from textblob import TextBlob
import re
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
from nltk import word_tokenize         
from nltk.stem import WordNetLemmatizer 

def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


##Step 2: Download and save the `yelp_labeled.txt` data set from the class resources  

Make a note of where you saved the file on your computer.

##Step 3: Upload the `yelp_labeled.txt` running the following code block 

When prompted, navigate to and select the `yelp_labeled.txt` data set where you saved it on your computer.

In [2]:
#Step 3

from google.colab import files
yelp_labeled = files.upload()

Saving yelp_labeled.txt to yelp_labeled.txt


## Step 4: Create a Pandas Dataframe from the `.txt` file
* Run the following code block to create a Pandas DataFrame from the `.txt` file and name it `Yelp`.
* The sentiment of the reviews is coded as `0` (negative) and `1` (positive).  

In [3]:
#Step 4

Yelp = pd.read_csv('yelp_labeled.txt',delimiter='\t',header=None)
Yelp.rename(columns={0:'Review',1:'Sentiment'}, inplace=True)
print(Yelp.head())

                                              Review  Sentiment
0                            Wow...Loved this place.          1
1                                 Crust is not good.          0
2          Not tasty and the texture was just nasty.          0
3  Stopped by during the late May bank holiday of...          1
4  The selection on the menu was great and so wer...          1


##Step 5: Instantiate and fit `CountVectorizer` that removes stop words from the English language
* Transform the review data and save the results as `X`.
* Create a target vector called `y` from `Yelp['Sentiment']`.
* This can be done by running the following code:

```
Vectorizer = CountVectorizer(stop_words={'english'})

X = Vectorizer.fit_transform(Yelp['Review'])

y = Yelp['Sentiment']
```


In [4]:
#Step 5
Vectorizer = CountVectorizer(stop_words={'english'})

X = Vectorizer.fit_transform(Yelp['Review'])

y = Yelp['Sentiment']



##Step 6: Model the data and calculate accuracy
* Run the following code block to split the data into training and test sets, model the data that has been pre-processed to use English stop words, and calculate the 10-fold cross-validation accuracy.
* What is the accuracy of the model now? How does it compare to the naive model, which had 79% accuracy?

In [5]:
#Step 6

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

yelp_review = MultinomialNB()
yelp_review.fit(X_train,y_train)

scores = cross_val_score(yelp_review, X_train, y_train, cv=10)

print(scores.mean())

0.7906666666666666


###Answer:

##Step 7: Instantiate and fit `CountVectorizer` to stem the words
* Transform the review data and save the results as `X_2`.
* This can be done by running the following code:

```
Vectorizer2 = CountVectorizer(tokenizer=stemming_tokenizer)

X_2 = Vectorizer2.fit_transform(Yelp['Review'])

```


In [6]:
#Step 7:
Vectorizer2 = CountVectorizer(tokenizer=stemming_tokenizer)

X_2 = Vectorizer2.fit_transform(Yelp['Review'])



##Step 8: Model the data and calculate accuracy
* Run the following code block to split the data into training and test sets, model the data that has been pre-processed to remove stop words and stem the words, and calculate the 10-fold cross-validation accuracy.
* What is the accuracy of the model now? How does it compare to the naive model, which had 79% accuracy?

In [7]:
#Step 8

X_train, X_test, y_train, y_test = train_test_split(X_2, y, test_size=0.25, random_state=42)

yelp_review = MultinomialNB()
yelp_review.fit(X_train,y_train)

scores = cross_val_score(yelp_review, X_train, y_train, cv=10)

print(scores.mean())

0.7959999999999999


###Answer:



##Step 9: Instantiate and fit `CountVectorizer` to lemmatize the words
* Transform the review data and save the results as `X_3`.
* This can be done by running the following code:

```
Vectorizer3 = CountVectorizer(tokenizer=LemmaTokenizer())

X_3 = Vectorizer3.fit_transform(Yelp['Review'])

```


In [8]:
#Step 9
Vectorizer3 = CountVectorizer(tokenizer=LemmaTokenizer())

X_3 = Vectorizer3.fit_transform(Yelp['Review'])



##Step 10: Model the data and calculate accuracy
* Run the following code block to split the data into training and test sets, model the data that has been pre-processed to remove stop words, lemmatize the data, and calculate the 10-fold cross-validation accuracy.
* What is the accuracy of the model now? How does it compare to the naive model, which had 79% accuracy?  How does the accuracy compare to the model where we stemmed the data?

In [9]:
#Step 10

X_train, X_test, y_train, y_test = train_test_split(X_3, y, test_size=0.25, random_state=42)

yelp_review = MultinomialNB()
yelp_review.fit(X_train,y_train)

scores = cross_val_score(yelp_review, X_train, y_train, cv=10)

print(scores.mean())

0.7853333333333333


###Answer:



##Step 11: Instantiate and fit `CountVectorizer` to create n-grams up to size 3
* Transform the review data and save the results as `X_4`.
* This can be done by running the following code:

```
Vectorizer4 = CountVectorizer(ngram_range=(1,3))

X_4 = Vectorizer4.fit_transform(Yelp['Review'])

```


In [10]:
#Step 11
Vectorizer4 = CountVectorizer(ngram_range=(1,3))

X_4 = Vectorizer4.fit_transform(Yelp['Review'])


##Step 12: Model the data and calculate accuracy
* Run the following code block to split the data into training and test sets, model the data that has been pre-processed into n-grams, and calculate the 10-fold cross-validation accuracy.
* What is the accuracy of the model now? How does it compare to the naive model, which had 79% accuracy?  How does the accuracy compare to the model where we stemmed the data?

In [11]:
#Step 12:


X_train, X_test, y_train, y_test = train_test_split(X_4, y, test_size=0.25, random_state=42)

yelp_review = MultinomialNB()
yelp_review.fit(X_train,y_train)

scores = cross_val_score(yelp_review, X_train, y_train, cv=10)

print(scores.mean())

0.784


###Answer:



##Step 13: Try both stemming and using n-grams
* Run the following code block to train and evaluate a model that uses both stemming and n-grams up to tri-grams.
* What is the accuracy of the model now? How does it compare to the naive model, which had 79% accuracy? How does the accuracy compare to the model where we stemmed the data?



In [12]:
#Step 13

Vectorizer5 = CountVectorizer(ngram_range=(1,3), tokenizer=stemming_tokenizer)

X_5 = Vectorizer5.fit_transform(Yelp['Review'])

X_train, X_test, y_train, y_test = train_test_split(X_5, y, test_size=0.25, random_state=42)

yelp_review = MultinomialNB()
yelp_review.fit(X_train,y_train)

scores = cross_val_score(yelp_review, X_train, y_train, cv=10)

print(scores.mean())

0.7920000000000001


###Answer:



##Step 14: Evaluate the model with stemming and n-grams on the test data
* Run the following code block to evaluate the model using the test data.
* What is the testing accuracy?



In [13]:
#Step 14

yelp_review.score(X_test,y_test)

0.808

###Answer:

