## DS 862 Machine Learning for Business Analysts Fall 2020

### Natural Language Processing

#### Submitted by:
* Di Wang

We will now repeat the same procedure, but this time we will use Yelp review instead. In the same [dataset](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences), there is a file that contains Yelp reviews and the labeled sentiments.

In [1]:
import numpy as np
import pandas as pd

# Load the data
yelp_data = pd.read_csv('yelp_labelled.txt', sep = "\t", names = ['text','sentiment'])

yelp_data.head()

Unnamed: 0,text,sentiment
0,Wow... Loved this place.,1.0
1,I learned that if an electric slicer is used t...,
2,But they don't clean the chiles?,
3,Crust is not good.,0.0
4,Not tasty and the texture was just nasty.,0.0


#### Your first task: drop the missing data

In [2]:
# Drop missing values
yelp_data.dropna(inplace = True)
yelp_data.head()

Unnamed: 0,text,sentiment
0,Wow... Loved this place.,1.0
3,Crust is not good.,0.0
4,Not tasty and the texture was just nasty.,0.0
10,Stopped by during the late May bank holiday of...,1.0
11,The selection on the menu was great and so wer...,1.0


In the lecture demonstration, we have seen how to use CountVectorizer and TFIDF, paired with a classifier to perform sentiment analysis. Here we will do the same, but instead of using Random Forest, I want you to use Naive Bayes. Naive Bayes has been a popular algorithm for text classification, since it could achieve good scores and it's a simple model.

First separate the data into train and test set. For each feature extraction method, perform hyperparameter tuning that gives you the best classification result on the test set. 

#### Split data

In [3]:
# Identify X and y
X = yelp_data['text']
y = yelp_data['sentiment']

# Split data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

#### Using Bag-of-Word as feature extraction, then build a classifier using Naive Bayes. Evaluate the performance on the test set.

In [4]:
# In python, Bag-of-Word is coded in CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

# Set up the pipeline
pipeline = Pipeline([('CV', CountVectorizer(stop_words = 'english', token_pattern = r'\b[^\d\W]+\b')),
                     ('MB', MultinomialNB())])
parameters = {
    'CV__ngram_range':[(1, 1), (1, 2), (1, 3), (1, 4)], 
    'CV__max_df' : [5,10,100,200,500,1000],
    'CV__min_df' : [1,2,3], 
    'CV__lowercase': (True, False), 
    'MB__alpha':[0,0.2,0.4,0.6,0.8,1]}
Model1 = GridSearchCV(pipeline, parameters, cv = 3, n_jobs = -2)
Model1.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('CV',
                                        CountVectorizer(stop_words='english',
                                                        token_pattern='\\b[^\\d\\W]+\\b')),
                                       ('MB', MultinomialNB())]),
             n_jobs=-2,
             param_grid={'CV__lowercase': (True, False),
                         'CV__max_df': [5, 10, 100, 200, 500, 1000],
                         'CV__min_df': [1, 2, 3],
                         'CV__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
                         'MB__alpha': [0, 0.2, 0.4, 0.6, 0.8, 1]})

In [6]:
Model1.best_params_

{'CV__lowercase': True,
 'CV__max_df': 100,
 'CV__min_df': 1,
 'CV__ngram_range': (1, 1),
 'MB__alpha': 0.6}

In [7]:
np.mean(Model1.predict(X_test) == y_test)

0.79

**Observation:**

- When we convert all characters to lowercase before tokenizing, has 100 maximum document frequency, has 1 minimum document frequency, uses only unigrams, and has 0.6 smoothing parameter. Our model will have an accuracy rate of 79%.

#### Using TF-IDF as feature extraction, then build a classifier using Naive Bayes. Evaluate the performance on the test set.
Note that TF-IDF returns continuous values, so you should be using Gaussian Naive Bayes for the model. However, TF-IDF returns a sparse matrix, where GaussianNB only takes in a dense matrix as input. To overcome this, there are two solutions:
1. If you are not doing any tuning, you can add a .toarray() transform after fitting the tfidf (see [here](https://codepunk.io/naive-bayesian-classification-of-text-categories-in-scikit-learn/)).
2. If you are using a pipeline, you can put in a function transformer in between to do the convertion (see [here](https://stackoverflow.com/questions/28384680/scikit-learns-pipeline-a-sparse-matrix-was-passed-but-dense-data-is-required/28384887#28384887)).

I will let you decide which one to use. Note this is not a problem for Multinomial NB because (for whatever reason) Multinomial NB can take in a sprase matrix as input.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import FunctionTransformer

In [9]:
# Set up the pipeline
pipeline = Pipeline([('TF', TfidfVectorizer()),
                     ('FT', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
                     ('GB', GaussianNB())])
parameters = {
    'TF__ngram_range':[(1, 1), (1, 2), (1, 3), (1, 4)], 
    'TF__max_df' : [5,10,100,200,500,1000],
    'TF__min_df' : [1,2,3], 
    'TF__lowercase': (True, False),
    'GB__var_smoothing': np.logspace(0,-9, 10)}
Model2 = GridSearchCV(pipeline, parameters, cv = 3, n_jobs = -2)
Model2.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('TF', TfidfVectorizer()),
                                       ('FT',
                                        FunctionTransformer(accept_sparse=True,
                                                            func=<function <lambda> at 0x000001FC64FA9438>)),
                                       ('GB', GaussianNB())]),
             n_jobs=-2,
             param_grid={'GB__var_smoothing': array([1.e+00, 1.e-01, 1.e-02, 1.e-03, 1.e-04, 1.e-05, 1.e-06, 1.e-07,
       1.e-08, 1.e-09]),
                         'TF__lowercase': (True, False),
                         'TF__max_df': [5, 10, 100, 200, 500, 1000],
                         'TF__min_df': [1, 2, 3],
                         'TF__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)]})

In [10]:
Model2.best_params_

{'GB__var_smoothing': 0.1,
 'TF__lowercase': True,
 'TF__max_df': 100,
 'TF__min_df': 2,
 'TF__ngram_range': (1, 2)}

In [11]:
np.mean(Model2.predict(X_test) == y_test)

0.755

**Obbservation:**

- When we convert all characters to lowercase before tokenizing, has 100 maximum document frequency, has 2 minimum document frequency, uses unigrams and bigrams, and has 0.1 variances smoothing parameter. Our model will have an accuracy rate of 75.5%.

- For this data set, looks like Bag-of-Word perform better than TF-IDF. 

### Thank you