# Naive Bayes with scikit-learn

The goal of this notebook is to build naive bayes model using scikit-learn library to predicting sentiment from product reviews. You will do the following:

 * Load product reviews.
 * Implement naive bayes model using scikit-learn.

In [None]:
import pandas
import numpy as np
from sklearn.model_selection import train_test_split
import json

In [None]:
def remove_punctuation(text):
    import string
    return text.translate(string.punctuation)

def get_numpy_data(dataframe, features, label):
    dataframe.loc[:, 'intercept'] = 1
    features = ['intercept'] + features
    feature_matrix = dataframe.loc[:, features].values
    label_array = dataframe.loc[:, label].values
    return (feature_matrix, label_array)

def get_product_reviews_data():
    products_df = pandas.read_csv('/content/drive/MyDrive/FUNIX Progress/MLP303x_1.1-A_EN/data/amazon_baby_subset.csv')

    with open('/content/drive/MyDrive/FUNIX Progress/MLP303x_1.1-A_EN/data/important_words.json', 'r') as f:
        important_words = json.loads(f.read())

    products_df = products_df.fillna({'review':''})  # fill in N/A's in the review column
    products_df.loc[:, 'review_clean'] = products_df['review'].apply(remove_punctuation)

    for word in important_words:
        products_df.loc[:, word] = products_df['review_clean'].apply(lambda s : s.split().count(word))

    sentiment_train_data = products_df.sample(frac=0.8, random_state=100)
    sentiment_validation_data = products_df.drop(sentiment_train_data.index)

    sentiment_X_train, sentiment_y_train = get_numpy_data(sentiment_train_data, important_words, 'sentiment')
    sentiment_X_valid, sentiment_y_valid = get_numpy_data(sentiment_validation_data, important_words, 'sentiment')

    print ('*****Sentiment data shape*****')
    print ('sentiment_X_train.shape: ', sentiment_X_train.shape)
    print ('sentiment_y_train.shape: ', sentiment_y_train.shape)
    print ('sentiment_X_valid.shape: ', sentiment_X_valid.shape)
    print ('sentiment_y_valid.shape: ', sentiment_y_valid.shape)

    return (sentiment_X_train, sentiment_y_train), (sentiment_X_valid, sentiment_y_valid)

## Load product reviews dataset
Like previous module, we load, preprocess data, convert and split them into train and test datasets. We dont't focus on that in this notebook, so you can just run the following cells. You can check out the load data code inside the folder **utils**.

In [None]:
train_set, val_set = get_product_reviews_data()

sentiment_X_train, sentiment_y_train = train_set
sentiment_X_valid, sentiment_y_valid = val_set

*****Sentiment data shape*****
sentiment_X_train.shape:  (42458, 194)
sentiment_y_train.shape:  (42458,)
sentiment_X_valid.shape:  (10614, 194)
sentiment_y_valid.shape:  (10614,)


# Build Naive Bayes classifiers using scikit learn
Now, let's use the built-in Naive Bayes learner [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). We will check the type of the data and we will use appropriate Naive Bayes for each type of data. We will use **Multinomial Naive Bayes** since the product reviews dataset's features are words count.

In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB

mnb_sentiment = MultinomialNB()

mnb_sentiment.fit(sentiment_X_train, sentiment_y_train)

print ("***Sentiment result***")
print("Train accuracy: {}".format(mnb_sentiment.score(sentiment_X_train, sentiment_y_train)))
print("Validation accuracy: {}".format(mnb_sentiment.score(sentiment_X_valid, sentiment_y_valid)))

# P(x|y)

As you can see the naive bayes classifers can run really fast and give a reasonable accuracy.
As you can see we get a little bit better results.
<br>
**Quiz**: What is the validation accuracy?
<br>
**Your answer**: 0.7629545882796307

# Gaussian Naiive Bayes

In [None]:
gauss_sentiment = GaussianNB()

gauss_sentiment.fit(sentiment_X_train, sentiment_y_train)

print ("***Sentiment result***")
print("Train accuracy: {}".format(gauss_sentiment.score(sentiment_X_train, sentiment_y_train)))
print("Validation accuracy: {}".format(gauss_sentiment.score(sentiment_X_valid, sentiment_y_valid)))

***Sentiment result***
Train accuracy: 0.699609025389797
Validation accuracy: 0.6940832862257396
