# Explore here

In [437]:
# Your code here

In this case, we have only 3 variables: 2 predictors and a dichotomous label. Of the two predictors, we are really only interested in the comment part, since the fact of classifying a comment as positive or negative will depend on its content, not on the application from which it was written. Therefore, the package_name variable should be removed.

When we work with text, as in this case, it does not make sense to do an EDA, the process is different, since the only variable we are interested in is the one that contains the text. In other cases where the text is part of a complex set with other numeric predictor variables and the prediction objective is different, then it makes sense to apply an EDA.

However, we cannot work with plain text; it must first be processed. This process consists of several steps:

Removing spaces and converting the text to lowercase:

- df["column"] = df["column"].str.strip().str.lower()

Divide the dataset into train and test: X_train, X_test, y_train, y_test.

Transform the text into a word count matrix. This is a way to obtain numerical features from the text. For this, we use the training set to train the transformer and apply it in test:

- vec_model = CountVectorizer(stop_words = "english")
- X_train = vec_model.fit_transform(X_train).toarray()
- X_test = vec_model.transform(X_test).toarray()

Once we have finished we will have the predictors ready to train the model.

- **DATA FROM**: https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv

In [438]:

import os
import pandas as pd
import csv

df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv", sep = ",")

os.makedirs("../data/raw", exist_ok = True)
file_path = os.path.join("../data/raw", "playstore_reviews.csv")

df.to_csv(file_path, index=False)

df.drop("package_name", axis = 1, inplace = True)

df.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0


- **package_name**: Name of the mobile application (categorical) (**ELIMINATED**)
- **review**: Comment about the mobile application (categorical)
- **polarity**: Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric)

In [439]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   review    891 non-null    object
 1   polarity  891 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 14.0+ KB


In [440]:
df.duplicated().sum()

np.int64(0)

In [441]:
df.nunique()

review      891
polarity      2
dtype: int64

In [442]:
df["review"] = df["review"].str.strip().str.lower()
df.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0
3,the new features suck for those of us who don'...,0
4,forced reload on uploading pic on replying com...,0


In [443]:
from sklearn.model_selection import train_test_split

X = df["review"]
y = df["polarity"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train)

331    just did the latest update on viber and yet ag...
733    keeps crashing it only works well in extreme d...
382    the fail boat has arrived the 6.0 version is t...
704    superfast, just as i remember it ! opera mini ...
813    installed and immediately deleted this crap i ...
                             ...                        
106    why can't i share my achievements? recently di...
270    beta is the best version of the chrome browser...
860    great little game. this is a great little game...
435    keeps crashing ever since i started using it m...
102    even though i am loving the new update, but th...
Name: review, Length: 712, dtype: object


In [444]:
from sklearn.feature_extraction.text import CountVectorizer

vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(712, 3310))

- **GAUSSIAN MODEL**

In [445]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

gaussian_model = GaussianNB()
gaussian_model.fit(X_train, y_train)

In [446]:
gaussian_y_pred = gaussian_model.predict(X_test)
gaussian_y_pred

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0])

In [447]:
gaussian_accuracy = accuracy_score(y_test, gaussian_y_pred)
print(f"Gaussian accuracy score: {gaussian_accuracy}")
print("Gaussian classification Report:\n", classification_report(y_test, gaussian_y_pred))

Gaussian accuracy score: 0.8044692737430168
Gaussian classification Report:
               precision    recall  f1-score   support

           0       0.85      0.88      0.86       126
           1       0.69      0.62      0.65        53

    accuracy                           0.80       179
   macro avg       0.77      0.75      0.76       179
weighted avg       0.80      0.80      0.80       179



- **MULTINOMINAL MODEL**

In [448]:
from sklearn.naive_bayes import MultinomialNB

multinomial_model = MultinomialNB()
multinomial_model.fit(X_train, y_train)

In [449]:
multinominal_y_pred = multinomial_model.predict(X_test)
multinominal_y_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0])

In [450]:
multinominal_accuracy = accuracy_score(y_test, multinominal_y_pred)
print(f"Multinominal accuracy: {multinominal_accuracy}")
print("Multinominal classification Report:\n", classification_report(y_test, multinominal_y_pred))

Multinominal accuracy: 0.8156424581005587
Multinominal classification Report:
               precision    recall  f1-score   support

           0       0.84      0.90      0.87       126
           1       0.73      0.60      0.66        53

    accuracy                           0.82       179
   macro avg       0.79      0.75      0.77       179
weighted avg       0.81      0.82      0.81       179



- **BERNOULLI MODEL**

In [451]:
from sklearn.naive_bayes import BernoulliNB

bernoulli_model = BernoulliNB()
bernoulli_model.fit(X_train, y_train)

In [452]:
bernoulli_y_pred = bernoulli_model.predict(X_test)
bernoulli_y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0])

In [453]:
bernoulli_accuracy = accuracy_score(y_test, bernoulli_y_pred)
print(f"Bernoulli accuracy: {bernoulli_accuracy}")
print("Bernoulli classification Report:\n", classification_report(y_test, bernoulli_y_pred))

Bernoulli accuracy: 0.770949720670391
Bernoulli classification Report:
               precision    recall  f1-score   support

           0       0.79      0.93      0.85       126
           1       0.70      0.40      0.51        53

    accuracy                           0.77       179
   macro avg       0.74      0.66      0.68       179
weighted avg       0.76      0.77      0.75       179



In [454]:
print(f"Gaussian accuracy score: {gaussian_accuracy}")
print(f"Gaussian classification Report:\n {classification_report(y_test, gaussian_y_pred)}")

print(f"Multinominal accuracy score: {multinominal_accuracy}")
print(f"Multinominal classification Report:\n {classification_report(y_test, multinominal_y_pred)}")

print(f"Bernoulli accuracy score: {bernoulli_accuracy}")
print(f"Bernoulli classification Report:\n {classification_report(y_test, bernoulli_y_pred)}")

Gaussian accuracy score: 0.8044692737430168
Gaussian classification Report:
               precision    recall  f1-score   support

           0       0.85      0.88      0.86       126
           1       0.69      0.62      0.65        53

    accuracy                           0.80       179
   macro avg       0.77      0.75      0.76       179
weighted avg       0.80      0.80      0.80       179

Multinominal accuracy score: 0.8156424581005587
Multinominal classification Report:
               precision    recall  f1-score   support

           0       0.84      0.90      0.87       126
           1       0.73      0.60      0.66        53

    accuracy                           0.82       179
   macro avg       0.79      0.75      0.77       179
weighted avg       0.81      0.82      0.81       179

Bernoulli accuracy score: 0.770949720670391
Bernoulli classification Report:
               precision    recall  f1-score   support

           0       0.79      0.93      0.85       1

**MultinominalNB** got the best accuracy score.

In [455]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

random_forest_model = RandomForestClassifier(n_estimators=300, random_state=42)
random_forest_model.fit(X_train, y_train)

selector = SelectFromModel(random_forest_model, threshold="median", prefit=True)
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

optimized_multinomial_model = MultinomialNB()
optimized_multinomial_model.fit(X_train_selected, y_train)

In [456]:
optimized_multinomial_pred = optimized_multinomial_model.predict(X_test_selected)
optimized_multinomial_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0])

In [457]:
optimized_multinomial_accuracy = accuracy_score(y_test, optimized_multinomial_pred)
print(f'Optimized multinominal accuracy: {optimized_multinomial_accuracy}')
print(f"Optimized multinominal classification Report:\n {classification_report(y_test, optimized_multinomial_pred)}")

print(f"Multinominal accuracy score: {multinominal_accuracy}")
print(f"Multinominal classification Report:\n {classification_report(y_test, multinominal_y_pred)}")

Optimized multinominal accuracy: 0.8156424581005587
Optimized multinominal classification Report:
               precision    recall  f1-score   support

           0       0.85      0.90      0.87       126
           1       0.72      0.62      0.67        53

    accuracy                           0.82       179
   macro avg       0.78      0.76      0.77       179
weighted avg       0.81      0.82      0.81       179

Multinominal accuracy score: 0.8156424581005587
Multinominal classification Report:
               precision    recall  f1-score   support

           0       0.84      0.90      0.87       126
           1       0.73      0.60      0.66        53

    accuracy                           0.82       179
   macro avg       0.79      0.75      0.77       179
weighted avg       0.81      0.82      0.81       179



The optimized model is just slightly more precise than the original one

In [458]:
from pickle import dump

file_path = os.path.join("../models", "naive_bayes_optimized_model.sav")
dump(optimized_multinomial_model, open(file_path, "wb"))