<h1 style="Comic Sans MS; text-align: center; color: #FFC300">Naive Bayes Project</h1>
<h3 id="step1" style="font-family: Comic Sans MS; color: #68FF33">1. Problem statement and data collection</h3>
<p style="color: pink">In this project we will practice with a dataset to create a review classifier for the Google Play store:</p>
<ul style="color: pink">
    <li>In this case, we have only 3 variables: 2 predictors and a dichotomous label:</li>
        <ul>
            <li><em><b>package_name</b></em>: Name of the mobile application (categorical)</li>
            <li><em><b>review</b></em>: Comment about the mobile application (categorical)</li>
            <li><em><b>polarity</b></em>: Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric).</li>
        </ul>
    <li>Study of variables and their content.</li>
    <li>Build a NB model and optimize it</li>
</ul>

In [1]:
import pandas as pd 

# Read and load the data from URL
total_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv')

total_data.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


<p style="color: pink">Of the two predictors, we are really only interested in the comment part, since the fact of classifying a comment as positive or negative will depend on its content, not on the application from which it was written. Therefore, the <em><b>package_name</b></em> variable should be removed.</p>
<h3 id="step2" style="font-family: Comic Sans MS; color: #68FF33">2. Exploration and data cleaning</h3>
<p style="color: pink">When we work with text as in this case, it does not make sense to do an EDA, the process is different, since the only variable we are interested in is the one that contains the text. In other cases where the text is part of a complex set with other numeric predictor variables and the prediction objective is different, then it makes sense to apply an EDA.
</p>

In [2]:
# Drop the column 'package_name'
total_data = total_data.drop(['package_name'], axis=1)
total_data.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0


In [3]:
total_data.shape

(891, 2)

<p style="color: pink">We cannot work with plain text, it must first be processed. This process consists of several steps as explained below:</p>

In [4]:
# Remove space and convert text to lowercase in column 'review'
total_data['review'] = total_data['review'].str.strip().str.lower()

total_data.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0
3,the new features suck for those of us who don'...,0
4,forced reload on uploading pic on replying com...,0


<h3 id="step3" style="font-family: Comic Sans MS; color: #68FF33">3. Model training</h3>

In [5]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(total_data['review'], total_data['polarity'], test_size=0.2, random_state=42)

X_train.head()

331    just did the latest update on viber and yet ag...
733    keeps crashing it only works well in extreme d...
382    the fail boat has arrived the 6.0 version is t...
704    superfast, just as i remember it ! opera mini ...
813    installed and immediately deleted this crap i ...
Name: review, dtype: object

<div class="alert alert-block alert-warning">
<b>MultinomialNB:</b> We choose the Multinomial Naive Bayes model because it is the most suitable for text classification problems and that the target is binary while the predictors are categorical numbers.
</div>

<p style="color: pink">Now we have to transform the text into a word count matrix. This is a way to obtain numerical features from the text. For this, we use the training set to train the transformer and apply it in test:</p>

In [6]:
# Create a pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# Train the model
model.fit(X_train, y_train)

<div class="alert alert-block alert-info">
<b>Explanation:</b> 
    <ul>
        <li>We use <b>CountVectorizer</b> to convert the text data into numerical features. It transforms a collection of text documents into a matrix of token counts.</li>
        <li>The <b>MultinomialNB</b> is then used to train the Naive Bayes classifier on the vectorized text features.</li>
        <li>The <b>Pipeline</b> combines both the vectorization and classification steps into a single object for ease of use.</li>
    </ul>
</div>

In [7]:
# Predict the test set
y_pred = model.predict(X_test)
y_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0], dtype=int64)

In [8]:
# Evaluate the model
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f" % (accuracy*100), "%")

Accuracy: 84.36 %


In [9]:
# Save the model
from pickle import dump
dump(model, open('C:/Users/Jorge Payà/Desktop/4Geeks/DSML Bootcamp/Naive-Bayes-project/models/NBMultinomial_model_default.pkl', 'wb'))
dump(model, open('C:/Users/Jorge Payà/Desktop/4Geeks/DSML Bootcamp/Naive-Bayes-project/models/NBMultinomial_model_default.sav', 'wb'))

<h3 id="step4" style="font-family: Comic Sans MS; color: #68FF33">4. Model optimization</h3>

In [10]:
# Optmize the previous model using a RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

hyperparameters = {
    'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'vectorizer__max_df': np.linspace(0.5, 1.0, 6),
    'classifier__alpha': np.linspace(0.5, 1.5, 6),
    'vectorizer__stop_words': ['english', None]
}

random_search = RandomizedSearchCV(model, hyperparameters, n_iter=20, scoring="accuracy", cv=5, random_state=42)
random_search.fit(X_train, y_train)

<div class="alert alert-block alert-info">
<b>Explanation:</b> 
    <ul>
        <li>When using RandomizedSearchCV (or GridSearchCV) with a pipeline estimator, we need to specify hyperparameters differently compared to when we are manually tuning hyperparameters. Specifically we need to prepend the name of the step in the pipeline to each hyperparameter. This ensures that the grid search or randomized search knows which step of the pipeline each hyperparameter belongs to, e.g., <b>'vectorizer__ngram_range'</b></li>
        <li>By specifying the step name (vectorizer or classifier) followed by a double underscore (__) and then the parameter name, we tell the grid search or randomized search which step of the pipeline each hyperparameter belongs to.</li>
    </ul>
</div>

In [11]:
# Get the best hyperparameters
print(f"Best hyperparameters: {random_search.best_params_}")

Best hyperparameters: {'vectorizer__stop_words': None, 'vectorizer__ngram_range': (1, 1), 'vectorizer__max_df': 1.0, 'classifier__alpha': 0.7}


In [12]:
# Re train the model with the best hyperparameters
model = random_search.best_estimator_
print(model)
model.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('classifier', MultinomialNB(alpha=0.7))])


In [13]:
# Predict the test set
y_pred = model.predict(X_test)
y_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0], dtype=int64)

In [14]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f" % (accuracy*100), "%")

Accuracy: 83.24 %


In [15]:
dump(model, open('C:/Users/Jorge Payà/Desktop/4Geeks/DSML Bootcamp/Naive-Bayes-project/models/NBMultinomial_model_optimized.pkl', 'wb'))
dump(model, open('C:/Users/Jorge Payà/Desktop/4Geeks/DSML Bootcamp/Naive-Bayes-project/models/NBMultinomial_model_optimized.sav', 'wb'))

<div class="alert alert-block alert-success">
<b>Conclusions:</b> After experimenting with various parameters to optimize the model, it is evident that the default settings yielded the most favorable outcome, achieving a notable score of 84.36%. 
</div>