# Naive Bayes

The Naive Bayes model is a simple probabilistic algorithm that uses Bayes's theorem to make predictions. It assumes that the features in the data are independent of each other, hence the "naive" in its name. The model calculates the probability of a given input belonging to each possible class, and then chooses the class with the highest probability as the predicted output. It is commonly used for text classification and spam filtering, although it is considered a bad estimator. 

In [58]:
# import pandas and read the data
import pandas as pd
df = pd.read_csv('output/spam_email.csv')

In [59]:
# spliting test and train samples
from sklearn.model_selection import train_test_split

df_predictors = df.drop('spam', axis = 1)
df_predicted = df['spam']

X_train, X_test, y_train, y_test = train_test_split(df_predictors,
                                                    df_predicted)

# Gaussian Naive Bayes

The Gaussian Naive Bayes model is a variant of the Naive Bayes algorithm that assumes the features in the data follow a Gaussian distribution. It calculates the probability of a given input belonging to each possible class using the Gaussian distribution to estimate the likelihood of each feature value for each class. The model then applies Bayes' theorem to compute the posterior probability of each class, and chooses the class with the highest probability as the predicted output.

In [60]:
from sklearn.naive_bayes import GaussianNB

# Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# prediction score: 
gy_pred = gnb.fit(X_train, y_train).predict(X_test)

gscore = gnb.score(X_test, y_test)

print("Number of mislabeled points out of a total %d points : %d \n\
Model score:"
      % (X_test.shape[0], (y_test != gy_pred).sum()), "{0:.2%}".format(gscore))

Number of mislabeled points out of a total 1151 points : 209 
Model score: 81.84%


### K-Fold Cross Validation

In [61]:
from sklearn.model_selection import cross_val_score

# we specify 20 folds, i.e, 20 train-test splits and fitted models. 
scores = cross_val_score(gnb, X_train, y_train, 
                         cv = 20)

def display_scores(scores):
    print("Scores:", scores)
    print("\nMean:", scores.mean(), f"({scores.mean():.2%})")
    print("\nStandard deviation:", scores.std(), f"({scores.std():.2%})")

display_scores(scores)

Scores: [0.84393064 0.83236994 0.8150289  0.84971098 0.73988439 0.83815029
 0.80924855 0.83815029 0.82080925 0.79768786 0.8372093  0.83139535
 0.85465116 0.74418605 0.84302326 0.81976744 0.79651163 0.86046512
 0.8255814  0.75581395]

Mean: 0.8176787874714343 (81.77%)

Standard deviation: 0.03425023203159151 (3.43%)


### Repeated K-Fold Cross Validation

Alternatively to what done in the Logistic Regreession, a simpler way to apply a Repeated k-Fold Cross Validation is through the cross_val_function, passing the RepeatedKFold function into its cv argument.

In [62]:
from sklearn.model_selection import cross_val_score, RepeatedKFold
cv = RepeatedKFold(n_splits=20, n_repeats=5, random_state=2)

scores = cross_val_score(gnb, X_train, y_train, cv=cv)

display_scores(scores)

Scores: [0.8150289  0.80346821 0.77456647 0.82080925 0.86127168 0.82080925
 0.78034682 0.79190751 0.78034682 0.79768786 0.80813953 0.80232558
 0.81976744 0.86046512 0.76744186 0.85465116 0.83139535 0.87209302
 0.81976744 0.8372093  0.78034682 0.8150289  0.8150289  0.83815029
 0.79768786 0.79190751 0.8150289  0.85549133 0.86127168 0.85549133
 0.83139535 0.79651163 0.8255814  0.81395349 0.83139535 0.77325581
 0.83139535 0.79069767 0.80813953 0.80813953 0.80924855 0.82080925
 0.80924855 0.85549133 0.78034682 0.82080925 0.8265896  0.80346821
 0.79190751 0.79768786 0.89534884 0.79069767 0.80232558 0.8255814
 0.8372093  0.81976744 0.79651163 0.83139535 0.83139535 0.81395349
 0.80924855 0.8150289  0.78034682 0.80346821 0.78612717 0.84393064
 0.76878613 0.79768786 0.84971098 0.78034682 0.81976744 0.83139535
 0.84883721 0.87209302 0.87209302 0.80813953 0.81395349 0.8255814
 0.80813953 0.81976744 0.82080925 0.80346821 0.79768786 0.79190751
 0.88439306 0.78034682 0.84393064 0.79768786 0.86705202 

# Multinomia Naive Bayes

In [63]:
from sklearn.naive_bayes import MultinomialNB

# Multinomia Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# prediction score
my_pred = mnb.fit(X_train, y_train).predict(X_test)
mscore = mnb.score(X_test, y_test)

print("Number of mislabeled points out of a total %d points : %d \n\
Model score:"
      % (X_test.shape[0], (y_test != my_pred).sum()), "{0:.2%}".format(mscore))

Number of mislabeled points out of a total 1151 points : 244 
Model score: 78.80%


### K-Fold Cross Validation

In [64]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(mnb, X_train, y_train, 
                         cv = 20)

def display_scores(scores):
    print("Scores:", scores)
    print("\nMean:", scores.mean(), f"({scores.mean():.2%})")
    print("\nStandard deviation:", scores.std(), f"({scores.std():.2%})")

display_scores(scores)

Scores: [0.82080925 0.83236994 0.76300578 0.77456647 0.80924855 0.84971098
 0.83236994 0.80346821 0.76300578 0.77456647 0.78488372 0.80232558
 0.77906977 0.79069767 0.80232558 0.78488372 0.76744186 0.79651163
 0.79651163 0.72093023]

Mean: 0.7924351391316037 (79.24%)

Standard deviation: 0.028559196621270523 (2.86%)


### Repeated K-Fold Cross Validation

In [65]:
from sklearn.model_selection import cross_val_score, RepeatedKFold
cv = RepeatedKFold(n_splits=20, n_repeats=5, random_state=2)

scores = cross_val_score(mnb, X_train, y_train, cv=cv)

display_scores(scores)

Scores: [0.76878613 0.80346821 0.82080925 0.8265896  0.78034682 0.78612717
 0.75144509 0.75722543 0.80924855 0.78034682 0.80232558 0.78488372
 0.8255814  0.79651163 0.76744186 0.78488372 0.79069767 0.76162791
 0.81976744 0.81976744 0.76878613 0.76878613 0.79768786 0.8265896
 0.80924855 0.80924855 0.73988439 0.82080925 0.87283237 0.84393064
 0.78488372 0.77906977 0.79651163 0.80813953 0.77325581 0.77325581
 0.75581395 0.77325581 0.83139535 0.72674419 0.76300578 0.82080925
 0.84393064 0.8150289  0.76878613 0.75722543 0.75722543 0.8265896
 0.79768786 0.80346821 0.80813953 0.76162791 0.72674419 0.80813953
 0.8372093  0.75581395 0.77906977 0.78488372 0.79069767 0.8255814
 0.83236994 0.78612717 0.75144509 0.8150289  0.78034682 0.79190751
 0.75722543 0.80346821 0.78034682 0.7283237  0.80232558 0.86046512
 0.75581395 0.81976744 0.84302326 0.78488372 0.79651163 0.79069767
 0.79069767 0.79651163 0.7283237  0.7283237  0.79190751 0.77456647
 0.83236994 0.78034682 0.8150289  0.8150289  0.79768786 0

### Bernoulli Naive Bayes

In [66]:
from sklearn.naive_bayes import BernoulliNB

# Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X_train, y_train)

# prediction score
by_pred = bnb.fit(X_train, y_train).predict(X_test)
bscore = bnb.score(X_test, y_test)

print("Number of mislabeled points out of a total %d points : %d \n\
Model score:"
      % (X_test.shape[0], (y_test != by_pred).sum()), "{0:.2%}".format(bscore))

Number of mislabeled points out of a total 1151 points : 128 
Model score: 88.88%


### K-Fold Cross Validation

In [67]:
from sklearn.model_selection import cross_val_score

# we specify 20 folds, i.e, 20 train-test splits and fitted models. 
scores = cross_val_score(bnb, X_train, y_train, 
                         cv = 20)

def display_scores(scores):
    print("Scores:", scores)
    print("\nMean:", scores.mean(), f"({scores.mean():.2%})")
    print("\nStandard deviation:", scores.std(), f"({scores.std():.2%})")

display_scores(scores)

Scores: [0.86127168 0.88439306 0.89595376 0.92485549 0.86127168 0.88439306
 0.86705202 0.89595376 0.86705202 0.86705202 0.91860465 0.89534884
 0.90116279 0.88953488 0.87790698 0.90697674 0.90116279 0.86046512
 0.85465116 0.86627907]

Mean: 0.8840670789084554 (88.41%)

Standard deviation: 0.020112679738551305 (2.01%)


### Repeated K-Fold Cross Validation

In [68]:
from sklearn.model_selection import cross_val_score, RepeatedKFold
cv = RepeatedKFold(n_splits=20, n_repeats=5, random_state=2)

scores = cross_val_score(bnb, X_train, y_train, cv=cv)

display_scores(scores)

Scores: [0.83815029 0.87283237 0.82080925 0.92485549 0.89595376 0.9017341
 0.9017341  0.87861272 0.9132948  0.84971098 0.87209302 0.84302326
 0.88372093 0.91860465 0.88953488 0.86046512 0.84302326 0.9244186
 0.93023256 0.90116279 0.91907514 0.86705202 0.9132948  0.86705202
 0.86705202 0.86705202 0.86705202 0.87861272 0.89017341 0.89595376
 0.88953488 0.88372093 0.89534884 0.81976744 0.90697674 0.87209302
 0.90116279 0.91860465 0.88372093 0.87790698 0.89595376 0.89017341
 0.9017341  0.9132948  0.89017341 0.89017341 0.82080925 0.86127168
 0.9132948  0.91907514 0.88372093 0.86627907 0.87790698 0.86046512
 0.9127907  0.84883721 0.93023256 0.88372093 0.84883721 0.87790698
 0.9017341  0.90751445 0.89017341 0.86127168 0.87283237 0.89595376
 0.88439306 0.85549133 0.88439306 0.87283237 0.90116279 0.88372093
 0.87790698 0.86046512 0.88953488 0.90116279 0.88953488 0.86627907
 0.87790698 0.9127907  0.89017341 0.9017341  0.88439306 0.87861272
 0.90751445 0.86705202 0.93641618 0.89017341 0.89595376 