# Frequentist Naive Bayes Classifier


In this notebook we are going to study a Frequentist approach of the Naive Bayes Classifier, to do so, we define a class that given an aribitrary dataset, and a probability distribution, calculate the Naive Bayes Classifier for that given inputs.

In [3]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

## Class for Naive Bayes Classifier:

In [5]:
class NaiveBayes:
  # all probabilities inside the class are log

  def __init__(self, *, param_func, prob_func, param_priors=None):
    # user-provided function to compute the distribution parameters
    # the function should return an list of parameter arrays (...)
    self.param_func = param_func

    # user-provided function to compute the probabilities from the parameters
    # the input are the parameter arrays and the X
    self.prob_func = prob_func

    # hyperparameters of the parameter priors used by the prob_func in case a
    # Bayesian approach is taken
    self.param_priors = [param_priors] if param_priors is not None else []

  def fit(self, X, y):
    # retrieve the number of samples and features from X
    self.n_samples, self.n_features = X.shape

    # collect the individual classes, and record their counts to compute priors
    self.classes, counts = np.unique(y, return_counts=True)

    # compute the parameters for every class - we transpose the array so that
    # the final dimensions are (parameters, classes, features)
    self.params = np.array(
      [self.param_func(X[c==y], *self.param_priors) for c in self.classes]
    ).transpose(1, 0, 2)

    # compute the priors from the counts
    self.class_probs = np.log(counts/self.n_samples)

    return self

  def posteriors(self, X):
    # reshape X to fit the array dimensions (samples, classes, features)
    X = np.reshape(X, (-1, 1, self.n_features))

    # compute the probabilities of the samples (...)
    probs = np.log(self.prob_func(X, *self.params[:,np.newaxis]))
    return probs.sum(axis=2) + self.class_probs[np.newaxis]

  def predict_proba(self, X):
    # predict the probabilities for each class
    exp_post = np.exp(self.posteriors(X))
    return exp_post / exp_post.sum()

  def predict(self, X):
    # predict the class with the maximum probability
    return self.classes[np.argmax(self.posteriors(X), axis=1)]

# Checking mail spam:

Given an e-mail message, determine wheather such message is Spam or Ham, text classification task:

In [174]:
data = pd.read_csv("data/spam.csv", encoding= 'latin-1')
data = data[["class", "message"]]
data

Unnamed: 0,class,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [175]:
x = np.array(data["message"])
y = np.array(data["class"])

cv = CountVectorizer() # count the amout of occurrences for each word
x = cv.fit_transform(x)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42, shuffle = True)

Note that we have the value counts with valus up to 18 occurrences, we can work then with Bernoulli, Multinomial and Gaussian Naive Bayes, althought it is clear that most likely multinomial will be the most efficient:

In [176]:
no_dense = xtrain.toarray()#number of distinct diferent occurrences!
np.unique(no_dense)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 12, 15, 18],
      dtype=int64)

### Using Bernoulli Naive Bayes:

**Must check if it's neccesary to prepare the data to 0 and 1 inputs:**

In [177]:
# initialize the class with to the proper functions and fit to the data
nb = NaiveBayes(
  param_func=lambda X: [X.mean(axis=0)],         # return probability of p(x_{k,c} = 1)
  prob_func=lambda x, phi: phi*x + (1-phi)*(1-x) # Bernoulli pmf
).fit(xtrain.toarray(), ytrain)


In [180]:
y_pred_test = nb.predict(xtest.toarray())
y_pred_train = nb.predict(xtrain.toarray())
print(f"The training error is: {(y_pred_train == ytrain).mean()}")
print(f"The testing error is: {(y_pred_test == ytest).mean()}")

  probs = np.log(self.prob_func(X, *self.params[:,np.newaxis]))
  probs = np.log(self.prob_func(X, *self.params[:,np.newaxis]))


The training error is: 0.915189589409917
The testing error is: 0.8860986547085202


### Using Multinomial Naive Bayes:

**Must finish and define the multinomial probability distribution**

### Using Gaussian Naive Bayes

In [182]:
def mean_var(X):
  return [X.mean(axis=0), X.var(axis=0)]
def gauss(x, mean, var):
  return np.exp(-(x-mean)**2/(2*var)) / np.sqrt(var*2*np.pi)

In [183]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(x.toarray()) #to standarize features to a zero mean and 1 variance!
xtrain = scaler.transform(xtrain.toarray())
xtest = scaler.transform(xtest.toarray())

Let's now fit the model with the parameters scaled

In [184]:
nb = NaiveBayes(param_func=mean_var, prob_func=gauss).fit(xtrain, ytrain)

And finally compute the accuracy in the test dataset

In [186]:
y_pred_test = nb.predict(xtest)
y_pred_train = nb.predict(xtrain)
print(f"The training error is: {(y_pred_train == ytrain).mean()}")
print(f"The testing error is: {(y_pred_test == ytest).mean()}")

  return np.exp(-(x-mean)**2/(2*var)) / np.sqrt(var*2*np.pi)
  return np.exp(-(x-mean)**2/(2*var)) / np.sqrt(var*2*np.pi)
  probs = np.log(self.prob_func(X, *self.params[:,np.newaxis]))


The training error is: 0.8660533991474085
The testing error is: 0.8654708520179372


# Raisin dataset


The raisin dataset was from a study of machine vision systems in order to distinguish between two distinct varieties of raisins, Kecimen and Besni, a total of 900 pices of raisin were obtained and distinct metric values were recorded and included in the dataset. 

In [188]:
df = pd.read_csv('data/Raisin_Dataset.csv')
x = np.array(df.drop("Class", axis=1))
y = np.array(df["Class"])
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42, shuffle = True)

In [189]:
df

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.040,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.684130,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.190770,290.827533,0.564011,81463,0.792772,1073.251,Kecimen
...,...,...,...,...,...,...,...,...
895,83248,430.077308,247.838695,0.817263,85839,0.668793,1129.072,Besni
896,87350,440.735698,259.293149,0.808629,90899,0.636476,1214.252,Besni
897,99657,431.706981,298.837323,0.721684,106264,0.741099,1292.828,Besni
898,93523,476.344094,254.176054,0.845739,97653,0.658798,1258.548,Besni


Naturally given that the data presented is continous, of the studied models we can only apply the Gaussian Naive Bayes. Let's observe the yielding results:

In [190]:
def mean_var(X):
  return [X.mean(axis=0), X.var(axis=0)]
def gauss(x, mean, var):
  return np.exp(-(x-mean)**2/(2*var)) / np.sqrt(var*2*np.pi)

In [191]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(xtrain) #to standarize features to a zero mean and 1 variance!
xtrain = scaler.transform(xtrain)
xtest = scaler.transform(xtest)

Let's now fit the model with the parameters scaled

In [192]:
nb = NaiveBayes(param_func=mean_var, prob_func=gauss).fit(xtrain, ytrain)

And finally compute the accuracy in the test dataset

In [193]:
y_pred_test = nb.predict(xtest)
y_pred_train = nb.predict(xtrain)
print(f"The training error is: {(y_pred_train == ytrain).mean()}")
print(f"The testing error is: {(y_pred_test == ytest).mean()}")

The training error is: 0.8375
The testing error is: 0.8388888888888889


## IMDB Sentiment Analysis:

**Can't work with a dataset so big without sparse matrices, and hence, if the class is unable to do so, we can avoid using this dataset!!!**


**How to proceed is given just in case is usefull at the end.**

Given a review from a film in IMDB, analyze it's sentiment and classify between positive and negative review:

In [164]:
df = pd.read_csv('data/IMDB_Dataset.csv')
#https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [165]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [166]:
x = np.array(df["review"])
y = np.array(df["sentiment"])
cv = CountVectorizer() #
x = cv.fit_transform(x)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42, shuffle = True)

KeyboardInterrupt: 

Note that again, as in the Spam or Ham dataset, we have data with multiplicites, we expect the Multinomial Naive Bayes to perform the best, but we can actually perform tests in the 3 distinct models presented.

### Using Bernoulli Naive Bayes:

**Must check if it's neccesary to prepare the data to 0 and 1 inputs:**

In [None]:
# initialize the class with to the proper functions and fit to the data
nb = NaiveBayes(
  param_func=lambda X: [X.mean(axis=0)],         # return probability of p(x_{k,c} = 1)
  prob_func=lambda x, phi: phi*x + (1-phi)*(1-x) # Bernoulli pmf
).fit(xtrain.toarray(), ytrain)


## Iris Dataset

In [196]:
df = pd.read_csv('data/Iris.csv')

This dataset is similar to the Raisin dataset, this set of data containd 50 distinct samples of 3 distinct Iris plants, Iris setosa, Iris viginica and Iris Versicolor. 4  distinct features were recorded for each plant, we would want to be able to tell apart each of this type of Iris just with this given 4 features. 

In [197]:
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


As in the Raisin Dataset, data is continous, and hence, the only course of action would be to approach it using a Gaussian Naive Bayes.

In [198]:
x = np.array(df.drop("Species", axis=1))
y = np.array(df["Species"])
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42, shuffle = True)

In [199]:
def mean_var(X):
  return [X.mean(axis=0), X.var(axis=0)]
def gauss(x, mean, var):
  return np.exp(-(x-mean)**2/(2*var)) / np.sqrt(var*2*np.pi)

In [200]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(xtrain) #to standarize features to a zero mean and 1 variance!
xtrain = scaler.transform(xtrain)
xtest = scaler.transform(xtest)

Let's now fit the model with the parameters scaled

In [201]:
nb = NaiveBayes(param_func=mean_var, prob_func=gauss).fit(xtrain, ytrain)

And finally compute the accuracy in the test dataset

In [202]:
y_pred_test = nb.predict(xtest)
y_pred_train = nb.predict(xtrain)
print(f"The training error is: {(y_pred_train == ytrain).mean()}")
print(f"The testing error is: {(y_pred_test == ytest).mean()}")

The training error is: 0.9916666666666667
The testing error is: 1.0
