# 2CS-SIL2/SIQ2 Lab04. Naïve Bayes

<p style='text-align: right;font-style: italic;'>Designed by: Mr. Abdelkrime Aries</p>

In this lab, we will learn about Naive Bayes by testing 2 implementations:
- Multinomial Naïve Bayes
- Gaussian Naïve Bayes

**Team:**
- **Member 01**: AMOURA YOUSRA
- **Member 02**: OUADI AMINA TINHINENE
- **Group**: SIQ2

In [3]:
import sys, timeit
from typing          import Tuple, List, Type
from collections.abc import Callable

sys.version

'3.11.12 (main, Apr  9 2025, 08:55:54) [GCC 11.4.0]'

In [4]:
import numpy             as np
import pandas            as pd
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

np.__version__, pd.__version__, matplotlib.__version__

('2.0.2', '2.2.2', '3.10.0')

In [5]:
import sklearn

from sklearn.naive_bayes   import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics       import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection         import train_test_split
from sklearn.naive_bayes             import MultinomialNB, GaussianNB
from sklearn.linear_model            import LogisticRegression
from sklearn.tree                    import DecisionTreeClassifier
from sklearn.metrics                 import precision_score, recall_score
import timeit


sklearn.__version__

'1.6.1'

## I. Algorithms implementation

In this section, we will try to implement multinomial Naive Bayes.


**>> Try to use "numpy" which will save a lot of time and effort**

In [6]:
# Dataset play

# outlook & temperature & humidity & windy
Xplay = np.array([
    ['sunny'   , 'hot' , 'high'  , 'no'],
    ['sunny'   , 'hot' , 'high'  , 'yes'],
    ['overcast', 'hot' , 'high'  , 'no'],
    ['rainy'   , 'mild', 'high'  , 'no'],
    ['rainy'   , 'cool', 'normal', 'no'],
    ['rainy'   , 'cool', 'normal', 'yes'],
    ['overcast', 'cool', 'normal', 'yes'],
    ['sunny'   , 'mild', 'high'  , 'no'],
    ['sunny'   , 'cool', 'normal', 'no'],
    ['rainy'   , 'mild', 'normal', 'no'],
    ['sunny'   , 'mild', 'normal', 'yes'],
    ['overcast', 'mild', 'high'  , 'yes'],
    ['overcast', 'hot' , 'normal', 'no'],
    ['rainy'   , 'mild', 'high'  , 'yes']
])

Yplay = np.array([
    'no',
    'no',
    'yes',
    'yes',
    'yes',
    'no',
    'yes',
    'no',
    'yes',
    'yes',
    'yes',
    'yes',
    'yes',
    'no'
])

len(Xplay), len(Yplay)

(14, 14)

In [7]:
# height & weight & footsize & person
Xperson = np.array([
    [182., 81.6, 30.],
    [180., 86.2, 28.],
    [170., 77.1, 30.],
    [180., 74.8, 25.],
    [152., 45.4, 15.],
    [168., 68.0, 20.],
    [165., 59.0, 18.],
    [175., 68.0, 23.]
])

Yperson = np.array([
    'male', 'male', 'male', 'male',
    'female', 'female', 'female', 'female'
])

len(Xperson), len(Yperson)

(8, 8)

### I.1. Prior statistics

Given an output list $Y[M]$, the probability of each class $c$ is estimated as:
$$p(c) = \frac{\#(Y = c)}{|Y|}$$

In here, we want to store the frequencies of different classes.
Our function must return two lists:
- One containing the names of unique classes.
- Another containing their frequencies.

In [8]:
# TODO: Prior statistics
def fit_prior(Y: np.ndarray[str]) -> Tuple[np.ndarray[str], np.ndarray[int]]:
    c, f = np.unique(Y, return_counts=True)
    return c, f
#=====================================================================
# UNIT TEST
#=====================================================================
# Result:
# ((array(['no', 'yes'], dtype='<U3'), array([5, 9])),
#  (array(['female', 'male'], dtype='<U6'), array([4, 4])))
#---------------------------------------------------------------------

fit_prior(Yplay), fit_prior(Yperson)

((array(['no', 'yes'], dtype='<U3'), array([5, 9])),
 (array(['female', 'male'], dtype='<U6'), array([4, 4])))

### I.2. Multinomial Law

In this section, we will implement multinomial naive Bayes from scratch using Numpy.

#### I.2.1. Multinomial Likelihood statistics

Given:
- $A$: a categorical feature
- $Y$: the ouput
- $C$: the classes

The function takes as argument $A, Y, C$ previously described.
It must return:
- $V$: unique values of this feature (feature's categories)
- $S[|C|, |V|]$: a matrix containing count $\#(Y = c \wedge A = v),\, \forall c \in C, \forall v \in A$

In [9]:
# TODO: Multinomial Likelihood statistics
def fit_multinomial_likelihood(A: np.ndarray[str],
                               Y: np.ndarray[str],
                               C: np.ndarray[str]
                               ) -> Tuple[np.ndarray[str], np.ndarray[int]]:
    V = np.unique(A)
    S = np.zeros((len(C), len(V)), dtype=int)

    # Populate the count matrix S
    for i, c in enumerate(C):
        for j, v in enumerate(V):
            S[i, j] = np.sum((Y == c) & (A == v))

    return V, S

#=====================================================================
# UNIT TEST
#=====================================================================
# Result:
# ((array(['overcast', 'rainy', 'sunny'], dtype='<U8'),
#   array([[0, 2, 3],
#          [4, 3, 2]])),
#  (array(['cool', 'hot', 'mild'], dtype='<U8'),
#   array([[1, 2, 2],
#          [3, 2, 4]])))
#---------------------------------------------------------------------
C_t = np.array(['no', 'yes'])
fit_multinomial_likelihood(Xplay[:, 0], Yplay, C_t), fit_multinomial_likelihood(Xplay[:, 1], Yplay, C_t)

((array(['overcast', 'rainy', 'sunny'], dtype='<U8'),
  array([[0, 2, 3],
         [4, 3, 2]])),
 (array(['cool', 'hot', 'mild'], dtype='<U8'),
  array([[1, 2, 2],
         [3, 2, 4]])))

#### I.2.2. Multinomial Likelihood training

**Nothing to code here, although you have to know how it functions for next use**

This function aims to generate parameters $\theta$.
In our case, paramters are diffrent from those of *logistic regrssion*.
They are a dictionary (map) with two entries:
- "prior": a dictionary having "vocab" a list of values and "freq" a list of their respective frequencies.
- "likelihood": a list of dictionaries representing statistics of each feature (the same order of $X$ features)

In [10]:
def fit_multinomial_NB(X: 'np.ndarray[M, N](str)',
                       Y: 'np.ndarray[M](str)'
                       ) -> object:

    Theta   = {'prior': {}, 'likelihood': []}

    Theta['prior']['vocab'], Theta['prior']['freq'] = fit_prior(Y)

    for j in range(X.shape[1]):
        likelihood = {}
        likelihood['vocab'], likelihood['freq'] = fit_multinomial_likelihood(X[:, j], Y, Theta['prior']['vocab'])
        Theta['likelihood'].append(likelihood)

    return Theta


#=====================================================================
# UNIT TEST
#=====================================================================
# Result:
# {'prior': {'vocab': array(['no', 'yes'], dtype='<U3'), 'freq': array([5, 9])},
#  'likelihood': [{'vocab': array(['overcast', 'rainy', 'sunny'], dtype='<U8'),
#    'freq': array([[0, 2, 3],
#           [4, 3, 2]])},
#   {'vocab': array(['cool', 'hot', 'mild'], dtype='<U8'),
#    'freq': array([[1, 2, 2],
#           [3, 2, 4]])},
#   {'vocab': array(['high', 'normal'], dtype='<U8'),
#    'freq': array([[4, 1],
#           [3, 6]])},
#   {'vocab': array(['no', 'yes'], dtype='<U8'),
#    'freq': array([[2, 3],
#           [6, 3]])}]}
#---------------------------------------------------------------------
Theta_play = fit_multinomial_NB(Xplay, Yplay)

Theta_play

{'prior': {'vocab': array(['no', 'yes'], dtype='<U3'), 'freq': array([5, 9])},
 'likelihood': [{'vocab': array(['overcast', 'rainy', 'sunny'], dtype='<U8'),
   'freq': array([[0, 2, 3],
          [4, 3, 2]])},
  {'vocab': array(['cool', 'hot', 'mild'], dtype='<U8'),
   'freq': array([[1, 2, 2],
          [3, 2, 4]])},
  {'vocab': array(['high', 'normal'], dtype='<U8'),
   'freq': array([[4, 1],
          [3, 6]])},
  {'vocab': array(['no', 'yes'], dtype='<U8'),
   'freq': array([[2, 3],
          [6, 3]])}]}

#### I.2.3. Multinomial Likelihood prediction

Given:
- $A$: a categorical feature
- $V$: unique values of this feature (feature's categories)
- $Y$: the ouput
- $C$: the classes
- $\alpha$: smoothing factor

Log likelihood is calculated as:
$$ \log p(A=v|Y=c) = \log(\#(Y = k \wedge A = v) + \alpha) - \log(\#(y = k) + \alpha * |V|)$$


In [11]:
# You can use this function in the next implimentation
# It takes a list of unique values V and a given value v
# It returns the position of v in V
# If v does not exist in V, it rturns -1
def find_idx(V: np.ndarray, v: str) -> int:
    k = np.argwhere(V == v).flatten()
    if len(k):
        return k[0]
    return -1

V_t = np.array(['One', 'Two', 'Three'])
find_idx(V_t, 'Two'), find_idx(V_t, 'Four')

(np.int64(1), -1)

In [12]:
# TODO: Multinomial Likelihood prediction
def predict_multinomial_NB1(v: str,
                            j: int,
                            Theta: object,
                            alpha: float = 0.
                            ) -> np.ndarray[float]:
    feature_likelihood = Theta['likelihood'][j]
    V = feature_likelihood['vocab']
    S = feature_likelihood['freq']

    v_idx = find_idx(V, v)

    P = Theta['prior']['freq']

    Result = np.zeros(S.shape[0])

    for i in range(S.shape[0]):
        count = S[i, v_idx] if v_idx != -1 else 0
        count += alpha

        total = P[i] + alpha * len(V)

        Result[i] = np.log(count) - np.log(total)

    return Result

#=====================================================================
# UNIT TEST
#=====================================================================
# Result:
# (array([-0.91629073, -1.09861229]), array([-2.07944154, -2.48490665]))
#---------------------------------------------------------------------

X_t = np.array([
    ['rainy', 'cool', 'normal', 'yes'],
    ['snowy', 'cool', 'normal', 'yes'],
    ['sunny', 'hot' , 'normal', 'no']
])

predict_multinomial_NB1('rainy', 0, Theta_play, alpha=0.), \
    predict_multinomial_NB1('snowy', 0, Theta_play, alpha=1.)

(array([-0.91629073, -1.09861229]), array([-2.07944154, -2.48490665]))

### I.3. Normal (Gaussian) Law

In this section, we will implement gaussian naive Bayes from scratch using Numpy.

#### I.3.1. Gaussian Likelihood statistics

Given:
- $A$: a categorical feature
- $Y$: the ouput
- $C$: the classes

The function takes as argument $A, Y, C$ previously described.
It must return $S[|C|, 2, N]$; a tensor having these dimensions:
- first dimension: each element represents one class's statistics
- second dimension: 1st element represents means; 2ns element represents variances
- third dimension: each element represents mean/variance of the respective feature

In [13]:
# TODO: Gaussian Likelihood statistics
def fit_gaussian_likelihood(X: np.ndarray[float],
                            Y: np.ndarray[str],
                            C: np.ndarray[str]
                            ) -> Tuple['np.ndarray[C, 2, N](float)']:
    Nb_C = len(C)
    Nb_F = X.shape[1]


    S = np.zeros((Nb_C, 2, Nb_F))

    for i, yi in enumerate(C):

        x = X[Y == yi]

        means = np.mean(x, axis=0)

        variances = np.var(x, axis=0, ddof=1)

        S[i, 0, :] = means
        S[i, 1, :] = variances

    return S
#=====================================================================
# UNIT TEST
#=====================================================================
# Result:
# array([[[165.        ,  60.1       ,  19.        ],
#         [ 92.66666667, 114.04      ,  11.33333333]],

#        [[178.        ,  79.925     ,  28.25      ],
#         [ 29.33333333,  25.47583333,   5.58333333]]])
#---------------------------------------------------------------------
C_t = np.array(['female', 'male'])
fit_gaussian_likelihood(Xperson, Yperson, C_t)

array([[[165.        ,  60.1       ,  19.        ],
        [ 92.66666667, 114.04      ,  11.33333333]],

       [[178.        ,  79.925     ,  28.25      ],
        [ 29.33333333,  25.47583333,   5.58333333]]])

#### I.3.2. Gaussian Likelihood training

**Nothing to code here, although you have to know how it functions for next use**

This function aims to generate parameters $\theta$.
In our case, paramters are diffrent from those of *logistic regrssion*.
They are a dictionary (map) with two entries:
- "prior": a dictionary having "vocab" a list of values and "freq" a list of their respective frequencies.
- "likelihood": a tensor of shape $[|C|, 2, N]$ containing likelihood statistics

In [14]:
def fit_gaussian_NB(X: np.ndarray[str, str],
                    Y: np.ndarray[str]
                    ) -> object:

    Theta   = {'prior': {}, 'likelihood': []}

    Theta['prior']['vocab'], Theta['prior']['freq'] = fit_prior(Y)
    Theta['likelihood'] = fit_gaussian_likelihood(X, Y, Theta['prior']['vocab'])

    return Theta



#=====================================================================
# UNIT TEST
#=====================================================================
# Result:
# {'prior': {'vocab': array(['female', 'male'], dtype='<U6'),
#   'freq': array([4, 4])},
#  'likelihood': array([[[165.        ,  60.1       ,  19.        ],
#          [ 92.66666667, 114.04      ,  11.33333333]],

#         [[178.        ,  79.925     ,  28.25      ],
#          [ 29.33333333,  25.47583333,   5.58333333]]])}
#---------------------------------------------------------------------
Theta_person = fit_gaussian_NB(Xperson, Yperson)

Theta_person

{'prior': {'vocab': array(['female', 'male'], dtype='<U6'),
  'freq': array([4, 4])},
 'likelihood': array([[[165.        ,  60.1       ,  19.        ],
         [ 92.66666667, 114.04      ,  11.33333333]],
 
        [[178.        ,  79.925     ,  28.25      ],
         [ 29.33333333,  25.47583333,   5.58333333]]])}

#### I.2.4. Gaussian Likelihood prediction

Given:
- $A$: a numerical feature
- $\mu_{Ac}$: mean of values of feature $A$ having $c$ as class
- $\sigma_{Ac}$: variance of values of feature $A$ having $c$ as class
- $Y$: the output
- $C$: the classes

Log likelihood is calculated as:
$$ \log p(A=v|Y=c) = \frac{-(v-\mu_{Ac})^2}{2 \sigma_{Ac}^2} - \log(\sqrt{2\pi \sigma_{Ac}^2})$$

In [15]:
# TODO: Gaussian Likelihood prediction
def predict_gaussian_NB1(v: str,
                         j: int,
                         Theta: object,
                         alpha: float = 0. # this is just added for compatibility
                         ) -> np.ndarray[float]:

    C = Theta['prior']['vocab']

    M_V = Theta['likelihood']

    Log_likelihood = []

    for i, yi in enumerate(C):

        mean = M_V[i][0][j]
        variance = M_V[i][1][j]

        Log_p = - ((v - mean) ** 2) / (2 * variance) - np.log(np.sqrt(2 * np.pi * variance))

        Log_likelihood.append(Log_p)

    return np.array(Log_likelihood)

#=====================================================================
# UNIT TEST
#=====================================================================
# Result:
# (array([-4.93164438, -3.03443716]), array([0.00721463, 0.04810173]))
#---------------------------------------------------------------------

pp = predict_gaussian_NB1(183, 0, Theta_person)

pp, np.exp(pp)

(array([-4.93164438, -3.03443716]), array([0.00721463, 0.04810173]))

### I.4. Final prediction

Our goal is to calculate approximate log probabilities of all classes given a sample:
$$\log P(y=c_k | \overrightarrow{x} = \overrightarrow{f})  \approx \log P(y=c_k) + \sum\limits_{f_j \in \overrightarrow{f}} \log P(f_j = x_j|y=c_k)$$

This function takes:
- $X^{(i)}$ one sample with $N$ features
- $\theta$ parameters (either those of multinomial or gaussian)
- $pred_{fct}$ a function to predict one feauture (either multinomial or gaussian)
- add_prior: if True, add prior probability
- $\alpha$ smoothing factor (passing it to gaussian function will do nothing)

It must return a vector of probabilities

In [16]:
# TODO: Final prediction
def predict_NB1(Xi    : 'np.ndarray[N]',
                Theta: object,
                pred_fct: Callable,
                add_prior: bool  = True,
                alpha: float = 1.0
                ) -> np.ndarray[float]:
    Nb_C = len(Theta['prior']['vocab'])

    log_p = np.zeros(Nb_C)

    for i in range(Nb_C):
        if add_prior:
            prior_prob = np.log(Theta['prior']['freq'][i] / np.sum(Theta['prior']['freq']))
            log_p[i] += prior_prob

        for j in range(len(Xi)):
            log_likelihood = pred_fct(Xi[j], j, Theta, alpha)
            log_p[i] += log_likelihood[i]

    return log_p

#=====================================================================
# UNIT TEST
#=====================================================================
# Result:
# (array([-2.20940778, -3.86937505]),
#  array([-2.56655064, -4.51223219]),
#  array([-2.85774653, -4.23617476]),
#  array([-10.401093  , -22.03977023]))
#---------------------------------------------------------------------

X_t1 = np.array(['sunny', 'hot' , 'high', 'no'])
X_t2 = np.array([183., 59., 20.])

predict_NB1(X_t1, Theta_play, predict_multinomial_NB1, add_prior=True, alpha=0.0), \
predict_NB1(X_t1, Theta_play, predict_multinomial_NB1, add_prior=False, alpha=0.0), \
predict_NB1(X_t1, Theta_play, predict_multinomial_NB1, add_prior=False, alpha=1.0), \
predict_NB1(X_t2, Theta_person, predict_gaussian_NB1, add_prior=False),

(array([-3.59617006, -4.95406494]),
 array([-2.56655064, -4.51223219]),
 array([-2.85774653, -4.23617476]),
 array([-10.401093  , -22.03977023]))

### I.5. Final product

**>> Nothing to code here**


In [17]:
class NaiveBayes(object):

    def __init__(self, multinomial=True):
        if multinomial:
            self.train = fit_multinomial_NB
            self.pred = predict_multinomial_NB1
        else:
            self.train = fit_gaussian_NB
            self.pred = predict_gaussian_NB1

    def fit(self, X, Y):
        self.Theta = self.train(X, Y)

    def predict(self, X, add_prior=True, prob=False, alpha=0.):
        Y_pred = []
        for i in range(len(X)):
            Y_pred.append(predict_NB1(
                X[i,:], self.Theta, self.pred, add_prior=add_prior, alpha=alpha
                ))

        Y_pred = np.array(Y_pred)

        if prob:
            return Y_pred

        return np.choose(np.argmax(Y_pred, axis=1), self.Theta['prior']['vocab'])

#=====================================================================
# UNIT TEST
#=====================================================================
# Result:
# (array(['yes', 'yes', 'no'], dtype='<U3'),
#  array([[-3.82235951, -3.01795347],
#         [-4.9209718 , -4.40424783],
#         [-2.50060367, -3.59331761]]),
#  array(['female', 'male'], dtype='<U6'),
#  array([[ -9.901093  , -21.53977023],
#         [-14.08654248, -11.22449947]]))
#---------------------------------------------------------------------

multinomial_nb = NaiveBayes()
multinomial_nb.fit(Xplay, Yplay)

gaussian_nb = NaiveBayes(multinomial=False)
gaussian_nb.fit(Xperson, Yperson)

X_t1 = np.array([
    ['rainy', 'cool', 'normal', 'yes'],
    ['snowy', 'cool', 'normal', 'yes'],
    ['sunny', 'hot' , 'high', 'no']
])

X_t2 = np.array([
    [183., 59., 20.],
    [175., 65., 30.]
])


multinomial_nb.predict(X_t1, alpha=1.), \
    multinomial_nb.predict(X_t1, alpha=1., prob=True), \
    gaussian_nb.predict(X_t2), \
    gaussian_nb.predict(X_t2, prob=True)

(array(['yes', 'yes', 'no'], dtype='<U3'),
 array([[-5.20912179, -4.10264337],
        [-6.30773408, -5.48893773],
        [-3.88736595, -4.67800751]]),
 array(['female', 'male'], dtype='<U6'),
 array([[-11.09424018, -22.73291741],
        [-15.27968966, -12.41764665]]))

## II. Application and Analysis

In this section, we will test different concepts by running an experiment, formulating a hypothesis and trying to justify it.

### II.1. Prior probability

We want to test the effect of prior probability.
To do this, we trained two models:
1. With prior probability
1. Without prior probability (It considers a uniform distribution of classes)

To test whether the models have adapted well to the training dataset, we will test them on the same dataset and calculate the classification ratio.


In [18]:
nb_withPrior     = CategoricalNB(alpha=1.0, fit_prior=True )
nb_noPrior       = CategoricalNB(alpha=1.0, fit_prior=False)

enc         = OrdinalEncoder()
Xplay_tf    = enc.fit_transform(Xplay)
nb_withPrior.fit(Xplay_tf, Yplay)
nb_noPrior.fit(Xplay_tf, Yplay)

Ypred_withPrior = nb_withPrior.predict(Xplay_tf)
Ypred_noPrior = nb_noPrior.predict(Xplay_tf)


print( 'Considring prior probability'  )
print(classification_report(Yplay, Ypred_withPrior))

print( 'No prior probability'  )
print(classification_report(Yplay, Ypred_noPrior))

Considring prior probability
              precision    recall  f1-score   support

          no       1.00      0.80      0.89         5
         yes       0.90      1.00      0.95         9

    accuracy                           0.93        14
   macro avg       0.95      0.90      0.92        14
weighted avg       0.94      0.93      0.93        14

No prior probability
              precision    recall  f1-score   support

          no       0.67      0.80      0.73         5
         yes       0.88      0.78      0.82         9

    accuracy                           0.79        14
   macro avg       0.77      0.79      0.78        14
weighted avg       0.80      0.79      0.79        14



**TODO: Analyze the results**

1. What do you notice, indicating if prior probability is useful in this case?
1. How does this probability affect the outcome?
1. When are we sure that using this probability is unnecessary?

**Answer**

1. In this case, considering prior probability clearly improves the model’s performance:
- Accuracy increases significantly (93% vs 79%)
- Precision and Recall for both classes become much better
- F1-score also increases, reflecting a better balance between precision and recall

This indicates that the model, when using the prior, correctly takes into account the imbalance between the "yes" and "no" classes, leading to more reliable and balanced predictions.

2. The incorporation of prior probability influences the outcome by assigning greater importance to classes that are more common in the dataset. Without considering prior probability, the model incorrectly assumes that all classes occur equally often. By including prior information, the model better matches the real class distribution, naturally favoring more frequent classes, and thus can influence the classification of new examples toward those more likely classes.
3. Using prior probability is unnecessary when:
- The dataset is perfectly balanced (each class has roughly the same number of examples)
- Or when the features are so strong and discriminative that they alone are enough to make accurate predictions, regardless of any class imbalance.

In such cases, adding prior information would not significantly impact the model’s performance.

### II.2. Smoothing

We want to test the Lidstone smoothing's effect.
To do this, we trained three models:
1. alpha = 1 (Laplace smoothing)
1. alpha = 0.5
1. alpha = 0 (without smoothing)

In [19]:
NBC_10 = CategoricalNB(alpha = 1.0 )
NBC_05 = CategoricalNB(alpha = 0.5 )
NBC_00 = CategoricalNB(alpha = 0.0 )

NBC_10.fit( Xplay_tf,   Yplay )
NBC_05.fit( Xplay_tf,   Yplay )
NBC_00.fit( Xplay_tf,   Yplay )

Y_10   = NBC_10.predict(Xplay_tf)
Y_05   = NBC_05.predict(Xplay_tf)
Y_00   = NBC_00.predict(Xplay_tf)


print(                'Alpha = 1.0'                        )
print(classification_report(Yplay, Y_10, zero_division=0.0))

print(                'Alpha = 0.5'                        )
print(classification_report(Yplay, Y_05, zero_division=0.0))

print(                'Alpha = 0.0'                        )
print(classification_report(Yplay, Y_00, zero_division=0.0))


Alpha = 1.0
              precision    recall  f1-score   support

          no       1.00      0.80      0.89         5
         yes       0.90      1.00      0.95         9

    accuracy                           0.93        14
   macro avg       0.95      0.90      0.92        14
weighted avg       0.94      0.93      0.93        14

Alpha = 0.5
              precision    recall  f1-score   support

          no       1.00      0.80      0.89         5
         yes       0.90      1.00      0.95         9

    accuracy                           0.93        14
   macro avg       0.95      0.90      0.92        14
weighted avg       0.94      0.93      0.93        14

Alpha = 0.0
              precision    recall  f1-score   support

          no       1.00      0.80      0.89         5
         yes       0.90      1.00      0.95         9

    accuracy                           0.93        14
   macro avg       0.95      0.90      0.92        14
weighted avg       0.94      0.93     

  np.log(smoothed_cat_count) - np.log(smoothed_class_count.reshape(-1, 1))


**TODO: Analyze the results**

1. What do you notice, indicating if smoothing affects performance in this case?
1. Based on the past answeer, Why?
1. Why do we get a "RuntimeWarning: divide by zero" error?
1. What is the benefit of smoothing (generally; not just for this case)?

**Answer**

1. In this case, smoothing (whether alpha = 1, 0.5, or 0) does not affect the model’s performance: the precision, recall, f1-score, and accuracy remain exactly the same (93% accuracy, identical scores for each class). This suggests that smoothing has no visible impact here because the model already sees all the possible feature values during training and testing — no unseen feature occurs.
1. Smoothing becomes important when we risk encountering feature values that were never seen during training. Since in this dataset every test example only uses feature values that already appeared in the training set, there’s no zero probability problem to fix. Therefore, smoothing does not help nor hurt: it simply has no effect because the data is “complete” from the model’s point of view.
1. The "RuntimeWarning: divide by zero" happens when we use alpha = 0 (no smoothing) and try to take the logarithm of a zero probability. If a feature value was not seen for a particular class during training, the model estimates P(feature | class) = 0, and log(0) is undefined mathematically (it tends toward minus infinity), which causes the warning.
1. Smoothing ensures that the model never assigns zero probability to any event, even unseen ones. This makes the classifier more robust, especially when dealing with small datasets, rare feature values, or new situations in testing. Even if the training data misses some possibilities, smoothing allows the model to still handle them reasonably without collapsing predictions to zero probability.

### II.3. Naive Bayes performance

*   Élément de liste
*   Élément de liste



Naive Bayes is known to generate powerful models when it comes to classifying textual documents.
We want to test this proposition using spam detection over [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset) dataset.

Each message is represented using term frequency (TF), where a word is considered as a feature.
In this case, a message is represented by a vector of frequencies (how many times each word appeared in the message).
We want to compare these models:
1. Multinomial Naive Bayes (MNB)
1. Gaussian Naive Bayes (GNB)
1. Logistic Regression (LR)

In [20]:
# reading the dataset
messages = pd.read_csv('data/spam.csv', encoding='latin-1')
# renaming features: text and class
messages = messages.rename(columns={'v1': 'class', 'v2': 'text'})
# keeping only these two features
messages = messages.filter(['text', 'class'])

messages.head()

Unnamed: 0,text,class
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


In [22]:
models = [
    MultinomialNB(),
    GaussianNB(),
    LogisticRegression(solver='lbfgs'),
    #solver=sag is slower; so I chose the fastest
]

algos = [
    'Multinomial Naive Bayes (MNB)',
    'Gaussian Naive Bayes  (GNB)',
    'Logistic Regression (LR)',
]

perf = {
    'train_time': [],
    'test_time' : [],
    'recall'    : [],
    'precision' : []
}


msg_train, msg_test, Y_train, Y_test = train_test_split(messages['text'] ,
                                                        messages['class'],
                                                        test_size    = 0.2,
                                                        random_state = 0  )

count_vectorizer = CountVectorizer()
X_train          = count_vectorizer.fit_transform(msg_train).toarray()
X_test           = count_vectorizer.transform    (msg_test ).toarray()


for model in models:
    # ==================================
    # TRAIN
    # ==================================
    start_time = timeit.default_timer()
    model.fit(X_train, Y_train)
    perf['train_time'].append(timeit.default_timer() - start_time)

    # ==================================
    # TEST
    # ==================================
    start_time = timeit.default_timer()
    Y_pred     = model.predict(X_test)
    perf['test_time'].append(timeit.default_timer() - start_time)

    # ==================================
    # PERFORMANCE
    # ==================================
    # In here, we are interrested in "spam" class which is our positive class
    perf['precision'].append(precision_score(Y_test, Y_pred, pos_label='spam'))
    perf['recall'   ].append(recall_score   (Y_test, Y_pred, pos_label='spam'))


pd.DataFrame({
    'Algorithm' : algos,
    'Train time': perf['train_time'],
    'Test time' : perf['test_time'],
    'Precision' : perf['precision'],
    'Recall'    : perf['recall']
})

Unnamed: 0,Algorithm,Train time,Test time,Precision,Recall
0,Multinomial Naive Bayes (MNB),0.934189,0.041797,0.987179,0.927711
1,Gaussian Naive Bayes (GNB),0.675234,0.142542,0.616667,0.891566
2,Logistic Regression (LR),1.151713,0.033497,0.986111,0.855422


**TODO: Analyze the results**

1. What do you notice about training time? (order the algorithms)
1. Why did we get these results based on the algorithms? (discuss each algorithm with respect to training time)
1. What do you notice about the testing time? (order the algorithms)
1. Why did we get these results based on the algorithms? (discuss each algorithm with respect to testing time)
1. Why is the Gaussian model less efficient than the multinomial based on the nature of the two algorithms?
1. Why is the Gaussian model less efficient than the multinomial based on the nature of the problem/data?
1. How Multinomial NB's implementation affect the training/test time? (store statistics vs. store probabilities)
1. Which one is more adequate for updating the model with new data? explain.

**Answer**

1. The order from fastest to slowest training time is Gaussian Naive Bayes (GNB), then Multinomial Naive Bayes (MNB), and finally Logistic Regression (LR). GNB trains the fastest (~ 0.67s), followed by MNB (~ 0.93s), and LR is the slowest (~1.15s).
1. Gaussian Naive Bayes is faster because it only needs to compute simple statistics (mean and variance) for each feature and class. Multinomial Naive Bayes, while simple, has to count and smooth word occurrences across classes, which takes a bit longer. Logistic Regression requires iterative optimization (gradient descent), which is more computationally heavy, explaining why it trains the slowest.
1. The order from fastest to slowest test time is Logistic Regression (LR), then Multinomial Naive Bayes (MNB), and finally Gaussian Naive Bayes (GNB). LR is extremely fast to predict (~ 0.033s), followed by MNB (~ 0.041s), and GNB is the slowest (~0.142s).
1. Logistic Regression is fast at prediction because it simply computes a weighted sum and applies a sigmoid threshold. MNB is also fast but must multiply many small probabilities for each feature, which takes a bit more time. GNB is the slowest because it needs to compute a Gaussian probability density function (involving exponentials and divisions) for each feature and class at prediction time, which is heavier.
1. Gaussian Naive Bayes must handle continuous data, requiring more complex mathematical operations like exponentials and divisions for each prediction, while Multinomial Naive Bayes only performs simple integer counting and multiplication, making it fundamentally lighter and faster for discrete/counted features.
1. The dataset consists of text data transformed into count vectors (integer frequencies), which naturally fits the multinomial model. Gaussian Naive Bayes assumes real-valued, continuous, normally distributed features — which is not true for text — so it struggles to model the data efficiently, making it less accurate and computationally heavier in this case.
1. Multinomial Naive Bayes stores simple count statistics during training (how many times each word appears per class) and converts them into log-probabilities. Because these are precomputed and stored efficiently, predictions only involve summing log-probabilities, making both training and testing extremely fast compared to methods that recompute complex functions during prediction.
1. Multinomial Naive Bayes is more adequate for updating with new data because it only needs to increment word counts for each class and recompute probabilities, without retraining the whole model. Logistic Regression and Gaussian Naive Bayes would require a full reoptimization or recalculation of parameters, making online updates much slower or infeasible.

