<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/naive_bayes/main/figs/bayes_theorem.png" width="400"/>
</div>
</center>

<br>

# <center>Naive Bayes Classifiers <a class="tocSkip"></center>
## <center>Davi Moreira <a class="tocSkip"></center>
### <center>Quantitative Theory and Methods <a class="tocSkip"></center>
### <center>Emory University <a class="tocSkip"></center>
    
<!---Notebook Navigation: [Another Cell](#another_cell)
Slide Navigation: <a href="#/2/3">Link to xxx</a>--->
    



<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/naive_bayes/main/figs/naivebayesshort.png" width="400"/>
</div>
</center>

<br>

## <center> [http://tiny.cc/naive-bayes](http://tiny.cc/naive-bayes) <a class="tocSkip"></center>

<br>

# <center> Access the link or the QR code to follow the material <a class="tocSkip"></center>


# Today's Summary and Learning Objectives

1. **Statistical Foundations** of Naive Bayes Classifiers.
2. The beauty of Bayes Theorem in action;
3. The Naive Bayes assumptions;
4. Naive Bayes Classifiers **Implementation with `Python`**.
5. Three versions of Naive Bayes Classifiers:
                - Gaussian Naive Bayes.
                - Multinomial Naive Bayes.
                - Bernoulli Naive Bayes.



# <center>What is a classification problem?</center>

## <center>What is a classification problem? <a class="tocSkip"></center>

 <br>
    
<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/naive_bayes/main/figs/boxes.gif" width="600"/>
</div>
</center>
    
<br>
    
<center>Classification involves categorizing data into predefined classes or groups based on their features.</center> 

# <center>Why do we use Machine Learning in Classification?</center>


## Advantages of Machine Learning for Classification

<br>

- **Efficiency at Scale**: Machine learning algorithms can quickly classify large volumes of data with high accuracy.
- **Pattern Recognition**: ML models excel at recognizing complex patterns in data that are not easily discernible by humans.
- **Adaptability**: ML classifiers can adapt to new, unseen data, making them ideal for dynamic environments.
- **Automation**: Automates the decision-making process in real-time applications (e.g. spam detection).
- **Continuous Improvement**: ML models can learn from new data over time, improving their accuracy and robustness.

> Machine learning has transformed the landscape of classification, providing tools that offer precision, speed, and flexibility, which are unparalleled by traditional human and statistical methods.

# <center>Where does the Naive Bayes' classifier come from?</center>

## [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem)

- Developed by the English statistician Thomas Bayes (1701–1761). It describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

<br>

$$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $$

- $P(A|B)$ is the **posterior probability**: Probability of event A occurring given that B is true - updated probability after the evidence is considered.
- $P(A)$ is the **prior probability**: Initial probability of event A - the probability before the evidence is considered.
- $P(B|A)$ is the **likelihood**: Probability of observing event B given that A is true.
- $P(B)$ is the **marginal probability**: Total probability of the evidence, event B.


## Deriving Bayes' Theorem

## From Conditional Probability to Bayes' Theorem <a class="tocSkip">

Given that the definition of Conditional Probability is:

$$ P(A|B) = \frac{P(A \cap B)}{P(B)} $$

And knowing:

$$ P(A \cap B) = P(A|B) \cdot P(B) $$
$$ P(B \cap A) = P(B|A) \cdot P(A) $$

Joint probability is symmetric, meaning:

$$ P(A \cap B) = P(B \cap A) $$

Thus, we can also express it as:

$$ P(A \cap B) = P(B|A) \cdot P(A) $$

We get:

$$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $$

### <center>**The Bayes' Theorem!** <a class="tocSkip"> </center>
    


# <center>How do we use the Bayes' Theorem for Classification?</center>

<a id='another_cell'></a>


## Bayes' Theorem for Classification

<br>

Given a set of features $X = (x_1, x_2, ..., x_n)$, we want to predict the class $C_k$ out of $m$ possible classes.

The goal is to find:

$$ P(C_k|X) = \frac{P(X|C_k) \cdot P(C_k)}{P(X)} $$

- **$P(C_k|X)$** is the **posterior probability**: Probability of class $C_k$ given features $X$.
- **$P(C_k)$** is the **prior probability**: Probability of class $C_k$.
- **$P(X|C_k)$** is the **likelihood**: Likelihood of features $X$ given class $C_k$.
- **$P(X)$** is the **marginal probability**: Evidence, the total probability of observing features $X$.

### <font color='red'>Warning!</font> How do we deal with $X$ being multidimensional? <a class="tocSkip">

## The Complexity of High Dimensionality


<br>

- **Direct calculation** of $P(X|C_k)$ involves understanding complex relationships among all features.

<br>

- With **multidimensional feature vectors** $X = (x_1, x_2, ..., x_n)$, calculating the likelihood, $P(X|C_k)$, directly becomes impractical due to the **curse of dimensionality**.

<br>

- High-dimensional spaces increase the **data requirement exponentially**. With 10 binary features, there are $2^{10} = 1024$ possible combinations, for example.





# <center>The Naive Assumption</center>


## The Naive Assumption

The Naive Bayes assumption simplifies the problem by assuming **each feature $x_i$ is independent of every other feature**.

By treating each feature as independent, **we only need to calculate the probability of each feature individually given the class**, rather than all possible combinations of features.

This leads to:

$$ P(X|C_k) = P(x_1, x_2, ..., x_n|C_k) = \prod_{i=1}^{n} P(x_i|C_k) $$

Thus, we can rewrite the Bayes' Theorem and our classifier becomes:

$$ P(C_k|X) = \frac{P(C_k) \prod_{i=1}^{n} P(x_i|C_k)}{P(X)} $$

- **Independence**: Each feature $x_i$ is independent of every other feature given the class $C_k$.
- This assumption significantly reduces computational complexity.

# <center>The Naive Bayes Classifier</center> 


## The Naive Bayes Classifier


**Classification Decision**: Since $P(X)=P(x_1, \dots, x_n) $  is constant across all classes, we focus on the classification rule that maximize the numerator:

$$ 
P(C_k|X) = \frac{P(C_k) \prod_{i=1}^{n} P(x_i|C_k)}{P(X)} 
$$

<br>

$$
P(C_k \mid x_1, \dots, x_n) \propto P(C_k) \prod_{i=1}^{n} P(x_i \mid C_k)
$$
$$
\Downarrow
$$

$$ \hat{C} = \arg \max_{C_k} P(C_k) \prod_{i=1}^{n} P(x_i|C_k) $$

$ P(C_k) $ is then the relative frequency of class $ C $
in the training set.




## The Naive Bayes Classifier in Statistical Learning

### Naive Bayes within Supervised Learning

Naive Bayes classifiers belong to the family of supervised learning models for classification, differentiating itself from unsupervised techniques that focus on discovering hidden patterns in unlabeled data.

### Naive Bayes as a Generative Model

- **Naive Bayes** is a fundamental example of a generative model.

- By learning the distribution of each class, Naive Bayes models the generation process of the data.

- This allows not only for classification but also for generating new data samples based on the learned distributions.



# <center>Naive Bayes Classifiers</center>

## Choosing the Right Naive Bayes Classifier

The best classifier aligns with the statistical properties of your data and performs best empirically.
    
## Selection Strategy <a class="tocSkip">

1. **Features Distribution**: How do we assume the features $x_i$ are distributed?
2. **Domain Knowledge**: Let insights from the domain guide your choice.
3. **Analyze Features**: Understand the distribution of your data (plot your data!).
4. **Preprocess**: Tailor preprocessing to fit the model's assumptions (e.g. log transformations).
5. **Model Comparison**: Apply different models and evaluate their performance.






## Gaussian Naive Bayes

- **Assumption**: The features $x_i$ are assumed to be normally distributed (Gaussian) for each class $C_k$.
- **Applicability**: Ideal for datasets where features are continuous.
- **`Python`**: `GaussianNB` in the `scikit-learn` package implements the Gaussian Naive Bayes algorithm for classification.


$$ P(x_i | C_k) = \frac{1}{\sqrt{2\pi\sigma_{k}^{2}}} \exp\left(-\frac{(x_i - \mu_{k})^2}{2\sigma_{k}^{2}}\right) $$

- Where $\mu_{k}$ and $\sigma_{k}^{2}$ are the mean and variance of feature $x_i$ for class $C_k$. They are estimated using [maximum likelihood](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation#:~:text=In%20statistics%2C%20maximum%20likelihood%20estimation,observed%20data%20is%20most%20probable.).


## Multinomial Naive Bayes

- **Assumption**: The features $x_i$ represent the frequencies with which certain events have been generated by a multinomial distribution.
- **Applicability**: Suited for count data, such as the frequency of words in text documents.
- **`Python`**: `MultinomialNB` in the `scikit-learn` package implements the algorithm. The distribution is parametrized by vectors $ \theta_{C_k} = (\theta_{C_k1},\ldots,\theta_{C_kn}) $
for each class $ C_k $, where $ n $ is the number of features and $ \theta_{C_ki} $ is the likelihood, $ P(x_i \mid C_k) $, of feature $ i $ appearing in a sample belonging to class $ C_k $.

The parameters $ \theta_{C_k} $ is estimated by relative frequency counting:

$$
\hat{\theta}_{C_ki} = \frac{ N_{C_ki} + \alpha}{N_{C_k} + \alpha n}
$$

where $ N_{C_ki} = \sum_{x \in T} x_i $ is the number of times feature $ i $ appears in a sample of class $ C_k $, and $ N_{C_k} = \sum_{i=1}^{n} N_{C_ki} $ is the total count of
all features for class $ C_k $.

If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero, because the probability estimate is directly proportional to the number of occurrences of a feature's value. 



The smoothing priors $ \alpha > 0 $ accounts for features not present in the learning samples (*pseudocount*) and prevents zero probabilities in further computations. Setting $ \alpha = 1 $ is called [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing), while $ \alpha < 1 $ is called [Lidstone smoothing](https://en.wikipedia.org/wiki/Additive_smoothing).

<br>

Slide Navigation: <a href="#/118/1">Link to Laplace Smoothing Example </a>


## Bernoulli Naive Bayes

- **Assumption**: The features $x_i$ are binary (Boolean) variables indicating its presence or absence.

- **Applicability**: Effective for datasets where features are binary, such as text classification where a word's presence or absence is a feature.

- **`Python`**: `BernoulliNB` in the `scikit-learn` package implements the algorithm.

The likelihood for Bernoulli Naive Bayes is

$$
P(x_i \mid C_k) = P(x_i = 1 \mid C_k) x_i + (1 - P(x_i = 1 \mid C_k)) (1 - x_i)
$$

It explicitly penalizes the non-occurrence of a feature $ i $.

<br>

Slide Navigation: <a href="#/120/1">Link to BNB Non-occurrence Example </a>

# Measuring Performance

<br>

**Confusion Matrix**: it is a powerful tool as it provides insights beyond overall accuracy, allowing for a detailed analysis of the model's effectiveness.

<br>

|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

<br>

Slide Navigation: <a href="#/119/1">Link to ROC/AUC </a>

## Accuracy


|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

<br>

$$\dfrac{\text{correct predictions}}{\text{total observations}} \ = \ \dfrac{tp + tn}{tp + tn + fp + fn}$$

- Overall effectiveness of the model.
- In the context of weather forecasting, accuracy would reflect how well a model predicts weather events correctly, such as correctly forecasting a day as rainy (true positive) or sunny (true negative), over the total number of forecasts made. 
- **High accuracy**: lots of correct predictions!

## Precision


|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

<br>
    
$$\dfrac{\text{true positives}}{\text{total predicted positive}} = \dfrac{tp}{tp + fp}$$

- Accuracy of positive predictions.
- In email spam detection, it would indicate the percentage of emails correctly identified as spam (true positives) out of all emails flagged as spam, aiming to reduce the number of legitimate emails incorrectly marked as spam (false positives).
- **High precision**: low false-positive rates.


## Recall (Sensitivity)


|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

<br>

$$\dfrac{\text{true positives}}{\text{total actual positive}} \ = \ \dfrac{tp}{tp + fn}$$


- Is the fraction of positives correctly identified.
- In criminal justice, it would assess how well a predictive policing model identifies all potential criminal activities (true positives) without missing any (thus minimizing false negatives).
- **High recall**: low false-negative rates.


## Specificity


|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

<br>

$$\frac{\text{true negatives}}{\text{total actual negative}} = \frac{tn}{tn + fp}$$

- It is the true negative rate, measures a model's ability to correctly identify actual negatives
- Crucial in fields where incorrectly identifying a negative case as positive could have serious implications (e.g., criminal justice).
- **High specificity**: the model is very effective at identifying true negatives.


## F1-Score


|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

<br>


$$ \text{F1} \ = \ 2 \times \dfrac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} $$

- Harmonic mean of Precision and Recall.
- In a medical diagnosis scenario, it would help in evaluating a test's effectiveness in correctly identifying patients with a disease (true positives) while minimizing the misclassification of healthy individuals as diseased (false positives and false negatives).
- **High F1 score**: a better balance between precision and recall.


# <center>Let's Practice!</center>

<br>
<br>
<br>

- Slide Navigation: <a href="#/30/0">Link to Gaussian Naive Bayes for Disease Prediction</a>

- Slide Navigation: <a href="#/66/0">Link to Multinomial Naive Bayes for Document Classfication</a>

- Slide Navigation: <a href="#/88/1">Link to Bernoulli Naive Bayes for Image Recognition</a>

# Practice: Gaussian Naive Bayes

## Pima Indians Diabetes Database  <a class="tocSkip">

### Objective <a class="tocSkip">   
    
- The aim is to predict whether a patient has diabetes based on diagnostic measurements. 
    
### Context <a class="tocSkip">

- This dataset ([Kaggle Dataset Link](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data)) originates from the **National Institute of Diabetes and Digestive and Kidney Diseases**. 
- All patients are females of at least 21 years old of **Pima Indian heritage**, selected under specific constraints for a study.

### Assumption <a class="tocSkip">
    
- The features $x_i$ can be aproximated by a Gaussian distribution.    

<br>
    
Slide Navigation: <a href="#/29/1">Link to Practice Menu </a>
    

Slide Navigation: <a href="#/113/1">Link to Audience Questions </a>



## Step 1: Import libraries and Dataset

In [None]:
# Importing libraries
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, cross_val_score


import warnings
warnings.filterwarnings('ignore')

# Importing dataset
df = pd.read_csv('https://raw.githubusercontent.com/davi-moreira/naive_bayes/main/data/diabetes.csv')



## Step 2: Descriptive Statistics

In [None]:
# Preview data
df.head()

In [None]:
# Dataset dimensions - (rows, columns)
df.shape

In [None]:
# Features data-type
df.info()

In [None]:
# Statistical summary
df.describe().T

In [None]:
# Count of null values
df.isnull().sum()

#### Observations: <a class="tocSkip">

1. There are a total of 768 records and 9 features in the dataset.
2. Each feature can be either of integer or float dataype.
3. Some features like Glucose, Blood pressure, Insulin, BMI have zero values which represent missing data.
4. There are zero NaN values in the dataset.
5. In the outcome column, 1 represents diabetes positive and 0 represents diabetes negative.

In [None]:
# Outcome countplot
sns.countplot(x = 'Outcome',data = df)

In [None]:
# Histogram of each feature

col = df.columns[:8]
plt.subplots(figsize = (20, 15))
length = len(col)

for i, j in itertools.zip_longest(col, range(length)):
    plt.subplot((length//2), 3, j + 1)
    plt.subplots_adjust(wspace = 0.1,hspace = 0.5)
    df[i].hist(bins = 20)
    plt.title(i)
plt.show()

In [None]:
# Scatter plot matrix 
scatter_matrix(df, figsize = (20, 20));

In [None]:
# Pairplot 
sns.pairplot(data = df, hue = 'Outcome')
plt.show()

In [None]:
# Heatmap
sns.heatmap(df.corr(), annot = True)
plt.show()

#### Observations: <a class="tocSkip">

1. The countplot tells us that the dataset is imbalanced, as number of patients who don't have diabetes is more than those who do.

2. From the correaltion heatmap, we can see that there is a high correlation between Outcome and [Glucose,BMI,Age,Insulin]. 

## Step 3: Data Preprocessing

In [None]:
df_new = df
# list(df_new)
# Preview data
df_new.head()

In [None]:
# checking zero values

# np.where(df_new['Glucose'] == 0)[0].shape
# np.where(df_new['BloodPressure'] == 0)[0].shape
np.where(df_new['SkinThickness'] == 0)[0].shape
# np.where(df_new['Insulin'] == 0)[0].shape
# np.where(df_new['BMI'] == 0)[0].shape
# np.where(df_new['DiabetesPedigreeFunction'] == 0)[0].shape
# np.where(df_new['Age'] == 0)[0].shape


In [None]:
# Replacing zero values with NaN
df_new[['Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']] = df_new[['Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']].replace(0, np.NaN) 

In [None]:
# Count of NaN
df_new.isnull().sum()

In [None]:
# Removing Features with too many zeros NaNs
df_new = df_new.drop(['SkinThickness', 'Insulin'], axis = 1)

# Removing Observations with NaNs
df_new = df_new.dropna(subset=['Glucose'])
df_new = df_new.dropna(subset=['BloodPressure'])
df_new = df_new.dropna(subset=['BMI'])



In [None]:
# Statistical summary
df_new.describe().T

In [None]:
# log transformation
df_new['LogPregnancies'] = np.log1p(df_new['Pregnancies'])
df_new['LogDiabetesPedigreeFunction'] = np.log1p(df_new['DiabetesPedigreeFunction'])
df_new['LogAge'] = np.log1p(df_new['Age'])

# Statistical summary
df_new.describe().T


In [None]:
# Histogram of each feature

col = df_new.columns[:10]
plt.subplots(figsize = (20, 15))
length = len(col)

for i, j in itertools.zip_longest(col, range(length)):
    plt.subplot((length//2), 3, j + 1)
    plt.subplots_adjust(wspace = 0.1,hspace = 0.5)
    df_new[i].hist(bins = 20)
    plt.title(i)
plt.show()

In [None]:
# Selecting features 
features = [
    'LogPregnancies',
    'Glucose',
    'BloodPressure',
    'BMI',
    'LogDiabetesPedigreeFunction',
    'LogAge'
]

# Splitting X and Y
df_train, df_test = train_test_split(df_new, test_size = 0.20, random_state = 42, stratify = df_new['Outcome'] )

X_train = df_train[features]
Y_train = df_train['Outcome']
X_test = df_test[features]
Y_test = df_test['Outcome']


In [None]:
# Checking dimensions
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)

## Step 4: Data Modelling

In [None]:
# Naive Bayes Algorithm
nb = GaussianNB()
nb.fit(X_train, Y_train)

In [None]:
# Making predictions on test dataset

Y_pred_nb = nb.predict(X_test)


## Step 5: Model Evaluation

In [None]:
# Evaluating using accuracy_score metric

accuracy_nb = accuracy_score(Y_test, Y_pred_nb)

# Accuracy on test set
print("Naive Bayes ACC: " + str(accuracy_nb * 100))

In [None]:
# Confusion matrix
cm = confusion_matrix(Y_test, Y_pred_nb)
cm

In [None]:
# Heatmap of Confusion matrix
sns.heatmap(pd.DataFrame(cm))

In [None]:
# Classification report

print(classification_report(Y_test, Y_pred_nb))

In [None]:
# Cross Validation
X = df_new[features]
y = df_new['Outcome']
result = cross_val_score(nb, X, y, scoring = 'accuracy')
# Recall: If we consider that the cost of not classifying someone with diabetes is high, 
# that failing to identify a sick patient (a false negative) is more dangerous 
# than incorrectly diagnosing a healthy patient as sick (a false positive).

result.mean(), result.std()


# Practice: Multinomial Naive Bayes

### BBC Full Text Document Classification <a class="tocSkip">

### Objective <a class="tocSkip">
    
- The aim is to predict which topic does a news article belong to based on its content.
    
### Context <a class="tocSkip">

- The original dataset ([Kaggle Dataset](https://www.kaggle.com/datasets/dheemanthbhat/bbc-full-text-preprocessed)) 
consists of 2225 documents (as text files) from the BBC news website corresponding to news articles in five topical areas: 
    - business
    - entertainment 
    - politics
    - sport
    - tech

### Assumption <a class="tocSkip">
    
- The features $x_i$ represent the frequencies with which certain events have been generated by a multinomial distribution.    

<br>
    
Slide Navigation: <a href="#/29/1">Link to Practice Menu </a>

Slide Navigation: <a href="#/113/1">Link to Audience Questions </a>


## Step 1: Import Libraries and Dataset

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.pipeline import make_pipeline
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
import string

import warnings
warnings.filterwarnings('ignore')

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

train_df: pd.DataFrame = pd.read_csv('https://raw.githubusercontent.com/davi-moreira/naive_bayes/main/data/mnist_train_subset.csv')
test_df: pd.DataFrame = pd.read_csv('https://raw.githubusercontent.com/davi-moreira/naive_bayes/main/data/mnist_test.csv')


# load data
df = pd.read_csv('https://raw.githubusercontent.com/davi-moreira/naive_bayes/main/data/bbc_text_cls.csv')



In [None]:
# Preview data
df.head(10)

# len(df)

## Step 2: Descriptive Statistics

In [None]:
# let's check labels frequency

df['labels'].hist(figsize=(10, 5));

## Step 3: Data Preprocessing

In [None]:
# Define the preprocessing function
def preprocess_text(text):
    # Handles the removal of stopwords and stemming
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove punctuation and make lower case
    tokens = [token.lower() for token in tokens if token.isalpha()]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    # Stem the words
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(tokens)

In [None]:
# Apply the preprocessing to each row

df['processed_text'] = df['text'].apply(preprocess_text)


In [None]:
df.head(10)

In [None]:
# Split the data into features and labels
features = df['processed_text']
labels = df['labels']

In [None]:
# Split the data into train and test sets

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, 
                                                                            test_size=0.2, 
                                                                            random_state=123)


In [None]:
# Create a text processing and classification pipeline

pipeline = make_pipeline(
    # Bag of Words
    # Convert the processed text into a matrix of token counts, 
    # which is then used as input to the MultinomialNB classifier
    CountVectorizer(),
    MultinomialNB()
)

## Step 4: Data Modeling

In [None]:
# Train the model

pipeline.fit(features_train, labels_train)

## Access the CountVectorizer step and get the feature names, which correspond to the number of features
#num_features = len(pipeline.named_steps['countvectorizer'].get_feature_names_out())

#print(f"The number of features is: {num_features}")

In [None]:
# Predict on the test set
labels_pred = pipeline.predict(features_test)

## Step 5: Model Evaluation

In [None]:
# Evaluate the model using train-test split
print("Train-test split evaluation:")
print(classification_report(labels_test, labels_pred))
print(f"Accuracy: {accuracy_score(labels_test, labels_pred)}")

In [None]:
# Evaluate the model using cross-validation

cross_val_accuracy = cross_val_score(pipeline, features_train, labels_train, cv=5, scoring='accuracy')

print("\nCross-validation evaluation:")
print(f"Cross-validated accuracy: {np.mean(cross_val_accuracy)}")


In [None]:
# Confusion Matrix Display
ConfusionMatrixDisplay.from_predictions(labels_test, labels_pred)


In [None]:
# Let's Check Some Misclassified Examples

# identifying misclassified examples
misclassified_idx = np.where(labels_pred != labels_test)[0]

# random select a misclassified example
i = np.random.choice(misclassified_idx)

print("True class:", labels_test.iloc[i])
print("Predicted class:", labels_pred[i])

# The specific element from 'features_test'
specific_element = features_test.iloc[i]

# Find the indices where the 'text' column in the DataFrame matches the specific element
matching_indices = df.index[df['processed_text'] == specific_element].tolist()

# Print text
list(df.iloc[matching_indices,0])


In [None]:

# Get the feature names and the log probability of features given a class
feature_names = pipeline.named_steps['countvectorizer'].get_feature_names_out()
feature_log_prob = pipeline.named_steps['multinomialnb'].feature_log_prob_

# Create a DataFrame to hold the top words for each category
top_words_per_category = pd.DataFrame()

for i, category in enumerate(pipeline.named_steps['multinomialnb'].classes_):
    # Get the indices of the top 10 features for this class
    top_indices = np.argsort(feature_log_prob[i])[-10:]
    # Get the associated words and probabilities
    top_features = feature_names[top_indices]
    top_probabilities = np.exp(feature_log_prob[i][top_indices])
    # Add to the DataFrame
    top_words_per_category[category] = top_features[::-1]  # Reverse to show the top word at first

# Transpose the DataFrame to have categories as columns and words as rows
top_words_per_category = top_words_per_category.T
top_words_per_category.columns = [f'Top {i+1}' for i in range(top_words_per_category.shape[1])]

# Print the table
print(top_words_per_category.transpose())


# Practice: Bernoullil Naive Bayes

### Objective  <a class="tocSkip">

- The aim is to predict the correct digit based on the hand-written image.

### Context <a class="tocSkip">

- The original [Kaggle Dataset](https://www.kaggle.com/datasets/oddrationale/mnist-in-csv) contains the 60,000 training examples and labels in addition to 10,000 test examples and labels. Each row consists of 785 values: the first value is the label (a number from 0 to 9) and the remaining 784 values are the pixel values (a number from 0 to 255).
    
- For our purposes in the lecture, 33,000 random training examples were removed from the original dataset.
        
### Assumption <a class="tocSkip">
    
- The features $x_i$ are binary (Boolean) variables indicating the presence or absence of a feature.
    
    
<br>
    
Slide Navigation: <a href="#/29/1">Link to Practice Menu </a>
    

Slide Navigation: <a href="#/113/1">Link to Audience Questions </a>
    


## Step 1: Import libraries and Dataset

In [None]:
import os
import cv2
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from matplotlib import pyplot as plt
from typing import List


In [None]:
train_df: pd.DataFrame = pd.read_csv('https://raw.githubusercontent.com/davi-moreira/naive_bayes/main/data/mnist_train_subset.csv')
test_df: pd.DataFrame = pd.read_csv('https://raw.githubusercontent.com/davi-moreira/naive_bayes/main/data/mnist_test.csv')


## Step 2: Descriptive Statistics

In [None]:
train_df.head(5)


In [None]:
train_df.shape

In [None]:
test_df.head(5)

In [None]:
test_df.shape

In [None]:
# Sort the label counts by the label value, assuming they are categorical but not numerical
label_counts = train_df['label'].value_counts().sort_index()

# Create a bar plot for the label counts
plt.figure(figsize=(12, 6))
label_counts.plot(kind='bar')
plt.title('Distribution of Labels in the Dataset')
plt.xlabel('Labels')
plt.ylabel('Count')
plt.xticks(rotation=1)  # Rotate x labels for better readability if necessary
plt.show()


## Step 3: Data Preprocessing

In [None]:
# Create variables for the pixels and the labels we want to predict
X_train: np.ndarray = train_df.drop('label', axis=1).to_numpy()
y_train: np.ndarray = train_df['label'].to_numpy()
X_test: np.ndarray = test_df.drop('label', axis=1).to_numpy()
y_test: np.ndarray = test_df['label'].to_numpy()

    

In [None]:
#print(X_train.ndim)
#print(X_train.shape)
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

In [None]:
# Plot a sample of each digit as the original image
# create sub plot for each digit
fig, ax = plt.subplots(2,5, figsize=(7,4))
# loop over each subplot to add its digit
for i, ax in enumerate(ax.flatten()):
    # find index for image with the corresponding digit
    img_idx: int = np.argwhere(y_train == i)[0]
    # get the image and reshape to 28X28
    img: np.ndarray = np.reshape(X_train[img_idx], (28, 28))
    # plot digit image
    ax.imshow(img, cmap="gray_r")
    # add digit label
    ax.set_title(f"Label: {i}")
    # remove gridlines
    ax.grid(False)
# add title to the plot
fig.suptitle("MNIST Images Sample And Their Labels")
# adjust the padding between and around subplots
fig.tight_layout()
# show plot
plt.show()


In [None]:
# We assume each pixel is either 0 (black) or 1 (white)
# create empty list for our binary vector
X_train_binary: List = []
# loop over each vector
for i in X_train:
    # reshape digit vector to image
    img: np.ndarray = np.reshape(i, (28, 28)).astype(np.uint8)
    # binarize
    im_gray: np.ndarray = cv2.threshold(
        img, 120, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    # append to binary vector list
    X_train_binary.append(np.reshape(im_gray, (784,)))
# convert to numpy array
X_train_binary: np.ndarray = np.asarray(X_train_binary)

## Step 4: Data Modeling

In [None]:
model = BernoulliNB()
model.fit(X_train, y_train)

## Step 5: Model Evaluation

In [None]:
# Model Evaluation
print("train acc:", model.score(X_train, y_train))
print("test acc:", model.score(X_test, y_test))

In [None]:
# Predict the labels for the test set
y_test_pred = model.predict(X_test)

# Print the classification report for the test set
print("Classification report for the test set:")
print(classification_report(y_test, y_test_pred))

In [None]:
# Generate the confusion matrix for the test set
cm = confusion_matrix(y_test, y_test_pred)

# Display the confusion matrix using ConfusionMatrixDisplay
cmd = ConfusionMatrixDisplay(cm, display_labels=model.classes_)
cmd.plot(values_format='d')
plt.title('Confusion Matrix for the Test Set')
plt.show()

In [None]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation with 5 folds
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

# Print the cross-validation scores for each fold
print("Cross-validation scores for each fold:", cv_scores)

In [None]:
# Calculate the mean and standard deviation of the cross-validation scores
mean_cv_score = cv_scores.mean()
std_cv_score = cv_scores.std()

# Print the mean and standard deviation
print(f"Mean cross-validation accuracy: {mean_cv_score:.3f}")
print(f"Standard deviation of cross-validation accuracy: {std_cv_score:.3f}")

In [None]:
# Show some misclassified examples

#np.random.seed(1)

misclassified_idx = np.where(y_test_pred != y_test)[0]
i = np.random.choice(misclassified_idx)
plt.imshow(X_test[i].reshape(28, 28), cmap='gray')
plt.title(f"True label: {y_test[i]} Predicted: {y_test_pred[i]}");


# <center>Questions?</center> 

<br>
    
Slide Navigation: <a href="#/29/1">Link to Practice Menu </a>

# Homework Assignment: Enhancing Naive Bayes Classifier Models

## It is your turn! 

In this assignment, you are tasked with enhancing the predictive performance of one of the Naive Bayes classifier models we developed in class. This is an opportunity for you to implement and experiment with the machine learning concepts and techniques discussed during the lecture.

### Objectives <a class="tocSkip">

- **Model Optimization**: Optimize the existing Naive Bayes classifier or select an alternative model that you believe could yield better results.
- **Data Processing**: Apply different data preprocessing techniques to improve model accuracy.
- **Research**: Conduct research to explore various strategies that could enhance model performance. Utilize reputable resources to support your choices.

### Deliverables <a class="tocSkip">

1. **Enhanced Model Implementation**: A Python script or Jupyter Notebook containing the code for your improved model.
2. **Performance Comparison**: A report comparing the original model's performance with your enhanced model. Include metrics such as accuracy, precision, recall, and F1-score.
3. **Justification**: A detailed explanation of the changes you made, including:
   - Why you chose to adjust or change the model.
   - The data processing techniques you applied.
   - Any resources or references you utilized to inform your decisions.

### Evaluation Criteria <a class="tocSkip">

- **Innovation**: Creative and effective approaches to model enhancement.
- **Accuracy**: The predictive performance of your final model.
- **Justification**: The rationale behind your methodological choices.
- **Presentation**: Clarity and structure of your comparative analysis and justifications.

Your assignment will not only be evaluated on the improved accuracy of the model but also on your analytical approach and the ability to articulate your decision-making process.

## <center>Have Fun!</center> <a class="tocSkip">


# <center>Thank you!</center>

# References

<br>

- James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning: With Applications in Python (1st ed. 2023 edition). Springer.

- H. Zhang (2004). [The optimality of Naive Bayes.](https://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf)
  Proc. FLAIRS.  

- C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to
  Information Retrieval. Cambridge University Press, pp. 234-265.  

- A. McCallum and K. Nigam (1998).
  [A comparison of event models for Naive Bayes text classification.](https://citeseerx.ist.psu.edu/doc_view/pid/04ce064505b1635583fa0d9cc07cac7e9ea993cc)
  Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.  

- V. Metsis, I. Androutsopoulos and G. Paliouras (2006).
  [Spam filtering with Naive Bayes – Which Naive Bayes?](https://citeseerx.ist.psu.edu/doc_view/pid/8bd0934b366b539ec95e683ae39f8abb29ccc757)
  3rd Conf. on Email and Anti-Spam (CEAS).  

- Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003).
  [Tackling the poor assumptions of naive bayes text classifiers.](https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf)
  In ICML (Vol. 3, pp. 616-623).  

- C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to
  Information Retrieval. Cambridge University Press, pp. 234-265.  

- A. McCallum and K. Nigam (1998).
  [A comparison of event models for Naive Bayes text classification.](https://citeseerx.ist.psu.edu/doc_view/pid/04ce064505b1635583fa0d9cc07cac7e9ea993cc)
  Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.  

- V. Metsis, I. Androutsopoulos and G. Paliouras (2006).
  [Spam filtering with Naive Bayes – Which Naive Bayes?](https://citeseerx.ist.psu.edu/doc_view/pid/8bd0934b366b539ec95e683ae39f8abb29ccc757)
  3rd Conf. on Email and Anti-Spam (CEAS).  

- Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003).
  [Tackling the poor assumptions of naive bayes text classifiers.](https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf)
  In ICML (Vol. 3, pp. 616-623).  

- Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). *Using the ADAP learning algorithm to forecast the onset of diabetes mellitus*. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.
- D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
  



# <center>Annex</center>

## Laplace smoothing Example <a id='laplace'></a>

We have a dataset with 10 features (words) and 2 classes. We'll calculate the conditional probability $P(x_i|C_k)$  of a feature $x_i$ given a class $C_k$.

Assume our dataset is represented as counts of each feature in documents belonging to each class. Here's a simplified representation:

- Total documents in Class 1: 100
- Total documents in Class 2: 100
- Total number of features (words): 10

Let's say we want to calculate $P(x_i|C_k)$ for a specific feature $x_3$ (the third word) for Class 1. Assume in our training data, $x_3$ appears 0 times in Class 1 documents.

First, calculate the total count of all features in each class (let's assume these totals after counting all words in all documents):

- Total feature counts in Class 1 documents: 500

The probability of $x_3$ given Class 1, where $x_3$ does not appear at all, would be calculated simply as:

$$\hat{\theta}_{C_13} = P(x_3|C_1) = \frac{N_{x_3,C_1}}{N_{C_1}} = \frac{0}{500} = 0$$

Due to the Naive Assumption, this lead to a zero probability for any document containing $x_3$ to be classified as Class 1.

### To solve this issue we use Laplace Smoothing <a class="tocSkip">

    
### With Laplace Smoothing <a class="tocSkip">

For Laplace smoothing, where $\alpha = 1$ and $n = 10$, even if $x_3$ does not appear in Class 1:

$$\hat{\theta}_{C_13} = P(x_3|C_1) = \frac{0 + 1}{500 + 10} = \frac{1}{510} = 0.00196$$

Due to the Naive Assumption, this non-zero probability allows the classifier to classify the documents with the unseen features in any class by assuming a small, non-zero probability for them.

Slide Navigation: <a href="#/29/1">Link to Practice Menu </a>
    
Slide Navigation: <a href="#/21/1">Link to Multinomial Naive Bayes </a>

## Evaluating Model Performance with ROC Curve and AUC

## ROC Curve <a class="tocSkip">

- Stands for Receiver Operating Characteristic Curve.
- A graphical plot that illustrates the diagnostic ability of a binary classifier.
- Plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

### True Positive Rate (TPR) <a class="tocSkip">

- Also known as sensitivity or recall.
- Calculated as: $TPR = \frac{TP}{TP + FN}$
- Represents the proportion of actual positives that are correctly identified.

### False Positive Rate (FPR) <a class="tocSkip">

- Calculated as: $FPR = \frac{FP}{FP + TN}$
- Represents the proportion of actual negatives that are incorrectly classified as positives.

## AUC - Area Under the ROC Curve

- A single scalar value summarizing the performance across all classification thresholds.
- Ranges from 0 to 1, with **1 indicating a perfect model and 0.5 representing a random guess**.

### Benefits of AUC <a class="tocSkip">

- AUC provides an aggregate measure of performance across all classification thresholds.
- It's often used when the classes are very imbalanced.
- AUC is scale-invariant and classification-threshold-invariant.

## AUC Interpretation <a class="tocSkip">
    
- A higher AUC value means a better-performing model. Model A with AUC of 0.85 is considered superior to Model B with an AUC of 0.75.
- An AUC of 0.5 suggests no discriminative power, akin to random guessing.
- AUC is particularly useful when you need to compare different models.

## Usage <a class="tocSkip">

- Tools like Scikit-learn's `roc_curve` and `auc` functions can be used to compute ROC and AUC.

```python
    
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Assuming y_true and y_pred have been defined:
# y_true: actual class labels
# y_pred: predicted probabilities or decision function values

fpr, tpr, thresholds = roc_curve(y_true, y_pred)
roc_auc = auc(fpr, tpr)

# Plotting the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()

```

Slide Navigation: <a href="#/29/1">Link to Practice Menu </a>

Slide Navigation: <a href="#/23/1">Link to Measuring Performance </a>
    

## BNB Non-occurrence Example 

Let's consider a simple example involving email classification. We want to classify emails into two categories: "Spam" and "Not Spam" using a Bernoulli Naive Bayes classifier. We will use two binary features for this classifier:

1. The word "discount" appears in the email ($x_1 $)
2. The word "lottery" appears in the email ($x_2$)

From our training data, we have calculated the following probabilities:

- $P(\text{"discount"} = 1 | \text{Spam}) = 0.8 $
- $P(\text{"lottery"} = 1 | \text{Spam}) = 0.9 $

For the "Not Spam" class, the probabilities are:

- $P(\text{"discount"} = 1 | \text{Not Spam}) = 0.1 $
- $P(\text{"lottery"} = 1 | \text{Not Spam}) = 0.01 $

Now, we receive a new email to classify that contains the word "discount" but not the word "lottery". The feature vector for this email is $ X = (x_1 = 1, x_2 = 0) $.

For the "Spam" class:

- The likelihood for $ x_1 $ (discount is present) contributes $ P(x_1 = 1 | \text{Spam}) = 0.8 $.
- The likelihood for $ x_2 $ (lottery is absent) contributes $ 1 - P(x_2 = 1 | \text{Spam}) = 1 - 0.9 = 0.1 $.

So, the total likelihood for the email being spam is $ 0.8 \times 0.1 = 0.08 $.

For the "Not Spam" class:

- The likelihood for $ x_1 $ (discount is present) contributes $ P(x_1 = 1 | \text{Not Spam}) = 0.1 $.
- The likelihood for $ x_2 $ (lottery is absent) contributes $ 1 - P(x_2 = 1 | \text{Not Spam}) = 1 - 0.01 = 0.99 $.

So, the total likelihood for the email not being spam is $ 0.1 \times 0.99 = 0.099 $.

The absence of the highly indicative feature "lottery" significantly lowers the likelihood of the email being spam to 0.08. The likelihood of the email being "Not Spam" is actually higher than the likelihood of it being "Spam" (0.099 > 0.08).

**Result**:

The email would be classified as "Not Spam" because the likelihood of "Not Spam" (0.099) is higher than that of "Spam" (0.08), due to the significant penalty applied for the absence of the highly indicative feature "lottery".

Slide Navigation: <a href="#/29/1">Link to Practice Menu </a>

Slide Navigation: <a href="#/22/1">Link to Bernoulli Naive Bayes </a>

