# Naive Bayes (Bayes)

**author** : Germain Forestier (germain.forestier@uha.fr)

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. They are particularly known for their efficiency in text classification tasks.


## **Exercice 1**: Naive Bayes using Scikit-learn

This exercise will guide you through implementing a Naive Bayes classifier using Python and Scikit-learn. You'll learn to handle data loading, preprocessing, training a model, and evaluating its performance.

### Part 1: Data Loading

In [2]:
# Import the necessary library
import pandas as pd

# TODO: Load the dataset from the provided URL into a DataFrame named 'data'
# Hint: Use pd.read_csv() with the URL https://germain-forestier.info/dataset/weather.csv
data = pd.read_csv("https://germain-forestier.info/dataset/weather.csv")

# TODO: Display the first few rows of the dataset
data[0:4]

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes


### Part 2: Data Preprocessing

In [7]:
# Import LabelEncoder from sklearn
from sklearn.preprocessing import LabelEncoder

# Initialize a LabelEncoder
encoder = LabelEncoder()

# TODO: Apply the encoder to each categorical column and replace the column with encoded data
# Example: data['Outlook'] = encoder.fit_transform(data['Outlook'])
# data['Outlook'] = encoder.fit_transform(data['Outlook'])
# data['Temperature'] = encoder.fit_transform(data['Temperature'])
# data['Humidity'] = encoder.fit_transform(data['Humidity'])
# data['Wind'] = encoder.fit_transform(data['Wind'])
# data['Play'] = encoder.fit_transform(data['Play'])

for column in data.columns:
    #print(column)
    data[column] = encoder.fit_transform(data[column])

# TODO: Display the transformed data
data[0:4]

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play
0,2,1,0,1,0
1,2,1,0,0,0
2,0,1,0,1,1
3,1,2,0,1,1


### Part 3: Splitting the Data

In [8]:
# Import train_test_split from sklearn
from sklearn.model_selection import train_test_split

# Split data into features (X) and target (y)
X = data.drop('Play', axis=1)  # Features
y = data['Play']               # Target

# TODO: Split the data into training and testing sets with a test size of 20% and random_state=42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Part 4: Training the Naive Bayes Model

In [9]:
# Import CategoricalNB from sklearn
from sklearn.naive_bayes import CategoricalNB

# Initialize the Categorical Naive Bayes classifier
model = CategoricalNB()

# TODO: Train the model using the training data
model.fit(X_train, y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,
,min_categories,


### Part 5: Model Evaluation

In [10]:
# Import accuracy_score and confusion_matrix from sklearn
from sklearn.metrics import accuracy_score, confusion_matrix

# TODO: Make predictions on the test set and store in a variable 'predictions'
y_pred=model.predict(X_test)

# TODO: Calculate and print the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# TODO: Generate and display the confusion matrix
confusion_matrix(y_test, y_pred)

array([[0, 1],
       [1, 1]])

## **Exercise** 2: Text Classification using Multinomial Naive Bayes

This exercise introduces students to text classification using the Multinomial Naive Bayes classifier, a popular algorithm for text data that involves word counts as features. We will work with the 20 Newsgroups dataset, focusing on a subset of categories.

### Part 1: Data Overview

Before diving into text classification, it's crucial to understand the dataset you'll be working with. The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. In this part, you'll load the dataset and explore the different categories available, which will help you become familiar with the type of text data you'll classify.


In [None]:
# Import necessary library to fetch data
from sklearn.datasets import fetch_20newsgroups

# TODO: Load data using fetch_20newsgroups and display the target names (categories)

### Part 2: Selecting Specific Categories

Often, you may not need all the data available in a dataset for your specific task. For this exercise, we'll focus on a subset of categories to simplify the learning process and reduce computation time. You will learn how to fetch data for selected categories only.


In [None]:
# Define the categories to focus on
categories = ['talk.religion.misc', 'soc.religion.christian', 'sci.space', 'comp.graphics']

# TODO: Fetch the training and testing data for the selected categories


### Part 3: Text Vectorization and Model Training

Text data must be converted into a numerical format that machine learning models can understand. Using the TF-IDF vectorization, you will convert text documents into a matrix of TF-IDF features. Then, you will use these features to train a Multinomial Naive Bayes model, which is especially suited for text classification tasks where features represent word counts or frequencies.


In [None]:
# Import necessary libraries for text vectorization and Naive Bayes classifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# TODO: Create a pipeline that includes TfidfVectorizer and MultinomialNB

# TODO: Train the model with the training data


### Part 4: Evaluating the Model

After training a model, it's crucial to evaluate its performance to understand how well it performs on unseen data. In this part, you will use the test data to make predictions and then visualize the results using a confusion matrix. This will allow you to see not only the overall accuracy but also where the model makes mistakes.


In [None]:
# Import necessary library for evaluation
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# TODO: Predict labels for the test data

# TODO: Generate and visualize the confusion matrix using seaborn's heatmap


### Part 5: Using the Model for Prediction

Now that the model is trained and evaluated, we can use it to predict the category of any new piece of text. This functionality is extremely useful in many applications, such as sorting emails, organizing documents, and moderating content. Below is a utility function that predicts the category of a given text string using our trained Multinomial Naive Bayes pipeline.


In [None]:
# Utility function to predict the category of a text
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

# Test the function with some example strings
print(predict_category('Meeting to discuss the launch of the new space satellite.'))
print(predict_category('Join us this Sunday for the church anniversary service.'))
print(predict_category('How to dertermine the screen resolution!'))



sci.space
soc.religion.christian
comp.graphics


## **Exercise** 3: Advanced Model Optimization and Comparison

This exercise will go deeper into Bayes classification and will explore advanced model optimization and comparisons.

### Part 1: Feature Engineering with N-Grams

Introducing N-Grams into our feature set allows the model to consider the context provided by adjacent word sequences, potentially increasing its predictive accuracy. You will modify the TfidfVectorizer to include not only single words (uni-grams) but also sequences of two and three words (bi-grams and tri-grams).


In [None]:
# Import necessary libraries for the classifier and vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

# TODO: Create a TfidfVectorizer with ngram_range set to include uni-grams, bi-grams, and tri-grams
# TODO: Integrate this vectorizer into a pipeline with the MultinomialNB classifier

# Train the model with the training data
# model_ngram.fit(train.data, train.target)

# Evaluate the model on the test set
# predictions = model_ngram.predict(test.data)
# print(classification_report(test.target, predictions, target_names=train.target_names))


### Part 2: Comparing Different Naive Bayes Models

Different types of Naive Bayes classifiers are suitable for different types of data. You will now explore how Bernoulli Naive Bayes, which is designed for binary/boolean features, performs on our text data compared to Multinomial Naive Bayes.


In [None]:
# Import necessary libraries for the classifier and vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

# TODO: Create a pipeline with TfidfVectorizer set to binary mode and BernoulliNB
# TODO: Train and evaluate the BernoulliNB model


### Part 3: Hyperparameter Tuning

Tuning model parameters can significantly improve performance. You will use grid search to experiment with different values of the `alpha` parameter in MultinomialNB, which controls smoothing.


In [None]:
# Import necessary libraries for grid search
from sklearn.model_selection import GridSearchCV

# Setup the pipeline again with the standard TfidfVectorizer and MultinomialNB
pipeline = make_pipeline(TfidfVectorizer(), MultinomialNB())

# TODO: Define the parameter grid for 'alpha' in MultinomialNB
# TODO: Initialize and fit a GridSearchCV object to find the best 'alpha'



### Conclusion and Discussion

In this exercise, you engaged in advanced text classification tasks, exploring different feature engineering techniques with N-Grams, comparing various Naive Bayes classifiers, and tuning hyperparameters to optimize performance. These activities are essential for developing practical skills in machine learning and understanding the complexities and nuances of model optimization and evaluation.


## **Exercise** 4: IMDB Dataset of 50K Movie Reviews

In this exercise, your goal will be to use a Bayesian classifier to perform sentiment analysis on the IMDB Dataset of 50K Movie Reviews : https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews