# Implementation of Naive Bayes using Python
Let us understand the working of naive bayes with the help of an example:
Classifying New Emails:
Now, when a new email arrives, the model calculates the probability of each
feature in the email occurring in both the spam and not-spam classes. It then
uses Bayes' theorem to determine the overall probability of the email
belonging to each class.
The email is classified as the class with the highest probability.
Example:
Imagine a new email contains the words "free offer" and "limited time." The
model might calculate:
* P(word "free" | spam) = 0.8
* P(word "offer" | spam) = 0.6
* P(word "limited time" | spam) = 0.7
* P(word "free" | not spam) = 0.1
* P(word "offer" | not spam) = 0.05
* P(word "limited time" | not spam) = 0.02
## Based on these probabilities and Bayes' theorem, the model might conclude:
* P(spam | email) = 0.98
* P(not spam | email) = 0.02
Therefore, the email is classified as spam with a high confidence level.

# Step 1: Import necessary libraries

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 2: Load the 20 newsgroups dataset
we will use the famous "20 newsgroups" dataset from scikitlearn, which consists of approximately 20,000 newsgroup documents across
20 different newsgroups. Each document belongs to one of the newsgroups,
making it suitable for text classification tasks.

In [4]:
newsgroups = fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'))

In [5]:
# Extract features (text) and labels
X = newsgroups.data
y = newsgroups.target

## Save the  data to `.pkl` since it takes time to load (Optional).
 you avoid repeating the same operations and can quickly reload the data for analysis or modeling.

In [23]:
import pickle
# Save features and labels
with open('../pickles/newsgroups_data.pkl', 'wb') as file:
    pickle.dump({'X': X, 'y': y}, file)

print("Data saved successfully!")


Data saved successfully!


In [24]:
# Load features and labels
with open('../pickles/newsgroups_data.pkl', 'rb') as file:
    data = pickle.load(file)

X = data['X']
y = data['y']

# Verify the data
print(f"Number of samples: {len(X)}")
print(f"Number of labels: {len(y)}")

Number of samples: 18846
Number of labels: 18846


# Step 3:Split the dataset into training and testing sets

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=42)

# 4: Vectorize the text data
* Use CountVectorizer to convert the text data into a bag-of-words
representation.
* Fit and transform the training data (X_train) and transform the testing
data (X_test) using the same vectorizer.


In [26]:
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Step 5: Train the Naive Bayes classifier
* Create a Multinomial Naive Bayes classifier (MultinomialNB).
* Train the classifier using the vectorized training data
(X_train_vectorized) and corresponding labels (y_train).

In [27]:
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_vectorized, y_train)

# Step 6: Make predictions on the test set

In [28]:
y_pred = naive_bayes_classifier.predict(X_test_vectorized)

# Step 7: Evaluate the model

In [29]:
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

### Display the results

In [30]:
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 61.75%


In [31]:
print("\nClassification Report:")
print(classification_rep)


Classification Report:
              precision    recall  f1-score   support

           0       0.61      0.25      0.36       151
           1       0.48      0.75      0.58       202
           2       0.73      0.04      0.08       195
           3       0.53      0.73      0.62       183
           4       0.86      0.58      0.69       205
           5       0.68      0.80      0.74       215
           6       0.88      0.53      0.66       193
           7       0.87      0.63      0.73       196
           8       0.49      0.58      0.53       168
           9       0.99      0.67      0.80       211
          10       0.92      0.80      0.86       198
          11       0.59      0.77      0.67       201
          12       0.84      0.49      0.62       202
          13       0.82      0.75      0.79       194
          14       0.76      0.69      0.72       189
          15       0.36      0.95      0.52       202
          16       0.80      0.55      0.65       188
   

In [13]:
print("\nConfusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[ 38   1   0   0   0   0   0   1   5   0   0   1   0   1   3  75   2  13
    8   3]
 [  1 151   0   9   1   6   1   1   6   0   1   8   0   2   5   6   1   2
    1   0]
 [  1  50   8  48   5  51   0   0   7   0   0  13   2   0   1   5   1   0
    3   0]
 [  0  16   0 134   6  10   2   0   1   0   0   5   2   1   0   4   0   1
    1   0]
 [  2  15   1  16 119   2   2   1  12   0   0  13   3   5   2   7   0   2
    3   0]
 [  0  29   0   5   0 171   0   0   2   0   0   0   0   1   1   4   0   1
    1   0]
 [  0  12   0  26   5   3 103   4   2   0   1  11   6   3   4   5   1   3
    4   0]
 [  1   3   0   0   0   1   2 124  10   0   1   8   3   0   4   8   1  12
   18   0]
 [  0   2   0   0   0   1   4   8  98   1   1   5   0   1   3  11   3  17
   13   0]
 [  0   2   0   0   0   0   0   0  10 142   6   2   0   1   0  23   0  15
   10   0]
 [  1   1   0   0   0   0   0   0   5   0 159   2   0   2   0  13   0   5
   10   0]
 [  0   5   1   0   1   1   0   0   1   0   1 