# **Purpose**<br>
This notebook demonstrates a complete workflow for building and evaluating a text classification model using the Naive Bayes algorithm. From data loading and preprocessing to model training, evaluation, and deployment, each step is carefully executed to ensure a professional and effective machine learning solution. This project highlights my understanding of key concepts in machine learning, including data preprocessing, model selection, and performance evaluation

# **Installing Required Libraries** <br>
**scikit-learn**: A powerful library for machine learning in Python, providing tools for data preprocessing, model training, and evaluation.

**pandas**: A library for data manipulation and analysis, particularly useful for handling structured data.

**numpy**: A library for numerical computations, providing support for arrays and matrices

In [1]:
!pip install scikit-learn pandas numpy



# **Importing Required Libraries** <br>

**pandas**: Used for loading and manipulating the dataset.

**train_test_split**: A function from scikit-learn used to split the dataset into training and testing sets.

**CountVectorizer**: A tool for converting text data into numerical features (bag-of-words representation).

**MultinomialNB**: A Naive Bayes classifier suitable for classification with discrete features (e.g., word counts).

**accuracy_score, classification_report, confusion_matrix**: Functions used to evaluate the performance of the model.




In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# **Loading the Dataset** <br>
load the SMS Spam Collection dataset from a URL. The dataset is in TSV (Tab-Separated Values) format and contains two columns:

**label**: Indicates whether the message is "spam" or "ham" (non-spam).

**message**: The actual text content of the SMS.

*We use pd.read_csv() to load the data into a pandas DataFrame and display the first 5 rows using data.head() to get a quick overview of the dataset*

In [3]:
# Load the dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Display the first 5 rows
print(data.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


# **Preprocessing the Data** <br>
 preprocess the data to make it suitable for machine learning:

**Label Encoding**: We convert the categorical labels ("spam" and "ham") into binary values (1 for spam, 0 for ham) using the map() function.

**Feature and Target Separation**: We separate the dataset into features (X) and labels (y). Here, X contains the SMS messages, and y contains the corresponding binary labels.


To evaluate the performance of the model, by split the dataset into training and testing sets:

**Training Set (80%)**: Used to train the model.

**Testing Set (20%)**: Used to evaluate the model's performance on unseen data.

*The train_test_split() function is used for this purpose, with random_state=42 ensuring reproducibility of the results*


In [4]:
# Convert labels to binary values: spam = 1, ham = 0
data['label'] = data['label'].map({'spam': 1, 'ham': 0})

# Split the data into features (X) and labels (y)
X = data['message']
y = data['label']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **Text Vectorization** <br>
convert the text data into numerical features using the CountVectorizer:

**fit_transform()**: This method learns the vocabulary from the training data and transforms the text into a matrix of token counts.

**transform()**: This method applies the same transformation to the testing data using the vocabulary learned from the training data.

*The result is a sparse matrix where each row represents an SMS message, and each column represents a word in the vocabulary.*

In [5]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the training data
X_train_vec = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_vec = vectorizer.transform(X_test)

# **Training the Naive Bayes Model** <br>
initialize a Multinomial Naive Bayes classifier, which is well-suited for text classification tasks. The model is then trained on the vectorized training data (X_train_vec) and corresponding labels (y_train).

*Naive Bayes is a probabilistic classifier that assumes independence between features, making it efficient and effective for text data.*



In [6]:
# Initialize the Naive Bayes model
model = MultinomialNB()

# Train the model on the training data
model.fit(X_train_vec, y_train)

# **Making Predictions** <br>
After training the model, we use it to make predictions on the test set (X_test_vec). The predict() method returns the predicted labels for the test data, which we store in y_pred

In [7]:
# Make predictions on the test set
y_pred = model.predict(X_test_vec)

# **Evaluating the Model** <br>
To assess the model's performance, we calculate the following metrics:

**Accuracy:** The proportion of correctly classified instances out of the total instances.

**Classification Report:** Provides precision, recall, F1-score, and support for each class.

**Confusion Matrix:** A matrix showing the actual vs. predicted classifications, helping to visualize the model's performance.

In [8]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

# Print confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 99.19%

Classification Report:
              precision    recall  f1-score   support

         Ham       0.99      1.00      1.00       966
        Spam       1.00      0.94      0.97       149

    accuracy                           0.99      1115
   macro avg       1.00      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115


Confusion Matrix:
[[966   0]
 [  9 140]]


# **Classifying New Emails** <br>
demonstrate the model's ability to classify new, unseen SMS messages:

**New SMS Messages**: We provide a list of new SMS messages to classify.

**Vectorization**: The new messages are transformed using the same CountVectorizer to ensure consistency with the training data.

**Prediction:** The trained model predicts whether each message is spam or ham.

**Output:** The predictions are displayed alongside the original messages, showcasing the model's real-world applicability.

In [9]:
# New sms to classify
new_sms = [
    "Congratulations! You've won a $1000 Walmart gift card. Click here to claim now.",
    "Hey, are we still meeting for lunch today?",
    "URGENT: Your bank account has been compromised. Click here to secure it."
]

# Convert new sms to numerical features
new_sms_vec = vectorizer.transform(new_sms)

# Make predictions
predictions = model.predict(new_sms_vec)

# Display predictions
for sms, prediction in zip(new_sms, predictions):
    print(f"sms: {sms}\nPrediction: {'Spam' if prediction == 1 else 'Ham'}\n")

Email: Congratulations! You've won a $1000 Walmart gift card. Click here to claim now.
Prediction: Spam

Email: Hey, are we still meeting for lunch today?
Prediction: Ham

Email: URGENT: Your bank account has been compromised. Click here to secure it.
Prediction: Spam

