### Spam Detection

*** 

### Welcome to my personal project in which I will build a Machine Learning model for spam detection using Naive Bayes and a data set sourced from a Udemy course.

### This project involves a number of different tasks like:

- Importing data
- Preprocessing text data
- Noise reduction
- Standardization
- Feature Extraction
- Building a Naive Bayes model
- Model evaluation

### The goal of my project is to demonstrate the power of Machine Learning models in a real-life scenario, showcasing the effectiveness and potential of Machine Learning to drive business success.


***

### Importing libraries and data

In [1]:
# Importing libraries
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.naive_bayes import MultinomialNB
import nltk # (Natural Language Toolkit) is used for text preprocessing
from nltk.corpus import stopwords # provides a list of common stopwords
import string # is used to handle string operations, specifically for punctuation removal
from sklearn.feature_extraction.text import TfidfVectorizer # is used to convert text data into TF-IDF feature vectors

import warnings
warnings.filterwarnings("ignore")
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d00845\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Importing data
df = pd.read_csv('spam.csv')
df.head()

Unnamed: 0,type,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Working with text data:

- Preprocessing data is essential in text classification tasks for several reasons. By reducing the noise and standardizing the text, the model can focus on the meaningful parts of the text, which can improve the model's performance and generalization ability.

- Noise reduction: Punctuation marks often do not contribute to the meaning of the text in the context of text classification and can be considered noise. Stopwords (like "and", "the", "is") are common words that usually do not carry significant meaning and can be removed to reduce dimensionality and noise in the data.

- Standardization: Converts all text to a uniform case (usually lowercase), which helps in treating words like "Apple" and "apple" as the same word.

- In order to preprocess the data, a function will be defined which converts text to lowercase, removes punctuation and stopwords.

In [3]:
# Preprocess the data
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

df['message'] = df['message'].apply(preprocess_text)
df.head()

Unnamed: 0,type,message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though


### Feature extraction: 

- Feature extraction is the process of converting text data into numerical features that can be used by machine learning algorithms. TF-IDF (Term Frequency-Inverse Document Frequency) is a popular method for this purpose.

- Term Frequency (TF): Measures how frequently a term appears in a document. Higher frequency indicates more importance.

- Inverse Document Frequency: Measures how important a term is. Words that occur in many documents have a lower IDF, while rare words have a higher IDF.

- The TF-IDF score for each word is calculated by multiplying its TF and IDF scores. This helps balance the term's frequency in a document with how unique it is across all documents.

In [4]:
# Feature extraction using TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['type']

### Naive Bayes: 

- Naive Bayes is a family of probabilistic algorithms based on Bayes' Theorem, primarily used for classification tasks. These algorithms operate under the "naive" assumption that the features of a dataset are conditionally independent given the class label. 

- Despite this simplification, Naive Bayes classifiers perform remarkably well in various real-world scenarios, such as spam detection, sentiment analysis, and medical diagnosis. The key advantage of Naive Bayes is its efficiency, both in terms of computational speed and the ability to handle large datasets with ease. 

- Furthermore, it requires a relatively small amount of training data to estimate the necessary parameters, making it a robust and scalable solution for many applications.

In [5]:
# Initialize K-Fold Cross Validator
kf = KFold(n_splits=3, shuffle=True)

# Initialize the Naive Bayes model
nb_model = MultinomialNB()

# Perform cross-validation and evaluate the model
accuracy_scores = cross_val_score(nb_model, X, y, cv=kf, scoring='accuracy')

print(f'Accuracy scores for each fold: {accuracy_scores}')
print(f'Average accuracy: {accuracy_scores.mean()}')

Accuracy scores for each fold: [0.96932185 0.95476575 0.95153473]
Average accuracy: 0.958540778701947


*** 

### Conclusion:

- This notebook demonstrates a complete workflow for building and evaluating a Naive Bayes classifier to classify email messages as either "ham" or "spam". 

- It begins by importing a dataset (spam.csv), preprocesses the text data by converting it to lowercase, removing punctuation, and eliminating stopwords. Next, it uses TF-IDF Vectorization to transform the preprocessed text into numerical features suitable for machine learning. 

- The code then employs K-Fold Cross Validation with 3 folds to train and evaluate the Multinomial Naive Bayes model on the dataset, calculating accuracy scores for each fold and displaying both individual fold scores and the average accuracy across all folds. This approach ensures robust evaluation of the model's performance while handling text data effectively through preprocessing and feature extraction techniques.

- Overall, Naive Bayes was able to successfully detect spam mails with an average accuracy score of 0.95, making it an excellent model for spam detection.