# 📧 Spam Classification using Machine Learning


## 1. Introduction

### Why is Spam Classification Important?

Spam emails are unsolicited messages sent in bulk, often for advertising, phishing, or malicious purposes. These emails can be annoying at best and dangerous at worst, as they may contain fraudulent links, scams, or malware.

With the increasing volume of emails received daily, manually identifying and filtering spam emails is impractical. This is where **Machine Learning (ML)** comes in. By leveraging **Natural Language Processing (NLP)** techniques, we can automate spam detection with high accuracy.

### Goal of This Project

The primary objective of this project is to classify emails as **spam** or **ham (not spam)** using machine learning techniques. We will:

1. **Process and clean** the email dataset.
2. **Convert textual data** into numerical features using **TF-IDF Vectorization**.
3. **Train a Logistic Regression model** to classify emails.
4. **Evaluate model performance** using metrics like **accuracy score**.
5. **Suggest improvements** for future work.

This project is a **Level 2 Data Science Project**, which means we go beyond basic dataset exploration and implement more advanced **feature engineering**, **machine learning algorithms**, and **evaluation techniques**.

---


## 2. Data Preprocessing

### 2.1 Importing Essential Libraries

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix



### Why Are These Libraries Important?

1. **NumPy (`numpy`)**: Provides support for large, multi-dimensional arrays and matrices. It is particularly useful for handling numerical operations efficiently.

2. **Pandas (`pandas`)**: A powerful data analysis library used to load, manipulate, and preprocess tabular datasets like CSV files.

3. **Matplotlib (`matplotlib.pyplot`)**: A popular data visualization library that helps us create graphs and charts to analyze dataset characteristics.

4. **Seaborn (`seaborn`)**: Built on top of Matplotlib, it provides additional plotting capabilities for statistical visualizations.

5. **Scikit-Learn (`sklearn`)**: A machine learning library with utilities for data preprocessing, model training, and evaluation.

   - `train_test_split`: Splits data into training and testing sets.
   - `LogisticRegression`: A statistical model for binary classification.
   - `TfidfVectorizer`: Converts textual data into numerical features using **TF-IDF (Term Frequency-Inverse Document Frequency)**.
   - `accuracy_score`: Computes the accuracy of model predictions.
   - `classification_report`: Provides a detailed breakdown of precision, recall, and F1-score.
   - `confusion_matrix`: Displays true positives, false positives, true negatives, and false negatives.

With these libraries, we can effectively process and classify emails.


### 2.2 Loading and Exploring the Dataset

In [None]:

df = pd.read_csv("mail_data.csv")
print("Dataset Overview:")
print(df.head())  # Display first few rows



### Understanding the Dataset

Our dataset consists of two main columns:

1. **Category**: Labels indicating whether the email is **"spam"** or **"ham"** (not spam).
2. **Message**: The actual email text.

Our goal is to use the **Message** column to predict the **Category**.

### Sample Data:

| Index | Category | Message |
|--------|----------|------------|
| 0 | ham | Hello, how are you? |
| 1 | spam | Congratulations! You won a free iPhone! Click here to claim. |
| 2 | ham | Are we still meeting tomorrow? |
| 3 | spam | Urgent! Your account has been compromised. Enter your password now. |

From the example above, we can observe that spam messages often contain words like **"Congratulations," "Urgent," "Click here,"** or **"Free."**

To proceed, we will clean and preprocess this text data to improve our model's accuracy.


### 2.3 Handling Missing and Duplicate Values

In [None]:

# Checking for missing values
missing_values = df.isnull().sum()
duplicates = df.duplicated().sum()

# Removing missing values and duplicates
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

print(f"Missing values: {missing_values.sum()}")
print(f"Duplicate rows removed: {duplicates}")



### Why Handle Missing and Duplicate Values?

- **Missing values** can disrupt the training process, leading to inaccurate predictions.
- **Duplicate rows** can bias the model, causing it to learn incorrect patterns.
- By removing both, we ensure a **clean and balanced dataset** for training.


## 3. Feature Engineering

### 3.1 Converting Labels into Numerical Form

In [None]:

df["label"] = df["Category"].map({"spam": 1, "ham": 0})  # Mapping spam as 1 and ham as 0



### Why Convert Labels?

Most machine learning models cannot work directly with **text labels**. They require **numerical representations**.

- **Spam (1)**: Indicates an email is unwanted or fraudulent.
- **Ham (0)**: Indicates a normal, non-spam email.

This transformation allows the model to learn patterns associated with each category.


### 3.2 Transforming Text Data using TF-IDF

In [None]:

vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df["Message"])



### Why Use TF-IDF?

**TF-IDF (Term Frequency - Inverse Document Frequency)** is a statistical method that converts text into numerical values by analyzing the importance of words in a document.

- **High TF-IDF Score**: Words that appear frequently in a specific document but rarely elsewhere (e.g., "lottery" in spam emails).
- **Low TF-IDF Score**: Common words found across multiple documents (e.g., "the," "is," "and").

This method ensures that our model focuses on words that truly differentiate spam from ham.


## 4. Model Training

### 4.1 Splitting the Dataset

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, df["label"], test_size=0.2, random_state=42)



### Why Split Data?

- **Training Set (80%)**: Used to teach the model.
- **Testing Set (20%)**: Used to evaluate performance.

A good split ensures the model generalizes well to unseen emails.


### 4.2 Training the Model

In [None]:

model = LogisticRegression()
model.fit(X_train, y_train)
