# Spam Email Detection

This project implements a machine learning pipeline to classify emails as spam or not spam using natural language processing and various classification algorithms. The dataset is preprocessed to clean text data, vectorized, and then used to train multiple models to determine the most effective classifier.

## Features
- Data cleaning (removing punctuation, stop words, stemming).
- Splitting the data into training and test sets.
- Testing multiple machine learning classifiers:
  - Logistic Regression
  - Decision Tree
  - Random Forest
  - Gradient Boosting
  - Support Vector Classifier
  - K-Nearest Neighbors
  - Naive Bayes
- Performance evaluation using metrics like accuracy, confusion matrix, and classification report.

## Dataset
The project uses the [Spam Mails Dataset], which contains labeled email messages with labels `0` (not spam) and `1` (spam).

## Installation

1. Clone the repository or download the script file:
   ```bash
   git clone <repository_url>
   ```
2. Install the required libraries:
   ```bash
   pip install numpy pandas matplotlib seaborn scikit-learn xgboost nltk
   ```

3. Download the dataset and place it in the appropriate directory as specified in the script.

## Usage

1. Run the script to preprocess the data and train the models:
   ```bash
   python spam_email_detection.py
   ```
2. Review the outputs to determine the performance of various classifiers. The best-performing model can be selected for further use or deployment.

## Results

- Random Forest achieved the highest accuracy of **97.64%** on the test dataset.
- Confusion matrices and classification reports for all models are displayed in the output.

## Future Improvements

- Fine-tuning hyperparameters to improve model accuracy.
- Testing additional text vectorization techniques like TF-IDF.
- Integrating the model into a real-time spam detection application.

## Dependencies

- Python 3.7+
- Libraries:
  - numpy
  - pandas
  - matplotlib
  - seaborn
  - scikit-learn
  - xgboost
  - nltk


# Importing Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Importing Data

In [None]:
dataset= pd.read_csv("spam_ham_dataset.csv")

In [None]:
dataset.head()

# Data Preprocesssing

In [None]:
dataset.isna().sum()

In [None]:
dataset.duplicated().sum()

* so we have no null and duplicate value
* we only need text and label num for our purpose

In [None]:
df= dataset[["text","label_num"]]
df.head()

In [None]:
length=len(df["text"])

# Cleaning text

* We will remove punctuation and other unnecessary item from our text
* Then will be stemming the word to its root form

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus= []
# contain list of words that will be used for training -> final words after cleaning

for i in range(0,length):
    # re is used to remove punctuation
    text= re.sub('[^a-zA-Z]',' ',df["text"][i])

    # converting to lower case
    text=text.lower()

    #stemming
    text=text.split()
    ps= PorterStemmer()
    all_stopwords= stopwords.words("english")
    text= [ps.stem(word) for word in text if not word in set(all_stopwords)]
    text= ' '.join(text)

    corpus.append(text)

In [None]:
corpus[0:5]


In [None]:
df=df.copy()

In [None]:
df["clean_text"]=corpus

In [None]:
df["clean_text"]=df["clean_text"].str.replace("subject","")
df

# splitting data into training set and test set

In [None]:
x=df.loc[:,"clean_text"].values
y=df.loc[:,"label_num"].values

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
x=cv.fit_transform(x).toarray()

In [None]:
x

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

# Classifiying using various classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
import xgboost as xg
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report

In [None]:
def train_test_model(model):
    model.fit(x_train,y_train)
    y_pred=model.predict(x_test)
    score=accuracy_score(y_test,y_pred)
    print(f'accuracy score of {model} is :{score}')
    print(f'{classification_report(y_test,y_pred)}')
    mx=confusion_matrix(y_test,y_pred)
    sns.heatmap(mx,annot=True, fmt='d', cmap='Blues', xticklabels=['Not spam', 'spam'], yticklabels=['Not spam', 'spam'])
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()


In [None]:
train_test_model(LogisticRegression())

In [None]:
train_test_model(DecisionTreeClassifier(max_depth=20))

In [None]:
train_test_model(RandomForestClassifier())

In [None]:
train_test_model(xg.XGBClassifier(learning_rate= 0.2, max_depth=7, n_estimators= 100))

In [None]:
train_test_model(SVC())

In [None]:
train_test_model(KNeighborsClassifier())

In [None]:
train_test_model(GaussianNB())

# Result and inference

In [None]:
df['label_num'].value_counts()/df['label_num'].count().sum()

so we could see that our model has done better as it is better to use the model than saying that mail is not spam

we could see that `random forest` has better accuracy i.e of `97.64`%
* Improvements can be done by tuning the hyperparameters