# Spam Email Detection

## Introduction

This project focuses on detecting spam emails using a dataset containing information from 5,172 randomly selected email files. The goal is to build a classification model that can accurately distinguish between spam and not-spam emails based on the content of the emails.

## Source

This dataset is available on Kaggele in the following link:

> https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

## About the Dataset

The dataset is provided in a CSV file with the following characteristics:

- **Rows**: 5,172 rows, each representing an individual email.
- **Columns**: 3,002 columns in total.
  - **First Column**: Indicates the email name. The names have been anonymized with numbers to protect privacy.
  - **Last Column**: Contains the labels for classification:
    - `1` for spam emails.
    - `0` for not-spam emails.
  - **Remaining 3,000 Columns**: These columns represent the 3,000 most common words across all emails, excluding non-alphabetical characters. Each cell in these columns contains the count of the respective word in the corresponding email.

This compact representation allows for efficient processing and analysis of email data without needing to work with separate text files.

## Problem Statement

1. **Dimensionality Reduction**: Perform Principal Component Analysis(PCA) to select the principal components and reduce the dimension as it is a high dimension data.

### Load Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import os
import warnings
from sklearn.preprocessing import StandardScaler

### Settings

In [2]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
csv_path = os.path.join(data_path, "emails_or.csv")

### Load Data

In [3]:
df = pd.read_csv(csv_path)

In [4]:
# Check data
df.head()

Unnamed: 0,the,to,ect,and,for,of,a,you,hou,in,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,0,0,1,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,8,13,24,6,6,2,102,1,27,18,...,0,0,0,0,0,0,0,1,0,0
2,0,0,1,0,0,0,8,0,0,4,...,0,0,0,0,0,0,0,0,0,0
3,0,5,22,0,5,1,51,2,10,1,...,0,0,0,0,0,0,0,0,0,0
4,7,6,17,1,5,2,57,0,9,3,...,0,0,0,0,0,0,0,1,0,0


### Preprocessing

In [5]:
# Separate input features
X = df.iloc[:, :-1]

In [7]:
# Standardize the data before Principal component analysis
scaler = StandardScaler()
X_s = scaler.fit_transform(X)

In [8]:
# Use Principal component analysis to reduce the dimension of the data by preserving the 95% of variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_s)

In [18]:
# Generate colum names for PCA features
column_names = [f"PC {i + 1}" for i in range(X_pca.shape[1])]
# Convert the PCA result array to data frame
df_pca = pd.DataFrame(X_pca, columns = column_names)

In [19]:
# Add the target to this dataframe
df_pca["Prediction"] = df["Prediction"]

In [20]:
# Sanity check
df_pca.head()

Unnamed: 0,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8,PC 9,PC 10,...,PC 1217,PC 1218,PC 1219,PC 1220,PC 1221,PC 1222,PC 1223,PC 1224,PC 1225,Prediction
0,-9.660095,0.987285,-0.189626,-0.254604,0.08237,-0.134656,0.455017,0.025014,0.097693,-0.175794,...,0.103395,-0.036508,-0.162385,-0.350383,-0.243013,-0.288009,0.146185,0.164837,-0.151256,0
1,22.357003,-9.544622,3.533676,-2.025507,4.462641,-5.34502,-1.193437,1.00805,-4.411091,5.298597,...,-0.313507,0.126721,-0.429281,-0.313418,-0.490992,0.082647,0.221526,0.524422,0.140697,0
2,-9.110445,0.489013,0.599765,0.095861,0.060251,-0.272427,0.45594,0.211488,-0.030948,0.126848,...,0.549715,0.268953,0.144322,-0.200733,-0.345378,0.581285,0.135372,-0.486959,0.250638,0
3,5.476925,-8.115297,3.500444,-7.06546,-0.743282,6.695316,-2.791357,-0.140759,3.461559,-2.794691,...,-0.939405,-0.629226,-0.505552,0.999678,0.342748,-0.464474,-0.604258,0.875298,-0.694246,0
4,4.552939,-6.495357,3.364397,-6.044998,-1.263732,5.399054,-0.500196,1.495957,1.070027,-0.591191,...,-0.034365,0.735361,0.038051,1.130274,-0.403381,-0.825087,-0.615582,1.0907,-0.551134,0


In [23]:
# Save the data
pca_path = os.path.join(data_path, "emails_pca.csv")
df_pca.to_csv(pca_path, index= False)