**2.Classify the email using the binary classification method. Email Spam detection has two states: a) Normal State – Not Spam, b) Abnormal State – Spam.**
Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze their performance.
Dataset link: The emails.csv dataset on the Kaggle
https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

**Step 1: Load and Preprocess the Dataset**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

In [2]:
# Load the dataset
data = pd.read_csv('emails.csv')

In [3]:
# Display the first few rows
print(data.head())

# Check the column names
print(data.columns)

  Email No.  the  to  ect  and  for  of    a  you  hou  ...  connevey  jay  \
0   Email 1    0   0    1    0    0   0    2    0    0  ...       0.0  0.0   
1   Email 2    8  13   24    6    6   2  102    1   27  ...       0.0  0.0   
2   Email 3    0   0    1    0    0   0    8    0    0  ...       0.0  0.0   
3   Email 4    0   5   22    0    5   1   51    2   10  ...       0.0  0.0   
4   Email 5    7   6   17    1    5   2   57    0    9  ...       0.0  0.0   

   valued  lay  infrastructure  military  allowing   ff  dry  Prediction  
0     0.0  0.0             0.0       0.0       0.0  0.0  0.0         0.0  
1     0.0  0.0             0.0       0.0       0.0  1.0  0.0         0.0  
2     0.0  0.0             0.0       0.0       0.0  0.0  0.0         0.0  
3     0.0  0.0             0.0       0.0       0.0  0.0  0.0         0.0  
4     0.0  0.0             0.0       0.0       0.0  1.0  0.0         0.0  

[5 rows x 3002 columns]
Index(['Email No.', 'the', 'to', 'ect', 'and', 'for', 'o

In [4]:
# Check for missing values
print(data.isnull().sum())

Email No.     0
the           0
to            0
ect           0
and           0
             ..
military      1
allowing      1
ff            1
dry           1
Prediction    1
Length: 3002, dtype: int64


In [5]:
data.dropna(inplace=True)

**Step 2: Prepare Features and Target Variable**

In [7]:
# Prepare features and target variable
X = data.drop(columns=['Email No.', 'Prediction'])  # Drop non-feature columns
y = data['Prediction']  # Use 'Prediction' as the target variable

**Step 3: Split the Data into Training and Testing Sets**

In [8]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Step 4: Train K-Nearest Neighbors (KNN) Model**

In [9]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

In [10]:
# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Train the KNN model
knn.fit(X_train, y_train)

# Make predictions
y_pred_knn = knn.predict(X_test)

# Evaluate the KNN model
print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))

KNN Classification Report:
              precision    recall  f1-score   support

         0.0       0.92      0.92      0.92       102
         1.0       0.78      0.78      0.78        37

    accuracy                           0.88       139
   macro avg       0.85      0.85      0.85       139
weighted avg       0.88      0.88      0.88       139

KNN Accuracy: 0.8848920863309353


**Step 5: Train Support Vector Machine (SVM) Model**

In [11]:
from sklearn.svm import SVC

In [12]:
# Initialize the SVM classifier
svm = SVC(kernel='linear')

# Train the SVM model
svm.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm.predict(X_test)

# Evaluate the SVM model
print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm))
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))

SVM Classification Report:
              precision    recall  f1-score   support

         0.0       0.98      0.95      0.97       102
         1.0       0.88      0.95      0.91        37

    accuracy                           0.95       139
   macro avg       0.93      0.95      0.94       139
weighted avg       0.95      0.95      0.95       139

SVM Accuracy: 0.9496402877697842
