In [None]:

To classify emails as spam or not spam, here’s a quick breakdown of the approach using K-Nearest Neighbors (KNN) and Support Vector Machine (SVM):

Theory Summary
K-Nearest Neighbors (KNN):

KNN is a simple, instance-based algorithm that classifies data by finding the “k” closest training points to a given input and assigning the majority label.
It relies on the idea that similar instances are likely to belong to the same class.
KNN is intuitive but can be computationally expensive for large datasets.
Support Vector Machine (SVM):

SVM is a powerful classification model that finds the optimal hyperplane (decision boundary) that maximally separates classes (spam vs. not spam).
For non-linearly separable data, SVM can use a kernel trick (e.g., RBF, polynomial) to transform data into a higher dimension where it becomes linearly separable.
SVM is effective for high-dimensional spaces, making it suitable for text-based classification tasks.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [2]:

df = pd.read_csv(r'F:\7. Seventh Seemester Degeree\ML Assignments\email_spam\emails.csv')  # Use the correct path to the CSV file

# Display the first few rows of the dataset
print(df.head())


  Email No.  the  to  ect  and  for  of    a  you  hou  ...  connevey  jay  \
0   Email 1    0   0    1    0    0   0    2    0    0  ...         0    0   
1   Email 2    8  13   24    6    6   2  102    1   27  ...         0    0   
2   Email 3    0   0    1    0    0   0    8    0    0  ...         0    0   
3   Email 4    0   5   22    0    5   1   51    2   10  ...         0    0   
4   Email 5    7   6   17    1    5   2   57    0    9  ...         0    0   

   valued  lay  infrastructure  military  allowing  ff  dry  Prediction  
0       0    0               0         0         0   0    0           0  
1       0    0               0         0         0   1    0           0  
2       0    0               0         0         0   0    0           0  
3       0    0               0         0         0   0    0           0  
4       0    0               0         0         0   1    0           0  

[5 rows x 3002 columns]


In [8]:
# Define features and target
X = df.drop(['Email No.', 'Prediction'], axis=1)  # Dropping 'Email No.' as it’s irrelevant
y = df['Prediction']  # Target variable for Spam/Not Spam


In [9]:
df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


In [10]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [13]:
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [14]:
# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Train the KNN model
knn.fit(X_train, y_train)

# Predict using the KNN model
y_pred_knn = knn.predict(X_test)

# Evaluate the KNN model
print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))


KNN Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.77      0.86      1097
           1       0.63      0.96      0.76       455

    accuracy                           0.83      1552
   macro avg       0.81      0.86      0.81      1552
weighted avg       0.88      0.83      0.83      1552

KNN Accuracy: 0.8253865979381443


In [15]:
# Initialize the SVM classifier
svm = SVC(kernel='linear')

# Train the SVM model
svm.fit(X_train, y_train)

# Predict using the SVM model
y_pred_svm = svm.predict(X_test)

# Evaluate the SVM model
print("\nSVM Classification Report:")
print(classification_report(y_test, y_pred_svm))
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))



SVM Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.95      0.96      1097
           1       0.89      0.91      0.90       455

    accuracy                           0.94      1552
   macro avg       0.92      0.93      0.93      1552
weighted avg       0.94      0.94      0.94      1552

SVM Accuracy: 0.9400773195876289
