# TASK-4
### EMAIL SPAM DETECTION WITH MACHINE LEARNING

#### Problem Statement 

      We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email that is sent to a massive number of users at one time, frequently containing cryptic messages, scams, or most dangerously, phishing content. In this Project, use Python to build an email spam detector. Then, use machine learning to train the spam detector to recognize and classify emails into spam and non-spam. Let’s get started!

### Dataset:
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

### Github:
https://github.com/charankumar-77/OIBSIP.git 

### Name: Charan Mandula
### Email: charanmandula07@gmail.com

#### Import Library


In [1]:
#import libraries 
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix


#### Load Dataset

In [2]:
#Load the dataset 
df=pd.read_csv("spam.csv",encoding='latin-1')
labels=df['v1']
emails=df['v2']

In [3]:
# Displaying first 5 records 
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
# Displaying last 5 records 
df.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


In [5]:
#Information about the dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [6]:
#size of the dataset (Rows & columns)
df.shape


(5572, 5)

In [7]:
# Print Column Names Only
df.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [8]:
# Summary statistics
# if std is 0, that column should be removed from analysis
df.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


#### Drop unimportant columns

In [9]:
# Remove unimportant column:Id
df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1, inplace=True)
df

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


#### Check for Null Values and Manipulate Data, if Null values found

In [10]:
#Finding the null values 
df.isnull().sum()

v1    0
v2    0
dtype: int64

In [11]:
# This will display rows with at least one null value across all columns
rows_with_null = df[df.isnull().any(axis=1)]
rows_with_null

Unnamed: 0,v1,v2


####  Check for Duplicates and Drop those Rows, if Duplicates found

In [12]:
# This will display all duplicate rows (excluding the first occurrence)
duplicate_rows = df[df.duplicated()]
duplicate_rows

Unnamed: 0,v1,v2
102,ham,As per your request 'Melle Melle (Oru Minnamin...
153,ham,As per your request 'Melle Melle (Oru Minnamin...
206,ham,"As I entered my cabin my PA said, '' Happy B'd..."
222,ham,"Sorry, I'll call later"
325,ham,No calls..messages..missed calls
...,...,...
5524,spam,You are awarded a SiPix Digital Camera! call 0...
5535,ham,"I know you are thinkin malaria. But relax, chi..."
5539,ham,Just sleeping..and surfing
5553,ham,Hahaha..use your brain dear


In [13]:
# This will display all duplicate rows (including the first occurrence)
all_duplicate_rows = df[df.duplicated(keep=False)]
all_duplicate_rows

Unnamed: 0,v1,v2
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."
...,...,...
5524,spam,You are awarded a SiPix Digital Camera! call 0...
5535,ham,"I know you are thinkin malaria. But relax, chi..."
5539,ham,Just sleeping..and surfing
5553,ham,Hahaha..use your brain dear


In [14]:
# Remove Duplicate Rows, ifany
# Keep first row and remove other duplicate rows of that row
df = df.drop_duplicates(keep='first')

In [15]:
# This will display all duplicate rows (excluding the first occurrence)
duplicate_rows = df[df.duplicated()]
duplicate_rows

Unnamed: 0,v1,v2


In [16]:
x_train,x_test,y_train,y_test=train_test_split(df['v2'],df['v1'],test_size=0.2,random_state=42)

In [17]:
vectorizer=CountVectorizer()
x_train_vectorized=vectorizer.fit_transform(x_train)
x_test_vectorized=vectorizer.transform(x_test)

In [18]:
classifer=MultinomialNB()
classifer.fit(x_train_vectorized,y_train)

In [19]:
y_pred=classifer.predict(x_test_vectorized)

In [20]:
accuracy=accuracy_score(y_test,y_pred)
conf_matrix=confusion_matrix(y_test,y_pred)
class_report=classification_report(y_test,y_pred)

In [21]:
print(f"Accuracy : {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

Accuracy : 0.9854932301740812
Confusion Matrix:
[[887   2]
 [ 13 132]]
Classification Report:
              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       889
        spam       0.99      0.91      0.95       145

    accuracy                           0.99      1034
   macro avg       0.99      0.95      0.97      1034
weighted avg       0.99      0.99      0.99      1034

