<a href="https://www.kaggle.com/code/anjusukumaran4/spam-detection-nlp?scriptVersionId=140301881" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

The Project is to implements an email spam detection system using machine learning

Importing libraries

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

import seaborn as sns
import matplotlib.pyplot as plt

## Load Data

In [2]:
df = pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv",encoding='latin-1')

In [3]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Data Exploration

In [4]:
print('Shape of the dataset : ',df.shape)

Shape of the dataset :  (5572, 5)


In [5]:
#drop unnamed columns
df=df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)

In [6]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [8]:
df.describe().transpose()

Unnamed: 0,count,unique,top,freq
v1,5572,2,ham,4825
v2,5572,5169,"Sorry, I'll call later",30


In [9]:
#rename the column
df=df.rename(columns={'v1':'label','v2':'text'})

In [10]:
df.head(1)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."


In [11]:
df.groupby('label').describe().transpose()

Unnamed: 0,label,ham,spam
text,count,4825,747
text,unique,4516,653
text,top,"Sorry, I'll call later",Please call our customer service representativ...
text,freq,30,4


In [12]:
#checking for null values
df.isnull().sum()

label    0
text     0
dtype: int64

## Data Preparation

In [13]:
#convert categorical value into numerical in label column
df.loc[df['label']=='spam','label']=0
df.loc[df['label']=='ham','label']=1

In [14]:
df.head()

Unnamed: 0,label,text
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


In [15]:
X=df['text']
y=df['label']

In [16]:
#train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.20,random_state=0)

## Vectorization

 Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics. The process of converting words into numbers are called Vectorization. TF-IDF is a popular vectorization technique used in NLP.

In [17]:
#importing libraries for nlp
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from collections import Counter

In [18]:
vect=TfidfVectorizer(min_df=1,stop_words='english',lowercase=True)

In [19]:
X_train_vect=vect.fit_transform(X_train)
X_test_vect=vect.transform(X_test)

In [20]:
#label column is in object type we need to convert it into integer 
y_train=y_train.astype('int')
y_test=y_test.astype('int')

## Model Building

In [21]:
from sklearn.linear_model import LogisticRegression

In [22]:
model=LogisticRegression()
model.fit(X_train_vect,y_train)

### Prediction and Accuracy

In [23]:
from sklearn.metrics import accuracy_score

In [24]:
pred=model.predict(X_test_vect)
acc=accuracy_score(y_test,pred)
print('Accuracy: ',acc.round(4)*100,'%')

Accuracy:  95.61 %
