<a href="https://colab.research.google.com/github/aviraljoshi23/Email_spam_detection/blob/main/EmailSpamClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Spam Classifier based on Naive Bayes
The method is called naive because we're asuming there's no relationships between the words themselves. We're looking at each word in isolation, individually within a message, and basically combining all the probabilities of each word's contribution to being spam or not. A better spam classifier would obviously be looking at the relationships between the words.

~Importing the dataset

In [1]:
from google.colab import files
uploaded=files.upload()

Saving spam.csv to spam.csv


#

In [82]:
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

#import metrics libraries
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

In [92]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [32]:
import pandas as pd
mails=pd.read_csv('spam.csv',encoding='latin-1')

In [33]:
mails.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [34]:
mails.drop(columns=["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"],inplace=True)
mails= mails.rename(columns={"v1":"label", "v2":"sms"})

In [35]:
mails.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#Checking the maximum length of SMS
Number of observations in each label spam and ham

In [36]:
mails.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [37]:
#Mail text, mail length, mail is ham/spam label
mails['length']=mails['sms'].apply(len)
mails.head(20)

Unnamed: 0,label,sms,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61
5,spam,FreeMsg Hey there darling it's been 3 week's n...,148
6,ham,Even my brother is not like to speak with me. ...,77
7,ham,As per your request 'Melle Melle (Oru Minnamin...,160
8,spam,WINNER!! As a valued network customer you have...,158
9,spam,Had your mobile 11 months or more? U R entitle...,154


In [38]:
mails['length'].describe()

count    5572.000000
mean       80.118808
std        59.690841
min         2.000000
25%        36.000000
50%        61.000000
75%       121.000000
max       910.000000
Name: length, dtype: float64

Now i've found max text with the length of 910 in spam dataset.Lets Locate that text in dataset

In [39]:
mails[mails['length']==910]['sms'].iloc[0]

"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."

#Data Preprocessing
Converting the values in the 'label' column to numerical values using map method as follows: {'ham':0, 'spam':1} This maps the 'ham' value to 0 and the 'spam' value to 1.

In [40]:
mails.loc[:,'label'] = mails.label.map({'ham':0, 'spam':1})

In [41]:
mails.head()

Unnamed: 0,label,sms,length
0,0,"Go until jurong point, crazy.. Available only ...",111
1,0,Ok lar... Joking wif u oni...,29
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,0,U dun say so early hor... U c already then say...,49
4,0,"Nah I don't think he goes to usf, he lives aro...",61


In [42]:
mails.shape

(5572, 3)

In [109]:
ps=PorterStemmer()
lemma=WordNetLemmatizer()

#Bag of words
What we have here in our data set is a large collection of text data (5,572 rows of data). Most ML algorithms rely on numerical data to be fed into them as input, and email/sms messages are usually text heavy.

In [110]:
ps=PorterStemmer()#PorterStemmer object
lemma=WordNetLemmatizer()#Here using lemmatization and defining the object
corpus=[]
for i in range(0,len(mails)):
  review=re.sub('[^a-zA-Z]',' ',mails['sms'][i])
  review=review.lower()#Convert all the strings in the documents set to their lower case.
  review=review.split()
  review=[ps.stem(words)for words in review if not words in stopwords.words('english')]
  review=' '.join(review)#joining it using join() function in py 
  corpus.append(review)#appending the processed string in list corpus

#Data preprocessing with CountVectorizer()

In [127]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

#Train Test Splits
Now that we have understood how to deal with the Bag of Words problem we can get back to our dataset and proceed with our analysis. Our first step in this regard would be to split our dataset into a training and testing set so we can test our model later.

In [132]:
xSet = mails['sms'].values
ySet = mails['label'].values

In [133]:
X_train, X_test, y_train, y_test = train_test_split(xSet,ySet,test_size=0.20,random_state=1)

In [134]:
# Fit the training data and then return the matrix
training_data = cv.fit_transform(X_train)

# Transform testing data and return the matrix. 
testing_data = cv.transform(X_test)

#Naive Bayes Model
With messages represented as vectors, we can finally train our spam/ham classifier. Now we can actually use almost any sort of classification algorithms. For a variety of reasons, the Naive Bayes classifier algorithm is a good choice.

In [137]:
#create and fit NB model
naive_bayes=MultinomialNB()
naive_bayes.fit(training_data,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [138]:
predictions = naive_bayes.predict(testing_data)


Now that predictions have been made on our test set, we need to check the accuracy of our predictions.

In [140]:
accuracyScore =accuracy_score(y_test,predictions)
print(accuracyScore)

0.9847533632286996


#Evaluating our SMS Spam Detection Model

In [152]:
#Precision 
print('Precision score: {}'.format(precision_score(y_test, predictions)))
#Recall
print('Recall score: {}'.format(recall_score(y_test, predictions)))
#F1 score
print('F1 score: {}'.format(f1_score(y_test, predictions)))

Precision score: 0.9295774647887324
Recall score: 0.9496402877697842
F1 score: 0.9395017793594306


#Confusion Matrix

In [154]:
print('Confusion Matrix: {}'.format(confusion_matrix(y_test, predictions)))

Confusion Matrix: [[966  10]
 [  7 132]]
