<a href="https://colab.research.google.com/github/aakashrvx/Data-Science-Resources-for-Beginners/blob/main/Naive_Bayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Detection Using Naives Bayes Classification

## Use of Bayes Theorem in Spam Classification

$$ P(spam|email) = \cfrac {P(spam) \times P(email|spam)} {P(email)} = \cfrac {P(spam) \times P(word1, word2, word3 ... wordn|spam)} {P(email)}$$
$$ = \cfrac {P(spam) \times P(word1, word2, word3 ... wordn|spam)} {P(email)}$$
## Assuming feature independence,
$$ = \cfrac {P(spam) \times P(word1|spam) \times P(word2|spam) \times P(word3|spam) ... P(wordn|spam)} {P(email)}$$


## Imports


In [None]:
# import required libraries
import matplotlib.pyplot as plt # plotting
import nltk # for natural language processing tasks
import numpy as np # numerical computing library
import pandas as pd # for data visualization, preprocessing and wrangling
import seaborn as sns # for graphing and visualization

%matplotlib inline

In [None]:
# import various functions and classes of sklearn for our propose
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.utils.multiclass import unique_labels

## Load Data

In [None]:
# load training data
# data is obtained from kaggle
# kaggla dataset link: https://www.kaggle.com/uciml/sms-spam-collection-dataset
data = pd.read_csv('https://raw.githubusercontent.com/AiDevNepal/ai-saturdays-workshop-8/master/data/spam.csv')

data['target'] = np.where(data['target']=='spam',1, 0)
print('No of rows:', len(data))
data.head(10)

No of rows: 5572


Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


## Peek into Spam Messages and Non-spam Message

In [None]:
data[data['target'] == 0].sample(10)

Unnamed: 0,text,target
4184,I'm good. Have you registered to vote?,0
5414,East coast,0
1084,For me the love should start with attraction.i...,0
1447,DonÛ÷t give a flying monkeys wot they think a...,0
61,Ha ha ha good joke. Girls are situation seekers.,0
1292,Da my birthdate in certificate is in april but...,0
4138,Ever green quote ever told by Jerry in cartoon...,0
5246,Haven't eaten all day. I'm sitting here starin...,0
829,Thanks for yesterday sir. You have been wonder...,0
2726,No i am not having not any movies in my laptop,0


In [None]:
data[data['target'] == 1].sample(10)

Unnamed: 0,text,target
5058,Free video camera phones with Half Price line ...,1
4110,URGENT! Your Mobile number has been awarded a ...,1
2662,Hello darling how are you today? I would love ...,1
1379,No. 1 Nokia Tone 4 ur mob every week! Just txt...,1
3860,Free Msg: Ringtone!From: http://tms. widelive....,1
116,You are a winner U have been specially selecte...,1
4650,A å£400 XMAS REWARD IS WAITING FOR YOU! Our co...,1
2846,Free-message: Jamster!Get the crazy frog sound...,1
4241,Show ur colours! Euro 2004 2-4-1 Offer! Get an...,1
4832,"New Mobiles from 2004, MUST GO! Txt: NOKIA to ...",1


## Split Data into Training and Test Set

In [None]:
# splitting the dataset into training and test data
X_train, X_test, Y_train, Y_test = train_test_split(data['text'], 
                                                    data['target'], 
                                                    random_state=0)
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)

Training data shape: (4179,)
Testing data shape: (1393,)


## Feature Extraction

In [None]:
# extract features
vectorizer = CountVectorizer(ngram_range=(1, 2)).fit(X_train)
X_train_vectorized = vectorizer.transform(X_train)
X_train_vectorized.toarray().shape

(4179, 40704)

In [None]:
type(X_train)

pandas.core.series.Series

## Model Creation

In [None]:
# create Naive Bayes model
model = MultinomialNB(alpha=0.1)
model.fit(X_train_vectorized, Y_train)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

## Predictions on Test Dataset

In [None]:
# predict y values for test dataset using the model we created
predictions = model.predict(vectorizer.transform(X_test))

## Evaluation

### Accuracy

In [None]:
# see accuracy in the testing set
print("Accuracy:", 100 * sum(predictions == Y_test) / len(predictions), '%')

Accuracy: 98.99497487437186 %


## See performance on real life examples

In [None]:
# recent official emails
model.predict(vectorizer.transform(
    [
        "Dear ABC, Thank you very much for sharing these files with us. I will request your help if need be. Regards, XYZ",
        "Hi ABC, Thanks for putting together the note on admin code. It’s great first draft! It covers many important aspects I wanted to have a good understanding about!",
        "Dear ABC, Thanks for the kind reply. The paper seems very interesting, will discuss more when we meet. Referring to our previous conversation, it would be great to know about your work on mapping the administrative units. Furthermore, we would appreciate knowing about the depth of administrative level mapping and the methodology. I have copied ABC DEF (GHI) in our team who is working on a similar task. ABC can reflect on the technical perspectives as the conversation progress. We are looking forward to hearing from you and mutually benefit from the data if your convenience permits. Best regards, XYZ",
        "Hi ABC and DEF, A gentle reminder that we're very interested in hearing more about your comparisons between the HRSL dataset, WorldPop and other similar datasets. We're planning on doing a small desk review specific to Nepal when we have time and would appreciate the opportunity to start from where you all left off. We are of course more than happy to keep any unpublished research findings you share internal to the World Bank, and share back the results of our review.",
    ])
            )              

array([0, 0, 0, 0])

In [None]:
# recent personal emails
model.predict(vectorizer.transform(
    [
        "Thank you, ABC. Can you also share your updated GitHub and LinkedIn profile? It helps to have personal/college projects in GitHub with proper documentation. As you are a fresher, employers would be willing to see your personal/college projects. Also, share a competitive programming profile if any.",
        "Hi ABC, I wish I was in Kathmandu so that we could have in-person discussion. However, will you be available for hangout call sometime next week? Let me know of your availability. We can talk more about your interest and future plans and discuss the options. -XYZ",
        "Hi y’all, Making quick introductions between python + QGIS Atlas lovers in Kathmandu. ABC, XYZ is looking at your code now and seems pretty comfortable with it. I told him he can write you with any questions — hope that’s OK. I’ll buy you some momos by way of thanks. Best, DEF",
        "Heyyy hiiiiii... Long time... Remember me? 😊😂 How are you? How's it going there...? What are you upto? :)",
    ])
            ) 

array([0, 0, 0, 0])

In [None]:
# recent personal emails
model.predict(vectorizer.transform(
    [
        "get free discount in plane tickets",
        "free recharge card offer",
        "girls are waiting to chat with you",
        "1-month unlimited calls offer Activate now",
        "congratulation, you became today's lucky winner",
        "Jelie wants your phone number",
        
    ])
            ) 

array([1, 1, 1, 1, 1, 1])

In [None]:
# contrasts
model.predict(vectorizer.transform(
    [
        "Jelie wants your email",
        "can you please share your phone number?"
    ])
            ) 

array([0, 0])