# Naive Bayes For Email Classification

In [61]:
import pandas as pd

In [62]:
# read csv file
df = pd.read_csv("data_spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [63]:
print(df.shape)

(5572, 2)


In [64]:
# Lets look at unique values of Category

df["Category"].nunique()

2

In [65]:
# spam -> 1 and ham -> 0
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [66]:
# Seperate Features and target

X = df['Message']
y = df['spam']

In [67]:
# split data into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [68]:
X_train.shape

(4457,)

In [69]:
X_test.shape

(1115,)

```
corpus = [
    'This document is the first document.',   # doc1
    'This document is the second document.',  # doc2
    'And this one is the third one.',         # doc3
    'Is this the first document?',            # doc4
]

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Vector representation of above documents: 
[0   ,      2,         1,    1      0,      0,        1,     0,        1]    --> doc1
[0,         2,         0,    1,     0,      1,        1,     0,        1]    --> doc2
[1,         0,         0,    1,     2,      0,        1,     1,        1]    --> doc3
[0,         1,         1,    1,     0,      0,        1,     0,        1]    --> doc4
```


In [70]:
# convert the X_train into vector

from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

X_train_count = v.fit_transform(X_train.values)

# lets see first 2 values
X_train_count.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)


## Naive Bayes classifiers are categorized based on the type of data they handle.
     
- Bernoulli Naive Bayes:  It is designed for binary or Boolean features. It's effective in scenarios where data is 
represented as yes/no or true/false or 0/1. 
This classifier is frequently employed in spam detection and sentiment analysis.

- Multinomial Naive Bayes:  It excels with discrete data. This classifier is adept at handling features that represent 
counts, like word frequencies in documents.    
It's commonly used in text classification tasks and document categorization.


- Gaussian Naive Bayes:  It is suited for continuous data. It posits that the features adhere to a Gaussian distribution. 
This classifier is particularly useful for numerical data, such as measurements or sensor readings

In [71]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()

model.fit(X_train_count,y_train)

In [72]:
# Test on 2 email whose target status is know to us: 1 and 0
emails = [ 
    'Had your mobile 11 months or more? U R entitled to Update to the latest \
    colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030',
    "Nah I don't think he goes to usf, he lives around here though"
]

emails_count = v.transform(emails)
model.predict(emails_count) # 1 -> spam

array([1, 0], dtype=int64)

In [73]:
# Lets check the accuracy on test data

X_test_count = v.transform(X_test)

model.score(X_test_count, y_test)

0.9838565022421525

# STOP

# OPTIONAL: Pipeline

In [10]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [11]:
clf.fit(X_train, y_train)

In [12]:
clf.score(X_test,y_test)

0.9856424982053122

In [13]:
clf.predict(emails)

array([1, 0], dtype=int64)