# Supervised Learning - Naive Bayes

## Formula:
$$P(A|B) = \frac{P(B|A) P(A)}{P(B)}$$

## Example: Spam filters
## $P(Spam | email) = P(Spam | \vec{w}) = \frac{P(spam) \cdot P(\vec{w} | Spam)}{P(spam)\:\cdot \: P(\vec{w}|Spam)\:\cdot \: P(not \:Spam)\: \cdot \: P(\vec{w} | not \:spam)}$

### Terminolgy
Priors --> P(spam)  
Likelihood --> P(spam | $\vec{w}$) and other | terms 
Evidence --> The lower term of the formula $P(spam)\:\cdot \: P(\vec{w}|Spam)\:\cdot \: P(not \:Spam)\: \cdot \: P(\vec{w} | not \:spam)$

## Preprocessing (aka Featurizing)
<ol>
    <li>Tokenization: splitting the words </li>
    <li>Stop words removal</li>
    <li>Remove non-alphabetical charachters.</li>
    <li>Stemming: keeping the root of the word but stripping things like ing, ed etc. More for large data</li>
    <li>Lemmatization: Alternative to stemming by assinging to the same root. more taxing of resources than stemming</li>
    <li>Lowercasing: could be bad in cases where the name turns into a verb like "Mark" and "mark"</li>

</ol>

## There are 2 phases in the Classifer workflow:
<ol>
    <li>Learning phase: splitting the data into training and testing data</li>
    <li>Evaluation phase: testing the classifier performance using key metrics:
        <ol>
            <li>Accuracy: </li>
            <li>Precision: </li>
            <li>Recall: </li>
        </ol>
    </li>
</ol>

## Building Naive Bayes Classifer

In [2]:
# library imports
import numpy as np 
import pandas as pd 
from sklearn.naive_bayes import GaussianNB



  return f(*args, **kwds)


## Exploratory Data Analysis

In [44]:
# reading data
df = pd.read_csv("data/spam.csv",encoding = "ISO-8859-1")
df = df[["v1", "v2"]]

In [45]:
df

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [47]:
# checking unique values on v1
print("Number of Unique Values: \n", df['v1'].value_counts(),'\n\n')

# value percentage
print("Value Percentage: \n",df['v1'].value_counts() * 100 / len(df['v1']))

Number of Unique Values: 
 ham     4825
spam     747
Name: v1, dtype: int64 


Value Percentage: 
 ham     86.593683
spam    13.406317
Name: v1, dtype: float64


## Preprocessing

<ol>
    <li>Tokenization: splitting the words </li>
    <li>Stop words removal</li>
    <li>Remove non-alphabetical charachters.</li>
    <li>Stemming: keeping the root of the word but stripping things like ing, ed etc. More for large data</li>
    <li>Lemmatization: Alternative to stemming by assinging to the same root. more taxing of resources than stemming</li>
    <li>Lowercasing: could be bad in cases where the name turns into a verb like "Mark" and "mark"</li>

</ol>

In [48]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [49]:
ps = PorterStemmer()

In [50]:
stop_words = stopwords.words("english")
print(len(stop_words))
print(stop_words[:10])

153
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']


In [51]:
# # 1. Tokenization
# df['tokens'] = df.v2.str.split(" ")
# df.head()

In [55]:
def remove_stopwords(text):
    """Function to replace stopords with an empty space and removes double spaces
    removing all chars other then alphabet and stemming words
    
    returns: corpus of stemmed words"""

    text = re.sub('^a-zA-Z',' ',text)
    text = text.split()
    text = [ps.stem(word) for word in text if word not in stop_words]
    text = ' '.join(text).replace('  ', ' ')
    return text


In [56]:
df["removed_stopwords"]= df.v2.apply(remove_stopwords)

In [57]:
df

Unnamed: 0,v1,v2,removed_stopwords
0,ham,"Go until jurong point, crazy.. Available only ...","Go jurong point, crazy.. avail bugi n great wo..."
1,ham,Ok lar... Joking wif u oni...,Ok lar... joke wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entri 2 wkli comp win FA cup final tkt 21...
3,ham,U dun say so early hor... U c already then say...,U dun say earli hor... U c alreadi say...
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah I don't think goe usf, live around though"
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,thi 2nd time tri 2 contact u. U å£750 pound pr...
5568,ham,Will Ì_ b going to esplanade fr home?,will Ì_ b go esplanad fr home?
5569,ham,"Pity, * was in mood for that. So...any other s...","pity, * mood that. so...ani suggestions?"
5570,ham,The guy did some bitching but I acted like i'd...,the guy bitch I act like i'd interest buy some...


In [None]:
# building pipeline