<a href="https://colab.research.google.com/github/ajrianop/ML/blob/main/02_BayesianMethod.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Naive Bayes, Bayesian Method or Bayesian's Theorem**

Bayes's Theorem describe the following $P(A\mid B) = \dfrac{P(A)P(B\mid A)}{P(B)}$


We can use Naive Bayes to study different problems, as for determine if some email is spam given some word, so we can compute a large database with the information of email with spam and not spam, and do the complete computing.
$P(spam\mid free)=\dfrac{P(spam)P(free\mid spam)}{P(free)}.$

In order to compute that, we can use scikit-learn to make it very simple with the function multinomialNB which do the hard work with the process of Naive Bayes.

## **Example**

Before to start, it is necessary to call the information from drive. The following documents will be taken from the instructor Frank Kane

In [1]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


Let us consider two directories one with emails classified as spam and the other as ham (it is equivalent as not spam). Thus, we can train our model with this information.

In [2]:
import os
import io
import numpy
import pandas as pd
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = pd.concat([data, dataFrameFromDirectory("/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam", "spam")]);
data = pd.concat([data, dataFrameFromDirectory("/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/ham", "ham")])

#For Pandas 1.3:
#data = data.append(dataFrameFromDirectory('emails/spam', 'spam'))
#data = data.append(dataFrameFromDirectory('emails/ham', 'ham'))


In [3]:
data.reset_index()[['message','class']]

Unnamed: 0,message,class
0,"<html>\n\n\n\n<head>\n\n<meta http-equiv=3D""Co...",spam
1,IS YOUR BUSINESS MAKING MONEY!\n\nSet Up To Ac...,spam
2,When America's top companies compete for your ...,spam
3,Lowest rates available for term life insurance...,spam
4,<!-- saved from url=3D(0022)http://internet.e-...,spam
...,...,...
2995,"Jim Whitehead wrote:\n\n\n\n>Great, this is ha...",ham
2996,"On Fri, 6 Sep 2002, Russell Turpin wrote:\n\n\...",ham
2997,This article from NYTimes.com \n\nhas been sen...,ham
2998,"On Thu, 5 Sep 2002 bitbitch@magnesium.net wrot...",ham


In [4]:
'''
The following method tokenize all the message and counts the amount of times that a word occurs in an email. 
'''
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

# MultinomialNB perform Naive Bayes on the data given
classifier = MultinomialNB()
targets = data['class'].values
# The method requires two inputs: counts = data we are training on , targets = the classification of the emails 
classifier.fit(counts, targets)

MultinomialNB()

In [5]:
print(counts)

  (0, 28844)	3
  (0, 27856)	2
  (0, 36946)	4
  (0, 28855)	4
  (0, 22714)	2
  (0, 3669)	143
  (0, 17466)	6
  (0, 34248)	1
  (0, 22319)	1
  (0, 55923)	1
  (0, 38606)	2
  (0, 26057)	1
  (0, 37135)	1
  (0, 25238)	2
  (0, 44192)	1
  (0, 21789)	1
  (0, 20408)	1
  (0, 54656)	1
  (0, 53112)	2
  (0, 15912)	1
  (0, 3865)	1
  (0, 1193)	1
  (0, 53574)	2
  (0, 39540)	3
  (0, 9442)	1
  :	:
  (2999, 49584)	1
  (2999, 42397)	1
  (2999, 59128)	1
  (2999, 40219)	1
  (2999, 53380)	3
  (2999, 36776)	2
  (2999, 20163)	2
  (2999, 17527)	1
  (2999, 44241)	1
  (2999, 26673)	1
  (2999, 27649)	1
  (2999, 28072)	1
  (2999, 44240)	4
  (2999, 39252)	1
  (2999, 39516)	1
  (2999, 30694)	1
  (2999, 11952)	1
  (2999, 30540)	1
  (2999, 49440)	1
  (2999, 19791)	1
  (2999, 19162)	1
  (2999, 46461)	1
  (2999, 55958)	1
  (2999, 41700)	1
  (2999, 52738)	1


In [6]:
print(targets.size)

3000


In [7]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?",""]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham', 'ham'], dtype='<U4')

## **Using train/test set**

We are going to take a training set of $70\%$ of spam emails and ham emails. The other $30\%$ will be for the test set.

In [20]:
spamEmails=data[data['class']=='spam']#.sample(frac=0.7,random_state=100)
s=data[data['class']=='spam'].drop(spamEmails.index)
s

Unnamed: 0,message,class


In [23]:
# spam and ham emails
spamEmails = data[data['class']=='spam']
hamEmails = data[data['class']=='ham']
# training data
spamTrain = spamEmails.sample(frac=0.7,random_state=100)
hamTrain = hamEmails.sample(frac=0.7,random_state=100)
dfTrain = pd.concat([spamTrain,hamTrain])
# test data
spamTest = spamEmails.drop(spamTrain.index)
hamTest = hamEmails.drop(hamTrain.index)
dfTest = pd.concat([spamTest, hamTest])

In [16]:
data[data['class']=='spam']


Unnamed: 0,message,class
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00030.0c9cdd9d4025bd55dac02719ec8d29dc,"<html>\n\n\n\n<head>\n\n<meta http-equiv=3D""Co...",spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00063.2334fb4e465fc61e8406c75918ff72ed,IS YOUR BUSINESS MAKING MONEY!\n\nSet Up To Ac...,spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00070.ab34b6c044a55bef3d6c1f64b7521773,When America's top companies compete for your ...,spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00057.0a2e17bde9485e999ac2259df38528e2,Lowest rates available for term life insurance...,spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00061.bec763248306fb3228141491856ed216,<!-- saved from url=3D(0022)http://internet.e-...,spam
...,...,...
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00499.988506a852cf86b396771a8bdc8cf839,<html>\n\n<head>\n\n</head>\n\n <body backgro...,spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00492.73db79fb9ad03aff1e08deb73b83203c,"<html><body><center>\n\n\n\n<table bgcolor=3D""...",spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00493.1c5f59825f7a246187c137614fb1ea82,<HR>\n\n<html>\n\n<head>\n\n <title>Secured I...,spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00495.e22a609b7dc412c120d09e11544c67fb,\n\n\n\n\n\n\n\n\n\n\n\n\n\nFROM:MR. DESMOND S...,spam


In [11]:
dfTest

Unnamed: 0,message,class
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00030.0c9cdd9d4025bd55dac02719ec8d29dc,"<html>\n\n\n\n<head>\n\n<meta http-equiv=3D""Co...",spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00070.ab34b6c044a55bef3d6c1f64b7521773,When America's top companies compete for your ...,spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00061.bec763248306fb3228141491856ed216,<!-- saved from url=3D(0022)http://internet.e-...,spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00058.64bb1902c4e561fb3e521a6dbf8625be,<html>\n\n<head>\n\n</head>\n\n<body>\n\n\n\n<...,spam
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/spam/00009.027bf6e0b0c4ab34db3ce0ea4bf2edab,TIRED OF THE BULL OUT THERE?\n\nWant To Stop L...,spam
...,...,...
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/ham/00427.49db73be9017efca7355ee80f173a26c,"Jim Whitehead wrote:\n\n\n\n>Great, this is ha...",ham
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/ham/00465.772004398b9f98bc63ab9c603b10ce08,"On Fri, 6 Sep 2002, Russell Turpin wrote:\n\n\...",ham
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/ham/00548.120e45c5d33311bc09e844bf236521d2,This article from NYTimes.com \n\nhas been sen...,ham
/content/gdrive/MyDrive/Programming Topics/ML Python/emailsNB/ham/00454.e4cd59db7f7856303052e3e882be313c,"On Thu, 5 Sep 2002 bitbitch@magnesium.net wrot...",ham


In [9]:
print(5000*0.7)
print(574*0.7)

3500.0
401.79999999999995
