## Phase 3.28
# Bayes Classification

## Objectives
- Revisit <a href='#bayes'>Bayes Theorem</a> and conceptualize building a model.
- Understand the data processing method: <a href='#bow'>Bag of Words</a>.
- Implement <a href='#nb-clf'>Naive Bayes Classifier</a> in scikit-learn.
    - Compare Naive Bayes with <a href='#logreg'>another classification model</a>.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, plot_confusion_matrix
from sklearn.datasets import fetch_20newsgroups

<a id='bayes'></a>
# Bayes Classification: Introduction

Naive Bayes algorithms extend Bayes' formula to multiple variables by assuming that these features are independent of one another, which may not be met, (hence its naivety) it can nonetheless provide strong results in scenarios with clean and well normalized datasets. 

This then allows you to estimate an overall probability by multiplying the conditional probabilities for each of the independent features.

## Recap: Bayes Theorem

$$ \Large P(A|B) = \frac{P(B|A)\bullet P(A)}{P(B)}$$

Expanding to multiple features, the multinomial Bayes' formula is:  

$$ \Large P(y|x_1, x_2, ..., x_n) = \frac{P(y)\prod_{i}^{n}P(x_i|y)}{P(x_1, x_2, ..., x_n)}$$

---
<a id='bow'></a>
# How can we represent a text as a DataFrame?

 $$ P(\text{Document | Word}) = \dfrac{P(\text{Word | Document})P(\text{Document})}{P(\text{Word})}$$  

## Bag of Words

A **Bag of Words** is a frequency-mapping for a given text or document.

For example:

> **Sentence**: `'My favorite food is pizza. My brother likes pizza too.'`
> 
> **Bag of Words**: `{'My': 2, 'favorite': 1, 'food': 1, 'is': 1, 'pizza': 2, 'brother': 1, 'likes': 1, 'too': 1}`

---
## Transforming Text Data

***Original Data***
> **Sentence 1**: `'My favorite food is pizza. My brother likes pizza too.'`
> 
> **Sentence 2**: `'My mom likes baseball. My dad likes painting.'`
>

***Processing***
> **BoW 1**: `{'My': 2, 'favorite': 1, 'food': 1, 'is': 1, 'pizza': 2, 'brother': 1, 'likes': 1, 'too': 1}`
>
> **BoW 2**: `{'My': 2, 'mom': 1, 'likes': 2, 'baseball': 1, 'dad': 1, 'painting': 1}`

***DataFrame Representation***
> | mom | favorite | brother | pizza | painting | too | my | is | dad | food | baseball | likes |
> | --- | ---      | ---     | ---   | ---      | --- | -- | -- | --- | ---  | ---      | ---   |
> |   0 |   1      |   1     |   2   |   0      |   1 |  2 |  1 |   0 |   1  |   0      |   1   |
> |   1 |   0      |   0     |   0   |   1      |   0 |  2 |  0 |   1 |   0  |   1      |   2   |

### PRACTICE!
#### Create BoW dictionaries for each sentence.

In [2]:
# Coding a Bag of Words
s1 = 'My favorite food is pizza. My brother likes pizza too.'
s2 = 'My mom likes baseball. My dad likes painting.'

In [8]:
# John
s1 = 'My favorite food is pizza. My brother likes pizza too.'

s1_split = s1.replace('.', '').split()
count = {}
for word in s1_split:
    if word in count:
        count[word] += 1
    else:
        count[word] = 1
count

{'My': 2,
 'favorite': 1,
 'food': 1,
 'is': 1,
 'pizza': 2,
 'brother': 1,
 'likes': 1,
 'too': 1}

In [5]:
# Vi
dict = {}

for word in s1.split():
    print(word)

My
favorite
food
is
pizza.
My
brother
likes
pizza
too.


In [6]:
s1.split()

['My',
 'favorite',
 'food',
 'is',
 'pizza.',
 'My',
 'brother',
 'likes',
 'pizza',
 'too.']

In [None]:
# Huseyin
def f(s):
    a={}
    b = s.split()
    count=0
    for i in b:
        pass # Not done yet

In [10]:
bow1 = {}
# Code me!!
for word in s1.replace('.', '').split():
    bow1[word] = bow1.get(word, 0) + 1
bow1

{'My': 2,
 'favorite': 1,
 'food': 1,
 'is': 1,
 'pizza': 2,
 'brother': 1,
 'likes': 1,
 'too': 1}

In [11]:
bow2 = {}
# Code me!!
for word in s2.replace('.', '').split():
    bow2[word] = bow2.get(word, 0) + 1
bow2

{'My': 2, 'mom': 1, 'likes': 2, 'baseball': 1, 'dad': 1, 'painting': 1}

In [16]:
d = {'a': 1, 'b': 2, 'c': 3}
d.get('x', 'NOTHING HERE!!!!')

'NOTHING HERE!!!!'

In [17]:
d['x']

KeyError: 'x'

In [18]:
bow1

{'My': 2,
 'favorite': 1,
 'food': 1,
 'is': 1,
 'pizza': 2,
 'brother': 1,
 'likes': 1,
 'too': 1}

In [19]:
bow2

{'My': 2, 'mom': 1, 'likes': 2, 'baseball': 1, 'dad': 1, 'painting': 1}

In [20]:
pd.DataFrame([bow1, bow2]).fillna(0).astype(int)

Unnamed: 0,My,favorite,food,is,pizza,brother,likes,too,mom,baseball,dad,painting
0,2,1,1,1,2,1,1,1,0,0,0,0
1,2,0,0,0,0,0,2,0,1,1,1,1


<a id='nb-clf'></a>
# Classifying News
## Implementing Bayes Classifier in Scikit-Learn

**There are 3 available classifiers for Naive Bayes in sklearn.** 

1. Gaussian: Assumes that continuous features follow a normal distribution.

2. Bernoulli: The binomial model is useful if your features are binary.

3. Multinomial: It is useful if your features are discrete.

In [21]:
# Load data.
news_data = fetch_20newsgroups(
    subset='all',
    categories=['rec.sport.baseball', 'talk.politics.misc']
    )
news_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [22]:
# Create a dataframe from the loaded data.
df = pd.DataFrame()
df['data'] = news_data['data']
df['target'] = news_data['target']
df.head()

Unnamed: 0,data,target
0,From: paula@koufax.cv.hp.com (Paul Andresen)\n...,0
1,From: garrett@Ingres.COM \nSubject: Re: Limiti...,1
2,From: djs9683@ritvax.isc.rit.edu\nSubject: Re:...,0
3,From: nickn@eskimo.com (Nick Nussbaum)\nSubjec...,1
4,From: jerry@sheldev.shel.isc-br.com (Gerald La...,0


In [23]:
# Look at a row of the data.
df.iloc[0]['data']

'From: paula@koufax.cv.hp.com (Paul Andresen)\nSubject: Re: Let it be Known\nNntp-Posting-Host: koufax.cv.hp.com\nOrganization: Hewlett-Packard Company, Corvallis, Oregon USA\nLines: 15\n\nIn article <93104.233239ISSBTL@BYUVM.BITNET>, <ISSBTL@BYUVM.BITNET> writes:\n|> I would like to make everyone aware that in winning the NL West the Atlanta\n|> Braves did not lead wire-to-wire.  Through games of 4/14/93 the Houston\n|> Astros are percentage points ahead of the "unbeatable" Braves.\n\nAnd they deserve to be, if for no other reason than salvaging a little of the\nhonor of the NL West. The supposed strongest division in baseball lost 6 of 7\nto the East yesterday, with only the Astros prevailing.\n--------------------------------------------------------------------------------\n           We will stretch no farm animal beyond its natural length\n\n  paula@koufax.cv.hp.com   Paul Andresen  Hewlett-Packard  (503)-750-3511\n\n    home: 3006 NW McKinley    Corvallis, OR 97330       (503)-75

In [24]:
df['target'].value_counts(normalize=True)

0    0.561899
1    0.438101
Name: target, dtype: float64

In [25]:
# Perform train test split (random_state=2021)
X_train, X_test, y_train, y_test = train_test_split(
    df['data'],
    df['target'])

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1326,), (443,), (1326,), (443,))

In [26]:
# 'The' != 'the'
X_train = X_train.apply(lambda x: x.lower())
X_test = X_test.apply(lambda x: x.lower())

In [27]:
X_train[0]

'from: paula@koufax.cv.hp.com (paul andresen)\nsubject: re: let it be known\nnntp-posting-host: koufax.cv.hp.com\norganization: hewlett-packard company, corvallis, oregon usa\nlines: 15\n\nin article <93104.233239issbtl@byuvm.bitnet>, <issbtl@byuvm.bitnet> writes:\n|> i would like to make everyone aware that in winning the nl west the atlanta\n|> braves did not lead wire-to-wire.  through games of 4/14/93 the houston\n|> astros are percentage points ahead of the "unbeatable" braves.\n\nand they deserve to be, if for no other reason than salvaging a little of the\nhonor of the nl west. the supposed strongest division in baseball lost 6 of 7\nto the east yesterday, with only the astros prevailing.\n--------------------------------------------------------------------------------\n           we will stretch no farm animal beyond its natural length\n\n  paula@koufax.cv.hp.com   paul andresen  hewlett-packard  (503)-750-3511\n\n    home: 3006 nw mckinley    corvallis, or 97330       (503)-75

In [28]:
# Use CountVectorizer to process text.
count_vec = CountVectorizer()

X_train_processed = count_vec.fit_transform(X_train)
X_test_processed = count_vec.transform(X_test)

In [29]:
X_train_processed

<1326x23135 sparse matrix of type '<class 'numpy.int64'>'
	with 226455 stored elements in Compressed Sparse Row format>

In [30]:
count_vec.get_feature_names()

['00',
 '000',
 '000007',
 '0000ahc',
 '000k',
 '000th',
 '001',
 '0010',
 '001116',
 '001211',
 '001338',
 '001815',
 '002',
 '002251w',
 '002302',
 '002339',
 '003',
 '003015',
 '003848',
 '004224',
 '005',
 '005314',
 '005634',
 '005756',
 '0059',
 '006',
 '0062',
 '007',
 '0084',
 '0096b0f0',
 '00bjgood',
 '00cgbabbitt',
 '00cmmiller',
 '00ecgillespi',
 '00ecgillespie',
 '00mbstultz',
 '00pm',
 '00pmlemen',
 '00x',
 '01',
 '0100',
 '010423',
 '010657',
 '010745',
 '010908',
 '011042',
 '0112',
 '0114',
 '011653',
 '012139',
 '012344',
 '013',
 '013559',
 '013651',
 '013653',
 '014',
 '015',
 '015442',
 '015908',
 '016',
 '0169',
 '017',
 '01810',
 '019',
 '01apr93',
 '02',
 '020',
 '0205',
 '020832',
 '02086551',
 '021',
 '021021',
 '02115',
 '021154',
 '02139',
 '02141',
 '02172',
 '022',
 '02215',
 '022222',
 '0223',
 '022425',
 '023',
 '023116',
 '023211',
 '024',
 '024222',
 '024646',
 '025',
 '025018',
 '025027',
 '025331',
 '025636',
 '027',
 '029',
 '03',
 '030234',
 '030412

In [31]:
# Show sparse matrix as DataFrame. (.toarray())
pd.DataFrame(X_train_processed.toarray(), 
             columns=count_vec.get_feature_names())

Unnamed: 0,00,000,000007,0000ahc,000k,000th,001,0010,001116,001211,...,zoologists,zoology,zorba,zot,zumwalt,zupcic,zz,zzzzzz,zzzzzzt,ñaustin
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1321,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1322,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1323,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1324,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
X_train_processed

<1326x23135 sparse matrix of type '<class 'numpy.int64'>'
	with 226455 stored elements in Compressed Sparse Row format>

In [35]:
X_train_processed.toarray()#.shape

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [36]:
# Build a Gaussian Naive Bayes model.
nb = GaussianNB()

nb.fit(X_train_processed.toarray(), y_train)

GaussianNB()

In [37]:
# Make predictions and show scores.
y_pred_train = nb.predict(X_train_processed.toarray())
y_pred_test = nb.predict(X_test_processed.toarray())

In [38]:
from sklearn.metrics import classification_report

In [39]:
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       733
           1       1.00      1.00      1.00       593

    accuracy                           1.00      1326
   macro avg       1.00      1.00      1.00      1326
weighted avg       1.00      1.00      1.00      1326



In [40]:
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       1.00      0.97      0.98       261
           1       0.96      0.99      0.98       182

    accuracy                           0.98       443
   macro avg       0.98      0.98      0.98       443
weighted avg       0.98      0.98      0.98       443



In [41]:
# How can we look at predictions where the classification was incorrect?
# put them both in a df as 2 cols, filter out where they don't match
df = pd.DataFrame()
df['Actual'] = y_test
df['Pred'] = y_pred_test
df

Unnamed: 0,Actual,Pred
702,0,0
1255,0,0
1740,0,0
400,0,0
579,0,0
...,...,...
1655,1,1
682,0,1
1107,1,1
1427,1,1


In [42]:
index_list_of_wrong_preds = (
    df
    .loc[df['Actual'] != df['Pred']]
    .index
)
index_list_of_wrong_preds

Int64Index([154, 824, 561, 791, 517, 257, 1121, 1033, 682], dtype='int64')

In [44]:
(df.loc[df['Actual'] != df['Pred']])

Unnamed: 0,Actual,Pred
154,0,1
824,1,0
561,0,1
791,0,1
517,0,1
257,0,1
1121,0,1
1033,0,1
682,0,1


In [43]:
X_test[index_list_of_wrong_preds].values[:2]

array(['subject: chris webber chokes\nfrom: krueger@argon.gas.uug.arizona.edu (theodore r krueger)\norganization: university of arizona, tucson\nlines: 11\n\nafter the marvelous "time-out" call by chris webber (which resulted in \na technical foul, since his team had no time-outs left) perhaps webber \nwill take the place of bill buckner as the master of choke.  at least \nthis red sox fan hopes so.\n\nted\n\n--\nwhen chelsea clinton\'s secret service agent had to be replaced by an active \nduty soldier she objected on the grounds that her family dislikes the military.\n\t\t----- krueger@gas.uug.arizona.edu -----\n',
       "from: jmann@vineland.pubs.stratus.com (jim mann)\nsubject: re: celebrate liberty!  1993\norganization: stratus computer inc, marlboro ma\nlines: 12\nreply-to: jmann@vineland.pubs.stratus.com\nnntp-posting-host: gondolin.pubs.stratus.com\n\nin article <1993apr5.201051.15818@dsd.es.com>  \nbob.waldrop@f418.n104.z1.fidonet.org (bob waldrop) writes:\n\nwhat did this ha

<a id='logreg'></a>
## Train KNN Model

In [45]:
from sklearn.neighbors import KNeighborsClassifier

In [46]:
# Let's compare with a KNN model!
knn = KNeighborsClassifier()
knn.fit(X_train_processed, y_train)

KNeighborsClassifier()

In [47]:
X_pred_train_knn = knn.predict(X_train_processed)
X_pred_test_knn = knn.predict(X_test_processed)

In [48]:
print(classification_report(y_train, X_pred_train_knn))
print(classification_report(y_test, X_pred_test_knn))

              precision    recall  f1-score   support

           0       0.90      0.98      0.94       733
           1       0.97      0.86      0.91       593

    accuracy                           0.93      1326
   macro avg       0.93      0.92      0.92      1326
weighted avg       0.93      0.93      0.92      1326

              precision    recall  f1-score   support

           0       0.86      0.97      0.91       261
           1       0.94      0.77      0.85       182

    accuracy                           0.88       443
   macro avg       0.90      0.87      0.88       443
weighted avg       0.89      0.88      0.88       443



# Assumptions / Pros-Cons
## Assumptions
- Features are independent.

## Pros
- Performs very well with categorical variables (OHE).
- Very fast training time.
    - Can handle large datasets well.
- Good when there are many features.
- Robust to outliers.

## Cons
- Correlated features negatively impact performance.


# Example Use Cases
- Sentiment Analysis
- Document Classification
    - Spam Filtering