## Phase 3.28
# Bayes Classification

## Objectives
- Revisit <a href='#bayes'>Bayes Theorem</a> and conceptualize building a model.
- Understand the data processing method: <a href='#bow'>Bag of Words</a>.
- Implement <a href='#nb-clf'>Naive Bayes Classifier</a> in scikit-learn.
    - Compare Naive Bayes with <a href='#logreg'>another classification model</a>.

<a id='bayes'></a>
# Bayes Classification: Introduction

Naive Bayes algorithms extend Bayes' formula to multiple variables by assuming that these features are independent of one another, which may not be met, (hence its naivety) it can nonetheless provide strong results in scenarios with clean and well normalized datasets. This then allows you to estimate an overall probability by multiplying the conditional probabilities for each of the independent features.

## Recap: Bayes Theorem

$$ \Large P(A|B) = \frac{P(B|A)\bullet P(A)}{P(B)}$$

Expanding to multiple features, the multinomial Bayes' formula is:  

$$ \Large P(y|x_1, x_2, ..., x_n) = \frac{P(y)\prod_{i}^{n}P(x_i|y)}{P(x_1, x_2, ..., x_n)}$$

---
<a id='bow'></a>
# How can we represent a text with as a DataFrame?

 $$ P(\text{Document | Word}) = \dfrac{P(\text{Word | Document})P(\text{Document})}{P(\text{Word})}$$  

## Bag of Words

A **Bag of Words** is a frequency-mapping for a given text or document.

For example:

> **Sentence**: `'My favorite food is pizza. My brother likes pizza too.'`
> 
> **Bag of Words**: `{'My': 2, 'favorite': 1, 'food': 1, 'is': 1, 'pizza': 2, 'brother': 1, 'likes': 1, 'too': 1}`

---
## Transforming Text Data

***Original Data***
> **Sentence 1**: `'My favorite food is pizza. My brother likes pizza too.'`
> 
> **Sentence 2**: `'My mom likes baseball. My dad likes painting.'`
>

***Processing***
> **BoW 1**: `{'My': 2, 'favorite': 1, 'food': 1, 'is': 1, 'pizza': 2, 'brother': 1, 'likes': 1, 'too': 1}`
>
> **BoW 2**: `{'My': 2, 'mom': 1, 'likes': 2, 'baseball': 1, 'dad': 1, 'painting': 1}`

***DataFrame Representation***
> | mom | favorite | brother | pizza | painting | too | my | is | dad | food | baseball | likes |
> | --- | ---      | ---     | ---   | ---      | --- | -- | -- | --- | ---  | ---      | ---   |
> |   0 |   1      |   1     |   2   |   0      |   1 |  2 |  1 |   0 |   1  |   0      |   1   |
> |   1 |   0      |   0     |   0   |   1      |   0 |  2 |  0 |   1 |   0  |   1      |   2   |

In [1]:
# Coding a Bag of Words
s1 = 'My favorite food is pizza My brother likes pizza too'
s2 = 'My mom likes baseball My dad likes painting'

# PRACTICE!
# Create BoW dictionaries for each sentence.
bow1 = {}
for word in s1.lower().split():
    # check if the word is in there
    # if it is ...
    # else, create the word with a new counter
    # K: V {word: freq}
    bow1[word] = bow1.get(word, 0) + 1
#     print(word, bow1.get(word))

bow2 = {}
for word in s2.lower().split():
    bow2[word] = bow2.get(word, 0) + 1

In [2]:
bow1

{'my': 2,
 'favorite': 1,
 'food': 1,
 'is': 1,
 'pizza': 2,
 'brother': 1,
 'likes': 1,
 'too': 1}

In [3]:
bow2

{'my': 2, 'mom': 1, 'likes': 2, 'baseball': 1, 'dad': 1, 'painting': 1}

<a id='nb-clf'></a>
# Classifying News
## Implementing Bayes Classifier in Scikit-Learn

**There are 3 available classifiers for Naive Bayes in sklearn.** 

1. Gaussian: Assumes that continuous features follow a normal distribution.

2. Bernoulli: The binomial model is useful if your features are binary.

3. Multinomial: It is useful if your features are discrete.

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, plot_confusion_matrix
from sklearn.datasets import fetch_20newsgroups

In [5]:
# Load data.
news_data = fetch_20newsgroups(
    subset='all',
    categories=['rec.sport.baseball', 'talk.politics.misc']
    )
news_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [6]:
# Create a dataframe from the loaded data.
df = pd.DataFrame()
df['data'] = news_data['data']
df['target'] = news_data['target']
df.head()

Unnamed: 0,data,target
0,From: paula@koufax.cv.hp.com (Paul Andresen)\n...,0
1,From: garrett@Ingres.COM \nSubject: Re: Limiti...,1
2,From: djs9683@ritvax.isc.rit.edu\nSubject: Re:...,0
3,From: nickn@eskimo.com (Nick Nussbaum)\nSubjec...,1
4,From: jerry@sheldev.shel.isc-br.com (Gerald La...,0


In [7]:
# Look at a row of the data.
df.iloc[0]['data']

'From: paula@koufax.cv.hp.com (Paul Andresen)\nSubject: Re: Let it be Known\nNntp-Posting-Host: koufax.cv.hp.com\nOrganization: Hewlett-Packard Company, Corvallis, Oregon USA\nLines: 15\n\nIn article <93104.233239ISSBTL@BYUVM.BITNET>, <ISSBTL@BYUVM.BITNET> writes:\n|> I would like to make everyone aware that in winning the NL West the Atlanta\n|> Braves did not lead wire-to-wire.  Through games of 4/14/93 the Houston\n|> Astros are percentage points ahead of the "unbeatable" Braves.\n\nAnd they deserve to be, if for no other reason than salvaging a little of the\nhonor of the NL West. The supposed strongest division in baseball lost 6 of 7\nto the East yesterday, with only the Astros prevailing.\n--------------------------------------------------------------------------------\n           We will stretch no farm animal beyond its natural length\n\n  paula@koufax.cv.hp.com   Paul Andresen  Hewlett-Packard  (503)-750-3511\n\n    home: 3006 NW McKinley    Corvallis, OR 97330       (503)-75

In [8]:
# Perform train test split (random_state=2021)
X_train, X_test, y_train, y_test = train_test_split(
    df['data'],
    df['target'])

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1326,), (443,), (1326,), (443,))

In [9]:
# 'The' != 'the'

In [10]:
# Use CountVectorizer to process text.
count_vec = CountVectorizer()

X_train_processed = count_vec.fit_transform(X_train)
X_test_processed = count_vec.transform(X_test)

In [11]:
X_train_processed

<1326x22604 sparse matrix of type '<class 'numpy.int64'>'
	with 223753 stored elements in Compressed Sparse Row format>

In [12]:
count_vec.get_feature_names()

['00',
 '000',
 '000007',
 '0000ahc',
 '000k',
 '000th',
 '001',
 '0010',
 '001116',
 '001338',
 '001815',
 '002',
 '002251w',
 '002302',
 '002339',
 '003',
 '003015',
 '003848',
 '004224',
 '004746',
 '005',
 '005204',
 '005314',
 '005634',
 '005756',
 '0059',
 '006',
 '0062',
 '007',
 '0084',
 '0096a95c',
 '0096b0f0',
 '00bjgood',
 '00cgbabbitt',
 '00cmmiller',
 '00ecgillespi',
 '00ecgillespie',
 '00mbstultz',
 '00pmlemen',
 '00x',
 '01',
 '0100',
 '010657',
 '010745',
 '010908',
 '011042',
 '0114',
 '011455',
 '011653',
 '012139',
 '013',
 '013145',
 '013559',
 '013651',
 '013653',
 '014',
 '015',
 '015442',
 '015908',
 '016',
 '0169',
 '017',
 '01810',
 '019',
 '01apr93',
 '02',
 '020',
 '020347',
 '020426',
 '0205',
 '020513',
 '020832',
 '02086551',
 '021',
 '021021',
 '021154',
 '02138',
 '02139',
 '02141',
 '02143',
 '02172',
 '022',
 '02215',
 '022222',
 '0223',
 '022425',
 '022903',
 '023',
 '023116',
 '023211',
 '024',
 '024222',
 '024646',
 '025',
 '025018',
 '025027',
 '02

In [13]:
# Show sparse matrix as DataFrame. (.toarray())
pd.DataFrame(X_train_processed.toarray(), columns=count_vec.get_feature_names())

Unnamed: 0,00,000,000007,0000ahc,000k,000th,001,0010,001116,001338,...,zoologists,zoology,zorba,zot,zumwalt,zupcic,zz,zzzzzz,zzzzzzt,ñaustin
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1321,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1322,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1323,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1324,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
X_train_processed

<1326x22604 sparse matrix of type '<class 'numpy.int64'>'
	with 223753 stored elements in Compressed Sparse Row format>

In [15]:
X_train_processed.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [16]:
# Build a Gaussian Naive Bayes model.
nb = GaussianNB()

nb.fit(X_train_processed.toarray(), y_train)

GaussianNB()

In [17]:
# Make predictions and show scores.
y_pred_train = nb.predict(X_train_processed.toarray())
y_pred_test = nb.predict(X_test_processed.toarray())

In [18]:
from sklearn.metrics import classification_report

In [19]:
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       740
           1       1.00      1.00      1.00       586

    accuracy                           1.00      1326
   macro avg       1.00      1.00      1.00      1326
weighted avg       1.00      1.00      1.00      1326



In [20]:
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.99      0.96      0.98       254
           1       0.95      0.98      0.97       189

    accuracy                           0.97       443
   macro avg       0.97      0.97      0.97       443
weighted avg       0.97      0.97      0.97       443



In [21]:
# How can we look at predictions where the classification was incorrect?

# put them both in a df as 2 cols, filter out where they don't match
saif_df = pd.DataFrame()
saif_df['y_test'] = y_test
saif_df['y_pred'] = y_pred_test
saif_df

Unnamed: 0,y_test,y_pred
334,1,1
758,1,1
1269,0,0
1299,0,0
1115,1,1
...,...,...
204,1,1
281,1,1
881,0,0
1281,0,0


In [22]:
index_list_of_wrong_preds = saif_df.loc[saif_df['y_test'] != saif_df['y_pred']].index

In [23]:
df.iloc[index_list_of_wrong_preds]['data'].values[1]

'From: rlm7638@tamsun.tamu.edu (Jack McKinney)\nSubject: Official Rules of Baseball ISBN\nOrganization: Mistress Barbara\'s Dungeon Palace\nLines: 12\nDistribution: na\nNNTP-Posting-Host: tamsun.tamu.edu\n\n     I am trying to get a copy of the _official_ rules of baseball.\nSomeone once sent me the ISBN number of it, but I have since lost it.\nCan anyone give me this information, or tell me where I can find the\nbook?  None of my local bookstores have it.\n\n+---------------------------------------------------+------------------------+\n| "I\'m walking home from school, and I\'m watching   | Jack McKinney          |\n|  some men building a new house, and the guy ham-  | jmckinney@tamu.edu     |\n|  mering on the roof calls me a paranoid little    +------------------------+\n|  weirdo....                  in Morse code."      |       This space       |\n|                       -Emo Philips                |        for rent        |\n+---------------------------------------------------+-

<a id='logreg'></a>
## Train KNN Model

In [24]:
# Let's compare with a KNN model!


# Assumptions / Pros-Cons
## Assumptions
- Features are independent.

## Pros
- Performs very well with categorical variables (OHE).
- Very fast training time.
    - Can handle large datasets well.
- Good when there are many features.
- Robust to outliers.

## Cons
- Correlated features negatively impact performance.


# Example Use Cases
- Sentiment Analysis
- Document Classification
    - Spam Filtering