## Phase 3.28
# Bayes Classification

## Objectives
- Revisit <a href='#bayes'>Bayes Theorem</a> and conceptualize building a model.
- Understand the data processing method: <a href='#bow'>Bag of Words</a>.
- Implement <a href='#nb-clf'>Naive Bayes Classifier</a> in scikit-learn.
    - Compare Naive Bayes with <a href='#logreg'>another classification model</a>.

<a id='bayes'></a>
# Bayes Classification: Introduction

Naive Bayes algorithms extend Bayes' formula to multiple variables by assuming that these features are independent of one another, which may not be met, (hence its naivety) it can nonetheless provide strong results in scenarios with clean and well normalized datasets. This then allows you to estimate an overall probability by multiplying the conditional probabilities for each of the independent features.

## Recap: Bayes Theorem

$$ \Large P(A|B) = \frac{P(B|A)\bullet P(A)}{P(B)}$$

Expanding to multiple features, the multinomial Bayes' formula is:  

$$ \Large P(y|x_1, x_2, ..., x_n) = \frac{P(y)\prod_{i}^{n}P(x_i|y)}{P(x_1, x_2, ..., x_n)}$$

---
<a id='bow'></a>
# How can we represent a text with as a DataFrame?

 $$ P(\text{Document | Word}) = \dfrac{P(\text{Word | Document})P(\text{Document})}{P(\text{Word})}$$  

## Bag of Words

A **Bag of Words** is a frequency-mapping for a given text or document.

For example:

> **Sentence**: `'My favorite food is pizza. My brother likes pizza too.'`
> 
> **Bag of Words**: `{'My': 2, 'favorite': 1, 'food': 1, 'is': 1, 'pizza': 2, 'brother': 1, 'likes': 1, 'too': 1}`

---
## Transforming Text Data

***Original Data***
> **Sentence 1**: `'My favorite food is pizza. My brother likes pizza too.'`
> 
> **Sentence 2**: `'My mom likes baseball. My dad likes painting.'`
>

***Processing***
> **BoW 1**: `{'My': 2, 'favorite': 1, 'food': 1, 'is': 1, 'pizza': 2, 'brother': 1, 'likes': 1, 'too': 1}`
>
> **BoW 2**: `{'My': 2, 'mom': 1, 'likes': 2, 'baseball': 1, 'dad': 1, 'painting': 1}`

***DataFrame Representation***
> | mom | favorite | brother | pizza | painting | too | my | is | dad | food | baseball | likes |
> | --- | ---      | ---     | ---   | ---      | --- | -- | -- | --- | ---  | ---      | ---   |
> |   0 |   1      |   1     |   2   |   0      |   1 |  2 |  1 |   0 |   1  |   0      |   1   |
> |   1 |   0      |   0     |   0   |   1      |   0 |  2 |  0 |   1 |   0  |   1      |   2   |

In [1]:
# Coding a Bag of Words
s1 = 'My favorite food is pizza My brother likes pizza too'
s2 = 'My mom likes baseball My dad likes painting'

# PRACTICE!
# Create BoW dictionaries for each sentence.


<a id='nb-clf'></a>
# Classifying News
## Implementing Bayes Classifier in Scikit-Learn

**There are 3 available classifiers for Naive Bayes in sklearn.** 

1. Gaussian: Assumes that continuous features follow a normal distribution.

2. Bernoulli: The binomial model is useful if your features are binary.

3. Multinomial: It is useful if your features are discrete.

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, plot_confusion_matrix
from sklearn.datasets import fetch_20newsgroups

In [3]:
# Load data.
news_data = fetch_20newsgroups(
    subset='all',
    categories=['rec.sport.baseball', 'talk.politics.misc']
    )
news_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [4]:
# Create a dataframe from the loaded data.
df = pd.DataFrame()
df['data'] = news_data['data']
df['target'] = news_data['target']
df.head()

Unnamed: 0,data,target
0,From: paula@koufax.cv.hp.com (Paul Andresen)\n...,0
1,From: garrett@Ingres.COM \nSubject: Re: Limiti...,1
2,From: djs9683@ritvax.isc.rit.edu\nSubject: Re:...,0
3,From: nickn@eskimo.com (Nick Nussbaum)\nSubjec...,1
4,From: jerry@sheldev.shel.isc-br.com (Gerald La...,0


In [5]:
# Look at a row of the data.


In [6]:
# Perform train test split (random_state=2021)


In [7]:
# Use CountVectorizer to process text.


In [8]:
# Show sparse matrix as DataFrame. (.toarray())


In [9]:
# Build a Gaussian Naive Bayes model.


In [10]:
# Make predictions and show scores.


In [11]:
# How can we look at predictions where the classification was incorrect?


<a id='logreg'></a>
## Train Logistic Regression Model

In [12]:
# Let's compare with a Logistic Regression model!


# Resources

