In [None]:
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Introduction

In this problem set we are going to combine our knowledge of probability and pandas to predict whether an article is about sports. 

**In the first part** we will predict whether an article is about sports depending on if it has the word "<font color='green'>heat</font>" but **NOT** the term "<font color='orange'>grid iron</font>" in it.

To do so, we will use Naive Bayes, which means we will compute the numerator of

<font color='blue'>$P(is\_sports\_article | heat, grid\_iron) = \frac{P(heat|is\_sports\_article) P(grid\_iron|is\_sports\_article) P(is\_sports\_article)}{P(heat, grid\_iron)}$</font>

**In the second part** we will compare the probabilities produced when we apply the conditional independence assumption versus when we do not. 


# Naive Bayes
We are going to load our dataset and "train" our naive bayes model. 

Load the `is_sports_article.csv` file. Each row in this file is associated with one news article. The file has three columns:
 - **is_sports_article**: A binary variable indicating whether the article in that row is about sports (1) or food (0)
 - **grid_iron**: A binary variable indicating whether the word `grid_iron` appears in the article
 - **heat**: A binary variable indicating whether the word `heat` appears in the article

In [None]:
### load file here
ps2_dataset = pd.read_csv('./is_sports_article.csv')



In [None]:
ps2_dataset

## Computing the first numerator (2 points)
First we will compute the numerator if the new article is a sports article:

<font color='blue'>$P(heat=1|is\_sports\_article=1) \cdot P(grid\_iron=0|is\_sports\_article=1) \cdot P(is\_sports\_article=1)$ </font>


**Compute:**

-  <font color='purple'>$P(heat=1|is\_sports\_article=1)$ and assign it to `p_heat1_given_SA1`</font>

-  <font color='purple'>$P(grid\_iron=0|is\_sports\_article=1)$ and assign it to `p_gridiron0_given_SA1`</font>

-  <font color='purple'>$P(is\_sports\_article=1)$ and assign it to `p_SA1`</font>

-  <font color='purple'>Multiply these three terms and assign it to `numerator_SA1`</font>


## Computing the second numerator (2 points)
Next we will compute the numerator if the new article is a cooking article:

<font color='blue'>$P(heat=1|is\_sports\_article=0) \cdot P(grid\_iron|is\_sports\_article=0) \cdot P(is\_sports\_article=0)$ </font>

**Compute:**

-  <font color='purple'>$P(heat=1|is\_sports\_article=0)$ and assign it to `p_heat1_given_SA0`</font>

-  <font color='purple'>$P(grid\_iron=0|is\_sports\_article=0)$ and assign it to `p_gridiron0_given_SA0`</font>

-  <font color='purple'>$P(is\_sports\_article=0)$ and assign it to `p_SA0`</font>

-  <font color='purple'>Multiply these three terms and assign it to `numerator_SA0`</font>


## Prediction (1 point)
<font color='purple'>Based on `numerator_SA1` and `numerator_SA0` would the algorithm predict the an article is about sports if the article has the word `heat` in it, but no the term `grid iron`?</font>


## Compute the denominator (1 point)
We are going to use the denominator down below, so for now:
- Compute <font color='purple'>Compute $P(heat=1, grid\_iron=0)$ and assign it to `p_heat1_gridiron0`</font>


# The Conditional Independence Assumption
Now we are going to compare: 

<font color='blue'>$P(is\_sports\_article | heat, grid\_iron) = \frac{P(heat|is\_sports\_article) P(grid\_iron|is\_sports\_article) P(is\_sports\_article}{P(heat, grid\_iron)}$</font>

with 

<font color='blue'>$P(is\_sports\_article | heat, grid\_iron) = \frac{P(heat, grid\_iron|is\_sports\_article) P(is\_sports\_article)}{P(heat, grid\_iron)}$</font>

(Notice that the denominator is the same for both of these. (And, in fact, we just computed it in the previous question.))

## Computing the numerator when not assuming conditional independence (2 points)
Compute:

- <font color='purple'>$P(heat=1, grid\_iron=0|is\_sports\_article=1)$ and assign it to `p_heat1_gridiron0_given_SA1`</font>


- <font color='purple'>$P(heat=1, grid\_iron=0|is\_sports\_article=0)$ and assign it to `p_heat1_gridiron0_given_SA0`</font>

## Computing the probabilities (1 point)
Now we are going to compute the four probabilities:

1) <font color='purple'>Compute
$P(is\_sports\_article=1|heat=1, grid\_iron=0)$ using the **conditional independence assumption method** and assign it to `p_SA1_yes_CIA`</font>

2) <font color='purple'>Compute
$P(is\_sports\_article=0|heat=1, grid\_iron=0)$ using the **conditional independence assumption method** and assign it to `p_SA0_yes_CIA`</font>

3) <font color='purple'>Compute
Then, compute:
$P(is\_sports\_article=1|heat=1, grid\_iron=0)$ **WITHOUT** using the **conditional independence assumption** and assign it to `p_SA1_no_CIA`</font>

4) <font color='purple'>Compute
$P(is\_sports\_article=0|heat=1, grid\_iron=0)$ **WITHOUT** using the **conditional independence assumption** and assign it to `p_SA0_no_CIA`</font>



## Calibration (1 point)
Using your answers to the previous question, can you say if the probabilities produced with the conditional independence assumption are calibrated? Why?