# Homework 8 – Naive Bayes

## DSC 40A, Fall 2021

### Due Friday, December 3rd, 2021

Homework 8, the final homework of the quarter, will work a little differently. All of the questions you need to answer will be in this Jupyter Notebook. With that said, you won't have to submit this notebook; as per usual, write your solutions to the problems by either typing them up or handwriting them on a piece of paper. Homeworks are due to Gradescope by 11:59pm on the due date. You can use a slip day to extend the deadline by 24 hours. Make sure to correctly assign pages to Gradescope when submitting.

Homework will be evaluated not only on the correctness of your answers, but on your ability to present your ideas clearly and logically. You should always explain and justify your conclusions, using sound reasoning. Your goal should be to convince the reader of your assertions. If a question does not require explanation, it will be explicitly stated.

Homeworks should be written up and turned in by each student individually. You may talk to other students in the class about the problems and discuss solution strategies, but you should not share any written communication and you should not check answers with classmates. You can tell someone how to do a homework problem, but you cannot show them how to do it.

For each problem you submit, you should cite your sources by including a list of names of other students with whom you discussed the problem. Instructors do not need to be cited.

This homework will be graded out of 13 points. The point value of each problem or sub-problem is indicated by the number of avocados shown.

In [None]:
# Run this cell.
import pandas as pd
import numpy as np

## Question 1 – Goodbye, Billy! 👋

In this final homework question of the quarter, we'll work with Billy the avocado-farmer-turned-waiter-turned-Instagram-influencer one last time.

Billy works as a waiter at Dirty Birds, the on-campus restaurant, on Thursdays, Fridays, Saturdays, and Sundays. He wrote down a few pieces of information for a random sample of customers that he served. That information is stored in the `small_tips` DataFrame, which you can load in below.

In [None]:
small_tips = pd.read_csv('small_tips.csv')
small_tips

Each row of the `small_tips` dataset contains information about a single transaction. For each transaction, Billy kept track of:
- `'sex'`: the sex of the customer paying (in this case, either `'Male'` or `'Female'`)
- `'day'`: the day of the week (either `'Thur'`, `'Fri'`, `'Sat'`, or `'Sun'` – these are the only days that Billy works)
- `'time'`: either `'Lunch'` or `'Dinner'`
- `'above_18'`: `True` if the customer tipped at least 18%, and `False` otherwise

Billy wants to predict whether or not a customer will tip at least 18%, given their sex, the day of the week, and the time of day. He enlists you to help him, and you decide to use the Naive Bayes classifier that you learned about in class.

### Part A [4 points]

Using the Naive Bayes classifier and **no** smoothing, predict whether a male customer who comes to Dirty Birds on a Thursday for dinner will tip at least 18%.

**Note that this is a math problem, not a coding problem.** You must show **all** of your steps in order to get full credit. Do not convert any probabilities to decimals; write your final results as fractions.

_Some guidance: This involves computing and comparing the numerators of $P(\text{True | Male, Thur, Dinner})$ and $P(\text{False | Male, Thur, Dinner})$, using both Bayes' theorem and an assumption of conditional independence. You will know you did this correctly if one of the numerators you compute is equal to $\frac{1}{16}$_.

### Part B [4 Points]

Now, using the Naive Bayes classifier **with** smoothing, predict whether a male customer who comes to Dirty Birds on a Thursday for dinner will tip at least 18%.

**Note that this is a math problem, not a coding problem.** You must show **all** of your steps in order to get full credit. Do not convert any probabilities to decimals; write your final results as fractions.

**Note:** There was a mistake during live lecture, which has now been fixed in the slides. When smoothing, do **not** smooth the non-conditional probabilities. In other words, $P(\text{True})$ and $P(\text{False})$ should be the same as they were in Part A (fractions out of 15). Only the conditional probabilities that you calculate, e.g. $P(\text{Thur | False})$, should change now that we're using smoothing.

_Hint: You'll know if you did this correctly if one of the numerators you compute is equal to $\frac{4}{75}$._

### Part C [2 Points]

Moving forward, let's assume that we're using our results from Part B, i.e. that we're using smoothing.

When deciding what to predict for a male customer coming for dinner on a Thursday, we only computed the numerators of $P(\text{True | Male, Thur, Dinner})$ and $P(\text{False | Male, Thur, Dinner})$. This was because we weren't interested in the actual values of these two probabilities; rather, we're interested in which probability is larger. Since they both have the same denominator, $P(\text{Male, Thur, Dinner})$, the value of the denominator was irrelevant in making our prediction.

With that said, we have enough information to compute the value of $P(\text{Male, Thur, Dinner})$, which would help us compute the values of $P(\text{True | Male, Thur, Dinner})$ and $P(\text{False | Male, Thur, Dinner})$, not just their numerators. You'll see the benefit of doing this in Part D.

Here's how we'd compute that denominator. Recall that the law of total probability says that

$$P(F) = P(E \cap F) + P(\overline{E} \cap F) = P(E) \cdot P(F | E) + P(\overline{E}) \cdot P(F | \overline{E})$$

Here, treat $F$ as "$\text{Male, Thur, Dinner}$" and $E$ as "$\text{True}$". This means that an estimate for $P(\text{Male, Thur, Dinner})$ is

$$
\begin{align*}
P(\text{Male, Thur, Dinner}) &= P(\text{True}) \cdot P(\text{Male, Thur, Dinner | True}) + P(\text{False}) \cdot P(\text{Male, Thur, Dinner | False})
\end{align*}
$$

We've written $P(\text{Male, Thur, Dinner})$ as the sum of the two numerators you computed in Part B – how convenient!

Using this information, you now have enough information to estimate $P(\text{True | Male, Thur, Dinner})$ in $P(\text{False | Male, Thur, Dinner})$ in their entirety, not just their numerators.

<p style="color:red"><b>Your Job</b></p>

Below, assign `true_numerator` and `false_numerator` to the two numerators you computed in Part B, then use those variables to assign `p_true_given_features` and `p_false_given_features` to be the values of $P(\text{True | Male, Thur, Dinner})$ and $P(\text{False | Male, Thur, Dinner})$, respectively. Submit a screenshot of your code in your PDF writeup.

_Hint: `p_true_given_features` and `p_false_given_features` should sum to 1._

In [None]:
true_numerator = ... # TODO
false_numerator = ... # TODO
p_true_given_features = ... # TODO
p_false_given_features = ... # TODO

print('P(True | Male, Thur, Dinner) = ', p_true_given_features)
print('P(False | Male, Thur, Dinner) = ', p_false_given_features)

### Part D [2 Points]

Let's now confirm that our prediction and computed probabilities are correct. To do so, we will use the implementation of Naive Bayes built into `sklearn`, a popular machine learning package in Python. We've written most of the necessary code and will walk you through it – we'll show you how to write `sklearn` code on your own in DSC 80 😎.

Run the cell below to import the `CategoricalNB` object from `sklearn`. (This stands for "Categorical Naive Bayes"; the variant of Naive Bayes we've discussed in class only works for categorical features, hence the "Categorical".)

In [None]:
from sklearn.naive_bayes import CategoricalNB

A minor inconvenience is that `sklearn` expects all values in our data matrix to be numerical, even though all of our data is categorical. To get our data in the right format, we'll convert each column of `small_tips` individually to be numerical according to the following dictionaries:

In [None]:
sex_map = {'Male': 0, 'Female': 1}
day_map = {'Thur': 0, 'Fri': 1, 'Sat': 2, 'Sun': 3}
time_map = {'Lunch': 0, 'Dinner': 1}
above_18_map = {False: 0, True: 1}

For instance, a female (1) customer coming on Saturday (2) for lunch (0) would have a feature vector of $[1, 2, 0]$. **Make sure you're comfortable with why this is the case before moving forward.**

We'll apply these conversions to all columns in `small_tips`, and store the results in a new DataFrame called `small_tips_converted`.

In [None]:
def convert_tips_data(row):
    return row.replace(sex_map).replace(day_map).replace(time_map).replace(above_18_map)

small_tips_converted = small_tips.apply(convert_tips_data, axis=1)
small_tips_converted

Now that we have a converted data matrix, we can "fit" the `CategoricalNB` object. Run the following cell.

In [None]:
model = CategoricalNB(alpha=1) # alpha=1 sets up smoothing the way we've discussed it in class
model.fit(X=small_tips_converted[['sex', 'day', 'time']], # X contains our features
          y=small_tips_converted['above_18'])             # y contains our "true values"

Now that our model is "fit", we can use it to make predictions. For instance, suppose we're interested in predicting whether a female (1) customer who comes to Dirty Birds on Saturday (2) for lunch (0) will tip at least 18%. We can do so as follows:

In [None]:
model.predict([[1, 2, 0]])

The result is 0, which corresponds to `False`. Thus, we predict that said customer will not tip at least 18%.

It turns out that `sklearn` also lets us peek under the hood and see the conditional probabilities of each class that it calculated. To see these probabilities, we use `predict_proba` instead of `predict`:

In [None]:
model.predict_proba([[1, 2, 0]])

This is saying that $P(\text{False | Female, Sat, Lunch}) = 0.77243173$ and $P(\text{True | Female, Sat, Lunch}) = 0.22756827$. **Note that the probability for `False` comes before the probability for `True`.**

<p style="color:red"><b>Your Job</b></p>

Using `model.predict_proba`, determine what `sklearn` thinks $P(\text{True | Male, Thur, Dinner})$ and $P(\text{False | Male, Thur, Dinner})$ are. Assign the former to `p_true_sklearn` and `p_false_sklearn`. Submit a screenshot of your code for this part in your PDF writeup.

If you did this part and Part C correctly, `p_true_sklearn` should be equal to `p_true_given_features`, and `p_false_sklearn` should be equal to `p_false_given_features` (at least to the first 10 decimal places).

In [None]:
p_true_sklearn = ... # TODO
p_false_sklearn = ... # TODO

print('Through sklearn: P(True | Male, Thur, Dinner) = ', ' '*10,  p_true_sklearn)
print('Through manual calculation: P(True | Male, Thur, Dinner) = ', p_true_given_features, '\n')

print('Through sklearn: P(False | Male, Thur, Dinner) = ', ' '*10, p_false_sklearn)
print('Through manual calculation: P(False | Male, Thur, Dinner) = ', p_false_given_features)

The point of this exercise was to show you that the Naive Bayes classifier we showed you in class is the same as one that is widely used in industry!

### Part E [1 Point]

We've created a widget that will allow you to select a `'sex'`, `'day'`, and `'time'` from a dropdown menu and will dynamically show you the prediction that `sklearn` makes for that combination of features. 

Run the cell below to play around with it!

In [None]:
from ipywidgets import widgets, interact
import matplotlib.pyplot as plt
from IPython.display import display, HTML

def predict_from_features(sex, day, time):
    sex_conv = sex_map[sex]
    day_conv = day_map[day]
    time_conv = time_map[time]
    pred = model.predict([[sex_conv, day_conv, time_conv]])
    statement = 'not ❌' if pred == 0 else '✅'
    output = f'''We do <b>{statement}</b> predict that a <b>{sex}</b> customer coming to Dirty Birds
    on <b>{day}</b> for <b>{time}</b> will tip Billy at least 18%.'''
    display(HTML(output))
    
interact(predict_from_features,
         sex=['Male', 'Female'],
         day=['Thur', 'Fri', 'Sat', 'Sun'],
         time=['Lunch', 'Dinner']);

<p style="color:red"><b>Your Job</b></p>

There is only one day of the week that we'd predict a male customer coming for dinner would tip at least 18%. What day of the week is it? Use the above widget to find out, and put the result in your PDF writeup.