In [None]:
# Please don't change this cell, but do make sure to run it.
import otter
grader = otter.Notebook()

# Homework 5 Supplemental Notebook

## DSC 40A, Spring 2024

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from IPython.display import Markdown

pd.options.plotting.backend = "plotly"

# DSC 40A preferred styles.
pio.templates["dsc40a"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+dsc40a"

import warnings
warnings.simplefilter('ignore')

### Reminder

<div class="alert alert-block alert-warning" markdown="1">

This notebook **does not** need to be submitted! Instead, the part that uses it asks you to take screenshots of relevant pieces and include them in your Homework 8 PDF submission.

</div>

## Problem 2

Run the cell below to load in Billy's dataset as a `pandas` DataFrame.

In [None]:
tips = pd.read_csv('data/small_tips.csv')
tips

### Problem 2(d)

Let's now confirm that our prediction and computed probabilities are correct. First, start by assigning `p_small_manual`, `p_medium_manual`, and `p_large_manual` to your values for the following probabilities from part (c):

$$P(\text{Small } | \text{ Male, Thur, Dinner})$$ 
$$P(\text{Medium } | \text{ Male, Thur, Dinner})$$
$$P(\text{Large } | \text{ Male, Thur, Dinner})$$

Don't round. You're free to define additional variables to break down the calculations for `p_small_manual`, `p_medium_manual`, and `p_large_manual` – that's what we did1

In [None]:
p_small_manual = ...
p_medium_manual = ...
p_large_manual = ...

To verify that your answers above are correct, we will use the implementation of Naive Bayes built into `sklearn`, a popular machine learning package in Python. We've written most of the necessary code and will walk you through it – we'll show you how to write `sklearn` code on your own in DSC 80 😎.

Run the cell below to import the `CategoricalNB` object from `sklearn`. (This stands for "Categorical Naive Bayes"; the variant of Naive Bayes we've discussed in class only works for categorical features, hence the "Categorical". There are versions of Naive Bayes that also work for numerical features, too!)

In [None]:
from sklearn.naive_bayes import CategoricalNB

A minor inconvenience is that `sklearn` expects all values in our data matrix to be numerical, even though all of our data is categorical. To get our data in the right format, we'll convert each column of `small_tips` individually to be numerical according to the following dictionaries:

In [None]:
sex_map = {'Male': 0, 'Female': 1}
day_map = {'Thur': 0, 'Fri': 1, 'Sat': 2, 'Sun': 3}
time_map = {'Lunch': 0, 'Dinner': 1}
tip_cat_map = {'Small': 0, 'Medium': 1, 'Large': 2}

For instance, a female (1) customer coming on Saturday (2) for lunch (0) would have a feature vector of $[1, 2, 0]$. **Make sure you're comfortable with why this is the case before moving forward.**

We'll apply these conversions to all columns in `tips`, and store the results in a new DataFrame called `tips_converted`.

In [None]:
def convert_tips_data(row):
    return row.replace(sex_map).replace(day_map).replace(time_map).replace(tip_cat_map)

tips_converted = tips.apply(convert_tips_data, axis=1)
tips_converted

Now that we have a converted data matrix, we can "fit" the `CategoricalNB` object. Run the following cell.

In [None]:
model = CategoricalNB(alpha=1) # alpha=1 sets up smoothing the way we've discussed it in class.
model.fit(X=tips_converted[['sex', 'day', 'time']], # X contains our features.
          y=tips_converted['tip_cat'])              # y contains our "true values".

Now that our model is "fit", we can use it to make predictions. For instance, suppose we're interested in predicting whether a female (1) customer who comes to Dirty Birds on Saturday (2) for lunch (0) will leave a small, medium, or large-sized tip. We can do so as follows:

In [None]:
model.predict([[1, 2, 0]])

The result is 2, which corresponds to large-sized tips. Thus, we predict that said customer will leave a large-sized tip.

It turns out that `sklearn` also lets us peek under the hood and see the conditional probabilities of each class that it calculated. To see these probabilities, we use `predict_proba` instead of `predict`:

In [None]:
model.predict_proba([[1, 2, 0]])

This is saying that $P(\text{Small | Female, Saturday, Lunch}) = 0.38493307$, $P(\text{Medium | Female, Saturday, Lunch}) = 0.11110313$, $P(\text{Large | Female, Saturday, Lunch}) = 0.5039638$.

<br><br>

<p style="color:red"><b>Your Job</b></p>

Using `model.predict_proba`, determine what `sklearn` thinks $P(\text{Small | Male, Thur, Dinner})$, $P(\text{Medium | Male, Thur, Dinner})$, and $P(\text{Large | Male, Thur, Dinner})$ are. Assign those probabilities to `p_small_sklearn`, `p_medium_sklearn`, and `p_large_sklearn`.

If you did this part and part (c) correctly, `p_small_sklearn`, `p_medium_sklearn`, and `p_large_sklearn` should be equal to your answers from part (c) – `p_small_manual`, `p_medium_manual`, and `p_large_manual`.

In [None]:
p_small_sklearn = ...
p_medium_sklearn = ...
p_large_sklearn = ...

print('Through sklearn: P(Small | Male, Thur, Dinner) = ', ' ' * 10,  p_small_sklearn)
print('Through manual calculation: P(Small | Male, Thur, Dinner) = ', p_small_manual, '\n')

print('Through sklearn: P(Medium | Male, Thur, Dinner) = ', ' ' * 10,  p_medium_sklearn)
print('Through manual calculation: P(Medium | Male, Thur, Dinner) = ', p_medium_manual, '\n')

print('Through sklearn: P(Large | Male, Thur, Dinner) = ', ' ' * 10,  p_large_sklearn)
print('Through manual calculation: P(Large | Male, Thur, Dinner) = ', p_large_manual, '\n')

**In your PDF, include a screenshot of your code above along with its output.**

The point of this exercise was to show you that the Naive Bayes classifier we showed you in class is the same as one that is widely used in industry!

### For fun!

We've created a widget that will allow you to select a `'sex'`, `'day'`, and `'time'` from a dropdown menu and will dynamically show you the predicted tip category that `sklearn` makes for that combination of features. 

Run the cell below to play around with it!

In [None]:
from ipywidgets import widgets, interact
import matplotlib.pyplot as plt
from IPython.display import display, HTML

def predict_from_features(sex, day, time):
    sex_conv = sex_map[sex]
    day_conv = day_map[day]
    time_conv = time_map[time]
    pred = model.predict([[sex_conv, day_conv, time_conv]])
    statement = {0: 'Small', 1: 'Medium', 2: 'Large'}
    output = f'''We predict that a <b>{sex}</b> customer coming to Dirty Birds
    on <b>{day}</b> for <b>{time}</b> will leave a <b>{statement[pred[0]].lower()}-sized</b> tip.'''
    display(HTML(output))
    
interact(predict_from_features,
         sex=['Male', 'Female'],
         day=['Thur', 'Fri', 'Sat', 'Sun'],
         time=['Lunch', 'Dinner']);

Play around with the widget! There's only one combination of features that predicts that a customer will leave a large-sized tip, see if you can find it!

Remember, you do not have to submit this notebook! Instead, include a screenshot of the relevant part of your code in Problem 2(d).