## Naïve Bayes with Categorical Data: Manual Calculation

**1. Suppose tomorrow will be mild, rainy, and windy, with high humidity. Should we play golf tomorrow? Show how the answer was calculated.**  
*From the likelihood tables on slides, we obtain P (Rainy|Golf) = 3/9, P (Mild | Golf) = 4/9, P (High | Golf) = 3/9, and P (True | Golf) = 3/9 and so on, where True is used to represent windy.*  
- *__P (Golf | Rainy,Mild,High,True)__ ∝ P (Rainy|Golf) × P (Mild | Golf) × P (High | Golf) × P (True | Golf) x P(Golf). Plugging in known likelihoods, we get __P (Golf | Rainy,Mild,High,True)__ ∝ 2/189 ≈ 0.0106.*  
- *__P (!Golf | Rainy,Mild,High,True)__ ∝ P (Rainy|!Golf) × P (Mild|!Golf) × P (High|!Golf) × P (True|!Golf) x P(!Golf) = 2/5 × 2/5 × 4/5 × 3/5 x 5/14 = 23/875 ≈ 0.0274.*  
- *Normalization would produce __P (Golf | tomorrow)__ = 0.0106/(0.0106+0.0183) ≈ 0.367.*  
- *Similarly, __P (!Golf | tomorrow)__ = 0.0183/(0.0106+0.0183) ≈ 0.633.*

*Comparing the two posterior likelihood, we see that __P (Golf | Rainy,Mild,High,True)__ < __P (!Golf | Rainy,Mild,High,True)__. Therefore, we should not go play golf tomorrow.*

**2. Is the assumption that outlook and humidity are independent a good assumption? Explain why or why not.**  
*It is probably a bad assumption. Usually, we would expect higher humidity if it is rainy. Therefore, these two variables are not independent, and the probabilities calculated for each class based on them might not be correct.*

## Naïve Bayes with Categorical Data: Sklearn

In [1]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import CategoricalNB # naive Bayes

In [2]:
golf = pd.read_csv("../W10/PlayGolf.csv")
golf.head()

Unnamed: 0,Outlook,Temperature,Humidity,Windy,PlayGolf
0,Rainy,Hot,High,False,No
1,Rainy,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Sunny,Mild,High,False,Yes
4,Sunny,Cool,Normal,False,Yes


### 1. Temperature and Outlook are ordinal variables. Instead of converting them to dummy variables, they need to be recoded as ordinal variables.

In [3]:
golf["Outlook"].value_counts(), golf["Temperature"].value_counts()

(Rainy       5
 Sunny       5
 Overcast    4
 Name: Outlook, dtype: int64,
 Mild    6
 Hot     4
 Cool    4
 Name: Temperature, dtype: int64)

In [4]:
outlook_mapper = {"Rainy":1, "Sunny":2, "Overcast":3}
temperature_mapper = {"Mild":1, "Hot":2, "Cool":3}
golf["Outlook"] = golf["Outlook"].replace(outlook_mapper)
golf["Temperature"] = golf["Temperature"].replace(temperature_mapper)
golf["Outlook"].value_counts(), golf["Temperature"].value_counts()

(1    5
 2    5
 3    4
 Name: Outlook, dtype: int64,
 1    6
 2    4
 3    4
 Name: Temperature, dtype: int64)

### 2. The other variables need to be recoded to binary variables. Because Windy is Boolean, it does not need to be recoded.

In [5]:
humidity_mapper = {"High":0, "Normal":1}
golf_mapper = {"Yes":1, "No":0}
golf["Humidity"] = golf["Humidity"].replace(humidity_mapper)
golf["PlayGolf"] = golf["PlayGolf"].replace(golf_mapper)
golf["Humidity"].value_counts(), golf["PlayGolf"].value_counts()

(0    7
 1    7
 Name: Humidity, dtype: int64,
 1    9
 0    5
 Name: PlayGolf, dtype: int64)

### 3. Fit the data using CategoricalNB.

In [6]:
X = golf[["Outlook", "Temperature", "Humidity", "Windy"]]
y = golf["PlayGolf"]

In [7]:
cnb = CategoricalNB()
cnb.fit(X, y)

CategoricalNB()

### 4. Using the data set, PlayGolfNext.csv, use your Naïve Bayes model to predict the next few days. Which days should you play golf?

In [8]:
future = pd.read_csv("../W10/PlayGolfNext.csv")
future.head()

Unnamed: 0,Day,Outlook,Temperature,Humidity,Windy
0,Day After Tomorrow,Overcast,Cool,High,False
1,Tomorrow,Rainy,Mild,High,True
2,Today,Sunny,Hot,Normal,False


In [9]:
future["Outlook"] = future["Outlook"].replace(outlook_mapper)
future["Temperature"] = future["Temperature"].replace(temperature_mapper)
future["Humidity"] = future["Humidity"].replace(humidity_mapper)
future.head()

Unnamed: 0,Day,Outlook,Temperature,Humidity,Windy
0,Day After Tomorrow,3,3,0,False
1,Tomorrow,1,1,0,True
2,Today,2,2,1,False


In [10]:
cnb.predict(future.iloc[:, 1:5]) # ignore the Day column

array([1, 0, 1])

*The prediction is:*
- *__Today__: play*
- *__Tomorrow__: not play*
- *__Day After Tomorrow__: play*

### 5. Does the recommendation (Yes or No to play golf) for today and tomorrow match the class example and your manual prediction above?

*Yes, the recommendation and manual prediction match. They both say we should go play golf today, but not tomorrow.*