In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('07_play_tennis.csv')
data.head()

Unnamed: 0,day,outlook,temp,humidity,wind,play
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes


In [3]:
data = data.drop(columns=['day'])

In [4]:
data.head()

Unnamed: 0,outlook,temp,humidity,wind,play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes


In [5]:
data.shape

(14, 5)

In [6]:
data.isnull().sum()

outlook     0
temp        0
humidity    0
wind        0
play        0
dtype: int64

### Problem 1
Outlook = Sunny, Temp = Hot, Humidity = High, Wind = Weak

Play or No Play

### Solution

- P(Yes | Sunny, Hot, High, Weak) = P(Sunny | Yes) * P(Hot | Yes)  * P(High | Yes) * P(Weak | Yes) * P(Y)
- P(No | Sunny, Hot, High, Weak) = P(Sunny | No) * P(Hot | No)  * P(High | No) * P(Weak | No) * P(N)

- Compare and decide using **Maximum a posteriori rule**

### How Naive Bayes works in both the training and testing phases:

### Training Phase:

1. **Collect and Prepare Data**: In the training phase, you start with a labeled dataset where each data point is associated with a class label. This dataset is used to train the Naive Bayes classifier.

2. **Calculate Class Priors**: Calculate the prior probability of each class. This involves counting the occurrences of each class in the dataset and dividing by the total number of instances.

3. **Calculate Conditional Probabilities**: For each feature in the dataset:
   - Calculate the conditional probability of each feature given each class. This involves counting the occurrences of each feature-value pair within each class and dividing by the total number of instances in that class.
   - Naive Bayes assumes that the features are conditionally independent given the class, so you can calculate the conditional probability of each feature independently.

4. **Store Parameters**: Store the calculated probabilities (class priors and conditional probabilities) to be used during the testing phase.

### Testing Phase:

1. **Input Data**: In the testing phase, you have a new, unseen instance (or instances) for which you want to predict the class.

2. **Calculate Class Posteriors**:
   - For each class:
     - Calculate the likelihood of the features given the class. This involves multiplying the conditional probabilities of each feature-value pair given the class.
     - Multiply the likelihood by the prior probability of the class to get the unnormalized posterior probability of the class given the features.

3. **Normalize Posteriors**: 
   - To obtain the normalized posterior probability of each class given the features, divide each unnormalized posterior probability by the sum of all unnormalized posteriors. This ensures that the probabilities sum up to 1.
   
4. **Make Prediction**: 
   - Choose the class with the highest posterior probability as the predicted class for the input instance. This is typically done by selecting the class with the maximum normalized posterior probability.

5. **Output**: 
   - The predicted class is returned as the output of the Naive Bayes classifier for the given input instance.

6. **Repeat for Each Instance**: 
   - If you have multiple instances to classify, repeat the above steps for each instance.

By following these steps, Naive Bayes learns from the training data to make predictions on unseen data in the testing phase, utilizing the probabilities calculated during training to determine the most likely class for each input instance.

In [7]:
print(data['play'].unique())

['No' 'Yes']


In [8]:
from sklearn.naive_bayes import CategoricalNB

In [9]:
data.head()

Unnamed: 0,outlook,temp,humidity,wind,play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes


In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X = data.iloc[:,:4]
y = data.iloc[:,-1].values

In [12]:
y

array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes',
       'Yes', 'Yes', 'Yes', 'No'], dtype=object)

In [13]:
from sklearn.preprocessing import LabelEncoder

In [14]:
le = LabelEncoder()

In [15]:
y = le.fit_transform(y)

In [16]:
from sklearn.preprocessing import OneHotEncoder
ohe =OneHotEncoder()
X = ohe.fit_transform(X)
X

<14x10 sparse matrix of type '<class 'numpy.float64'>'
	with 56 stored elements in Compressed Sparse Row format>

In [17]:
X_train,X_test, y_train, y_test = train_test_split(X.toarray(),y,test_size=0.2, random_state=42)

In [18]:
from sklearn.naive_bayes import CategoricalNB

In [19]:
cnb = CategoricalNB()

In [20]:
cnb.fit(X_train, y_train)

In [21]:
y_pred = cnb.predict(X_test)

In [22]:
X_test

array([[0., 1., 0., 0., 0., 1., 0., 1., 0., 1.],
       [1., 0., 0., 0., 0., 1., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0., 1., 0., 0., 1.]])

In [23]:
X_train

array([[1., 0., 0., 0., 1., 0., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0., 0., 0., 1., 0., 1.],
       [1., 0., 0., 0., 1., 0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1., 1., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 0., 1., 1., 0., 0., 1.],
       [0., 0., 1., 0., 0., 1., 0., 1., 1., 0.],
       [0., 1., 0., 0., 0., 1., 1., 0., 0., 1.],
       [1., 0., 0., 1., 0., 0., 0., 1., 1., 0.]])

In [24]:
data

Unnamed: 0,outlook,temp,humidity,wind,play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


In [25]:
cnb.score(X.toarray(),y) * 100

64.28571428571429