P(Ci|x) = P(x|Ci) * P(Ci) / P(x)

* P(Ci|x) is the posterior probability of class y given the predictor x.
* P(x|Ci) is the class conditinal probability where the x is likelihood.
* P(Ci) is the class prior probability
* P(x) is the marginal probability of predictor x.

# Task 1 - Play Dataset

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
cd /content/gdrive/My Drive/CSE4020_ML

/content/gdrive/My Drive/CSE4020_ML


In [None]:
import pandas as pd
import numpy as np

In [None]:
data = pd.read_csv("play.csv")
data.head()

Unnamed: 0,day,outlook,temp,humidity,wind,play
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes


In [None]:
X = data.drop(['day', 'play'], axis=1) # rows - (axis=0) ; columns - (axis=1)
y = data['play']
X

Unnamed: 0,outlook,temp,humidity,wind
0,Sunny,Hot,High,Weak
1,Sunny,Hot,High,Strong
2,Overcast,Hot,High,Weak
3,Rain,Mild,High,Weak
4,Rain,Cool,Normal,Weak
5,Rain,Cool,Normal,Strong
6,Overcast,Cool,Normal,Strong
7,Sunny,Mild,High,Weak
8,Sunny,Cool,Normal,Weak
9,Rain,Mild,Normal,Weak


In [None]:
y

0      No
1      No
2     Yes
3     Yes
4     Yes
5      No
6     Yes
7      No
8     Yes
9     Yes
10    Yes
11    Yes
12    Yes
13     No
Name: play, dtype: object

In [None]:
class_priors = data['play'].value_counts(normalize=True) #.value_counts() counts the number of unique values
#normalize=True -- the resulting counts are divided by the total number of elements, which gives the relative probabilities of each unique value
#normalize=False (or not specified) returns the frequency of each unique value without normalization
print("Class Prior Probabilities:")
class_priors

Class Prior Probabilities:


Yes    0.642857
No     0.357143
Name: play, dtype: float64

alt:


```
class_priors = y.value_counts(normalize=True)
```



In [None]:
# Convert the dataset into a numpy array
X = X.values

*conversion is done to facilitate easier indexing and slicing of the data. X is now a simple 2D array*

In [None]:
test_sample = ['Rain', 'Cool', 'High']

In [None]:
def calc_likelihood(feature, value, target_class):
    class_indices = np.where(y == target_class)[0]
    feature_values = X[class_indices][:, feature]
    #give all the selected col's rows
    count = np.count_nonzero(feature_values == value)
    return count / len(class_indices)

Suppose we have the following dataset:

```python
X = [['Sunny', 'Hot', 'High', 'Weak'],
     ['Sunny', 'Hot', 'High', 'Strong'],
     ['Overcast', 'Hot', 'High', 'Weak'],
     ['Rain', 'Mild', 'High', 'Weak'],
     ['Rain', 'Cool', 'Normal', 'Weak']]
     
y = ['No', 'No', 'Yes', 'Yes', 'Yes']
```

Let's say we want to calculate the likelihood of the feature `'Outlook'` having the value `'Sunny'` given the target class `'Yes'`.

1. `class_indices = np.where(y == target_class)[0]`
   - This line finds the indices of the instances in the target variable (`y`) that belong to the target class (`'Yes'` in this case). In our example, `class_indices` would be `[2, 3, 4]`.

2. `feature_values = X[class_indices][:, feature]`


    ` X[row][col] ----> [:, feature] -> `
   - The `[:, feature]` syntax means that we want to select all rows (:) and the column specified by the feature index.
   
   - This line extracts the values of the specified feature (`feature`) from the instances that belong to the target class. In our example, the feature values for the instances with indices `[2, 3, 4]` are `['Overcast', 'Rain', 'Rain']`.

In [None]:
# Calculate likelihood for the test sample
likelihoods = []
for i, feature_value in enumerate(test_sample):
    likelihood_yes = calc_likelihood(i, feature_value, 'Yes')
    likelihood_no = calc_likelihood(i, feature_value, 'No')
    likelihoods.append((likelihood_yes, likelihood_no))

**for i, feature_value in enumerate (test_sample):**, the loop will iterate over each value in test_sample, and for each iteration, it will assign the index of the value to the variable i, and the value itself to the variable feature_value.

In [None]:
print("Likelihoods for Test Sample:\n")
for i, feature_value in enumerate(test_sample):
    print(f"Feature: {data.columns[i+1]}") #i+1 to skip the first dropped column
    #print(f"Feature: {X[:, i].tolist()}, Value: {feature_value}")
    print(f"Likelihood (Yes): {likelihoods[i][0]}")
    print(f"Likelihood (No): {likelihoods[i][1]}\n")

Likelihoods for Test Sample:

Feature: outlook
Likelihood (Yes): 0.3333333333333333
Likelihood (No): 0.4

Feature: temp
Likelihood (Yes): 0.3333333333333333
Likelihood (No): 0.2

Feature: humidity
Likelihood (Yes): 0.3333333333333333
Likelihood (No): 0.8



In [None]:
class_conditional_probs = []
for target_class in ['Yes', 'No']:
    class_conditional_prob = np.prod([likelihood[target_class] for likelihood in likelihoods])
    class_conditional_probs.append(class_conditional_prob)
    #class_conditional_probs = []
    #class_conditional_prob = np.prod([calc_likelihood(i, feature_value, target_class) for i, feature_value in enumerate(test_sample)])

print("Class Conditional Probabilities:")
print(f"P(Yes|Sample): {class_conditional_probs[0]}")
print(f"P(No|Sample): {class_conditional_probs[1]}")

Class Conditional Probabilities:
P(Yes|Sample): 0.037037037037037035
P(No|Sample): 0.06400000000000002


In [None]:
# Calculate posterior probabilities
posterior_probs = class_priors * np.array(class_conditional_probs)
posterior_probs /= np.sum(posterior_probs)

print("Posterior Probabilities:")
for i, target_class in enumerate(['Yes', 'No']):
    print(f"P({target_class}|Sample): {posterior_probs[i]}")

Posterior Probabilities:
P(Yes|Sample): 0.510204081632653
P(No|Sample): 0.489795918367347


In [None]:
# Get the class with the maximum posterior probability
max_prob_index = np.argmax(posterior_probs)
predicted_class = ['Yes', 'No'][max_prob_index]

predicted_class

'Yes'

In [None]:
from sklearn.preprocessing import OneHotEncoder
df_encoded = pd.get_dummies(data, columns=['outlook', 'temp', 'humidity', 'wind'])
df_encoded.head()

Unnamed: 0,day,play,outlook_Overcast,outlook_Rain,outlook_Sunny,temp_Cool,temp_Hot,temp_Mild,humidity_High,humidity_Normal,wind_Strong,wind_Weak
0,D1,No,0,0,1,0,1,0,1,0,0,1
1,D2,No,0,0,1,0,1,0,1,0,1,0
2,D3,Yes,1,0,0,0,1,0,1,0,0,1
3,D4,Yes,0,1,0,0,0,1,1,0,0,1
4,D5,Yes,0,1,0,1,0,0,0,1,0,1


In [None]:
X = df_encoded.drop(['day', 'play'], axis=1) # rows - (axis=0) ; columns - (axis=1)
y = df_encoded['play']

In [None]:
from sklearn.naive_bayes import CategoricalNB

clf = CategoricalNB()
clf.fit(X, y)

class_prior_probabilities = clf.class_count_ / np.sum(clf.class_count_)
print("Class Prior Probabilities:", class_prior_probabilities)

# Classify the test sample  <Rain, Cool, High>.
sample = pd.DataFrame({'outlook_Rain': [1], 'outlook_Sunny': [0], 'outlook_Overcast': [0],
                       'temp_Cool': [1], 'temp_Hot': [0], 'temp_Mild': [0],
                       'humidity_High': [1], 'humidity_Normal': [0],
                       'wind_Weak': [0], 'wind_Strong': [0]})

# Reorder the columns to match the data
sample = sample[X.columns]

print("Likelihood for Test Sample:", clf.predict_proba(sample))
print("Class Conditional Probabilities:", clf.feature_log_prob_)

Class Prior Probabilities: [0.35714286 0.64285714]
Likelihood for Test Sample: [[0.61556907 0.38443093]]
Class Conditional Probabilities: [array([[-0.15415068, -1.94591015],
       [-0.6061358 , -0.78845736]]), array([[-0.55961579, -0.84729786],
       [-0.45198512, -1.01160091]]), array([[-0.84729786, -0.55961579],
       [-0.31845373, -1.29928298]]), array([[-0.33647224, -1.25276297],
       [-0.45198512, -1.01160091]]), array([[-0.55961579, -0.84729786],
       [-0.31845373, -1.29928298]]), array([[-0.55961579, -0.84729786],
       [-0.6061358 , -0.78845736]]), array([[-1.25276297, -0.33647224],
       [-0.45198512, -1.01160091]]), array([[-0.33647224, -1.25276297],
       [-1.01160091, -0.45198512]]), array([[-0.84729786, -0.55961579],
       [-0.45198512, -1.01160091]]), array([[-0.55961579, -0.84729786],
       [-1.01160091, -0.45198512]])]




```
# Define the test sample
test_sample = np.array(['Rain', 'Cool', 'High']).reshape(1, -1)

# One-hot encode the test sample
test_sample = enc.transform(test_sample).toarray()
```



# Task 2 - Music Dataset

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [None]:
data = pd.read_csv(r"C:\Users\Ashima\Downloads\music.csv")
data.head()

Unnamed: 0,Class,_RMSenergy_Mean,_Lowenergy_Mean,_Fluctuation_Mean,_Tempo_Mean,_MFCC_Mean_1,_MFCC_Mean_2,_MFCC_Mean_3,_MFCC_Mean_4,_MFCC_Mean_5,...,_Chromagram_Mean_9,_Chromagram_Mean_10,_Chromagram_Mean_11,_Chromagram_Mean_12,_HarmonicChangeDetectionFunction_Mean,_HarmonicChangeDetectionFunction_Std,_HarmonicChangeDetectionFunction_Slope,_HarmonicChangeDetectionFunction_PeriodFreq,_HarmonicChangeDetectionFunction_PeriodAmp,_HarmonicChangeDetectionFunction_PeriodEntropy
0,relax,0.052,0.591,9.136,130.043,3.997,0.363,0.887,0.078,0.221,...,0.426,1.0,0.008,0.101,0.316,0.261,0.018,1.035,0.593,0.97
1,relax,0.125,0.439,6.68,142.24,4.058,0.516,0.785,0.397,0.556,...,0.002,1.0,0.0,0.984,0.285,0.211,-0.082,3.364,0.702,0.967
2,relax,0.046,0.639,10.578,188.154,2.775,0.903,0.502,0.329,0.287,...,0.184,0.746,0.016,1.0,0.413,0.299,0.134,1.682,0.692,0.963
3,relax,0.135,0.603,10.442,65.991,2.841,1.552,0.612,0.351,0.011,...,0.038,1.0,0.161,0.757,0.422,0.265,0.042,0.354,0.743,0.968
4,relax,0.066,0.591,9.769,88.89,3.217,0.228,0.814,0.096,0.434,...,0.004,0.404,1.0,0.001,0.345,0.261,0.089,0.748,0.674,0.957


In [None]:
X = data.drop('Class', axis=1)
y = data['Class']

# Normalize the feature values
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Encode the target variable
label_encoder = LabelEncoder() #create an instance of LabelEncoder class
y_encoded = label_encoder.fit_transform(y) #fit the data

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)
X_train

array([[0.29691211, 0.75062344, 0.34491078, ..., 0.03884624, 0.56349206,
        0.76315789],
       [0.152019  , 0.63341646, 0.14420709, ..., 0.18004187, 0.42063492,
        0.71052632],
       [0.17814727, 0.69326683, 0.12837396, ..., 0.04861596, 0.11111111,
        0.71052632],
       ...,
       [0.56057007, 0.75561097, 0.17014325, ..., 0.16515469, 0.65873016,
        0.78947368],
       [0.53444181, 0.50124688, 0.07393818, ..., 0.47825076, 0.86772487,
        0.73684211],
       [0.24940618, 0.6159601 , 0.26021613, ..., 0.40358223, 0.82539683,
        0.71052632]])

In [None]:
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred

array([3, 3, 2, 2, 1, 2, 3, 2, 3, 1, 2, 0, 2, 2, 1, 2, 0, 0, 3, 0, 2, 2,
       1, 0, 2, 0, 1, 3, 0, 1, 0, 1, 2, 2, 2, 1, 1, 0, 1, 1, 0, 2, 2, 2,
       2, 0, 1, 1, 2, 1, 2, 1, 1, 2, 0, 2, 3, 1, 2, 1, 0, 3, 1, 2, 1, 2,
       1, 3, 1, 2, 2, 1, 3, 2, 0, 3, 1, 0, 1, 2])

In [None]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.7625