In [1]:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder # is used to convert categorical labels into numeric labels.
from sklearn.model_selection import train_test_split
import pandas as pd

In [2]:
df = pd.read_csv('mushrooms.csv')

In [3]:
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [5]:
df.shape

(8124, 23)

In [6]:
le = LabelEncoder()

In [7]:
ds = df.apply(le.fit_transform)

This line applies the fit_transform() method of the LabelEncoder to each column of the DataFrame df. The fit_transform() method fits the encoder to the unique values in each column and then transforms those values into numerical labels.

The result is stored in a new DataFrame called ds, which represents the transformed dataset. Each categorical value in df is replaced with a corresponding numerical label in ds.

In [8]:
ds

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,0,3,2,4,0,5,0,0,0,11,...,2,5,5,0,1,1,4,0,1,2
8120,0,5,2,4,0,5,0,0,0,11,...,2,5,5,0,0,1,4,0,4,2
8121,0,2,2,4,0,5,0,0,0,5,...,2,5,5,0,1,1,4,0,1,2
8122,1,3,3,4,0,8,1,0,1,0,...,1,7,7,0,2,1,0,7,4,2


In [9]:
data = ds.values
print(data.shape)
print(type(data))

(8124, 23)
<class 'numpy.ndarray'>


ds.values returns the underlying data of the ds DataFrame as a NumPy array. This step is commonly used to extract the data from a DataFrame and convert it into a format that can be easily used by machine learning algorithms.

In [10]:
data[:5, :] #The code data[:5, :] is used to retrieve the first five rows of all columns in the data NumPy array.

array([[1, 5, 2, 4, 1, 6, 1, 0, 1, 4, 0, 3, 2, 2, 7, 7, 0, 2, 1, 4, 2, 3,
        5],
       [0, 5, 2, 9, 1, 0, 1, 0, 0, 4, 0, 2, 2, 2, 7, 7, 0, 2, 1, 4, 3, 2,
        1],
       [0, 0, 2, 8, 1, 3, 1, 0, 0, 5, 0, 2, 2, 2, 7, 7, 0, 2, 1, 4, 3, 2,
        3],
       [1, 5, 3, 8, 1, 6, 1, 0, 1, 5, 0, 3, 2, 2, 7, 7, 0, 2, 1, 4, 2, 3,
        5],
       [0, 5, 2, 3, 0, 5, 1, 1, 0, 4, 1, 3, 2, 2, 7, 7, 0, 2, 1, 0, 3, 0,
        1]])

In [11]:
data_X = data[:, 1:] #Here, data[:, 1:] uses array indexing to select all rows (:) and all columns starting from index 1 until the end (1:)
data_Y = data[:, 0] # Similarly, data[:, 0] selects all rows (:) but only the first column (0). 

In [12]:
X_train, X_test, Y_train, Y_test = train_test_split(data_X, data_Y, test_size= 0.2)

In [13]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((6499, 22), (1625, 22), (6499,), (1625,))

In [14]:
np.unique(Y_train) # The output of np.unique(Y_train) will be an array containing the unique values found in Y_train. Each unique value 
                   # represents a distinct class or category present in the training set's labels.

# This is often useful to understand the distribution of classes in your training set and to identify the different 
# categories or classes your machine learning model will be trained to predict.

array([0, 1])

The output array([0, 1]) from np.unique(Y_train) means that there are two unique values, namely 0 and 1, present in the Y_train array.

In [15]:
def prior_prob(y_train, label):
    return np.sum(y_train == label)/float(y_train.shape[0])

label is the specific label for which you want to calculate the prior probability.
___
np.sum(y_train == label) compares each element in the y_train array with the label value and returns an array of Boolean values (True or False) indicating whether each element is equal to the label. 
____
The np.sum() function then sums up the True values, which represent instances where the label matches the label parameter.
____
float(y_train.shape[0]) retrieves the number of elements (i.e., = 6499)  in the y_train array, which represents the total number of training samples. By using float(), you ensure that the division operation produces a floating-point result.
___

The function returns the ratio of the count of instances with the specified label to the total number of instances in the training set. This ratio represents the prior probability of that label in the training data.
___

In [16]:
def cond_prob(X_train, Y_train, feature_col, feature_val, label):
    
    X_train = X_train[Y_train == label]
    numerator_mushrooms = np.sum(X_train[:, feature_col] == feature_val)
    denominator_mushroom = np.sum(Y_train == label)
    
    return numerator_mushrooms/float(denominator_mushroom)

- feature_col is the column index of the feature for which you want to calculate the conditional probability.

- feature_val is the specific value of the feature for which you want to calculate the conditional probability.

- label is the specific label for which you want to calculate the conditional probability.

- X_train = X_train[Y_train == label] selects only the rows from X_train where Y_train matches the specified label. This restricts the dataset to instances with the given label.

- numerator_mushrooms calculates the number of instances in the restricted X_train dataset where the feature at feature_col matches the specified feature_val.

- denominator_mushroom calculates the total number of instances in Y_train that match the specified label.

- The function then returns the ratio of the numerator_mushrooms to the denominator_mushroom. This represents the conditional probability of the specified feature value given the label.

In [17]:
def predict(X_train, Y_train, X_test):
    classes = np.unique(Y_train)
    
    n_feat = X_train.shape[1] #22
    post_prob = []
      
    for label in classes:
        likelihood = 1.0
        
        for f in range(n_feat):
            conditional_prob = cond_prob(X_train, Y_train, f, X_test[f], label)
            likelihood *= conditional_prob
            
        prior = prior_prob(Y_train, label)
        
        posterior = likelihood * prior
        post_prob.append(posterior)
        
    pred = np.argmax(post_prob)
    
    return pred

- classes = np.unique(Y_train) retrieves the unique labels in Y_train and assigns them to the classes variable.
- n_feat = X_train.shape[1] determines the number of features in the training set by accessing the second         dimension (axis 1) of X_train.shape, which represents the number of columns.
- post_prob = [] initializes an empty list to store the posterior probabilities for each class.
-----
- The code enters a loop that iterates over each unique label in classes (i.e., 0 or 1) .
- For each label, it calculates the likelihood by multiplying the conditional probabilities of each feature       given that label. It uses the cond_prob() function, which you defined earlier, to calculate the conditional     probabilities.
- It then calculates the prior probability of the label using the prior_prob() function, which you also defined   earlier.
- The posterior probability is computed by multiplying the likelihood and the prior probability.
- The posterior probability is added to the post_prob list.
---
- pred = np.argmax(post_prob) finds the index of the maximum value in the post_prob list, which corresponds to   the predicted class with the highest posterior probability.

- The predicted class index is returned as the output of the predict function.
---
In summary, the predict function iterates over each class, calculates the likelihood and prior probability, and then computes the posterior probability for each class based on the Naive Bayes assumption. The class with the highest posterior probability is chosen as the predicted class for the given test set.

In [19]:
output = predict(X_train, Y_train, X_test[1])
print(output, Y_test)

1 [0 1 0 ... 0 1 1]


- The code you provided calls the predict function to make predictions on a test instance and then compares the predicted output with the corresponding ground truth labels.

- output = predict(X_train, Y_train, X_test[1]) calls the predict function with the X_train (training set         features), Y_train (training set labels), and X_test[1] (a single test instance) as arguments. It assigns the   predicted class to the variable output.

- print(output, Y_test) prints the predicted class (output) and the corresponding ground truth labels (Y_test).

- predict function will be a predicted class index. By printing output and Y_test, you can compare the predicted class with the corresponding ground truth label for that test instance.

In [20]:
def score(X_train, X_test, Y_train, Y_test):
    pred = []
    
    for i in range(X_test.shape[0]):
        predict_label = predict(X_train, Y_train, X_test[i])
        
        pred.append(predict_label)
        
    pred = np.array(pred)
    
    accuracy = np.sum(pred == Y_test)/ Y_test.shape[0]
    
    return accuracy

In [21]:
print(score(X_train, X_test, Y_train, Y_test))

0.9975384615384615


In [28]:
X_test[1]

array([2, 2, 8, 1, 2, 1, 0, 0, 3, 1, 1, 2, 2, 7, 7, 0, 2, 1, 4, 1, 3, 1])