In [4]:
import pandas as pd

In [5]:
data = pd.read_csv("mushrooms.csv")

In [6]:
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


Determine the value of:

\begin{equation}
P(cap-shape \cap cap-surface \cap cap-color | class)
\end{equation}

Let's try answering the easy question first that suppose if you have been asked to compute the following Probability then what would you have done: 

\begin{equation}
P(cap-shape=x \cap cap-surface=y \cap cap-color=w | class = e)
\end{equation}

That is Probability of picking up a mushroom whose cap-shape is x as well as cap-surface is y as well as cap-color is w provided (on a condition) that the mushroom belongs to 'e' class. 

                                                        OR 
                                                        
The probability of picking up a 'e' class mushroom whose cap-shape is x as well as cap-surface is y as well as cap-color is w

We can say that event A can be represented as:

\begin{equation}
cap-shape=x \cap cap-surface=y \cap cap-color=w
\end{equation}

that is picking up a mushroom whose cap-shape is x, cap-surface is y and cap-color is w

Similarly, we can assume that we have another event B that can be represented as:

\begin{equation}
class = e 
\end{equation}

that is picking up a mushroom whose class is e

So, now we can easily say that:

\begin{equation}
P(cap-shape=x \cap cap-surface=y \cap cap-color=w | class = e) = P(A|B)
\end{equation}

Therefore, now according to the expression for conditional probability, we can say that: 

\begin{equation}
P(A|B) = \frac{P(A \cap B)}{P(B)}
\end{equation}

Now, following the above equation, we can write that: 

\begin{equation}
P(cap-shape=x \cap cap-surface=y \cap cap-color=w | class=e) = \frac{P(cap-shape=x \cap cap-surface=y \cap cap-color=w \cap class=e)}{P(class=e)}
\end{equation}

\begin{equation}
P(cap-shape=x \cap cap-surface=y \cap cap-color=w \cap class=e)
\end{equation}

will be actually equal to:

Relative Frequency of the number of mushrooms (number of rows in pandas dataframe) which are having all the four conditions satisfied at the same time (their cap-shape should be x, cap-surface should be y, cap-color should be w as well as they should be of e category)

which will be furher equal to:

\begin{equation}
\frac{Frequency(cap-shape=x \cap cap-surface=y \cap cap-color=w \cap class=e)}{Total Number of Mushrooms}
\end{equation}

which will be further equal to:

\begin{equation}
\frac{Number of rows of pandas dataframe having cap-shape=x as well as cap-surface=y as well as cap-color=w as well as class=e}{Total number of rows in pandas dataframe}
\end{equation}

Finding the number of rows satisftying the conditions mentioned in denominator

In [9]:
frequency = data[(data['cap-shape'] == 'x') & (data['cap-surface'] == 'y') & (data['cap-color'] == 'w') & (data['class'] == 'e')].shape[0]

In [10]:
total_mushrooms = data.shape[0]

In [11]:
numerator_joint_probability = frequency/total_mushrooms

In [12]:
numerator_joint_probability

0.008862629246676515

\begin{equation}
P(class=e)
\end{equation}

can be computed as:

\begin{equation}
\frac{Frequency(class=e)}{Total Number of Mushrooms}
\end{equation}

which can be further computed as:

\begin{equation}
\frac{Number of rows of pandas dataframe having e class mushrooms}{Total Number of Mushrooms}
\end{equation}

In [13]:
frequency_e = data[data['class'] == 'e'].shape[0]

p_class_equals_e = frequency_e/total_mushrooms

In [14]:
conditional_probability = numerator_joint_probability/p_class_equals_e

In [15]:
conditional_probability

0.01711026615969582

So, now we have to compute similarly: 

\begin{equation}
P(cap-shape \cap cap-surface \cap cap-color | class)
\end{equation}

But, this time this is not going to be a probability, this is going to be a set of Probability Distributions. 

Let's see how many unique values are there in each discrete random variables of cap-shape, cap-surface as well as cap-color accross your data

In [16]:
data['cap-shape'].unique()

array(['x', 'b', 's', 'f', 'k', 'c'], dtype=object)

In [17]:
data['cap-surface'].unique()

array(['s', 'y', 'f', 'g'], dtype=object)

In [18]:
data['cap-color'].unique()

array(['n', 'y', 'w', 'g', 'e', 'p', 'b', 'u', 'c', 'r'], dtype=object)

In [19]:
data['class'].unique()

array(['p', 'e'], dtype=object)

Let's have a look on the data of poissonous class mushrooms

In [21]:
p_mushrooms_data = data[data['class'] == 'p']

In [22]:
p_mushrooms_data['cap-shape'].unique()

array(['x', 'f', 'b', 'k', 'c'], dtype=object)

It can be clearly observed that in p category mushrooms, cap shape of s is not there. It means that among the whole sample data, we were manage to get even a single p category mushroom which is having 's' cap shape therefore, it's Frequency is going to be zero. Which means that suppose if we compute the following probability:

\begin{equation}
P(cap-shape=s \cap cap-surface=any \cap cap-color=any | class=p) = \frac{Frequency(cap-shape=s \cap capsurface=any \cap cap-color=any) for p category mushrooms}{Number of p category mushrooms} 
\end{equation}

Then this probability will be equal to zero because none of the p category mushrooms have cap shape as s

Here, we can compute this probability and say that its zero but not in Naive Bayes Classification. There, we have to smoothen out our hard probability of zero by either laplace smoothing or lidstone smoothing.

In [26]:
conditional_probabilities = {}

for cls in data['class'].unique():
    
    for cap_shape in data['cap-shape'].unique():
    
        for cap_surface in data['cap-surface'].unique():
        
            for cap_color in data['cap-color'].unique():
                
                conditional_prob = data[(data['cap-shape'] == cap_shape) & (data['cap-surface'] == cap_surface) & (data['cap-color'] == cap_color) & (data['class'] == cls)].shape[0]/data[data['class'] == cls].shape[0]
                
                key_string = "P("+"cap-shape={}".format(cap_shape)+" and cap-surface={}".format(cap_surface)+" and cap-color={}".format(cap_color)+" | class ={}".format(cls)+")"
                
                conditional_probabilities[key_string] = conditional_prob

In [29]:
conditional_probabilities

{'P(cap-shape=x and cap-surface=s and cap-color=n | class =p)': 0.0449438202247191,
 'P(cap-shape=x and cap-surface=s and cap-color=y | class =p)': 0.0,
 'P(cap-shape=x and cap-surface=s and cap-color=w | class =p)': 0.028600612870275793,
 'P(cap-shape=x and cap-surface=s and cap-color=g | class =p)': 0.020429009193054137,
 'P(cap-shape=x and cap-surface=s and cap-color=e | class =p)': 0.03677221654749745,
 'P(cap-shape=x and cap-surface=s and cap-color=p | class =p)': 0.008171603677221655,
 'P(cap-shape=x and cap-surface=s and cap-color=b | class =p)': 0.012257405515832482,
 'P(cap-shape=x and cap-surface=s and cap-color=u | class =p)': 0.0,
 'P(cap-shape=x and cap-surface=s and cap-color=c | class =p)': 0.0,
 'P(cap-shape=x and cap-surface=s and cap-color=r | class =p)': 0.0,
 'P(cap-shape=x and cap-surface=y and cap-color=n | class =p)': 0.04647599591419816,
 'P(cap-shape=x and cap-surface=y and cap-color=y | class =p)': 0.04187946884576098,
 'P(cap-shape=x and cap-surface=y and cap