In [1]:
import pandas as pd
import numpy as np
# numpy is a Python library that offers lots of data manipulation functions

In [2]:
# Load the dataset persons.txt and check if the dataset is loaded correctly
# The seperator is ;
persons = pd.read_csv('./persons.txt', sep = ";")
persons.head(10)

Unnamed: 0,gender,height,weight,footsize
0,male,183.0,81.5,46
1,male,180.0,86.0,45
2,male,170.0,77.0,46
3,male,180.0,75.0,44
4,female,152.5,45.0,38
5,female,167.5,68.0,40
6,female,165.0,59.0,39
7,female,175.0,68.0,42


In [3]:
# Convert male to 0 and female to 1
# Check the dataset afterwards
persons['gender'] = np.where(persons['gender']=='male', 0, 1)
persons.head(10)

Unnamed: 0,gender,height,weight,footsize
0,0,183.0,81.5,46
1,0,180.0,86.0,45
2,0,170.0,77.0,46
3,0,180.0,75.0,44
4,1,152.5,45.0,38
5,1,167.5,68.0,40
6,1,165.0,59.0,39
7,1,175.0,68.0,42


Before we can compute for the probability distribution for features $x$, 
we must first compute for the mean $μ$ and variance $\sigma^{2}$ values of $x_{i}$ for each $k$ class.

In [14]:
mean_male = persons[persons['gender'] == 0][['height', 'weight', 'footsize']].mean()
mean_female = persons[persons['gender'] == 1][['height', 'weight', 'footsize']].mean()
var_male = persons[persons['gender'] == 0][['height', 'weight', 'footsize']].var()
var_female = persons[persons['gender'] == 1][['height', 'weight', 'footsize']].var()

In [15]:
# We take a look at the computed values.

print('Mean values for male features: ')
print(mean_male)
print()
print('Mean values for female features: ')
print(mean_female)
print()
print('Variance values for male features: ')
print(var_male)
print()
print('Variance values for female features: ')
print(var_female)
print()

Mean values for male features: 
height      178.250
weight       79.875
footsize     45.250
dtype: float64

Mean values for female features: 
height      165.00
weight       60.00
footsize     39.75
dtype: float64

Variance values for male features: 
height      32.250000
weight      24.062500
footsize     0.916667
dtype: float64

Variance values for female features: 
height       87.500000
weight      118.000000
footsize      2.916667
dtype: float64



Now that we have the $μ$ and $\sigma^{2}$ values for each features $x_{i}$ per $k$ -class, 
let us now write a function for the likelihood computation, i.e. $p(x_{i}|C_{k})$ . 
We are going to plugin the likelihood computation into the Gaussian probability density function:
$$ p(x = x_{i} | C_{k}) = \dfrac{1}{\sqrt{2 \pi \sigma_{k}^{2}}} \cdot exp\bigg(\dfrac{-(x_{i} - \mu_{k})^2}{2 \sigma_{k}^{2}}\bigg) $$

Hence, we implement the likelihood function as follows:

In [16]:
def likelihood(feature, mean, variance):
    return (1 / np.sqrt(2 * np.pi * variance)) * np.exp((-(feature - mean) ** 2) / (2 * variance))

Now we want to calculate the chance that a person with height = 175 and weight = 59 and footsize = 40 is male or female
So we have to calculate 2 things:
- P(male | height = 175 , weight = 59, footsize = 40) 
    = P(height = 175 | male) * P(weight = 59 | male) * P(footsize = 40 | male) * P(male) / (P(height = 175) * P(weight = 59) * P(footsize = 40))
- P(female | height = 175 , weight = 59, footsize = 40)
    = P(height = 175 | female) * P(weight = 59 | female) * P(footsize = 40 | female) * P(female) / (P(height = 175) * P(weight = 59) * P(footsize = 40))

Because P(male) = P(female) = 0.5 and (P(height = 175) * P (weight = 59) * P(footsize = 40)) is for both formulas the same, these values don't matter
So we only need to calculate
- P(height = 175 | male) * P(weight = 59 | male) * P(footsize = 40 | male)
- P(height = 175 | female) * P(weight = 59 | female) * P(footsize = 40 | female)

In [17]:
p_height = likelihood(feature=175,
                 mean=np.array([mean_male['height'], mean_female['height']]),
                 variance=np.array([var_male['height'], var_female['height']]))
print(p_height)

[0.0596383  0.02408451]


In [18]:
p_weight = likelihood(feature=59,
                 mean=np.array([mean_male['weight'], mean_female['weight']]),
                 variance=np.array([var_male['weight'], var_female['weight']]))
print(p_weight)

[9.50078654e-06 3.65703260e-02]


In [19]:
p_footsize = likelihood(feature=40,
                 mean=np.array([mean_male['footsize'], mean_female['footsize']]),
                 variance=np.array([var_male['footsize'], var_female['footsize']]))
print(p_footsize)

[1.23191750e-07 2.31107219e-01]


In [20]:
result = p_height * p_weight * p_footsize
print(result)

[6.98017742e-14 2.03554218e-04]
