Import functions from the `math`-module to compute density function of the normal distribution.

In [1]:
from math import sqrt, pi, exp

Define the data needed for the classification task.

In [2]:
height = [180, 174, 184, 168, 178, 170, 164, 155, 162, 166]
weight = [75, 71, 83, 63, 70, 59, 53, 46, 52, 55]
hair_length = ["Short", "Short", "Short", "Short", "Long", "Long", "Short", "Long", "Long",  "Long"]
gender = ["Male", "Male", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Female"]

incomplete_data = [172, 60, "Long"]

To calculate probabilites for height and weight, we will use the normal distribution. Before we can do this, we need to calculate the parameters $\mu$ and $\sigma^2$ of height und weight for males and females.

Since the list `gender` is already sorted (first five values correspond to males, the other values correspond to females), we can calculate these parameters like this:

In [3]:
def mean(data):
    return(sum(data) / len(data))

mean_height_m, mean_height_f, mean_weight_m, mean_weight_f = list(map(mean, [height[0:5],
                                                                             height[5:10],
                                                                             weight[0:5],
                                                                             weight[5:10]]))

def variance(data):
    data_mean = sum(data) / len(data)
    return sum([(data[i] - data_mean)**2 for i in range(len(data))]) / (len(data) - 1)

var_height_m, var_height_f, var_weight_m, var_weight_f = list(map(variance, [height[0:5],
                                                                             height[5:10],
                                                                             weight[0:5],
                                                                             weight[5:10]]))

Next, we can create a function that outputs a probability using the normal distribution:

In [4]:
def normal_prob(x, mean, variance):
    return (1 / sqrt(2 * pi * variance)) * exp(-(x - mean)**2 / (2 * variance))

How we are ready to calculate all probabilites we need to answer the given questions.

In [5]:
# probability that someone is 172 cm tall given that the person is male
p_h172_given_m = normal_prob(172, mean_height_m, var_height_m)

# probability that someone weighs 60 kg given that the person is male
p_w60_given_m = normal_prob(60, mean_weight_m, var_weight_m)

# probability that someone has long hair given that the person is male
p_hlong_given_m = hair_length[0:5].count("Long") / hair_length.count("Long")

# (assumed) probability that someone is male
p_m = gender.count("Male") / len(gender)

# probability that someone is 172 cm tall given that the person is female
p_h172_given_f = normal_prob(172, mean_height_f, var_height_f)

# probability that someone weighs 60 kg given that the person is female
p_w60_given_f = normal_prob(60, mean_weight_f, var_weight_f)

# probability that someone has long hair given that the person is female
p_hlong_given_f = hair_length[5:10].count("Long") / hair_length.count("Long")

# (assumed) probability that someone is female
p_f = gender.count("Female") / len(gender)

First, we are considering hair length as variable to predict the probablility. We will look at a second example that doesn't consider hair length and compare the results.

To calculate the probability that someone is male given that the person is 172 cm tall, weighs 60 kg and has long hair, we can use the following formula which is a generalization of the basic theorem by Bayes. 

\begin{align}
P(male|height = 172, weight = 60, hair = long) = \frac{R} {R + \neg{R}} 
\end{align}

where:

\begin{align}
R = P(height = 172|male) \cdot P(weight = 60|male) \cdot P(hair = long|male) \cdot P(male)
\end{align}

and:

\begin{align}
\neg{R} =  P(height = 172|female) \cdot P(weight = 60|female) \cdot P(hair = long|female) \cdot P(female)
\end{align}

Let's plug in the values! First, we consider hair length as an additional "predictor". Note that we could also compute the given probability that the person is female just by switching the contents of the formula.

In [6]:
r = p_h172_given_m * p_w60_given_m * p_hlong_given_m * p_m
not_r = p_h172_given_f * p_w60_given_f * p_hlong_given_f * p_f

p_m_given_h172_w60_hlong = r / (r + not_r)

print(
'''Probability that someone is male given that the person is 172 cm tall,
weighs 60 kg, and has long hair: {}'''.format(round(p_m_given_h172_w60_hlong, 4))
)

Probability that someone is male given that the person is 172 cm tall,
weighs 60 kg, and has long hair: 0.2033


Since the probability is below 0.5 we would consider it more likely that the person is female. What happens if we just consider height and weight?

In [7]:
r = p_h172_given_m * p_w60_given_m * p_m
not_r = p_h172_given_f * p_w60_given_f * p_f

p_m_given_h172_w60 = r / (r + not_r)

print(
'''Probability that someone is male given that the person is 172 cm tall
and weighs 60 kg: {}'''.format(round(p_m_given_h172_w60, 4))
)

Probability that someone is male given that the person is 172 cm tall
and weighs 60 kg: 0.5051


By only considering height and weight (not considering hair length), the result is quite different (50.51% opposed to 20.33%). The result is also more ambiguous: The probability is barely above 0.5, so just based on the cutoff at 0.5 we would guess that the given person is male. However, the probability that the given person is female is almost as high (1 - 0.5051).