Here, we'll show how to learn parameter values via maximum likelihood estimation for the cancer problem, where $Y$ is a random variable indicating whether we have a rare form of cancer, and $X$ is a test with certain accuracy properties.  To begin with, in the interest of transparency, we can look at the method for generating an artificial dataset that we can use to learn parameters.

In [19]:
import numpy as np

m = 1000000 # Number of data points

X = np.zeros(m,dtype=int)
Y = np.zeros(m,dtype=int)

P_cancer = 1e-4               # Incidence rate of cancer in the population at large
P_sensitivity = 0.9           # Probability of detection when cancer is true
P_false_positive = 0.01       # False positive probability

# Loop over all patients
for i in range(m):
    # Roll the dice for whether the patient has cancer
    if np.random.rand()<P_cancer: 
        # Unlucky branch
        X[i] = 1
        # Roll the dice for whether the cancer is detected by the test
        if np.random.rand()<P_sensitivity:
            Y[i] = 1
    else:
        # Lucky branch
        # Test for whether the patient gets a false positive
        if np.random.rand()<P_false_positive:
            Y[i] = 1

Now that we have a dataset (X and Y), we can use it to come up with some parameters.  First, the maximum likelihood estimator for $P(X=1)=\theta_{prior} = \frac{1}{m}\sum_{i=1}^m X_i$.

In [20]:
theta_prior = np.sum(X)/m
print (theta_prior)

0.00011


That's pretty close to what we know to be the true value or $10^{-4}$.  Now let's try to get the true positive rate $P(Y=1|X=1) = \theta_{positive} = \frac{\sum_{i=1}^m Y_i [X_i=1]}{\sum_{i=1}^m [X_i=1]}$

In [23]:
theta_positive = np.sum(Y[X==1])/np.sum(X==1)
print (theta_positive)

0.9363636363636364


That one's not quite as good, primarily because we only have a limited number of cases (around 100) to learn this parameter from.  Finally, we can do something similar for the false positive rate.

In [24]:
theta_negative = np.sum(Y[X==0])/np.sum(X==0)
print (theta_negative)

0.01009010991209033


This is very close to the correct value because we have nearly a million examples to choose from.  