In [1]:
'''
In this assignment, we want to calculate the probability of a model ensemble that uses simple majority voting 
making an incorrect prediction in a few different scenarios (differing number of models and their error rates). 
Essentially, we want to calculate the probability that a majority of the models in the ensemble will predict 
incorrectly. For this, we can use the binomial distribution probability, or the probability of getting one of 
two outcomes given a number of parameters: the number of independent models, the error rates of the models, 
and by proxy, the majority number making incorrect predictions.

To use the binomial distribution to calculate probability, the following rules must be met:

1. There must be a fixed number of trials. In our cases, the ensemble has a fixed number of models, the "trials"
so this is satisfied.

2. Each trial must be independent, i.e., not interfere with the outcome of another trial. In the case of a model ensemble,
each model does not interfere with another model's function in the ensemble, so this is satisfied. 

3. Each trial must have one of two possible outcomes, with each having a fixed probability of success. In our case, each
model outputs a correct or incorrect prediction with a fixed probability within each scenario, so this is satisfied.

To calculate the binomial distribution probability, we will define a function that uses an equation to calculate it.
'''

'''
Defining the function to calculate the binomial distribution probability using the equation prob = nCr * (p**x) * (q^(n-x))

n is the number of trials 
x is the number of successes that occur at a time from within n
nCr is the number of combinations that r can be in n
p is the probability of success for each trial
q is the probability of failure for each trial

In our scenarios, "success" will be when a model predicts incorrectly, so p would be the error rate.
'''

def binomial_probability(n, p, x):
    from math import factorial
    
    #Number of combinations of outcome order using the nCr equation
    nCr = (factorial(n) / (factorial(n-x) * factorial(x)))
    
    #Applying the equation to calculate the outcome probability based on the input parameters
    q = 1 - p
    a = (p**x)
    b = (q**(n-x))
    prob = nCr * a * b
    
    return prob

In [2]:
#Ensemble Scenario A: 11 independent models, each with an error rate of 0.2

#We want to calculate the probability that at least 6/11 models are generating incorrect predictions, the majority.
#There are 11 models, so n=11, each has an error rate of 0.2, so p = 0.2 (in our case, the success probability is 
#that the model makes an incorrect prediction), and x=6 because the majority are making incorrect predictions.

n1 = 11
p1 = 0.2
x1 = 6

print('Probability: {} percent'.format(round(binomial_probability(n1, p1, x1), 4)*100))

#The probability of the majority generating incorrect predictions is only 0.97%, which makes sense 
#as the error rate for each model is only 0.2 or 20%.

Probability: 0.97 percent


In [3]:
#Ensemble Scenario B: 11 independent models, each with an error rate of 0.49

#The error rate is now 0.49, so p will be changed to 0.49 in the function

n2 = 11
p2 = 0.49
x2 = 6

print('Probability: {} percent'.format(round(binomial_probability(n2, p2, x2), 4)*100))

#Now that the error rate has been raised to 49%, it makes sense to see an 
#increase in the probability that a majority of models predict incorrectly.

Probability: 22.06 percent


In [4]:
#Ensemble Scenario C: 21 independent models, each with an error rate of 0.49

#Because the number of models increased, the number of models needed has also 
#increased to 11. This means that n=21 and x=11 now.

n3 = 21
p3 = 0.49
x3 = 11

print('Probability: {} percent'.format(round(binomial_probability(n3, p3, x3), 4)*100))

#Increasing the number of models decreases the probability that a majority will 
#predict incorrectly, because now the majoriy is much greater (11 models compared 
#to only 6 in Scenario B).

Probability: 16.42 percent
