# Ensemble Paradoxes

In machine learning and deep learning, we do a lot of tasks involving classification. Is this mushroom safe to eat? Is this person a good or bad credit risk? Is this a cat or a dog?

Different machine learning algorithms have different strengths and weaknesses.

It's natural to think that combining multiple models together will produce better (or at least more robust) results than the individuals. This is called an ensemble.

Imagine a binary classification problem. we have a bunch of photos. 0=Dog, 1=Cat. We build 3 classification systems. Each one has an accuracy of around 70%. We will create these fake classifiers by taking the true results and flipping 30% of the bits in them.

Then we'll create an ensemble of the three by taking the majority pick (at least 2 of 3 agree).

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import mode
from sklearn.metrics import f1_score, classification_report

rng = np.random.default_rng(2718)

def create_ground_truth(num_bits):
    return [rng.random() > .5 for x in range(num_bits)]

In [15]:
NUM_BITS = 100000

ground_truth = create_ground_truth(NUM_BITS)

ground_truth[:10]

[False, True, False, False, True, True, True, True, True, False]

In [3]:


def create_systems(ground_truth, num_systems, flip_ratio=.4):
    """
    flips a certain number of bits, according to `flip_ratio`
    """
    systems = [ground_truth.copy() for x in range(3)]

    for system in systems:
        for count, bit in enumerate(system):
            if rng.random() < flip_ratio:
                system[count] = not(bit)
    return systems


each of the systems agree with the ground truth about 60% of the time, as would be expected from randomly flipping 40% of the bits.

In [16]:
systems = create_systems(ground_truth, 3)

print(f1_score(ground_truth, systems[0]))
print(f1_score(ground_truth, systems[1]))
print(f1_score(ground_truth, systems[2]))

0.5994074785811514
0.6017316017316018
0.5993384121892542


In [17]:
sum(pd.Series(systems[0]) == pd.Series(systems[1]))

51884

In [43]:
def do_consensus(systems):
    """
    do a consensus pick. round each one and then pick the most popular answer

    this function is really, really slow, because the 'mode' function doesn't accept booleans.
    """
    consensus = []
    for x in range(len(systems[0])):
        system_picks = [int(s[x]) for s in systems]
        ### there are only two possible choices (0 or 1) so there will always be a majority
        ### winner with an odd number of systems.

        chosen_bit = mode(system_picks).mode
        consensus.append(bool(chosen_bit))
    return consensus


In [18]:
consensus = do_consensus(systems)


In [19]:
print(classification_report(ground_truth, consensus, target_names=['0', '1']))

              precision    recall  f1-score   support

           0       0.65      0.65      0.65     49965
           1       0.65      0.64      0.65     50035

    accuracy                           0.65    100000
   macro avg       0.65      0.65      0.65    100000
weighted avg       0.65      0.65      0.65    100000



Hey! the ensemble works! the consensus f1 score (0.65) is higher than the individual ones (.58, .59, .55).

Adding more systems doesn't increase the accuracy, though. We can't erase the randomness entirely.

In [45]:
systems2 = create_systems(ground_truth, 15)
consensus2 = do_consensus(systems2)
print(classification_report(ground_truth, consensus2, target_names=['0', '1']))

              precision    recall  f1-score   support

           0       0.65      0.65      0.65     49965
           1       0.65      0.65      0.65     50035

    accuracy                           0.65    100000
   macro avg       0.65      0.65      0.65    100000
weighted avg       0.65      0.65      0.65    100000



You can't get rid of the noise by adding a huge number of systems, though.

In [44]:
systems3 = create_systems(ground_truth, 151)
consensus3 = do_consensus(systems3)
print(classification_report(ground_truth, consensus3, target_names=['0', '1']))

              precision    recall  f1-score   support

           0       0.64      0.65      0.65     49965
           1       0.65      0.64      0.65     50035

    accuracy                           0.65    100000
   macro avg       0.65      0.65      0.65    100000
weighted avg       0.65      0.65      0.65    100000



Say we take a different approach -- imagine each system outputs a float that rounds to the right value 60% of the time. Instead of rounding each one and going with the majority opinion, we add together the floats and then round the result.

To model this, I will take the ground truth, flip 40% of the bits, then add some noise to each one, but not enough to change what it will round to.

In [34]:
def fuzzy_ground(ground_truth):
    """
    this takes the boolean ground truth and adds some random noise to the values
    so they are a float between 0 and 1 rather than a boolean. These fuzzy values 
    should round to the original booleans.
    """
    out = []
    for bit in ground_truth:
        if bit:
            # pick a number between .5 and 1
            fuzzy = (rng.random() + 1) / 2
        else:
            # pick a number between 0 and .5
            fuzzy = rng.random() / 2
        out.append(fuzzy)
    return out

def create_fuzzy_systems(ground_truth, num_systems=5, flip_ratio=.4):
    """
    generate systems, as before, then add some noise to each bit.
    """
    regular_systems = create_systems(ground_truth, num_systems, flip_ratio)

    fuzzy_systems = []
    for system in regular_systems:
        fuzzed = fuzzy_ground(system)
        fuzzy_systems.append(pd.Series(fuzzed))

    return fuzzy_systems

def fuzzy_ensemble(systems):
    rounded = np.round(np.mean(systems, axis=0))
    return rounded

To hammer the point here, I'm going to create an ensemble of 999 of these fuzzy systems. Looking at one of them, we can see that it still gives results that are correct 60% of the time. So, no funny business.

In [40]:
fuzzy_sys = create_fuzzy_systems(ground_truth, 999, .4)

sample_fuzz = round(fuzzy_sys[0])

print(classification_report(ground_truth, sample_fuzz, target_names=['0', '1']))

              precision    recall  f1-score   support

           0       0.60      0.60      0.60     49965
           1       0.60      0.60      0.60     50035

    accuracy                           0.60    100000
   macro avg       0.60      0.60      0.60    100000
weighted avg       0.60      0.60      0.60    100000



The ensemble does help, but only a little. Even with 999 systems in the ensemble, the f1 score only goes from 60% to 62%.

In [41]:
fuzz_ensemble_results = fuzzy_ensemble(fuzzy_sys)

print(classification_report(ground_truth, fuzz_ensemble_results, target_names=['0', '1']))

              precision    recall  f1-score   support

           0       0.62      0.63      0.63     49965
           1       0.63      0.62      0.62     50035

    accuracy                           0.62    100000
   macro avg       0.62      0.62      0.62    100000
weighted avg       0.62      0.62      0.62    100000

