# 2 Entropy-based discretization

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
%matplotlib inline

Entropy-based discretization is a supervised binning approach that aims at finding boundaries for discretization that keep the class labels of the resulting bins as pure as possible.

Consider the following set of sensor measurements $a_i$ with class labels $c_i ∈ {OK (0), FAIL (1)}$:

In [3]:
measurements = pd.DataFrame({'a': [0.1, 0.2, 0.8, 0.9, 1.0, 4.0, 10.0, 50.0],
                             'c': [1, 1, 0, 0, 1, 0, 0, 0]})
measurements

Unnamed: 0,a,c
0,0.1,1
1,0.2,1
2,0.8,0
3,0.9,0
4,1.0,1
5,4.0,0
6,10.0,0
7,50.0,0


## a)

Compute the entropy for the candidate boundaries $T = 0.5, T = 0.95, T = 2.5$. Which boundary
gives the best discretization? Use that boundary to discretize the data.

In [4]:
def entropy(data):
    values = {}
    for key in data['c']:
        values[key] = values.get(key, 0) + 1
    return -1 * sum([v / len(data['c']) * math.log(v / len(data['c']), 2) for k, v in values.items()])


def net_entropy(data, t_value):
    index = 0
    while(data['a'][index] <= t_value):
        index += 1
    intervals = (data[:index], data[index:])        
    return sum([len(i) / len(data) * entropy(i) for i in intervals])

In [5]:
t_values = [0.5, 0.95, 2.5]

for t_value in t_values:
    print('t: {0}, I: {1}'.format(t_value, net_entropy(measurements, t_value)))

t: 0.5, I: 0.4875168162362656
t: 0.95, I: 0.9056390622295665
t: 2.5, I: 0.6068441215341679


---

## b)

Describe a method to decide which candidate boundaries to test.