### Gini Impurity

To compute the Gini impurity for a set of items with $J$ classes, suppose $i \in {1,2,...,J}$, let $p_{1}$ be the fraction of items labeled with class $i$ in the set.

$$I_{G}(p) = \sum_{i=1}^J p_{i}(1-p_{i}) = \sum_{i=1}^J p_{i}(1-p_{i}^{2}) = \sum_{i=1}^J p_{i} - \sum_{i=1}^J p_{i}^{2}= 1 - \sum_{i=1}^J p_{i}^{2} = \sum_{i\ne k}p_{i}p_{k}$$


The gini coefficient is a measure of variance where higher variace means more misinformation, the lower the gini coefficient the better. 

A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups created by the split. A perfect separation results in a Gini score of 0, whereas the worst case split that results in 50/50 classes in each group result in a Gini score of 0.5 (for a 2 class problem).

In [1]:
import pandas as pd 
#add seaborn to make better the visualizations
from IPython.display import display

headers = ['index','refractive index', 'sodium', 'magnesium', 'aluminum', 'silicon', 'potassium', 'calcium', 'barium', 'iron', 'type of glass']
df = pd.read_csv('glass_data.csv', names = headers)

In [5]:
data_sample = df.sample(10)


Unnamed: 0,index,refractive index,sodium,magnesium,aluminum,silicon,potassium,calcium,barium,iron,type of glass
124,125,1.52177,13.2,3.68,1.15,72.75,0.54,8.52,0.0,0.0,2
178,179,1.51829,14.46,2.24,1.62,72.38,0.0,9.26,0.0,0.0,6
72,73,1.51593,13.09,3.59,1.52,73.1,0.67,7.83,0.0,0.0,2
88,89,1.51618,13.01,3.5,1.48,72.89,0.6,8.12,0.0,0.0,2
94,95,1.51629,12.71,3.33,1.49,73.28,0.67,8.24,0.0,0.0,2
42,43,1.51779,13.21,3.39,1.33,72.76,0.59,8.59,0.0,0.0,1
30,31,1.51768,12.65,3.56,1.3,73.08,0.61,8.69,0.0,0.14,1
96,97,1.51841,13.02,3.62,1.06,72.34,0.64,9.13,0.0,0.15,2
164,165,1.51915,12.73,1.85,1.86,72.69,0.6,10.09,0.0,0.0,5
133,134,1.518,13.71,3.93,1.54,71.81,0.54,8.21,0.0,0.15,2


In [None]:
#show an example of the form of the data 



# Calculate the Gini index for a split dataset
def gini_index(groups, classes):
    # count all samples at split point
    n_instances = float(sum([len(group) for group in groups]))
    # sum weighted Gini index for each group
    gini = 0.0
    for group in groups:
        size = float(len(group))
        # avoid divide by zero
        if size == 0:
            continue
        score = 0.0
        # score the group based on the score for each class
        for class_val in classes:
            p = [row[-1] for row in group].count(class_val) / size
            score += p * p
        # weight the group score by its relative size
        gini += (1.0 - score) * (size / n_instances)
    return gini