# Purpose 

This notebook explains Gini Impurity.  

Gini Impurity measures the probability of misclassifying an observation.  Gini Impurity ranges from 0 (correctly classified 100% of the time) to 0.5 (correctly misclassified 50% of the time, which is equivalent to classifying the observaton purely by chance).  

Gini Impurity measures how efficiently a node is split into two leafs in a decision tree.  When Gini Impurity is equal to 0 it means that the node classifies all of the observations into a single group. We can assume that the observations have been correctly classified.  If they are classified incorrectly 100% of the time we can just change the logic to properly classify the observations.  When Gini Impurity is equal to 0.5 then the node has been split equally and there is no information gained.

The calculation for Gini Impurity is 

Gini Impurity = Sum( p(i)(1-p(i)) ) for all i

where p is the probability that an observation is assigned to a particular category. 

For a binary classification

p(1)(1-p(1)) + p(2)(1-p(2))
p(1) - p(1)^2 + p(2) - p(2)^2

p(1) + p(2) - p(1)^2 - p(2)^2

1  - p(1)^2 - p(2)^2

# References

https://www.youtube.com/watch?v=7VeUPuFGJHk

https://bambielli.com/til/2017-10-29-gini-impurity/

https://towardsdatascience.com/gini-impurity-measure-dbd3878ead33

# Initialization

In [2]:
%matplotlib inline

import os
from pathlib import Path
import numpy as np
import datetime

import pandas as pd
pd.set_option("display.max_rows",10)

# IPython

from IPython.display import display, Markdown
from IPython.display import Image

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# http://stackoverflow.com/questions/21971449/how-do-i-increase-the-cell-width-of-the-jupyter-ipython-notebook-in-my-browser
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))


# Autoload Python Code
%load_ext autoreload
%autoreload 2

# Definitions

# Gini Impurity for Chest Pain

**Create data to match StatQuest Example**

In [275]:
node = 'Chest Pains'

data_chest_pains = pd.concat( [pd.DataFrame(data={'Heart Disease': [1]*105 + [0]*39 , node:[1]*(105+39)}), 
                   pd.DataFrame(data={'Heart Disease': [1]*34 +  [0]*125 , node:[0]*(34+125)})])

data_chest_pains.groupby(node)['Heart Disease'].value_counts().unstack()

Heart Disease,0,1
Chest Pains,Unnamed: 1_level_1,Unnamed: 2_level_1
0,125,34
1,39,105


**Calculate Cross Tabulation table.**

In [277]:
crosstab = (pd.crosstab(data_chest_pains[node], data_chest_pains['Heart Disease'], margins=True, margins_name='Totals')
             .drop('Totals', axis=0)
            )
crosstab

Heart Disease,0,1,Totals
Chest Pains,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,125,34,159
1,39,105,144


**Calculate the probability for each row.**

In [278]:
gi_cp = crosstab.copy()
gi_cp[[0, 1]] = gi_cp[[0, 1]].divide(gi_cp['Totals'], axis=0)

gi_cp

Heart Disease,0,1,Totals
Chest Pains,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.786164,0.213836,159
1,0.270833,0.729167,144


**Calculate the Gini Impurity for each Row**

In [279]:
gi_cp['Gini_Impurity'] = gi_cp[0]*(1-gi_cp[0]) + gi_cp[1]*(1-gi_cp[1])
gi_cp

Heart Disease,0,1,Totals,Gini_Impurity
Chest Pains,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.786164,0.213836,159,0.336221
1,0.270833,0.729167,144,0.394965


**Calculate the Weighted Gini Impurity for the Chest Pain Node**

In [280]:
gi_cp['Weight'] = gi_cp['Totals'] / gi_cp['Totals'].sum()
gi_cp

Heart Disease,0,1,Totals,Gini_Impurity,Weight
Chest Pains,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.786164,0.213836,159,0.336221,0.524752
1,0.270833,0.729167,144,0.394965,0.475248


In [281]:
gi_cp['Weighted_Gini_Impurity'] = gi_cp['Gini_Impurity'] * gi_cp['Weight']
gi_cp

gi_cp['Weighted_Gini_Impurity'].sum(axis=0).round(3)

Heart Disease,0,1,Totals,Gini_Impurity,Weight,Weighted_Gini_Impurity
Chest Pains,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.786164,0.213836,159,0.336221,0.524752,0.176433
1,0.270833,0.729167,144,0.394965,0.475248,0.187706


0.364

# Entropy for Chest Pains

**Calculate the Entropy for each Row**

In [325]:
gi_cp['Entropy'] = -gi_cp[0]*np.log(gi_cp[0]) + -gi_cp[1]*np.log(gi_cp[1])
gi_cp

Heart Disease,0,1,Totals,Gini_Impurity,Weight,Weighted_Gini_Impurity,Entropy,Weighted_Entropy
Chest Pains,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.786164,0.213836,159,0.336221,0.524752,0.176433,0.518996,-0.272344
1,0.270833,0.729167,144,0.394965,0.475248,0.187706,0.584086,-0.277585


In [326]:
gi_cp['Weighted_Entropy'] =  gi_cp['Weight']*gi_cp['Entropy']
gi_cp

gi_cp['Weighted_Entropy'].sum(axis=0).round(3)

Heart Disease,0,1,Totals,Gini_Impurity,Weight,Weighted_Gini_Impurity,Entropy,Weighted_Entropy
Chest Pains,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.786164,0.213836,159,0.336221,0.524752,0.176433,0.518996,0.272344
1,0.270833,0.729167,144,0.394965,0.475248,0.187706,0.584086,0.277585


0.55

# Gini Impurity for Good Blood Circulation

In [285]:
node = 'Good Blood Circulation'

data = pd.concat( [pd.DataFrame(data={'Heart Disease': [1]*37 + [0]*127 , node:[1]*(37+127)}), 
                   pd.DataFrame(data={'Heart Disease': [1]*100 +  [0]*33 , node:[0]*(100+33)})])

data.groupby(node)['Heart Disease'].value_counts().unstack()

Heart Disease,0,1
Good Blood Circulation,Unnamed: 1_level_1,Unnamed: 2_level_1
0,33,100
1,127,37


**Calculate Cross Tabulation table.**

In [286]:
crosstab = (pd.crosstab(data[node], data['Heart Disease'], margins=True, margins_name='Totals')
             .drop('Totals', axis=0)
            )
crosstab

Heart Disease,0,1,Totals
Good Blood Circulation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,33,100,133
1,127,37,164


**Calculate the probability for each row.**

In [291]:
gi_gbc = crosstab.copy()
gi_gbc[[0, 1]] = gi_gbc[[0, 1]].divide(gi_gbc['Totals'], axis=0)

gi_gbc

Heart Disease,0,1,Totals
Good Blood Circulation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24812,0.75188,133
1,0.77439,0.22561,164


**Calculate the Gini Impurity for each Row**

In [292]:
gi_gbc['Gini_Impurity'] = gi_gbc[0]*(1-gi_gbc[0]) + gi_gbc[1]*(1-gi_gbc[1]) 
gi_gbc

Heart Disease,0,1,Totals,Gini_Impurity
Good Blood Circulation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.24812,0.75188,133,0.373113
1,0.77439,0.22561,164,0.34942


**Calculate the Weighted Gini Impurity for the Chest Pain Node**

In [293]:
gi_gbc['Weight'] = gi_gbc['Totals'] / gi_gbc['Totals'].sum()
gi_gbc

Heart Disease,0,1,Totals,Gini_Impurity,Weight
Good Blood Circulation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.24812,0.75188,133,0.373113,0.447811
1,0.77439,0.22561,164,0.34942,0.552189


In [294]:
gi_gbc['Weighted_Gini_Impurity'] = gi_gbc['Gini_Impurity'] * gi_gbc['Weight']
gi_gbc

gi_gbc['Weighted_Gini_Impurity'].sum(axis=0).round(3)

Heart Disease,0,1,Totals,Gini_Impurity,Weight,Weighted_Gini_Impurity
Good Blood Circulation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.24812,0.75188,133,0.373113,0.447811,0.167084
1,0.77439,0.22561,164,0.34942,0.552189,0.192946


0.36

# Entropy for Good Blood Circulation

**Calculate the Entropy for each Row**

In [323]:
gi_gbc['Entropy'] = -gi_gbc[0]*np.log(gi_gbc[0]) + -gi_gbc[1]*np.log(gi_gbc[1])
gi_gbc

Heart Disease,0,1,Totals,Gini_Impurity,Weight,Weighted_Gini_Impurity,Entropy,Weighted_Entropy
Good Blood Circulation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.24812,0.75188,133,0.373113,0.447811,0.167084,0.560261,-0.250891
1,0.77439,0.22561,164,0.34942,0.552189,0.192946,0.533917,-0.294823


In [324]:
gi_gbc['Weighted_Entropy'] =  gi_gbc['Weight']*gi_gbc['Entropy']
gi_gbc

gi_gbc['Weighted_Entropy'].sum(axis=0).round(3)

Heart Disease,0,1,Totals,Gini_Impurity,Weight,Weighted_Gini_Impurity,Entropy,Weighted_Entropy
Good Blood Circulation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.24812,0.75188,133,0.373113,0.447811,0.167084,0.560261,0.250891
1,0.77439,0.22561,164,0.34942,0.552189,0.192946,0.533917,0.294823


0.546

# Gini Impurity for Blocked Arteries

**Create data to match StatQuest Example**

In [303]:
node = 'Blocked Arteries'

data_blocked_arteries = pd.concat( [pd.DataFrame(data={'Heart Disease': [1]*92 + [0]*31 , node:[1]*(92+31)}), 
                        pd.DataFrame(data={'Heart Disease': [1]*45 +  [0]*129 , node:[0]*(45+129)})])

data_blocked_arteries.groupby(node)['Heart Disease'].value_counts().unstack()

Heart Disease,0,1
Blocked Arteries,Unnamed: 1_level_1,Unnamed: 2_level_1
0,129,45
1,31,92


**Calculate Cross Tabulation table.**

In [304]:
crosstab = (pd.crosstab(data_blocked_arteries[node], data_blocked_arteries['Heart Disease'], margins=True, margins_name='Totals')
             .drop('Totals', axis=0)
            )
crosstab

Heart Disease,0,1,Totals
Blocked Arteries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,129,45,174
1,31,92,123


**Calculate the probability for each row.**

In [305]:
gi_ba = crosstab.copy()
gi_ba[[0, 1]] = gi_ba[[0, 1]].divide(gi_ba['Totals'], axis=0)

gi_ba

Heart Disease,0,1,Totals
Blocked Arteries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.741379,0.258621,174
1,0.252033,0.747967,123


**Calculate the Gini Impurity for each Row**

In [306]:
gi_ba['Gini_Impurity'] = gi_ba[0]*(1-gi_ba[0]) + gi_ba[1]*(1-gi_ba[1])
gi_ba

Heart Disease,0,1,Totals,Gini_Impurity
Blocked Arteries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.741379,0.258621,174,0.383472
1,0.252033,0.747967,123,0.377024


**Calculate the Weighted Gini Impurity for the Chest Pain Node**

In [307]:
gi_ba['Weight'] = gi_ba['Totals'] / gi_ba['Totals'].sum()
gi_ba

Heart Disease,0,1,Totals,Gini_Impurity,Weight
Blocked Arteries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.741379,0.258621,174,0.383472,0.585859
1,0.252033,0.747967,123,0.377024,0.414141


In [308]:
gi_ba['Weighted_Gini_Impurity'] = gi_ba['Gini_Impurity'] * gi_ba['Weight']
gi_ba

gi_ba['Weighted_Gini_Impurity'].sum(axis=0).round(3)

Heart Disease,0,1,Totals,Gini_Impurity,Weight,Weighted_Gini_Impurity
Blocked Arteries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.741379,0.258621,174,0.383472,0.585859,0.22466
1,0.252033,0.747967,123,0.377024,0.414141,0.156141


0.381

# Entropy for Blocked Arteries

**Calculate the Entropy for each Row**

In [317]:
gi_ba['Entropy'] = -(gi_ba[0]*np.log(gi_ba[0])) + (-gi_ba[1]*np.log(gi_ba[1]))
gi_ba

Heart Disease,0,1,Totals,Gini_Impurity,Weight,Weighted_Gini_Impurity,Entropy,Weighted_Entropy
Blocked Arteries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.741379,0.258621,174,0.383472,0.585859,0.22466,0.571609,0.334882
1,0.252033,0.747967,123,0.377024,0.414141,0.156141,0.564557,0.233806


In [318]:
gi_ba['Weighted_Entropy'] =  gi_ba['Weight']*gi_ba['Entropy']
gi_ba

gi_ba['Weighted_Entropy'].sum(axis=0).round(3)

Heart Disease,0,1,Totals,Gini_Impurity,Weight,Weighted_Gini_Impurity,Entropy,Weighted_Entropy
Blocked Arteries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.741379,0.258621,174,0.383472,0.585859,0.22466,0.571609,0.334882
1,0.252033,0.747967,123,0.377024,0.414141,0.156141,0.564557,0.233806


0.569

# Comparison

Ideally, we want to chose the decision that minimizes the Gini Impurity (and Entropy).  In the above example, the Good Blood Circulation has the lowest Gini Impurity score of 0.36. 

In [327]:
gi_cp['Weighted_Gini_Impurity'].sum(axis=0).round(3)
gi_gbc['Weighted_Gini_Impurity'].sum(axis=0).round(3)
gi_ba['Weighted_Gini_Impurity'].sum(axis=0).round(3)

0.364

0.36

0.381

The cell below shows the comparison of the Entropy for each decision. We are interested in picking the decision that maximizes the information gain.  The information gain is equal to 

Information Gain(T,X) = Entropy(before decision) - Entropy(after decision)

In [329]:
gi_cp['Weighted_Entropy'].sum(axis=0).round(3)
gi_gbc['Weighted_Entropy'].sum(axis=0).round(3)
gi_ba['Weighted_Entropy'].sum(axis=0).round(3)

0.55

0.546

0.569