### Gini Index

Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2).

Calculate Gini for split using weighted Gini score of each node of that split

We want to know, who all have watched the movie, lets use `employment_status` & `age` as the split.

__Split on `employment_status`__

In [1]:
import pandas as pd
import numpy as np
sample = 50
films = pd.read_csv('../data/films.csv')

In [3]:
films

Unnamed: 0.1,Unnamed: 0,gender,age,employment_status,watching
0,0,M,21,student,yes
1,1,M,30,working,yes
2,2,F,26,working,yes
3,3,F,21,student,no
4,4,M,33,working,yes
5,5,M,30,working,yes
6,6,F,28,working,yes
7,7,M,19,student,no
8,8,F,31,working,no
9,9,M,27,working,no


In [4]:
print("Viewers who watched the movie:{}".format(len(films[films['watching'] == 'yes'])))
print("Viewers who did not watched the movie:{}".format(len(films[films['watching'] == 'no'])))

Viewers who watched the movie:26
Viewers who did not watched the movie:24


We'll start with split on employment status

In [5]:
print("Viewers who work:{}".format(len(films[films['employment_status'] == 'working'])))
print("Viewers who are student:{}".format(len(films[films['employment_status'] == 'student'])))

Viewers who work:41
Viewers who are student:9


In [6]:
working_watched = films[films['employment_status'] == 'working']
print("Working viwers who watched the movie:{}".format(len(working_watched[working_watched['watching'] == 'yes'])))
print("Working viwers who did not watch the movie:{}".format(len(working_watched[working_watched['watching'] == 'no'])))

Working viwers who watched the movie:22
Working viwers who did not watch the movie:19


In [7]:
student_watched = films[films['employment_status'] == 'student']
print("Students who watched the movie:{}".format(len(student_watched[student_watched['watching'] == 'yes'])))
print("Students who did not watch the movie:{}".format(len(student_watched[student_watched['watching'] == 'no'])))

Students who watched the movie:4
Students who did not watch the movie:5


- Below are the calculations for probabilities:

$$student\;watched\;yes = \frac{students\;who\;watched\;movie}{total\;students}$$

$$working\;watched\;yes = \frac{working\;professionals\;who\;watched\;movie}{total\;working}$$

In [8]:
student_watched_yes = (22/float(41))
working_watched_yes = (4/float(9))
print("Probability of students that watched:{:.3f}".format(student_watched_yes))
print("Probability of working people that watched:{:.3f}".format(working_watched_yes))

Probability of students that watched:0.537
Probability of working people that watched:0.444


- Below are the calculations for gini scores:

$$gini(student)\;=\;(student\;watched\;yes)^{2}\;+\;(1-student\;watched\;yes)^{2}$$

$$gini(working)\;=\;(working\;watched\;yes)^{2}\;+\;(1-working\;watched\;yes)^{2}$$

In [9]:
subnode_student = (student_watched_yes)*(student_watched_yes) + (1-student_watched_yes)*(1-student_watched_yes)
print("Gini(student):{:.3f}".format(subnode_student))

Gini(student):0.503


In [10]:
subnode_working = (working_watched_yes)*(working_watched_yes) + (1-working_watched_yes)*(1-working_watched_yes)
print("Gini(working):{:3f}".format(subnode_working))

Gini(working):0.506173


- Finding the `weighted gini` for employment split:

$$weighted\;gini(employment)\;=\;\frac{working}{total}x(gini(working))\;+\;\frac{student}{total}x(gini(student))$$

In [11]:
calculated_wt_emp = (41/float(50))*subnode_working + (9/float(50))*subnode_student
print("Weighted Gini(employment):{:.3f}".format(calculated_wt_emp))

Weighted Gini(employment):0.506


__Split on gender__

In [12]:
print("Males who watch movies:{}".format(len(films[films['gender'] == 'M'])))
print("Females who watch movies:{}".format(len(films[films['gender'] == 'F'])))

Males who watch movies:28
Females who watch movies:22


In [13]:
watched_yes = films[films['watching'] == 'yes']
print("Males who have watched Dunkirk:{}".format(len(watched_yes[watched_yes['gender'] == 'M'])))
print("Females who have watched Dunkirk:{}".format(len(watched_yes[watched_yes['gender'] == 'F'])))

Males who have watched Dunkirk:12
Females who have watched Dunkirk:14


- Calculating probabilities:

$$males\;watched\;yes = \frac{males\;who\;watched\;movie}{total\;males}$$

$$females\;watched\;yes = \frac{females\;who\;watched\;movie}{total\;females}$$

In [14]:
male_watched_yes = (12/float(28))
female_watched_yes = (14/float(22))

In [15]:
print("Probability of males that watched Dunkirk:{:.3f}".format(male_watched_yes))
print("Probability of males that watched Dunkirk:{:.3f}".format(female_watched_yes))

Probability of males that watched Dunkirk:0.429
Probability of males that watched Dunkirk:0.636


- Calculating the `gini` values for `males` & `females`:

$$gini(males)\;=\;(males\;watched\;yes)^{2}\;+\;(1-males\;watched\;yes)^{2}$$

$$gini(females)\;=\;(females\;watched\;yes)^{2}\;+\;(1-females\;watched\;yes)^{2}$$

In [16]:
subnode_male = (male_watched_yes)*(male_watched_yes) + (1-male_watched_yes)*(1-male_watched_yes)
subnode_female = (female_watched_yes)*(female_watched_yes) + (1-female_watched_yes)*(1-female_watched_yes)

In [17]:
print("Gini(female):{:.3f}".format(subnode_female))
print("Gini(male):{:.3f}".format(subnode_male))

Gini(female):0.537
Gini(male):0.510


- Calculating the `weighted gini` for `gender` split:

$$weighted\;gini(gender)\;=\;\frac{males}{total}x(gini(males))\;+\;\frac{females}{total}x(gini(females))$$

In [18]:
calculated_wt_gender = (28/float(50))*subnode_working + (22/float(50))*subnode_student
print("Weighted Gini:{:.3f}".format(calculated_wt_gender))

Weighted Gini:0.505


__Since `weighted gini(gender) < weighted gini(employment)`, the node split will take on `employment`.__

### Task: Calculate the `weighted gini index` for `age split`. Would it be more for age?

### Chi-squared

Calculate Chi-square for individual node by calculating the deviation for Success and Failure both

Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split

__Split on Gender__

- __Female Node__

In [19]:
women_total = films[films['gender'] == 'F']
men_total = films[films['gender'] == 'M']
print("Total number of females that watch films:{}".format(len(women_total)))
print("Total number of males that watch films:{}".format(len(men_total)))

Total number of females that watch films:22
Total number of males that watch films:28


In [20]:
watch_movie_female = women_total[women_total['watching'] == 'yes']
not_watch_movie_female = women_total[women_total['watching'] == 'no']
print("Women that have watched Dunkirk:{}".format(len(watch_movie_female)))
print("Women that have not watched Dunkirk:{}".format(len(not_watch_movie_female)))

Women that have watched Dunkirk:14
Women that have not watched Dunkirk:8


In [21]:
expected_watch = len(women_total)/2
not_expected_watch = len(women_total)/2
print("Total number of women expected to watch the movie (same as above):{}".format(expected_watch))
print("Total number of women not expected to watch the movie (50% probability, half of women population):{}".format(not_expected_watch))

Total number of women expected to watch the movie (same as above):11
Total number of women not expected to watch the movie (50% probability, half of women population):11


In [22]:
deviation_female = (len(watch_movie_female) - expected_watch)
deviation_female_not = (len(not_watch_movie_female) - not_expected_watch)
print("Deviation(Actual-Expected) of the women watching movie:{}".format(deviation_female))
print("Deviation(Actual-Expected) of women not watching the movie:{}".format(deviation_female_not))

Deviation(Actual-Expected) of the women watching movie:3
Deviation(Actual-Expected) of women not watching the movie:-3


- Formula to calculate the `chi-square`:

$$chi-square\;=\;\sqrt(\frac{(actual\;-\;expected)^2}{expected})$$

In [23]:
chi_female = np.sqrt((deviation_female)*(deviation_female)/float(len(watch_movie_female)))
print("Chi-squared value(female watching):{}".format(chi_female))
chi_female_not = np.sqrt((deviation_female_not)*(deviation_female_not)/float(len(not_watch_movie_female)))
print("Chi-squared value(female not watching):{}".format(chi_female_not))

Chi-squared value(female watching):0.801783725737
Chi-squared value(female not watching):1.06066017178


- Male Node

    * Same needs to be followed for `male node`.

In [24]:
men_total = films[films['gender'] == 'M']
print("Total men that watch films:{}".format(len(men_total)))

Total men that watch films:28


In [25]:
watch_movie_male = men_total[men_total['watching'] == 'yes']
not_watch_movie_male = men_total[men_total['watching'] == 'no']

print("Women that have watched Dunkirk:{}".format(len(watch_movie_male)))
print("Women that have not watched Dunkirk:{}".format(len(not_watch_movie_male)))
print("\n")

Women that have watched Dunkirk:12
Women that have not watched Dunkirk:16




In [26]:
expected_watch = len(men_total)/2
not_expected_watch = len(men_total)/2
print("Total number of men expected to watch the movie (same as above):{}".format(expected_watch))
print("Total number of men not expected to watch the movie (same as above):{}".format(not_expected_watch))
print("\n")

Total number of men expected to watch the movie (same as above):14
Total number of men not expected to watch the movie (same as above):14




In [27]:
deviation_male = (len(watch_movie_male) - expected_watch)
deviation_male_not = (len(not_watch_movie_male) - not_expected_watch)
print("Deviation(Actual-Expected) of the men watching movie:{}".format(deviation_male))
print("Deviation(Actual-Expected) of the men not watching movie:{}".format(deviation_male_not))
print("\n")

Deviation(Actual-Expected) of the men watching movie:-2
Deviation(Actual-Expected) of the men not watching movie:2




In [28]:
chi_male = np.sqrt((deviation_male)*(deviation_male)/float(len(watch_movie_male)))
chi_male_not = np.sqrt((deviation_male_not)*(deviation_male_not)/float(len(not_watch_movie_male)))
print("Chi-squared value(male watching):{}".format(chi_male))
print("Chi-squared value(male not watching):{}".format(chi_male_not))

Chi-squared value(male watching):0.57735026919
Chi-squared value(male not watching):0.5


In [29]:
total_chi_gender = chi_female + chi_female_not + chi_male + chi_male_not
print("Total weighted chi for gender:{:.3f}".format(total_chi_gender))

Total weighted chi for gender:2.940


__Split on employment status__

- Working node

    * Continuing same calculations for `working` & `employment` nodes.
    * Get the total chi-squared value & compare.

In [30]:
working_total = films[films['employment_status'] == 'working']
print("Total number of working people watching movies:{}".format(len(working_total)))

Total number of working people watching movies:41


In [31]:
watch_movie_working = working_total[working_total['watching'] == 'yes']
not_watch_movie_working = working_total[working_total['watching'] == 'no']
print("Total number of working people that watch films:{}".format(len(watch_movie_working)))
print("Total number of working people that do no watch films:{}".format(len(not_watch_movie_working)))
print("\n")

Total number of working people that watch films:22
Total number of working people that do no watch films:19




In [32]:
expected_watch = len(working_total)/2
not_expected_watch = len(working_total)/2
print("Total number of working people expected to watch the movie (same as above):{}".format(expected_watch))
print("Total number of working people not expected to watch the movie (same as above):{}".format(not_expected_watch))
print("\n")

Total number of working people expected to watch the movie (same as above):20
Total number of working people not expected to watch the movie (same as above):20




In [33]:
deviation_working = (len(watch_movie_working) - expected_watch)
deviation_working_not = (len(not_watch_movie_working) - not_expected_watch)
print("Deviation(Actual-Expected) of the working member watching movie:{}".format(deviation_working))
print("Deviation(Actual-Expected) of the working member not watching movie:{}".format(deviation_working_not))
print("\n")

Deviation(Actual-Expected) of the working member watching movie:2
Deviation(Actual-Expected) of the working member not watching movie:-1




In [34]:
chi_working = np.sqrt((deviation_working)*(deviation_working)/float(len(watch_movie_working)))
chi_working_not = np.sqrt((deviation_working_not)*(deviation_working_not)/float(len(not_watch_movie_working)))
print("Chi-squared value(working people watching):{}".format(chi_working))
print("Chi-squared value(working people not watching):{}".format(chi_working_not))

Chi-squared value(working people watching):0.426401432711
Chi-squared value(working people not watching):0.229415733871


- Student node

In [35]:
student_total = films[films['employment_status'] == 'student']
print("Total number of student watching movies:{}".format(len(student_total)))

Total number of student watching movies:9


In [36]:
watch_movie_student = student_total[student_total['watching'] == 'yes']
not_watch_movie_student = student_total[student_total['watching'] == 'no']
print("Total number of students that watch films:{}".format(len(watch_movie_student)))
print("Total number of students that do no watch films:{}".format(len(not_watch_movie_student)))
print("\n")

Total number of students that watch films:4
Total number of students that do no watch films:5




In [37]:
expected_watch = len(student_total)/2
not_expected_watch = len(student_total)/2
print("Total number of students expected to watch the movie (same as above):{}".format(expected_watch))
print("Total number of students not expected to watch the movie (same as above):{}".format(not_expected_watch))
print("\n")

Total number of students expected to watch the movie (same as above):4
Total number of students not expected to watch the movie (same as above):4




In [38]:
deviation_student = (len(watch_movie_student) - expected_watch)
deviation_student_not = (len(not_watch_movie_student) - not_expected_watch)

In [39]:
chi_student = np.sqrt((deviation_student)*(deviation_student)/float(len(watch_movie_student)))
chi_student_not = np.sqrt((deviation_student_not)*(deviation_student_not)/float(len(not_watch_movie_student)))

In [40]:
total_chi_emp = chi_student + chi_student_not + chi_working + chi_working_not

In [41]:
total_chi_emp

1.1030307620817406

__Chi in `gender` split is more than `employment status`__

### Task: Calculate the `chi-squared` value for `age node`. Find out whether it is a good split or not.

### Entropy / Information Gain

$$entropy = -p\;x\;log_{2}p\;-q\;x\;log_{2}q$$

Here p and q is probability of success and failure respectively in that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy, the better it is.

Calculate entropy of parent node

Calculate entropy of each individual node of split and calculate weighted average of all sub-nodes available in split.

In [42]:
# Calculating the parent entropy
parent_entropy = -(26/float(50))*np.log2(26/float(50)) - (24/float(50))*np.log2(24/float(50))

In [43]:
parent_entropy

0.99884553599520176

Almost, an impure node

In [44]:
#Female node entropy
female_entropy = -(14/float(22))*np.log2(14/float(22)) - (8/float(22))*np.log2(8/float(22))

In [45]:
female_entropy

0.9456603046006401

In [46]:
#Male node entropy
male_entropy = -(12/float(28))*np.log2(12/float(28)) - (16/float(28))*np.log2(16/float(28))

In [47]:
male_entropy

0.98522813603425152

In [48]:
#Weighted entropy for gender
weighted_gender = (26/float(50))*male_entropy + (24/float(30))*female_entropy

In [49]:
weighted_gender

1.268846874418323

In [50]:
#Employment status == working (entropy)
working_entropy = -(4/float(9))*np.log2(4/float(9)) - (5/float(9))*np.log2(5/float(9))

In [51]:
working_entropy

0.99107605983822222

In [52]:
#Employment status == student (entropy)
student_entropy = -(22/float(41))*np.log2(22/float(41)) - (19/float(41))*np.log2(19/float(41))

In [53]:
student_entropy

0.99613448350957956

In [54]:
#Weighted entropy for employment status
weighted_emp = (41/float(50))*working_entropy + (9/float(30))*student_entropy

In [55]:
weighted_emp

1.111522714120216

__Employment status has lesser entropy. We can find out `information gain using $(1\;-\;entropy)$__

### Task: Calculate the `entropy` for `age node`. What will be its `entropy`?

### Variance

Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population:

$$variance\;=\;\frac{\sum(x\;-\;\overline{x})}{n}$$

- Variance for root node - Split (26-24)
- Assigning `watching film` as `1` and `not watching film` as `0`.

In [56]:
mean_root = (26*1 + 24*0)/float(50)

In [57]:
mean_root

0.52

In [58]:
var_root = (26*pow((1-0.52),2)+24*pow((0-0.52),2))/50

In [59]:
var_root

0.24960000000000002

- Similarly calculating variance for `gender` split

In [60]:
#Women watched == 14
#Women not watched == 8
#Total women == 22
mean_women = (14*1 + 8*0)/float(22)

In [61]:
mean_women

0.6363636363636364

In [62]:
var_women = ((14*pow((1-0.636),2))+(8*pow((0-0.636),2)))/22

In [63]:
var_women

0.23140509090909092

In [64]:
#Men watching == 12
#Men not watching == 16
#Total men == 28
mean_men = (12*1 + 16*0)/float(28)

In [65]:
mean_men

0.42857142857142855

In [66]:
var_men = (12*pow((1-0.428),2)+16*pow((0-0.428),2))/28

In [67]:
var_men

0.24489828571428576

In [68]:
#Weighted variance for gender split
weighted_gender = (28/float(50))*var_men + (22/float(50))*var_women

In [69]:
weighted_gender

0.23896128000000005

- Calculating variance for `employment` split

In [70]:
#Total working watched == 22
#Total working not watched == 19
#Total working == 41
mean_working = (22*1 + 19*0)/float(41)

In [71]:
mean_working

0.5365853658536586

In [72]:
var_working = (22*pow((1-0.536), 2) + 19*pow((0-0.536),2))/float(41)

In [73]:
var_working

0.24866185365853663

In [74]:
#Total students watched == 4
#Total students not watched == 5
#Total students == 9
mean_students = (4*1 + 5*0)/float(9)

In [75]:
mean_students

0.4444444444444444

In [76]:
var_students = (4*pow((1-0.444),2) + 5*pow((0-0.444),2))/float(9)

In [77]:
var_students

0.2469137777777778

In [78]:
#Weighted variance for employment split
weighted_emp = (41/float(50))*var_working + (9/float(50))*var_students

In [79]:
weighted_emp

0.24834720000000005

__Here, `employment (weighted_emp)` split is slightly higher than `gender (weighted_gender)` split.__

### Task: Calculate the `variance` for `age` split & compare it against other variances.