# Attribute Selection method w/ Gini

1. In decision tree terminology "Feature" or a "Column" is called an 'Attribute'.

2. There are mainly two algorithms to control the splitting conditions in a decision tree.
    a. Information gain (Entropy).
    b. Gini index.
    
    
### Entropy

Root node -> subsets (homogenous / dissimilar)
The entropy measures the homogenity of the samples are.

1. Similar samples in a Subset - Homogenous.
2. If all samples in a subset are of the same type (Target), then the entropy of that split is 0.

E(S) = sum(c,i=1) - P(i) log(2) P(i)


1. If the entropy keeps decreasing, then we keep increasing our confidence over a predictive class.
    a. Split in the direction where the entropy decreases.
    
2. The difference in the entropy before and after the split is called 'Information gain'.


### Example

a. 9-'yes', 5-'no'
b. Entropy = -p(y)log(p(y))-p(n)log(p(n))
c. p(y) = 9/14, p(n) = 5/14
d. After solving, Entropy = 0.942.

We split different attributes in columns based on the target values and calculate the entropy of each column.

E(T, X) = sum(c,e) * P(c)E(c)

Entropy of column X with target column T is sum of that class multiplied by the entropy of that class.

If sunny play (3 yes) (2 no)
E(sunny) = -3/5 log 3/5 - 2/5 log 2/5 = 0.971

Play when it's overcast (homogenous)
E(overcast) = -1log1 -0log0 = 0

Rainy (2 yes) (3 no)
Entropy(rainy) = -2/5 log 2/5 -3/5 log 2/5 = 0.971

E(outlook) = P(sunny)*E*(sunny) + P(overcast)*E(overcast) + P(rainy)*E(rainy)

P(sunny) = 5/14, P(overcast) = 4/14, P(rainy) = 5/14.

E(outlook) = 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971 = 0.693

Gain(outlook) = 0.942 - 0.693 = 0.249

Gain(T, X) = Entropy(T) - Entropy(T, X)

Similarly:

Gain(temp) = 0.029
Gain(humidity) = 0.152
Gain(windy) = 0.048

We have highest information gain on outlook, so let's split on it

Entropy(overcast) = 0 (leaf node)


# Gini Index

A measure of impurity.

Gini = 1 - Sum(x, i=1) * (P(i))(2)

The lesser the impurity the better to split on this column.

If sunny (2 yes) and (2 no)
gini_impurity(sunny) = 1 - (3/5)^2 - (2/5)^2 = 12/25

if overcast (4 yes)
gini_impurity(overcast) = 1 - (4/4)^2 - (0/4)^2 = 0 (pure leaf node)

if rainy (2 yes) and (3 no)
gini_impurity of rainy would be: 1 - (2/5)^2 - (3/5)^2 = 12/25

Gini_impurity(outlook) = P(sunny)*gini_impurity(sunny) + P(overcast)*gini_impurity(overcast) + P(rainy)*gini_impurity(rainy)

Gini_impurity(outlook) = 5/14 * 12/25 + 4/14 * 0 + 5/14 * 12/25 = 0.3428

Gini_impurity(temp) = 0.4403, Gini_impurity(humidity) = 0.3673, Gini_impurity(windy) = 0.4285

Impurity when we split on column 'outlook' is less compared to others so we will apply our initial split on 'outlook'.

We will have 3 subsets.
Overcast has the impurity 0 so it will become a leaf node.
Any sample with outlook as overcast would be play yes.
We apply the split based on Gini_impurity recursively on each node until it becomes a leaf node.
We can stop splitting nodes if the impurity before the split is less than the impurity after the split.
