# The building blocks of a Decision Tree classifier

### How to calculate [entropy](https://bricaud.github.io/personal-blog/entropy-in-decision-trees/)

To help establish how much disorder there is the data we calculate entropy. 

First let us establish the naive probability of an attribute that divides the data into two labels $n,m$:


$$ p(n) = 1 - p(m) $$

since the sum of the probability of n and m equals $ 1 $. Let us say $ q = p(n) = \frac{|n|}{|n|+|m|} $ and $ r = p(m) = \frac{|m|}{|n|+|m|} $ we define their entropy as:

$$ H(m,n) = - q \log(q) - r \log(r) $$

Which generalizes for an attribute that divides data into $ K $ labels:

$$ H = - \sum_{i=1}^K p_i \log(p_i) $$

### How to pick nodes
 - A chosen attribute $A$, with $K$ distinct values divides the training set $E$ into subsets $E_1,...E_K$.
 - The __Expected Entropy__ (__EH__) remaining after trying attribute $A$ (with branches $i=1,2,...,K$ is:
 
 $$ EH(A) = \sum_{i=1}^{K} \frac{p_i + n_i}{p + n} \cdot H \left( \frac{p_i}{p_i + n_i},\frac{n_i}{p_i+n_i} \right) $$
 - Where $ p_i $ are the number of datums that do descend along branch $ i $, while $ n_i $ are the number that do not. 
 
 - The __Information Gain__ (__I__) or reduction in entopy for this attribute is:
 
 $$ I(A) = H \left( \frac{p}{p+n},\frac{n}{p+n} \right) - EH(A) $$
 
 - Choose the attribute with the highest I.
 
 <mark>TODO: Understand what the p and q (without subscripts) represent above. Also, watch the Google video on creating a classifier now!</mark>

In [1]:
# #Pkg.add("CSV")

training_data = readcsv( "training_data.csv" )
show( training_data)

13×12 Array{Any,2}:
 "Datum"  "Alt"  "Bar"  "Fri"  "Hun"  …  "Type"     "Est"    "WillWait"
 "X1"     "T"    "F"    "F"    "T"       "French"   "0-10"   "T"       
 "X2"     "T"    "F"    "F"    "T"       "Thai"     "30-60"  "F"       
 "X3"     "F"    "T"    "F"    "F"       "Burger"   "0-10"   "T"       
 "X4"     "T"    "F"    "T"    "T"       "Thai"     "10-30"  "T"       
 "X5"     "T"    "F"    "T"    "F"    …  "French"   ">60"    "F"       
 "X6"     "F"    "T"    "F"    "T"       "Italian"  "0-10"   "T"       
 "X7"     "F"    "T"    "F"    "F"       "Burger"   "0-10"   "F"       
 "X8"     "F"    "F"    "F"    "T"       "Thai"     "0-10"   "T"       
 "X9"     "F"    "T"    "T"    "F"       "Burger"   ">60"    "F"       
 "X10"    "T"    "T"    "T"    "T"    …  "Italian"  "10-30"  "F"       
 "X11"    "F"    "F"    "F"    "F"       "Thai"     "0-10"   "F"       
 "X12"    "T"    "T"    "T"    "T"       "Burger"   "30-60"  "T"       

For the above data I seek to find the attribute that best splits the data into "WillWait" or not. 

For a training set like this with positive and negative examples we can use the following:

$$ H \left( \frac{p}{p+n}, \frac{n}{p+n} \right) = - \frac{p}{p+n} log_2 \frac{p}{p+n} - \frac{p}{p+n} log_2 \frac{p}{p+n} $$

For this training set, lets consider how _good_ the attributes "Patron" and "Type" are at splitting the "WillWait" values:

For the training set $ p = n = 6 $ (there are six negative "WillWait" values and six positive) = 1 bit

$$ IG(Patron) = 1 - \left[ EH(Patron_{None}) + EH(Patron_{Some}) + EH(Patron_{Full}) \right] $$

$$ IG(Patron) = 1 - \left[ \frac{2}{12}H \left( \frac{0}{2}, \frac{2}{2} \right) + \frac{4}{12}H \left( \frac{4}{4}, \frac{0}{4} \right) + \frac{6}{12}H\left( \frac{2}{6}, \frac{4}{6} \right) \right] = 0.0541 bits $$