## Understanding Tree 

### Classification

![title](Images\Trees_1.PNG)

* Start with top (root) node and when the condition is valid then go left; else right. 

* Follow until you reach child node for final predicition. Ex: if your petal length <= 2.45 then you go left and choose setosa as it the child node.

* Node "samples" tell how many training data it applies to. 

* "value" tells how many data in each of the category

* gini : Impurity at that node
    * aka misclassification rate
    
    * Lower the value implies purer the node (lower misclassifications)
    * gini = $ 1 - \sum_k p_{i, k} ^2 $

    * $p_{i, k}$ is the ratio of class k category among training data in ith node
    
    * Ex: Gini at depth 2-left node: $ 1 - (0/54)^2- (49/54)^2- (5/54)^2 = 0.168 $

* Probabilities: Tree can estimate prob based on ratio of data at the child node
    * Ex: Assume test data : flower with petal length 5cm, 1.5 wide

    * This will go to mid child node and output prob as:
        * Setosa : 0/54 = 0
        * Versicolor : 49/54 = 91%
        * Virginica : 5/54 = 9%

    * Same as thro clf.predict_proba([5, 
    1.5])

## CART Algorithm:

* It splits the feature, k and threshold, tk that produces purest split. This is weighted by their size.

* $ J = \frac{m_{left}}{m} G_{left} + \frac{m_{right}}{m} G_{right} $  

* m_left, m_right : number of data in left, right subset

* G_left, G_right : Impurity on left, right subset

## Gini Vs Entropy:

* Gini as we saw was the impurity at node based on data at node.

* Entropy: 
    * Entropy in thermodynamics - measure of disorder. If molecules are still -> zero Entropy

    * In ML, entropy is zero if it contains only one class.

    * $ H_i = - \sum_k P_{i,k} log2 (P_{i,k}) $
    * For depth-2, left node:
    
        * -(49/54)*log2(49/54) -(5/54)*log2(5/54) = 0.445

        * Beware of log2 and not log

* Which one to use?
    * Both lead to similar trees. Gini is faster to compute

    * Gini tends to isolate most freq class, while entropy creates more balanced trees.
    


## Hyperparameters:

* max_depth : depth of tree
* min_samples_split : minimum number of samples
a node must have before it can be split
* min_samples_leaf : minimum number of samples a leaf node must have
* min_weight_fraction_leaf :same as min_samples_leaf but expressed as a fraction of the total number of weighted instances
* max_leaf_nodes : the maximum number of leaf nodes
* max_features : the maximum number of features that are evaluated for splitting at each node


### Regression

![title](Images\Trees_2.PNG)

* Similar approach and prediction is the average target value in that child node. 

* mse at the node is the mse of the training data for that data in the child node

#### CART Algorithm for Regression:

![title](Images\Trees_3.PNG)


* Instead of minimizing impurity in classification, the algorithm minimizes mse in Regression

* $ J = \frac{m_{left}}{m} MSE_{left} + \frac{m_{right}}{m} MSE_{right} $  

    * where $ MSE_{node} = \sum_{node} (\hat y_{node} - y_i)^2 $
    
    * $ \hat y_{node} = \frac{1}{node_data_points} \sum y $


## Tree Problems:

* Orthogonal decision boundaries:
    * if a linearly separable data along x direction is rotated by 45 deg, decision boundary gets complicated. 
    * PCA and then tree would can solve this by putting just single line for split

* Sensitive to small variations in data:
    * Can alter the boundaries a lot
    * Hence, random forest is preferrred. Averaging over predictions
    


## Points to note : 

* Gini-Impurity generally decreases towards child from root. This is bcos the cost function is set up such that as you build/split ur impuirty decreases.

* Both gini & entropy creates similar trees. Entropy produces better balanced trees

* Tree has orthogonal decision boundaries and not sensitive to scaling or outliers.