In this notebook, I will go through basic concepts of trees and how to implement basic codes in sklearn

# TREES
1. cart (algorithm): classification and regression trees
    - decision trees for classification
        - to measure impurity of nodes: gini_index and cross_entropy index
    - regression trees
        - to measure impurity of nodes: mean_squared error and mean_absoluted error
        
2. Conditional Inference Trees
    - a little bit more stable than cart
    - select best split with correcting for multiple-hypothesis testing
    - more fair to categorical variables
    - only in R s far
3. Different spliting methods

##  Cart algorithm
### 1. Visualizing trees

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
# from sklearn.tree import plot_tree

cancer = load_breast_cancer()
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,\
                                                    stratify=cancer.target, random_state = 0)

tree = DecisionTreeClassifier(max_depth=2)
tree.fit(X_train, Y_train)
tree.feature_importances_   # it says how much each feature contribute to decreasing the impurity

# plot = plot_tree(tree, feature_names = cancer.feature_names)

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.87337408, 0.        , 0.        ,
       0.        , 0.        , 0.12662592, 0.        , 0.        ])

### 2. hyperparameter tuning 

#### 2.1-pre_pruning: limit the size of tree while building it: tune hyperparameters: max_depth, max_leaf_nodes, min_samples_split, max_impurity_decrease, ... (in sklearn)
- best hyperparameter: max_leaf_nodes: bc we know how much we want to split the space

In [2]:

from sklearn.model_selection import GridSearchCV
hyperparam = {'max_depth':range(1,7)}
grid = GridSearchCV(DecisionTreeClassifier(random_state=0),param_grid = hyperparam, cv=10 )
grid.fit(X_train, Y_train)



GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': range(1, 7)}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

#### 2.2 post_pruning: first build the tree and then shrink back its size
- most popular method: cost complexity pruning (like regularization)

In [None]:

    
from sklearn.model_selection import GridSearchCV
import numpy as np

# hyperparam = {'ccp_alpha':np.linspace(0,0.03, 20)}
# grid = GridSearchCV(DecisionTreeClassifier(random_state=0),param_grid = hyperparam, cv=10 )
# grid.fit(X_train, Y_train)


#more efficient pruning:
clf = DecisionTreeClassifier(random_state=0)
path = clf.cost_complexity_pruning_path(X_train, Y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities


### 3. Issues and considerations in cart
- 1. Extrapolation
        - consider x = year, y= price; if we have a predicted linear model like y = a x + b, we can 
          extrapolate the prediction line y to predict price for upcoming years. However if we make the same
          prediciton with regression tree, we cannot predict y for upcomming years that is bc tree always 
          compute mean on area in the training set, it will not extrapolate to new values
        - in reality it may not be that bad, bc extrapolation is really hard but this is sth we need to keep
          in mind
- 2. Relation to Nearest neighbors
    - predict average of neighbors- either by k, by epsilon ball, or by leaf  (neighbors in tree: the ones
      in the same leaf)
    - trees are much faster
    - both can't extrapolate  (bc we are computing means)
    
- 3. Instability: tree structure is very noisy and depends how training dataset is sampled
     - changing the random_state in DecisionTreeClassifier (from 0 to 1) -- which change how to split into 
       training and test set-- even for iris dataset( which is very simple), we will get different trees! even 
       root nodes will have completely different features.
    - so if we build the tree and want to decides the most imp things about our dataset based on the features
      in the root node , it is not a good idea! trees are less stable things to rely on, just split the data 
      slightly different, and we will get completely differnt root nodes and the whole structure of the tree 
      can also change
      
- 4. Feature importance (tree.feature_importances_): summary of structure of tree and it says how much each  
     feature contributes to decreasing the impurity. However, this actually measures impurity decrease on the 
     training set and it's quite unstable & changes depending on the way we sample dataset; there are more 
     robust way to determine feature importance in trees like:
      
- 5. Trees for categorical data is not implemented in sklearn; we can do one-hot encoding and use sklearn

- 6. We can use tress to predict probabilities (= fraction of class in a leaf). Without pruning all 
     probabilities are either 0 or 1 (overfitting). Even with pruning, probabilities might be too certain and 
     the tree will be still overly confident. There are ways to get good probabilities out of tress by using 
     calibration....

## Ensemble methods: bagging and boosting
Trees are very powerful but could overfit easily and they are very unstable
- Ensemble models use wisdom of crowd by averaging a lot of models
- Hard or soft voting; average of classes or probabilities
     - soft voting if they are all the same type of model
     - if one is forexample nn and the other tree, soft voting is not good bc they are not calibrated

### 1. voting classifier 

In [3]:
#silly example, not use in actual real world, just for illustration
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
voting = VotingClassifier([('logreg', LogisticRegression(C=100)), ('tree', DecisionTreeClassifier(max_depth=3, random_state = 0))],
                         voting = 'soft')
voting.fit(X_train, Y_train)
lr, tree = voting.estimators_
voting.score(X_test, Y_test), lr.score(X_test, Y_test), tree.score(X_test, Y_test)




(0.9370629370629371, 0.9370629370629371, 0.916083916083916)

### 2. Bagging (or Bootstrap Aggregation)
- Reduce variance (avoid overfitting)
- Bagging classifier and bagging regressor (sklearn: BaggingClassifier and BaggingRegressor)
- Algorithm:
    1. Create a bootstrap sample of training data (draw with replacement)
    2. Train model on each of the trained sample dataset
    3. Average results

### 3. Variance and Bias
Accuracy of model depend on how we sample training dataset from true data dist.
- Variance:  it means depending on how we sample the training dataset, we get very diff prediction
    - trees are example of high variance models
- Bias: it means predictions are off in some sense
    - example of model with high bias: logistic regression with nonlinear data; with nonlinear data, logistic 
      regresion will be systematically off but it will also be very consistent and will give the same results
      always (bc they are very stable models)
      
- high variance models: on average correct but depends a lot on how we sample dataset.
- high bias model are wrong
- high bias high variance: forexample a linear regression on the problem which is not linear in a very high
  dimensional space (in high dimensional space, linear regression is very noise without regularization, and if
  the dataset is not linear, the model will be systematically off)


- we want models with low variance and low bias (it means no matter how we sample the dataset, it gives the 
  same prediction and they are also very accurate on average)

