# Classification and Regression Trees

* Decision trees are supervised learning models used for problems involving classification and regression. 
* Tree models present a high flexibility that comes at a price: on one hand, trees are able to capture complex non-linear relationships; on the other hand, they are prone to memorizing the noise present in a dataset. 
* By aggregating the predictions of trees that are trained differently, ensemble methods take advantage of the flexibility of trees while reducing their tendency to memorize noise. 
* Ensemble methods are used across a variety of fields and have a proven track record of winning many machine learning competitions. 


* **Classification and Regression Trees (CART)** are a set of supervised learning models used for problems involving classification and regression.
* Given a labeled dataset, classification tree learns a sequence of if-else questions about individual features in order to infer the labels 
    * **Objective: infer class labels**
* Able to capture non-linear relationships between features and labels
* Don't require feature scaling (ex: Standardization, etc..)
* When a classification tree is trained, the tree learns a sequence of if-else questions, with each question involving one feature and one split point.
* The maximum number of branches separating the top from an extreme end is known as the `maximum depth`. For example, a max_depth of 2 means 2 levels of branches (and 3 levels of if-else statements).

```
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)
dt = DecisionTreeClassifier(max_depth=2, random_state=1)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
accuracy_score(X_test, y_pred)
```
* `stratify= y` means train and test sets to have the same proportion of class labels as the unsplit dataset.


* **Decision Regions:** A classification model divides the feature space into regions where all instances in one region are assigned to only one class label. These regions are known as **decision regions**.
* **Decision Boundary:** surface (line, plane, hyperplane) separating different decision regions. 
* A classification tree produces rectangular decision regions in the feature space.

#### Classification-Tree Learning
* **Decision Tree:** data structure consisting of a hierarchy of nodes.
* **Node:** question or prediction.
* **Root:** *no* parent node, question giving rise to *two* children nodes.
* **Internal Node:** *one* parent node, question giving rise to *two* children nodes.
* **Leaf:** *one* parent node, *no* children nodes $\Rightarrow$ *prediction*
* **Information Gain:** The nodes of a classification tree are grown recursively; in other words, the obtention of an internal node or a leaf depends on the state of its predecessors. To produce the purest leafs possible, at each node, a tree asks a question involving one feature f and a split-point sp. But how does it know which feature and which split-point to pick? It does so by maximizing Information gain! The tree considers that every node contains information and aims at maximizing the Information Gain obtained after each split.

* When a classification is trained on a labeled data set, the tree learns patterns from the features in such a way to produce the purest leaves.
* When an **unconstrained tree** is trained, the nodes are grown recursively. In other words, a node is grown based on the state of its predecessors 
* At the non-leaf node the data is split based on feature *f* and and split point *sp* in such a way as to maximize *IG*
* If the *IG* obtained by splitting a node is nil (0), the node is declared as a leaf.

`dt = DecisionTreeClassifier(criterion= 'gini', random_state=1)`

#### Decision Tree for Regression
* In regression, the target variable is continuous (a real value)

```
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error as MSE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
dt= DecisionTreeRegressor(max_depth=4, min_samples_leaf= 0.1, random_state=3)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
mse_dt = MSE(y_test, y_pred)
rmse_dt = mse_dt**(1/2)
print(rmse_dt)

```

* When a regression tree is trained on a dataset, the impurity of a node is measure using the MSE of the targets in that node
* This means that the Regression Tree tries to find the splits that produce the leaves where, in each leaf, the target values are, on average, the closest possible to the mean value of the labels in that particular leaf.
* As a new instance traverses the tree and reaches a leaf, its target variable, *y*, is computed as the average of the target variables contained in that leaf.
* Regression trees are able to capture greater flexibility than Linear Regression models (though are also more prone to overfitting)

gini-index and entropy

#### Generalization Error
* In supervised learning, you make the assumption that there is a mapping, $f$, between features and labels:
\begin{equation}
y = f(x)
\end{equation}
where *f* is unknown.
* In reality, data generation is always accompanied by randomness, or noise.
* The goal of supervised learning is to find $\hat f$ that best approximates $f$.
* $\hat f$ can be Logistic Regression, Decision Tree, Neural Network...
* When training $\hat f$, you want to make sure that noise is discarded as much as possible.
* **End goal:** $\hat f$ should achieve a low predictive error on unseen datasets.
#### Difficulties in approximating $f$
* Two main difficulties:
    * **Overfitting:** $\hat f(x)$ fits the training set *noise*.
    * **Underfitting:** $\hat f$ is not flexible enough to approximate $f$; here, the training set error will be roughly equivalent to the test set error and both errors are relatively high. 
    
* **Generalization Error:** The generalization error of a model tells you how well it generalizes to unseen data
    * Can be decomposed into three terms: bias, variance, and irreducible error
    * $\hat f$ = $bias^2$ + $variance$ + irreducible error
    * irreducible error = error contribution of noise
    
* **Bias:** by how much are $\hat f$ and $f$ different?
    * High bias models lead to underfitting
* **Variance:** tells you how much $\hat f$ is inconsistent over different datasets. 
    * High variance models lead to overfitting
* **Model Complexity:** sets the flexibility of $\hat f$
    * Example: maximum tree depth, minimum samples per leaf
    * The goal is to find the model complexity that achieves the lowest generalization error 
    * Since this (generalization) error is the sum of three terms, with the irreducible error being constant, you need to find a balance between bias and variance (**"Bias-Variance Tradeoff"**) because as one increases, the other decreases.

#### Diagnosing Bias and Variance Problems
* How do you estimate $\hat f$'s generalization error?
* This cannot be done directly because:
    * $f$ is unknown
    * usually you only have one dataset
    * noise is unpredictable
* Solution:
    * split the data into training and test sets
    * train on training set, evaluate error of $\hat f$ on test set
    * **Test should only be used to evaluate $\hat f$'s *final* performance.**
    * To obtain a reliable estimate of $\hat f$'s performance, use **Cross-Validation (CV).**
        * K-Fold CV
        * Hold-Out CV
* **K-Fold CV:**
    * If $\hat f$'s cross-validation error is greater than $\hat f$'s training error, then $\hat f$ suffers from **high variance** (overfitting).
        * Try decreasing model complexity
        * For example: reduce maximum tree depth or increase maximum samples per leaf (for Decision Trees)
        * Also try: gathering more data (if possible)
    * If $\hat f$'s CV error is roughly equal to the training set error, but much greater than desired error, then $\hat f$ suffers from **high bias** (underfitting).
        * Try increasing model complexity 
        * For example: increase max depth, decrease min samples per leaf (for Decision Trees)
        * Also try: gathering *more relevant* features

**K-Fold CV:**
`n_jobs = -1` to exploit all available CPUs and computation.

```
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE 
from sklearn.model_selection import cross_val_score
SEED=123
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=SEED)
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.14, random_state=SEED)
MSE_CV = - cross_val_score(dt, X_train, y_train, cv=10, scoring= 'neg_mean_squared_error', n_jobs=-1)
dt.fit(X_train, y_train)
y_predict_train = dt.predict(X_train)
y_predict_test = dt.predict(X_test)
```
`print('CV MSE: {:.2f}'.format(MSE_CV.mean()))` 

`print('Train MSE: {:.2f}'.format(MSE(y_train, y_predict_train)))` 

`print('Test MSE: {:.2f}'.format(MSE(y_test, y_predict_test)))` 

```
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
print('CV RMSE: {:.2f}'.format(RMSE_CV))
```

```
from sklearn.metrics import mean_squared_error as MSE
dt.fit(X_train, y_train)
y_pred_train = dt.predict(X_train)
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)
print('Train RMSE: {:.2f}'.format(RMSE_train))
```

### Ensemble Learning
* **Advantages of CARTs:**
    * Simple to understand
    * Simple to interpret
    * Easy to use
    * Flexibility: ability to describe non-linear dependencies
    * Preprocessing: no need to standardize or normalize features, ...
* **Limitations of CARTs:**
    * Classification: Can only produce orthogonal decision boundaries
    * Sensitive to small variations in the training set
    * High variance: unconstrained CARTs may overfit the training set.
    * **Solution: ensemble learning**
    
* **Ensemble Learning:** 
    * Train different models on the same dataset
    * Let each model make its predictions
    * Meta-model: aggregates predictions of individual models 
    * Final prediction: more robust and less prone to errors (than each individual model
    * Best results: models are skillful in different ways 
    * One example ensemble learner in practice: Voter Classifier (different classifiers hard vote on final prediction)
    
**Voting Classifier in sklearn:**

```
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Logistic Regression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemblsion(random_state=SEED)
knn = KNN()
dt = DecisionTreeClassifier(random_state=SEED)
classifiers = [('Logistic Regression', lr), 
                ('K Nearest Neighbors', knn)
                ('Classification Tree', dt)]
for clf_name, clf in classifiers:
    clf.fit(X_train, y_train)
    
    y_pred = clf.predict(X_test)
    
    print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))
```

```
vc = VotingClassifier(estimators = classifiers)
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)
print('Voting Classifier: {.3f}'.format(accuracy_score(y_test, y_pred)))
```

```
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))
```

```
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))
```

#### Bagging
* **Bootstrap aggregation = Bagging**
* Voting Classifier is an ensemble of models that are fit to the training set using different algorithms. The final predictions were obtaining with majority voting.
* In **bagging**, the ensemble is formed by models that use the same training algorithm. However, these models are not trained on the entire training set. Instead, each model is trained on a different subset of the data