In [1]:
import pandas as pd
import numpy as np

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) <br>Intro to Classification and Regression Trees (CARTs)
Week 6 | Day 2

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Describe what a decision tree is
- Explain how a classification tree works
- Explain how a regression tree works

## Decision tree example

<img src="http://i.imgur.com/IGebRXH.png" width=900 align="center">

## Recap

So far in this course, we've talked about regression and classification models.

In both cases, we used a **linear**, **parametric** model (with one exception).

What does that mean? And what was the exception?

## Linear

$ y = \beta_0 + \beta_1x_1...\beta_nx_n$


As the value of x increases, y increases in proportion to our $\beta$ parameter - The model is **linear in its parameters**.

## Parametric vs. Non-parametric Models


**Parametric models** have some fixed number of parameters that are solved for. There is an underlying assumption about the distribution from which the data is drawn.

> Check: what is the assumption about the distribution for linear regression as it is implemented in sklearn?



**Non-parametric models** make no assumptions about the distribution of the underlying data. Additionally, rather than being though of as having no parameters, they can be though of as infinite in the number of parameters. The number of parameters can also vary with the size of the data.

## Intro to Decision Trees


### What a Decision Tree is
_Decision trees_ are a _non-parametric, hierarchical_ technique for both regression and classification.

**_Non-parametric_** means that the model is not described in terms of parameters like for example Logistic Regression. It also means that there is no underlying assumption about the distribution of data and of the error.

**_Hierarchical_** means that the model consists of a sequence of questions which yield a class label or prediction when applied to any data observation. In other words, once trained, the model behaves like a recipe, which will tell us a result if followed exactly.

---

[Classification CART documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

[Regression CART documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor)

[Decision tree user guide](http://scikit-learn.org/stable/modules/tree.html)

## How it works

At each decision point, or node, in our tree, we need to choose one of the answers and move forward from there. This begins at our _root node_ and continues on through each _internal node_ until we reach a _leaf node_. At this leaf node, we have the answer to our problem.

## Back to our example

<img src="http://i.imgur.com/IGebRXH.png" width=900 align="center">

**Definitions:**
- Nodes
- Leaf nodes
- Splits

Some questions to ponder:
- Why is credit history the first question?
- Why those specific income thresholds?
- Does the order we ask these question in matter?

## How to build a decision tree

In order to build a productive and efficient decision tree we need an algorithm that is capable of determining optimal choices for each node. One such algorithm is _Hunt's algorithm_: a _greedy recursive_ algorithm that leads to a _local optimum_.

- _greedy_: makes the best local decision in the hope of ultimately making the best global decision
- _recursive_: splits task into subtasks, solves each the same way
- _local optimum_: solution for a given neighborhood of points



## How to build a decision tree cont...


The algorithm works by recursively partitioning records into smaller & smaller subsets. The partitioning decision (which feature to use and how to split) is made at each node according to a metric called _purity_. A node is said to be 100% pure when all of its records belong to a single class (or have the same value).

## Example: binary classification with classes A, B

Given a set of records `r` at node `n`:

1. If all records in `r` belong exclusively to class A or exclusively to class B, then `n` is a _leaf node_ and no further iteration can take place.  This is a terminal node
<br>
2. If however `r` contains records from both A and B, do the following:
    - create test condition to partition the records further
    - partition records in `r` to children according to test
    - `n` would then be internal node, with outgoing edges to child nodes

These steps are then recursively applied to each child node until they end in a leaf node (purely A or B), or a stopping criterion is met

## Types of splits

Splits can be 2 way or multi-way. 

Features can be categorical or continuous.

<img src="http://i.imgur.com/JOR2H8J.png" width=600>
<img src="http://i.imgur.com/KI8aBJX.png" width=600>

## Optimization functions

Recall from the algorithm above, that we iteratively create test conditions to split the data. How do we determine the best split amongst all possible splits? Recall that no split is necessary (at a given node) when all records belong to the same class. Therefore we want each step to create the partition with the **highest possible purity**.


In order to do this, we will need an objective function to optimize. We want our objective function to measure the gain in purity from a particular split. Therefore we want it to depend on the class distribution over the nodes (before and after the split). 

For example, let $p(i|t)$ be the probability of class $i$ at node $t$ (eg, the fraction of records labeled $i$ at node $t$). For a binary (0/1) classification problem, what would the values be for maximum impurity?


The maximum impurity partition is given by the distribution:
$$
p(0|t) = p(1|t) = 0.5
$$

where both classes are present in equal manner.



On the other hand, the minimum impurity partition is obtained when only one class is present, i.e:
$$
p(0|t) = 1 $$
$$p(1|t) = 0$$
$$p(0|t) = 1 - p(1|t)
$$

Therefore in the case of a binary classification we need to define an _impurity_ function that will smoothly vary between the two extreme cases of minimum impurity (i.e. purity, where we have only one class or the other) and the maximum impurity case of equal mix.

## How we measure "purity"

We can define several functions that satisfy these properties. Here are three common ones:


$$
\text{Entropy}(t) = - \sum_{i=0}^{c-1} p(i|t)log_2 p(i|t)
$$

$$
\text{Gini}(t) = 1 - \sum_{i=0}^{c-1} [p(i|t)]^2
$$

$$
\text{Classification error}(t) = 1 - \max_i[p(i|t)]
$$



<img src="http://i.imgur.com/UGPGzRq.png" width=600>

<center>Note that each measure achieves its max at 0.5, min at 0 & 1</center>

## One more step

Impurity measures put us on the right track, but on their own they are not enough to tell us how our split will do. We still need to look at impurity before & after the split to understand how good the split is. We can make this comparison using the gain:
$$
\Delta = I(\text{parent}) - \sum_{\text{children}}\frac{N_j}{N}I(\text{child}_j)
$$

Where $I$ is the impurity measure, $N_j$ denotes the number of records at child node $j$, and $N$ denotes the number of records at the parent node. When $I$ is the entropy, this quantity is called the _information gain_.

Generally speaking, a test condition with a high number of outcomes can lead to overfitting (ex: a split with one outcome per record). One way of dealing with this is to restrict the algorithm to binary splits only (CART). Another way is to use a splitting criterion which explicitly penalizes the number of outcomes (C4.5).

## Decision tree regression

In the case of regression, the outcome variable is not a category but a continuous value. We cannot therefore use the same measure of purity we used for classification. Instead we look to minimize the Sum of Square Errors at each split.

## Overfitting

Check: What is overfitting?

## How to preventing overfitting with decision trees

In addition to determining splits, we also need a stopping criterion to tell us when we’re done. For example, we can stop when all records in a given node belong to the same class, or when all records have the same attributes. This is correct in principle, but would likely lead to overfitting. 

**Pruning** - this builds the full tree and then prunes as a post-processing step. To prune a tree, we examine the nodes from the bottom-up and simplify pieces of the tree (according to some criteria). Complicated subtrees can be replaced either with a single node, or with a simpler (child) subtree.

Sklearn doesn't do post-pruning. Instead you can set a maximum depth for a tree.

**Minimum observations to make a split:**

An alternative to maximum depth (and can be used at the same time), is to specify the minimum number of datapoints reqired to make a split at a node.

[Decision Tree Example Example](Decision Tree Sklearn Example.ipynb)

<a id='advantages'></a>
## CART advantages
---

- Simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
    - Useful to work with non technical departments (marketing/sales).
- Requires little data preparation. 
    - Other techniques often require data normalization, dummy variables need to be created and blank values to be removed.
- Able to handle both numerical and categorical data. 
    - Other techniques are usually specialized in analyzing datasets that have only one type of variable.
- Uses a **white box** model.
    - If a given situation is observable in a model the explanation for the condition is easily explained by boolean logic.
    - By contrast, in a **black box** model, the explanation for the results is typically difficult to understand, for example in neural networks.
- Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
- Robust. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
- Performs well with large datasets. Large amounts of data can be analyzed using standard computing resources in reasonable time.
- Once trained can be implemented on hardware and has extremely fast execution.
    - Real-time applications like trading, for example.
- Performs feature selection for you

<a id='disadvantages'></a>
## CART disadvantages
---

- Locally-optimal.
    - Practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm where locally-optimal decisions are made at each node. 
    - Such algorithms cannot guarantee to return the globally-optimal decision tree.
- Overfitting.
    - Decision-tree learners can create over-complex trees that do not generalize well from the training data.
- There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. In such cases, the decision tree becomes prohibitively large.
    - Neural networks, for example, are superior for these problems.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
- For continuous outcome variables, predictions may not be very specific

<a name="ind-practice"></a>
## Independent Practice

1. Walk through the visual tutorial on R2D3 [Decision Trees](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
2. Find a decision tree model at [BigML](https://bigml.com/gallery/models) that you find interesting. Be prepared to talk about what the model is trying to predict and how it works.


## Conclusion
In this class we learned about Classification and Regression trees. We learned how Hunt's algorithm helps us recursively partition our data and how impurity gain is useful for determining optimal splits.
We've also reviewed pros/cons of decision trees and industry applications.
In the lab we will learn to use them in `Scikit-learn`.

### ADDITIONAL RESOURCES
- [Scikit-Learn Documentation](http://scikit-learn.org/stable/modules/tree.html)
- [Classification trees video](https://www.youtube.com/watch?v=p17C9q2M00Q)
- [Regression trees video](https://www.youtube.com/watch?v=zvUOpbgtW3c)
- [Decision trees on wikipedia](https://en.wikipedia.org/wiki/Decision_tree_learning)
- [Decision tree regression explained](http://www.saedsayad.com/decision_tree_reg.htm)