## Decision Trees
source: Wikipedia

In statistics, decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.

In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data (but the resulting classification tree can be an input for decision making).


https://en.wikipedia.org/wiki/Decision_tree_learning

#### Classification And Regression Tree (CART) 
Decision trees used in data mining are of two main types:

 - Classification tree analysis is when the predicted outcome is the class (discrete) to which the data belongs.  
 - Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient's length of stay in a hospital).  

The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures, first introduced by Breiman et al. in 1984. Trees used for regression and trees used for classification have some similarities - but also some differences, such as the procedure used to determine where to split.  

----

#### Classification
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition.

https://en.wikipedia.org/wiki/Statistical_classification

#### Regression
Regression analysis is primarily used for two conceptually distinct purposes. First, regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Second, in some situations regression analysis can be used to infer causal relationships between the independent and dependent variables. Importantly, regressions by themselves only reveal relationships between a dependent variable and a collection of independent variables in a fixed dataset. To use regressions for prediction or to infer causal relationships, respectively, a researcher must carefully justify why existing relationships have predictive power for a new context or why a relationship between two variables has a causal interpretation. The latter is especially important when researchers hope to estimate causal relationships using observational data.  

https://en.wikipedia.org/wiki/Regression_analysis

Uses for decision trees: 
 - should loan applicant be accepted or not based on history and other measures?
 - which drug is best for a particular patient?
 - is a cancerous cell malignant or benign?
 - is an email spam?


In [None]:
# This is normally where you'd type the code to install the packages required.

# !pip install sklearn
# !pip install numpy
# !pip install pandas
# !pip install graphviz
# !pip install pydotplus

#### Package information for curious minds:
https://numpy.org/devdocs/user/absolute_beginners.html  
https://pandas.pydata.org/docs/getting_started/10min.html  
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html 
https://www.graphviz.org/  
https://pydotplus.readthedocs.io/reference.html#module-pydotplus.graphviz  


Next we will import the libraries we need, starting with numpy and pandas.

In [None]:
# type the code to import numpy and pandas. hint: check how you imported in previous tutorials


We will use several modules from scikit-learn.

>`from sklearn.model_selection import train_test_split`  
>`from sklearn.tree import DecisionTreeClassifier`  
>`from sklearn.metrics import accuracy_score`  
>`from sklearn import tree`  
>`from sklearn import preprocessing`  

In [None]:
# type the code to import the sklearn modules


#### Our sample data
For this example, we are using a small golf dataset to predict whether or not someone will play golf based on different conditions.

Our data contains columns of data representing the outlook, temperature (in degrees Fahrenheit and categorized), humidity (relative humidity and categorized), windy conditions and whether or not golf was played. Each row represents one day of data.

To start, import the dataset
>`golf_data = pd.read_csv('datasets\golf_dataset.csv')`

In [None]:
# type the code to import the data

Once the data is loaded, we should do some exploration to understand the data.
Start with determining the shape of the data.

>`print('Dataset length:', len(golf_data))`  
>`print('Data shape:', golf_data.shape)`

In [None]:
# type the code to explore the shape


We can view the first 5 rows of data using `.head()`

>`golf_data.head()`

In [None]:
# type the code to display the first 5 rows of data


We will begin by using Label Encoder to transform the text values to numerical values. While we performed this transformation over several steps in the Naïve Bayes tutorial, we can condense the code to assign the transformed values to the `le_golf_data` table.  

Note that the columns to transform have been specified; this is to prevent Label Encoder from transforming the quantitative (numerical) information in the "temp" and "hum" columns.

>`le = preprocessing.LabelEncoder()`  
>`le_golf_data = golf_data[["outlook", "temperature", "humidity", "windy", "play"]].apply(le.fit_transform)`  

>`le_golf_data`

In [None]:
# type the code to use Label Encoder to create the le_golf_data table


We need to specify the features and target for the model. The model will use the data of the features to predict the target. Our model is supervised learning, as we are providing information about the predicted value - we know the conditions in the features that occurred when golf was played, so our model can solve based on the previous information.   
>`features = ["outlook", "temperature", "humidity", "windy"]`  
>`target = ["play"]`

In [None]:
# type the code to specify features and target


We will assign these to variables to make it easier to pass through our code.  
>`x = le_golf_data[features]`  
>`y = le_golf_data[target]`

In [None]:
# assign features to x and target to y


Display x and y to view the table.
>`x`  
>`y`

In [None]:
# type the code to display x


In [None]:
# type the code to display y


The classification model will convert the text to numbers. To make it readable for us, we will pass a variable through the label encoder to translate the response for us at the end.

>`le_play = preprocessing.LabelEncoder()`  
>`le_play.fit(golf_data["play"])`  
>`list(le_play.classes_)`  

In [None]:
# type the code to create the le_play variable 


We will use Gini impurity to create our first model.

Used by the CART (classification and regression tree) algorithm for classification trees, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.  

The Gini impurity can be computed by summing the probability $p_{i}$ of an item with label $ i$ being chosen times the probability $ \sum _{k\neq i}p_{k}=1-p_{i}$ of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.

To compute Gini impurity for a set of items with $ J$ classes, suppose $ i\in \{1,2,...,J\}$, and let $ p_{i}$ be the fraction of items labeled with class $ i$ in the set.

$ \operatorname {I} _{G}(p)=\sum _{i=1}^{J}p_{i}\sum _{k\neq i}p_{k}=\sum _{i=1}^{J}p_{i}(1-p_{i})=\sum _{i=1}^{J}(p_{i}-{p_{i}}^{2})=\sum _{i=1}^{J}p_{i}-\sum _{i=1}^{J}{p_{i}}^{2}=1-\sum _{i=1}^{J}{p_{i}}^{2}$

We will create the model using _clf_gini_

>`clf_gini = DecisionTreeClassifier(criterion='gini', random_state=100,`  
> $\qquad$  $\qquad$ $\qquad$ $\qquad$ $\qquad$ $\qquad$ &nbsp;&nbsp; `max_depth=5, min_samples_leaf=1)`   
>`clf_gini.fit(x, y)`

_Be sure to keep the white space._   
When you hit enter after the comma, it will automatically indent the next line of code to ensure that the proper tabbing is in place. 


In [None]:
# type the code to create the model


We will test the model based on: outlook=sunny, temperature=cool, humidity=high, windy=True, which translates to [2, 0, 0, 1] .

>`clf_gini.predict([[2, 0, 0, 1]])`

In [None]:
# type the code to test outlook=sunny, temperature=cool, humidity=high, windy=True 


Of course, the result is the encoded value - that's what we provided to the model. To translate back to text, you can wrap the call with the `inverse_tranform` function.
>`print(le_play.inverse_transform(clf_gini.predict([[2, 0, 0, 1]])))`

In [None]:
# type the code to translate the result


We will do a second test using: outlook=sunny, temperature=hot, humidity=normal, windy=False, which translates to [2, 1, 1, 0].
    
>`clf_gini.predict([[2, 1, 1, 0]])`


In [None]:
# type the code to test outlook=sunny, temperature=hot, humidity=normal, windy=False


Transform the result back to text.

>`print(le_play.inverse_transform(clf_gini.predict([[2, 0, 0, 1]])))`


In [None]:
# type the code to translate the result


Next, we will create a model using Entropy.  

Entropy is the amount of information disorder or simply said is the amount of randomness in the data or uncertainty.
Information gain increases the level of certainty in the model. Basically, using information gain, we want to determine which attribute in a given set of training feature vectors is most useful. In other words, information gain tells us how important a given attribute of the feature vectors is and decides where to split the data.

>`clf_entropy = DecisionTreeClassifier(criterion='entropy', random_state=100,`  
> $\qquad$  $\qquad$ $\qquad$ $\qquad$ $\qquad$ $\qquad$ $\qquad$  `max_depth=5, min_samples_leaf=1)`  
>`clf_entropy.fit(x, y)`

_Be sure to keep the white space._   
When you hit enter after the comma, it will automatically indent the next line of code to ensure that the proper tabbing is in place. 

In [None]:
# type the code to create the entropy model


We will test the model based on: outlook=sunny, temperature=cool, humidity=high, windy=True, which translates to [2, 0, 0, 1] .

>`clf_entropy.predict([[2, 0, 0, 1]])`

In [None]:
# type the code to test outlook=sunny, temperature=cool, humidity=high, windy=True 


Transform the result back to text.

>`print(le_play.inverse_transform(clf_entropy.predict([[2, 0, 0, 1]])))`


In [None]:
# type the code to translate the result

We will do a second test using: outlook=sunny, temperature=hot, humidity=normal, windy=False, which translates to [2, 1, 1, 0].
    
>`clf_entropy.predict([[2, 1, 1, 0]])`


In [None]:
# type the code to test outlook=sunny, temperature=hot, humidity=normal, windy=False


Transform the result back to text.

>`print(le_play.inverse_transform(clf_entropy.predict([[2, 0, 0, 1]])))`


In [None]:
# type the code to translate the result


Finally, we will use graphviz and pydotplus to visualize the results. This will create a png file which will be saved with your notebook files.

In [None]:
# run this cell to visualize data
import graphviz
import pydotplus
import collections
data_feature_names = features
dot_data = tree.export_graphviz(clf_entropy,
                                feature_names = features,
                                out_file = None, 
                                filled = True,
                                rounded = True)
graph = pydotplus.graph_from_dot_data(dot_data)
colors = ('turquoise', 'orange')
edges = collections.defaultdict(list)

for edge in graph.get_edge_list():
    edges[edge.get_source()].append(int(edge.get_destination()))
    
for edge in edges:
    edges[edge].sort()
    for i in range(2):
        dest = graph.get_node(str(edges[edge][i]))[0]
        dest.set_fillcolor(colors[i])

graph.write_png('golf_tree_01.png')