# Decision Tree
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.

The majority of recursive partitioning algorithms are special cases of a simple two-stage algorithm: First partition the observations by univariate splits in a recursive way and second fit a constant model in each cell of the resulting partition.


### Models Pros & Cons
----------------------------------------------------------------------------------------------------------------------
#### Classification Tree (CART)
- Big O Notation (Cost Function):

Pros: Simple to understand and to interpret. Trees can be visualised; Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.; The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.; Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.; Able to handle multi-output problems.; Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.; Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.; Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.; Non-linear; easy to handle categorical feature without transform; If non-linear relationship is good;  Handle missing value; Grow full tree and prune; Build use train and test set (prune); Each step allows binary splits; Prediction - is better suited for creating a model that has high prediction accuracy of new cases ; 


Cons: Poor prediction performance – high variance ; Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.; Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.; The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.; There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.; Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.; Easy to overfit; Globaly search (Expensive); Poor prediction performance – high variance ; 


#### Regression Tree (CART)
- Big O Notation (Cost Function):

Pros: [Same above]

Cons: [Same above]

#### C4.5 - C5.0 Tree
- Big O Notation (Cost Function):

- C4.5 Tree

Pros: Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it; Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations; Handling attributes with differing costs; Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. Grow full tree and prune; Build use single dataset; Each step allows multiple splits; Prediction - is better suited for creating a model that has high prediction accuracy of new cases; 

Cons:

- C5.0 Tree

Pros: [Improvement over c4.5] - Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude); Memory usage - C5.0 is more memory efficient than C4.5; Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees.; Support for boosting - Boosting improves the trees and gives them more accuracy.; Weighting - C5.0 allows you to weight different cases and misclassification types.; Winnowing - a C5.0 option automatically winnows the attributes to remove those that may be unhelpful. Solve high variance(over fitting) with bottom-up technique (Pruning); Accept continuous and discrete features; Handle missing data; Grow full tree and prune; Build use single dataset; Each step allows multiple splits; Prediction - is better suited for creating a model that has high prediction accuracy of new cases; 

Cons: Over fitting happens when algorithm model picks up data with uncommon characteristics , especially when data is noisy.; 


#### Chi-squared Automatic Interaction Detection (CHAID)
- Big O Notation (Cost Function):

Pros: Good for inference and analysis; 

Cons: Not best for prediction;



#### Decision Stump
- Big O Notation (Cost Function):

Pros: Low variance Used in ensemble model;

Cons: Too poor to predict (Weak learner);


#### Cubist Model Tree (Extension to M5)
- Big O Notation (Cost Function):

Pros: Efficient on high-dimensionality; Tree is smaller than regression tree so more accurate; Can extrapolate the prediction; 

Cons: 


#### GUIDE Tree
- Big O Notation (Cost Function):

Pros: Use a statistical stopping rule the tree growth; Build use single dataset; Each step allows multiple splits; Inference - when the goal is to describe or understand the relationship between a response variable and a set of explanatory variables; it is not biased in split-variable selection, unlike CART which is biased towards selecting split-variables which allow more splits, and those which have more missing values.;

Cons: 


#### MOB Tree
- Big O Notation (Cost Function):

Pros: 

Cons: 



#### QUEST Tree
- Big O Notation (Cost Function):

Pros: easily handle categorical predictor variables with many categories; it is not biased in split-variable selection, unlike CART which is biased towards selecting split-variables which allow more splits, and those which have more missing values;

Cons: 


#### Conditional Inference Tree
- Big O Notation (Cost Function):

Pros: avoids the following variable selection bias - tend to select variables that have many possible splits or many missing values; 

Cons: 




#### Logistic Model Tree (LMT)
- Big O Notation (Cost Function):

Pros: 

Cons: 



#### Random Forest
- Big O Notation (Cost Function):

Pros: Further reduce the “high variance” problem with simple decision tree; Good for high dimensional / high variance data; 

Cons:

#### Bagging Trees
- Big O Notation (Cost Function):

Pros: Reduce the “high variance” problem with simple decision tree (Not as good as random forest);

Cons:

#### Boosting Trees
- Big O Notation (Cost Function):

Pros: Reduce the “high variance” problem with simple decision tree by boosting; 

Cons:


----------------------------------------------------------------------------------------------------------------------

## --------------------- Classification Tree (CART)

#### Wiki Definitation: 
- Predict categorical

https://homes.cs.washington.edu/~tqchen/data/pdf/BoostedTree.pdf

In sum, the CART implementation is very similar to C4.5; the one notable difference is that CART constructs the tree based on a numerical splitting criterion recursively applied to the data, whereas C4.5 includes the intermediate step of constructing *rule set*s.

Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. Classification tree analysis is one of the main techniques used in Data Mining.
The goal of classification trees is to predict or explain responses on a categorical dependent variable, and as such, the available techniques have much in common with the techniques used in the more traditional methods of Discriminant Analysis, Cluster Analysis, Nonparametric Statistics, and Nonlinear Estimation. The flexibility of classification trees make them a very attractive analysis option, but this is not to say that their use is recommended to the exclusion of more traditional methods. Indeed, when the typically more stringent theoretical and distributional assumptions of more traditional methods are met, the traditional methods may be preferable. But as an exploratory technique, or as a technique of last resort when traditional methods fail, classification trees are, in the opinion of many researchers, unsurpassed.

The CART algorithm decides on a split based on the amount of homogeneity within class that is achieved by the split. And later on, the split is reconsidered based on considerations of over-fitting.
#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 
Minimal observations in each region ~ 5

Pruning parameter (trade-off over fitting - complexity) ~ 0 - infinite
#### Cost Function: 
- Classification error rate
- Gini Index
- Cross-Entropy

*Define the purity in terms of class after each split (if one class dominate the region, value is very small)

#### Process Flow: 
Split the features in an order that achieve the smallest [purity metric] (One feature at a time); Ypred ~ the major class in split which Y(i) belongs to; Stopping criterion – each region (region formed from splitting) < minimal observations. Ex. =5; 

Pruning Tree – tuning parameter to trade-off between over fitting and complexity

#### Evaluation Methods: 

#### Tips: 



In [None]:
# ----------------------- R

############# tree package ##############
# https://www.r-bloggers.com/classification-trees/
library(tree)

xtabs( ~ class, data = ecoli.df)
"""
class
 cp  im imL imS imU  om omL  pp 
143  77   2   2  35  20   5  52
"""

ecoli.tree1 = tree(class ~ mcv + gvh + lip + chg + aac + alm1 + alm2,
  data = ecoli.df)
summary(ecoli.tree1)

"""
Classification tree:
tree(formula = class ~ mcv + gvh + lip + chg + aac + alm1 + alm2, 
    data = ecoli.df)
Variables actually used in tree construction:
[1] "alm1" "mcv"  "gvh"  "aac"  "alm2"
Number of terminal nodes:  10 
Residual mean deviance:  0.7547 = 246 / 326 
Misclassification error rate: 0.122 = 41 / 336
"""
# Ploting the tree
plot(ecoli.tree1)
text(ecoli.tree1, all = T)

# To prune the tree we use cross-validation to identify the point to prune.
cv.tree(ecoli.tree1)
"""
$size
 [1] 10  9  8  7  6  5  4  3  2  1
 
$dev
 [1]  463.6820  457.4463  447.9824  441.8617  455.8318  478.9234  533.5856  586.2820  713.2992 1040.3878
 
$k
 [1]      -Inf  12.16500  15.60004  19.21572  34.29868  41.10627  50.57044  64.05494 180.78800 355.67747
 
$method
[1] "deviance"
 
attr(,"class")
[1] "prune"         "tree.sequence"
"""
# This suggests a tree size of 6 and we can re-fit the tree:
ecoli.tree2 = prune.misclass(ecoli.tree1, best = 6)
summary(ecoli.tree2)

# whether pruning tree
Set.seed(3)
cv.table = cv.tree(tree, FUN = prune.misclass)
cv.table
# Show graph
Par(mfrow = c(1,2))
Plot(cv.table$size, cv.table$dev, type = “b”)
Plot(……………$k, ………………………………………..)
# prune the tree
Prune.table = prune.misclass(tree, bast=[size])
Plot(prune.table)
Text(prune.table, pretty=0)

Predict.tree <- predict(tree, test.data, type = “class”)
Table(predict.tree, test.data$Y)



########## rpart packages ##############
# https://cran.r-project.org/web/packages/rpart/rpart.pdf
library(rpart)
library(rpart.plot)
mytree <- rpart(Y~., data=data, method="class", # classifier
               control=rpart.control(minsplit=1000, # minmal size to attempt a split
                                    minbucket=300, # min size in each leaf
                                    cp=0.008, # complexity 
                                    maxdepth=13, # max splits
                                    xval=5)) # number of CV performed
summary(mytree)
rpart.plot(mytree)
printcp(mytree)
mytree$variable.importance

a = rownames(mytree$frame) == '<leaf>'
path.rpart(mytree, nodes=as.numeric(a[1]), print.it=FALSE)



In [None]:
# ----------------------- Python
# http://scikit-learn.org/stable/modules/tree.html

from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

# After being fitted, the model can then be used to predict the class of samples:
clf.predict([[2., 2.]])

# Alternatively, the probability of each class can be predicted, which is the fraction of training samples 
# of the same class in a leaf:
clf.predict_proba([[2., 2.]])

# Case 2 - use irs dataset
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

# fitting the model ------------------ Sample code
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0) # impurity meansure // depth of tree
tree.fit(X_train, y_train)

# Once trained, we can export the tree in Graphviz format using the export_graphviz exporter. 
# Below is an example export of a tree trained on the entire iris dataset:
with open("iris.dot", 'w') as f:
    f = tree.export_graphviz(clf, out_file=f)

# Then we can use Graphviz’s dot tool to create a PDF file (or any other supported file type): 
# dot -Tpdf iris.dot -o iris.pdf.
import os
os.unlink('iris.dot')
    
# Plot
from IPython.display import Image  
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())  

## --------------------- Regression Tree (CART)

#### Wiki Definitation: 
- Predict Continuous

https://homes.cs.washington.edu/~tqchen/data/pdf/BoostedTree.pdf

In sum, the CART implementation is very similar to C4.5; the one notable difference is that CART constructs the tree based on a numerical splitting criterion recursively applied to the data, whereas C4.5 includes the intermediate step of constructing *rule set*s.

A regression tree is built through a process known as binary recursive partitioning, which is an iterative process that splits the data into partitions or branches, and then continues splitting each partition into smaller groups as the method moves up each branch.
#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 
Minimal observations in each region ~ 5

Pruning parameter (trade-off over fitting - complexity) ~ 0 - infinite

#### Cost Function: 
Minimize the RSS given each split

#### Process Flow: 
Split the features in an order that achieve the smallest RSS (One feature at a time); RSS = sum(Y(i) – Ypred)^2 where Ypred ~ the Mean in split which Y(i) belongs to; Stopping criterion – each region (region formed from splitting) < minimal observations. Ex. =5; 

Pruning Tree – tuning parameter to trade-off between over fitting and complexity

#### Evaluation Methods: 

#### Tips: 



In [None]:
# ----------------------- R
############## tree package ################
# http://www.di.fc.ul.pt/~jpn/r/tree/tree.html
library(tree)

real.estate <- read.table("cadata.dat", header=TRUE)
tree.model <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data=real.estate)
plot(tree.model)
text(tree.model, cex=.75)

# We can compare the predictions with the dataset 
# (darker is more expensive) which seem to capture the global price trend:
price.deciles <- quantile(real.estate$MedianHouseValue, 0:10/10)
cut.prices <- cut(real.estate$MedianHouseValue, price.deciles, include.lowest=TRUE)
plot(real.estate$Longitude, real.estate$Latitude, col=grey(10:2/11)[cut.prices], pch=20, xlab="Longitude",ylab="Latitude")
partition.tree(tree.model, ordvars=c("Longitude","Latitude"), add=TRUE)
# Summary
summary(tree.model)

# whether pruning tree
Set.seed(3)
cv.table = cv.tree(tree)
cv.table
# Show graph
Par(mfrow = c(1,2))
Plot(cv.table$size, cv.table$dev, type = “b”)
# prune the tree
Prune.table = prune.tree(tree, bast=[size])
Plot(prune.table)
Text(prune.table, pretty=0)

Predict.tree <- predict(tree, test.data[,-Y])
Plot(predict.tree, Y)
Abline(0,1)
Mean((predict.tree – Y)^2) # MSE



########## rpart packages ##############
# https://cran.r-project.org/web/packages/rpart/rpart.pdf
library(rpart)
library(rpart.plot)
mytree <- rpart(Y~., data=data, method="anova", # regressor
               control=rpart.control(minsplit=1000, # minmal size to attempt a split
                                    minbucket=300, # min size in each leaf
                                    cp=0.008, # complexity 
                                    maxdepth=13, # max splits
                                    xval=5)) # number of CV performed
summary(mytree)
rpart.plot(mytree)
printcp(mytree)
mytree$variable.importance

a = rownames(mytree$frame) == '<leaf>'
path.rpart(mytree, nodes=as.numeric(a[1]), print.it=FALSE)




In [None]:
# ----------------------- Python
# http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
# http://scikit-learn.org/stable/modules/tree.html
" Mostly the same as classification tree but use 'DecisionTreeRegressor()' instead "
from sklearn import tree
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X, y)
clf.predict([[1, 1]])
array([ 0.5])


## --------------------- C4.5 - C5.0 Tree

#### Wiki Definitation: 
- Predict Categorical (Classifier)

https://en.wikipedia.org/wiki/C4.5_algorithm

- C4.5 Tree

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan.[1] C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S={s_{1},s_{2},...} of already classified samples. Each sample s_{i} consists of a p-dimensional vector (x_{{1,i}},x_{{2,i}},...,x_{{p,i}}), where the x_{j} represent attribute values or features of the sample, as well as the class in which s_{i} falls. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists. 

Quinlan's next iteration. The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up technique usually known as "pruning"; and (iv) different weights can be applied the features that comprise the training data. Of these, the first three are very important--and i would suggest that any DT implementation you choose have all three. The fourth (differential weighting) is much less important

- C5.0 Tree

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are:[5][6]

Perhaps the most significant claimed improvement of C5.0 versus C4.5 is support for boosted trees. Ensemble support for DTs--boosted trees and Random Forests--has been included in the DT implementation in Orange; here, ensemble support was added to a C4.5 algorithm.

*Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude)

*Memory usage - C5.0 is more memory efficient than C4.5

*Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees.

*Support for boosting - Boosting improves the trees and gives them more accuracy.

*Weighting - C5.0 allows you to weight different cases and misclassification types.

*Winnowing - a C5.0 option automatically winnows the attributes to remove those that may be unhelpful.

#### Input Data: 
X(Numeric) / X(Categorical)

#### Initial Parameters: 

#### Cost Function: 
Shannon Entropy - to pick the features with the largest information gains

#### Process Flow: 
- C4.5 Tree
It builds a decision tree for the given data in a top-down fashion, starting from a set of objects and a specification of properties Resources and Information. each node of the tree, one property is tested based on maximizing information gain and minimizing entropy, and the results are used to split the object set. This process is recursively done until the set in a given sub-tree is homogeneous (i.e. it contains objects belonging to the same category). The ID3 algorithm uses a greedy search. It selects a test using the information gain criterion, and then never explores the possibility of alternate choices.


- C5.0 Tree
［same as above]


#### Evaluation Methods: 

#### Tips: 



In [None]:
# ----------------------- R
######## C4.5 Tree #######

" C5.0 is better in all - "

######## C5.0 Tree #######
# http://www.patricklamle.com/Tutorials/Decision%20tree%20R/Decision%20trees%20in%20R%20using%20C50.html
# https://cran.r-project.org/web/packages/C50/C50.pdf

credit <- read.csv("credit.csv")
str(credit)
"""
'data.frame':	1000 obs. of  17 variables:
 $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
 $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
 $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
 $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
 $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
 $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
 $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
 $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
 $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
 $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
 $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
 $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
"""

# Train Model
library(C50)
credit_model <- C5.0(credit_train[-17], credit_train$default)

summary(credit_model)

# Prediction
credit_pred <- predict(credit_model, credit_test)

# Confusion table
library(gmodels)
CrossTable(credit_test$default, credit_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))

" 5. Improving the model with (adaptative) boosting ¶"

In [None]:
# ----------------------- Python
######## C4.5 Tree #######

" C5.0 is better in all - "

######## C5.0 Tree #######




## --------------------- Chi-squared Automatic Interaction Detection (CHAID)

#### Wiki Definitation: 

#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 


In [1]:
# ------------------- R


In [2]:
# -------------------- Python


## --------------------- Decision Stump

#### Wiki Definitation: 

#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 



In [3]:
# --------------- R


In [4]:
# --------------- Python


## ---------------------- Cubist Model Tree (Extension to M5)

#### Wiki Definitation: 

#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 



In [5]:
# ---------------- R


In [6]:
# ---------------- Python


## ---------------------- GUIDE Tree

#### Wiki Definitation: 

#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 



In [7]:
# ------------------ R


In [8]:
# ------------------ Python


## ----------------------- MOB Tree

#### Wiki Definitation: 

#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 



## ----------------------- QUEST Tree

#### Wiki Definitation: 

#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 



In [9]:
# ------------------- R


In [10]:
# ------------------- Python

## ---------------------- Conditional Inference Tree

#### Wiki Definitation: 

#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 



In [11]:
# ------------------ R


In [12]:
# ------------------ Python


## ----------------------- Logistic Model Tree (LMT)

#### Wiki Definitation: 

#### Input Data: 
X(Numeric) / X(Categorical)
#### Initial Parameters: 

#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 



In [None]:
# -------------------- R


In [None]:
# -------------------- Python


## --------------------- Random Forest

#### Wiki Definitation: 
Random forests or random decision forests[1][2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. Decision trees are a popular method for various machine learning tasks. Tree learning "come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining", say Hastie et al., because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate. In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.[3]:587–588 This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model.
#### Input Data: 
X(Numeric) / X(Categorical)

#### Initial Parameters: 
Number of features sampled each time when train the tree: M

Number of trees need to train to make prediction: B

#### Cost Function: 
- Regression ~ (RSS) 

- Classification ~ (Classification error, Gini index, Cross-entropy)

#### Process Flow: 
We need many tree to average them to reduce variance and increase prediction accuracy -> bootstrap raw data with the same size of observation but only randomly select m subset of features (m < P) to train one tree (No pruning – full tree) -> Repeat multiple times to form the “forest” -> If regression, average all prediction, If classification, choose the majority class

-Variable importance ~ each feature: {total RSS decreased / number of trees} | {total Gini decreased / number of trees}

#### Evaluation Methods: 

#### Tips: 



In [None]:
# ----------------------- R
Library(“randomForest”)
# mtry = number of features to select, ntree = number of trees to build
Tree.random <- randomForest(Y ~. , data = train, mtry = 6, ntree = 500, importance = TRUE)
Predict.random <- predict(Tree.random, newdata = test)
# Importance
Importance(Tree.random) # %Increase the large the import


In [None]:
# ----------------------- Python
# Classifier
# -- Random Forests ---------------- Sample code 1
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(criterion='entropy', # impurity meansure
                                n_estimators=10, # learners
                                random_state=1,
                                n_jobs=2) # cores
forest.fit(X_train, y_train)


# ---------------------------------- Sample code 2
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Factor(iris.target, iris.target_names)
df.head()

train, test = df[df['is_train']==True], df[df['is_train']==False]

features = df.columns[:4]
clf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(train['species'])
clf.fit(train[features], y)

preds = iris.target_names[clf.predict(test[features])]
pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

# Regressor
# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
# http://scikit-learn.org/stable/modules/ensemble.html#forest

## --------------------- Bagging Trees

#### Wiki Definitation: 
It is almost the same as random forest, except it uses all P features to train one tree each time except use only subset m of the features.

#### Input Data: 
X(Numeric) / X(categorical)

#### Initial Parameters: 
Number of trees need to train to make prediction: B

#### Cost Function: 
- Regression ~ (RSS) 

- Classification ~ (Classification error, Gini index, Cross-entropy)

#### Process Flow: 
We need many tree to average them to reduce variance and increase prediction accuracy -> bootstrap raw data with the same size of observation and same size of features P to train one tree (No pruning – full tree) -> Repeat multiple times to form the “forest” -> If regression, average all prediction, If classification, choose the majority class

-Variable importance ~ each feature: {total RSS decreased / number of trees} | {total Gini decreased / number of trees}

#### Evaluation Methods: 

#### Tips: 




In [None]:
# ----------------------- R
library(“randomForest”)
# mtry = number of features to select, ntree = number of trees to build, choose mtry = P becomes bagging
Tree.random <- randomForest(Y ~. , data = train, mtry = 13, ntree = 500, importance = TRUE)
Predict.random <- predict(Tree.random, newdata = test)
# Importance
Importance(Tree.random) # %Increase the large the import


In [None]:
# ----------------------- Python

# Classifier
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(),
                            max_samples=0.5, max_features=0.5)
# Regressor
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html
from sklearn.ensemble import BaggingRegressor
from sklearn.neighbors import [regressor]
bagging = BaggingRegressor ([regressor](),
                            max_samples=0.5, max_features=0.5)


## --------------------- Generalized Boosted Regression tree

#### Wiki Definitation: 
https://en.wikipedia.org/wiki/Gradient_boosting

It builds based on previous tree and focus on those weakness to boost the performance.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

#### Input Data: 
X(Numeric) / X(categorical)

#### Initial Parameters: 
Number of trees (If too large over fit the data * different from bagging)

Shrinkage parameter (A small positive number) – controls learning rate of boosting ~ 0.01 or 0.001. Ex. very small shrinkage requires large number of trees

Number of split in each tree (Control the complexity – data fit of the boosting) ~ 1 (weak learner better)

#### Cost Function: 
RSS
#### Process Flow: 
Build a weak learner on data -> obtains the error when predict -> reweight the model to build a new model focus on those error and plus the previous model -> repeat this process for many new model to address the error -> boosted the performance 

#### Evaluation Methods: 

#### Tips: 




In [None]:
# ----------------------- R
# https://cran.r-project.org/web/packages/gbm/gbm.pdf
library(gbm)
# shrinkage default = 0.001, n.trees = number of trees, interaction.depth = maximimal tree split, distribution = Y=binary classification, Bernoulli, Y=numeric regression, gaussian
Table.boost <- gbm(Y ~ . , data = train, shrinkage = 0.001, distribution = "gaussian", n.trees = 100, interaction.depth=1)
Summary(Table.boost)
Predict.boost = predict(Table.boost, newdata = test, n.trees=5000)
Mean((predict.boost – Y)^2) # MSE



In [None]:
# ----------------------- Python
# http://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor

X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
X_train, X_test = X[:200], X[200:]
y_train, y_test = y[:200], y[200:]
est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
    max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
mean_squared_error(y_test, est.predict(X_test))



## --------------------- Generalized Boosted Classification tree

#### Wiki Definitation: 
https://en.wikipedia.org/wiki/Gradient_boosting

It builds based on previous tree and focus on those weakness to boost the performance.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

#### Input Data: 
X(Numeric) / X(categorical)

#### Initial Parameters: 
Number of trees (If too large over fit the data * different from bagging)

Shrinkage parameter (A small positive number) – controls learning rate of boosting ~ 0.01 or 0.001. Ex. very small shrinkage requires large number of trees

Number of split in each tree (Control the complexity – data fit of the boosting) ~ 1 (weak learner better)

#### Cost Function: 
(Classification error, Gini index, Cross-entropy)

#### Process Flow: 
Build a weak learner on data -> obtains the error when predict -> reweight the model to build a new model focus on those error and plus the previous model -> repeat this process for many new model to address the error -> boosted the performance 

#### Evaluation Methods: 

#### Tips: 





In [None]:
# ----------------------- R
# https://cran.r-project.org/web/packages/gbm/gbm.pdf
library(gbm)
# shrinkage default = 0.001, n.trees = number of trees, interaction.depth = maximimal tree split, distribution = Y=binary classification, Bernoulli, Y=numeric regression, gaussian
Table.boost <- gbm(Y ~ . , data = train, shrinkage = #, distribution = “Bernoulli”, n.trees = #, interaction.depth=#)
Summary(Table.boost)
Predict.boost = predict(Table.boost, newdata = test, n.trees=5000)



In [None]:
# ----------------------- Python
# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)    



----------------------------------------------------------------------------------------------------------------------

# Evaluation Methods