# Decision Tree

- Supervised learning algorithm, logical model, Non - parameteric
- Classification and Regression Trees (CART) is refer to the Decision Tree algorithm that can be learned for classification or regression problems. It uses the Gini method to create split points.
- decision trees are high variance algorithm, meaning that different splits in the training data can lead to very different trees. 

The main objective of the decision tree is to split data in such a way that each element in one group belongs to the same category. The splitting up of data is based on some measures that partition data into the best possible manner.The most popular measures are:

- Gini index
- Information gain
- chi square

__Gini Index__

- The Gini index is the name of the cost function used to evaluate splits in the dataset.

- Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified. It means an attribute with lower gini index should be preferred.

- The degree of the Gini index falls between 0 and 1, where 0 denotes that all the elements belong to a certain class and 1 denotes that the elements are randomly distributed across various classes. When the value of Gini is equal to 0, the node is considered pure and no further split is done. 

__Information Gain__

information gain is derived from entropy. Entropy is a way of measuring the amount of impurity in a given set of data

Information gain is used to determine which feature gives us the maximum information about a class. 

High entropy means that we have a collection of different classes and a low entropy means that we have predominantly one class, therefore, we keen on splitting the node in a way that decreases the entropy. 

Information gain is the decrease in entropy. Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain.

IG = information before splitting (parent) — information after splitting (children)

Information gain is biased for the attribute with many outcomes. It means it prefers the attribute with a large number of distinct values. For instance, consider an attribute with a unique identifier such as customer_ID has zero info(D) because of pure partition. This maximizes the information gain and creates useless partitioning.


__Assumption__

- In the beginning, the whole training set is considered at the root.
- The root node (the first decision node) partitions the data using the feature that provides the most information gain.
- Feature values are preferred to be categorical. If values are continuous then they are discretized prior to building the model.
- Records are distributed recursively on the basis of attribute values.
- Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

__How it works__
-  It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.

__How to choose and optimal max_depth for the tree?__

- a pruned tree that is less complex, explainable, and easy to understand.



- The core algorithm for building decision trees called ID3.  ID3 uses Entropy and Information Gain to construct a decision tree. ID3 algorithm uses entropy to calculate the homogeneity of a sample.

If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one.

Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.

It requires fewer data preprocessing from the user, for example, there is no need to normalize columns.

The small variation(or variance) in data can result in the different decision tree. This can be reduced by bagging and boosting algorithms.

Decision trees are biased with imbalance dataset, so it is recommended that balance out the dataset before creating the decision tree.

criterion: It is used to measure the quality of a split in the decision tree classification. By default, it is ‘gini’; it also supports ‘entropy’.

max_depth: This is used to add maximum depth to the decision tree after the tree is expanded.
min_samples_leaf: This parameter is used to add the minimum number of samples required to be present at a leaf node.

Visualizing decsion tree: Using graphviz we can visualize the tree.
export_graphviz function converts decision tree classifier into dot file and pydotplus convert this dot file to png or displayable form on Jupyter.

In [21]:
import pandas as pd

data = pd.DataFrame({"toothed":["True","True","True","False","True","True","True","True","True","False"],
                     "hair":["True","True","False","True","True","True","False","False","True","False"],
                     "breathes":["True","True","True","True","True","True","False","True","True","True"],
                     "legs":["True","True","False","True","True","True","False","False","True","True"],
                     "species":["Mammal","Mammal","Reptile","Mammal","Mammal","Mammal","Reptile","Reptile","Mammal","Reptile"]}, 
                    columns=["toothed","hair","breathes","legs","species"])

features = data[["toothed","hair","breathes","legs"]]
target = data["species"]

In [22]:
data

Unnamed: 0,toothed,hair,breathes,legs,species
0,True,True,True,True,Mammal
1,True,True,True,True,Mammal
2,True,False,True,False,Reptile
3,False,True,True,True,Mammal
4,True,True,True,True,Mammal
5,True,True,True,True,Mammal
6,True,False,False,False,Reptile
7,True,False,True,False,Reptile
8,True,True,True,True,Mammal
9,False,False,True,True,Reptile


Random forests are an example of an ensemble learner built on decision trees.

__Q When would you use Decision tree over Random Forest?__

- When the data is small and  is more non-parametric in nature and we are not worried about accuracy on future datasets.
- Easy to compute and explain why a particular variable is having higher importance
- If the goal is exploratory analysis, we should prefer a single DT , as to understand the data relationship in a tree hierarchy structure.
- The tree can be visualized and hence, for non-technical users, it is easier to explain model implementation
- DTs are prone to overfitting. Setting a max depth solves overfitting but introduces bias.

Random forest should be preferred if:
- If the goal is better predictions, we should prefer RF, to reduce the variance. when accuracy is prioritised over explainability
- when data has high bias, employing bagging and sampling techniques correctly will reduce over fitting

If we need to analyse a bigger dataset, with a lot of features with a higher demand for accuracy a Random Forest would be a better choice.
By bootstraping over our dataset, a Random Forest would be more acurate and generalistic model. However it is importante to note that even a
Random Forest would have trouble with outliers and rare events on our dataset. Depending of our problem another model such as neural network would be more
appropriate.

One single DT would lead to over-fit model if the dataset is huge (i.e. one person's POV)
However, if we have a voting mechanism and ask different individuals/trees to interpret the data then we would be able to cover the patterns in a much meticulous way. This is with the case of RF.

Decision trees work well when the data is less complex and the splits are easily determinable. Random forests would be helpful in cases where the data is comparatively large and number of features huge. Also, random forests can handle well, noisy and missing data as compared to decision trees.


__Q What is difference between Gini Impurity and Entropy in Decision Tree?__

There are three commonly used impurity measures used in binary decision trees: 

- Entropy (a way to measure impurity)
- Gini index (a criteria to minimize probability of misclassification)
- and Classification Error

- Both Gini Impurity and Entropy are criteria to split a node in a decision tree. 
- Gini is intended for continuous attributes, and Entropy for attributes that occur in classes.
- Gini will tend to find the largest class, and entropy tends to find groups of classes that make up ~50% of the data.
- Gini to minimize misclassification and Entropy for exploratory analysis.
- Because the ensemble model(RF) is quite robust and resistant to noise from the individual decision trees, we typically don't need to prune the random forest, and the only parameter we care about is the number of trees i.e k


Most of the times, performance of a model won’t change whether you use Gini or Entropy.In terms if computation, Entropy takes more time as it includes LOG function.

In classification trees, the Gini Index is used to compute the impurity of a data partition. So Assume the data partition D consisiting of 4 classes each with equal probability. Then the Gini Index (Gini Impurity) will be: 
Gini(D) = 1 - (0.25^2 + 0.25^2 + 0.25^2 + 0.25^2)

__NOTE__ 

- The entropy is 0 if all samples of a node belong to the same class, and the entropy is maximal if we have a uniform class distribution. In other words, the entropy of a node (consist of single class) is zero because the probability is 1 and log (1) = 0. Entropy reaches maximum value when all classes in the node have equal probability.

- Gini index is an intermediate measure between entropy and the classification error.

- underfitting (high bias)

- In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, or Bayesian priors).

- Regularization is a way of finding a good bias-variance tradeoff by tuning the complexity of the model. It is a very useful method to handle collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting. The concept behind regularization is to introduce additional information (bias) to penalize extreme parameter weights.

in SVMs, our optimization objective is to maximize the margin. The primary reason for having decision boundaries with large margins is that they tend to have a lower generalization error whereas models with small margins are more prone to overfitting 

The reason for introducing the slack variable ξ is that the linear constraints need to be relaxed for nonlinearly separable data to allow convergence of the optimization in the presence of misclassifications under the appropriate cost penalization.

With the variable C, we can penalize for misclassification. Large values of C correspond to large error penalties while we are less strict about misclassification errors if we choose smaller values for C. We can then use the parameter C to control the width of the margin and therefore tune the bias-variance trade-off as shown in the picture below:

The basic morale of kernel methods is to deal with linearly inseparable data

The principle behind Maximum Entropy [4] is that the correct distribution is the one that maximizes the Entropy / uncertainty and still meets the constraints which are set by the ‘evidence’.

__Q When would you prefer decision tree?__

Advantages of using decision tree are that it does not require much of data preprocessing and it does not require any assumptions of distribution of data. This algorithm is very useful to identify the hidden pattern in the dataset.
When the data is simple with less features, decision tree might be useful, Otherwise Random Forest will give better predictions.

Random forests can use random subset of features and/or samples for it's trees whereas decision tree do not.
 We prefer the Decision Tree to the Random Forest when the interpretability is more important than the accuracy.
We can easily visualize our Decision Tree and understand the decision-sequence for prediction of this machine learning algorithm when we want to describe model for business users. With Random Forest we can visualize one, two or all trees in forest, but we can't understand the summary decision-sequence for whole forest.

- You want visuals => DT
- You want POWER => RF

Therefore a decision tree would be better if a simple and fast model already meets our current EDA and prediction demands. However, it is model that is very prone to overfitting. If we need to analyse a bigger dataset, with a lot of features with a higher demand for accuracy a Random Forest would be a better choice.
By bootstraping over our dataset, a Random Forest would be more accurate and generalistic model. However it is important to note that even a Random Forest would have trouble with outliers and rare events on our dataset. Depending of our problem another model such as neural network would be more appropriate.

DecisionTrees are preferred over RandomForests in few cases:-
- 1. When you have less training data. As we know that Random forest is the ensembling of Decision trees having the less data and trying to implement the Random Forest can cause overfitting as it tries to match each and every input in input dataset
- 2. When the computational power is limited.
- 3. When we want the model to be simple and explainable even to the business users.

Decision tree advantages:
- 1.) Easy to understand and easy to implement.
- 2.) Runs fast.
- 3.) Scales well with large datasets.
- 4.) Works on numerical and categorical data.
- 5.) Algorithm workings can be observed, so work can be reproduced.

Decision tree disadvantages:
- 1.) Prone to overfitting. Setting a max depth solves overfitting but introduces bias.
- 2.) Running decision tree algorithms in a reasonable timespan requires greedy algorithms, which produces local optimum instead of global optimum.

Decision tree is better than random forest if speed is critical and accuracy can be traded off.


__Q Scaling Vs Normalization - What is the difference?__

We need scaling mainly for algorithms which internally uses some distance measure technique (say Euclidian Distance. Whereas, Normalization is needed, when comparing populations/phenomena of different size but with the same origin.
Scaling just changes the range of your data. This means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1. You want to scale data when you're using methods based on measures of how far apart data points, like SVM or KNN. With these algorithms, a change of "1" in any numeric feature is given the same importance.

For example, you might be looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions of the world. With currency, you can convert between currencies. But what about if you're looking at something like height and weight? It's not entirely clear how many pounds should equal one inch (or how many kilograms should equal one meter).
By scaling your variables, you can help compare different variables on equal footing. To help solidify what scaling looks like, let's look at a made-up example. (Don't worry, we'll work with real data in just a second, this is just to help illustrate my point.)

In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.)
The method were using to normalize here is called the Box-Cox Transformation. Let's take a quick peek at what normalizing some data looks like:




__Q What Is The Difference Between PCA and PLS (Principal Component Analysis VS. Partial Least Squares)?__

__PCA__
PCA tries to explain the variance-covariance structure of a data set. Aim is to increase the variance of the features itself, like the loss of information is greatly reduced. PCA is a Dimensionality Reduction algorithm. Both PLS and PCA are used for dimension reduction.

__PLS__
Partial Least Squares, use the annotated label to maximize inter-class variance. Principal components are pairwise orthogonal. Principal components are focus on maximize correlation.
The main difference is that the PCA is unsupervised method and PLS is supervised method.


__Q You are given a data set for classification, which model would you use and why: Logistic Regression, SVM or Neural Networks?__

In general, it depends on the kind of data and amount of samples x features. 

For text classification/categorization:  I would recommend to use naive Bayes or linear SVM.

For datasets with numerical attributes: I would suggest linear SVM, neural networks or logistic regression if the amount of features is much greater than the number of samples.

On the other hand, I would recommend neural networks or SVM with RBF or polynomial kernel if the amount of samples is not too large and greater than the number of features.

Otherwise, if the number of samples is huge I would suggest to use neural networks or linear SVM, and so on. ANN requires large data-set for training .

-----------------------------------

SVMs are really good when you have a high dimensionality dataset and you don't have a lot of data.as it can handle high dimensional data or a data-set with high number of features. SVM is better for those situations where data-set is not too large. SVM without kernel is also a linear classifier.

Logistic Regression is best when you have linear/Binary classification problems. LR is a very good all-purpose algorithm, if you need probabilities or you have a lot of data LR is usually good.

- reduced number of features => LR
- a lot of features but not a lot of data => SVM
- a lot of features and a lot of data => NN

--------------------------------------------------------------------
Linear SVM (with liner kernel) and LR are classification methods with linear decision boundaries
LR produces probabilistic values while SVM produces 1 or 0
SVMs are great for relatively small data sets with fewer outliers




