# Decision Tree

Decision Trees are widely used supervised machine learning algorithms that can be used for both classification and regression tasks. They are intuitive to understand and interpret, making them popular among data scientists. A decision tree is a flowchart-like structure where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents the outcome.

The concept behind decision trees is to split the dataset based on different attributes/features in such a way that the resulting subsets are as pure as possible. The purity of a subset refers to the homogeneity of the target variable within that subset. The goal is to minimize impurity and maximize information gain or Gini impurity (depending on the splitting criterion used).

Some assumptions while implementing the Decision-Tree algorithm. Here are those:

- Initially the entire training set is considered as the root node.
- Feature values are expected to be categorical. Continuous values are discretized beforehand.
- Records are distributed recursively based on attribute values.
- The selection of attributes as root or internal nodes is determined using a statistical approach.

Decision Tree algorithm is commonly known as CART, which stands for Classification and Regression Trees. Leo Breiman introduced this term to describe Decision Tree algorithms that can be used for both classification and regression tasks. CART serves as the foundation for other significant algorithms such as bagged decision trees, random forest, and boosted decision trees. In this particular scenario, we will focus on solving a classification problem, making use of the Decision Tree Classification algorithm.

## Geometrical Intuition

Decision tree can be called a nested `if-else` classifier as the decisions are made one after the other and one over the other. The part of the decision tree where we make decisions and where we store the result are called nodes, while the splitting of data based on the decision are known as branches. There are three types of nodes: root node, leaf node and internal node. We make decisions in root and internal nodes while leaf nodes hold the result. For each decision there is a hyperplace ($pi_1$ and $pi_2$ in the figure). All these hyperplanes are axis-parallel. These hyperplanes split the entire region of data into hypercubes and hypercuboids.

![Decision Tree: Geometrical Intuition](./../../assets/decision-tree.jpg)

## Attribute Selection Measures

The challenge in Decision Tree implementation is selecting attributes for the root node and each level. This process is called attribute selection, and two popular measures are used:

- **Information gain:** This measure is used when attributes are categorical. It quantifies the reduction in entropy achieved by splitting the dataset based on a particular attribute. Higher information gain indicates the attribute is more suitable for splitting the data.

- **Gini index:** This measure is used when attributes are continuous. It measures the probability of incorrectly classifying a randomly chosen element in the dataset. A lower Gini index suggests an attribute is better for splitting the data.

### Building Decision Tree using Information Gain

Using information gain as a criterion, we try to estimate the information contained by each attribute. To understand the concept of Information Gain, we need to know another concept called Entropy.

**Entropy:**

Entropy measures the impurity in the given dataset. In simple terms, it refers to the randomness or uncertainty of a set of examples. Information gain is the decrease in entropy. It calculates the difference between the entropy before and after splitting the dataset based on attribute values.

Entropy can be represented by a formula that considers the number of classes and the probability associated with each class.

\begin{equation}
Entropy = \sum_{i=1}^C{-p_i * log_2(p_i)}
\end{equation}

where,

- $C$ is the number of classes
- $p_i$ is the probability associated with the $i^th$ class

The ID3 (Iterative Dichotomiser) Decision Tree algorithm uses entropy to calculate information gain. By calculating the decrease in entropy measure of each attribute, we can calculate their information gain. The attribute with the highest information gain is chosen as the splitting attribute at the node.

### Building Decision Tree using Gini Index

Moving on to another attribute selection measure, CART (Categorical and Regression Trees) uses the Gini index. It also focuses on creating split points.

The Gini index measures the impurity or homogeneity in a dataset. It is represented by a formula that considers the number of classes and their probabilities.

\begin{equation}
Gini = 1 - \sum_{i=1}^C{(p_i)^2}
\end{equation}

where,

- $C$ is the number of classes
- $p_i$ is the probability associated with the $i^th$ class

Gini index indicates that if we randomly select two items from a population, they must belong to the same class. The probability of this happening is 1 if the population is pure.

CART (Classification and Regression Tree) uses the Gini method to create binary splits based on the categorical target variable "Success" or "Failure". The higher the value of the Gini index, the higher the homogeneity in the dataset.

To calculate Gini for a split,

- calculate Gini for sub-nodes using the formula that considers the square of the probability for success and failure
- calculate the Gini for the split using the weighted Gini score of each node.

For discrete-valued attributes, we select the subset that gives the minimum Gini index as the splitting attribute. In the case of continuous-valued attributes, we test each pair of adjacent values as a possible split point, and choose the one with the smaller Gini index as the splitting point.

## Overfitting in Decision Tree

Overfitting is a common problem when building a Decision-Tree model. It occurs when the algorithm keeps adding more branches to reduce training-set error but ends up increasing test-set error, resulting in lower prediction accuracy. This usually happens due to outliers and irregularities in the data.

To avoid overfitting, two approaches can be used: pre-pruning and post-pruning.

- **Pre-Pruning:**

In pre-pruning, we stop the construction of the tree before it becomes too complex. We decide not to split a node if its goodness measure falls below a certain threshold value. However, determining the appropriate stopping point can be challenging.

- **Post-Pruning:**

In post-pruning, we build a complete tree by going deeper into the tree. Afterward, we address the overfitting issue by pruning the tree. We utilize cross-validation data to assess the impact of pruning. By testing whether expanding a node improves the accuracy or not, we make pruning decisions. If expanding a node improves accuracy, we continue to expand it. However, if expanding a node reduces accuracy, we convert that node into a leaf node.

These approaches, pre-pruning and post-pruning, are used to tackle the problem of overfitting in Decision-Tree models. They help in finding the right balance between complexity and generalization to improve the model's performance.

## References

- [Machine Learning - Decision Tree](https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/ml-decision-tree/tutorial/)
- [An Introduction to Decision Trees](https://blog.paperspace.com/decision-trees/)