Link: https://medium.com/analytics-vidhya/https-medium-com-shashi-kiran-ai-decision-tree-intuition-92f708f13f33

## Decision Tree 

![DT](https://miro.medium.com/v2/resize:fit:828/format:webp/0*12CT72krFYLtGFGx.png)

A decision tree is a popular and intuitive machine learning model used for both classification and regression tasks. It's a flowchart-like structure where an internal node represents a "test" on an attribute (feature), each branch represents the outcome of the test, and each leaf node represents a class label (in classification) or a numerical value (in regression).

Here's a concise definition:

- **Classification**: In classification tasks, decision trees partition the input space into regions, each corresponding to a specific class label. When a new data point is encountered, the decision tree traverses the tree from the root node down to a leaf node, making decisions based on the features of the data, and assigns the most frequent class label of the training samples in that leaf node to the input.

- **Regression**: In regression tasks, decision trees predict a continuous value for the target variable. Similar to classification, the tree is built based on the features of the data, but the prediction at each leaf node is typically the average (or another summary statistic) of the target values of the training samples falling into that leaf.

Decision trees are appealing because they are easy to understand and interpret, and they can handle both numerical and categorical data. They can also capture non-linear relationships between features and the target variable. However, they can be prone to overfitting, especially when the trees are allowed to grow too deep without constraints. Techniques like pruning and setting maximum depth help mitigate overfitting.

#### Example

Imagine you're trying to decide whether to go outside to play or stay indoors and read. You can think of a decision tree as a series of questions you ask yourself to make that decision.

1. **Start with a Question**: At the top of the decision tree, you might ask, "Is it raining?" This question helps split your decision into two possibilities: yes or no.

2. **Follow the Branches**: Depending on the answer, you follow different branches of the tree. If it's raining (yes), you might ask yourself, "Do I have rain gear?" If you do, you might go outside. If not, you might decide to stay indoors. If it's not raining (no), you might consider other factors, like the temperature. 

3. **Keep Asking**: You continue asking questions and following branches until you reach a conclusion. Each question helps narrow down your options until you're confident about what to do.

4. **Make a Decision**: Eventually, you reach a point where you've asked enough questions to make a decision. For example, if you have rain gear and it's raining, you might decide to go outside. If it's not raining and the temperature is nice, you might decide to go outside as well. Otherwise, you might stay indoors.

So, a decision tree is like a roadmap of decisions. It helps you navigate through different possibilities based on the answers to a series of questions. In data science, decision trees work similarly but are used to classify or predict outcomes based on features of the data.

### What is Entropy?

Entropy, in the context of information theory and machine learning, is a measure of uncertainty or disorder in a system. It quantifies the amount of unpredictability or randomness in a dataset.

In the context of decision trees and classification tasks, entropy is commonly used as a criterion to determine the best split at each node. The goal is to reduce entropy and increase the purity of the subsets created by the split.

Mathematically, entropy is calculated using the following formula:

![](https://www.saedsayad.com/images/Entropy_3.png)

Where:
- \( p_i \) is the probability of occurrence of the \( i \)th class in the dataset.
- \( n \) is the total number of classes.

In this formula, entropy is measured in bits. When the dataset is perfectly homogeneous (i.e., all samples belong to the same class), the entropy is 0. When the dataset is evenly distributed across all classes (i.e., maximum uncertainty), the entropy is at its maximum value.

In decision trees, entropy is used to determine which feature and threshold to split on at each node. The split that results in the greatest reduction in entropy (or equivalently, the greatest increase in purity) is chosen. This process continues recursively until a stopping criterion is met, such as reaching a maximum tree depth or having all leaf nodes contain samples of the same class.

#### Example 

Imagine you have a bag of balls, some red and some blue, but you can't see inside. You want to know how mixed up they are. Entropy is a measure of how mixed up or uncertain the contents of the bag are.

1. **High Entropy**: If the bag has an equal number of red and blue balls, it's highly mixed up, and entropy is high. You're uncertain about which color you'll pick next.

2. **Low Entropy**: If the bag has only red balls, or only blue balls, it's not mixed up at all, and entropy is low. You're certain about which color you'll pick next.

Now, think of a decision tree trying to sort these balls. At each step, it looks to minimize the uncertainty (entropy) by making splits that organize the balls into more homogeneous groups. The best split is the one that reduces entropy the most, meaning it makes the groups more pure or less mixed up.

So, in a decision tree, entropy helps decide which questions to ask (which features to split on) to separate the data into more distinct categories. It's all about reducing uncertainty and making clearer decisions as the tree grows.

- For a 2 Class problem (yes or no) the Minimum Entropy is 0 and Maximum Entropy is 1
- For more than 2 Class problem (yes, no or maybe) the Minimum Entropy is 0 and Maximum Entropy can be greater than 1

### Information Gain

Information gain is a concept used in decision tree algorithms, such as the ID3 (Iterative Dichotomiser 3) and C4.5 algorithms, to measure the effectiveness of a particular attribute in classifying the data. It represents the amount of information gained about the target variable (class label) when a dataset is split based on a particular attribute.

Here's a simple explanation of information gain:

1. **Entropy**: Entropy, as mentioned earlier, measures the impurity or uncertainty in a dataset. A dataset with high entropy means it's mixed up or uncertain, while low entropy indicates it's well-structured or certain.

2. **Information Gain**: Information gain is the reduction in entropy (or uncertainty) that results from splitting the dataset on a particular attribute. In other words, it quantifies how much more certain we are about the target variable after splitting the data based on a certain attribute.

Mathematically, information gain is calculated as follows:

```Information Gain = Entropy before split - Weighted average of entropies after split```

The "Entropy before split" represents the entropy of the original dataset before splitting, while the "Weighted average of entropies after split" represents the average entropy of the subsets created by the split, weighted by their relative sizes.

3. **Selecting the Best Split**: In decision tree algorithms, the attribute with the highest information gain is chosen as the best attribute to split the data on at each node. This attribute maximally reduces the uncertainty about the target variable and helps improve the predictive accuracy of the decision tree.

By selecting attributes with high information gain, decision trees can efficiently partition the dataset into subsets that are more homogeneous with respect to the target variable, ultimately leading to better classification performance.

### Gini impurity

Gini impurity is another measure of impurity or uncertainty commonly used in decision tree algorithms, particularly in CART (Classification and Regression Trees) algorithms. It quantifies the likelihood of incorrectly classifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the subset.

Here's a breakdown of Gini impurity:

1. **Impurity Measure**: Gini impurity measures how often a randomly chosen element from the dataset would be incorrectly classified if it were randomly labeled according to the distribution of class labels in the subset.

2. **Calculation**: Mathematically, Gini impurity is calculated by summing the probabilities of each class being chosen squared:

   ![](https://miro.medium.com/v2/resize:fit:610/1*Ag2dXbJt-1M9_S_bi8UBSw.png)

   Where \( n \) is the number of classes, and \( p_i \) is the probability of choosing a class \( i \).

3. **Interpretation**: Gini impurity ranges between 0 and 1, where:
   - A Gini impurity of 0 indicates that the subset is completely pure (all elements belong to the same class).
   - A Gini impurity of 1 indicates that the subset is completely impure (the elements are evenly distributed across all classes).

4. **Selecting the Best Split**: Similar to information gain, decision tree algorithms aim to minimize Gini impurity when selecting the best attribute to split the data on at each node. The attribute with the lowest Gini impurity after the split is chosen as the best attribute.

Both Gini impurity and information gain are commonly used criteria for selecting the best split in decision tree algorithms. While Gini impurity is more efficient to compute, information gain tends to favor attributes with more levels or categories. Overall, both measures help improve the predictive accuracy of decision trees by creating splits that result in more homogeneous subsets with respect to the target variable.

#### Example

Let's use a simple example to explain Gini impurity in the context of a decision tree.

Imagine you have a dataset of animals classified as either "Dog" or "Cat", and you want to build a decision tree to classify them based on their characteristics. The dataset contains the following animals:

1. Dog
2. Dog
3. Cat
4. Cat
5. Cat

Now, let's calculate the Gini impurity for this dataset:

1. **Calculate Class Probabilities**:
- There are 3 cats and 2 dogs in the dataset.
- Probability of selecting a cat p_Cat = 3/5 = 0.6
- Probability of selecting a dog p_Dog = 2/5 = 0.4

2. **Calculate Gini Impurity**:
- Gini Impurity = 1 - (p_Cat^2 + p_Dog^2)
- Gini Impurity = 1 - (0.6^2 + 0.4^2) = 1 - (0.36 + 0.16) = 1 - 0.52 = 0.48

So, the Gini impurity for this dataset is 0.48.

Interpretation:
- This Gini impurity value suggests that if you randomly select an animal from this dataset and randomly label it as either "Dog" or "Cat" based on the distribution of classes, there's a 48% chance of misclassifying it.

In a decision tree, we aim to reduce this impurity by splitting the dataset based on certain features. The split that results in the lowest Gini impurity is chosen as the best split. This process helps in creating more homogeneous subsets that are easier to classify accurately.

### Handling Numerical Data

Link: https://youtu.be/IZnno-dKgVQ?si=VcAKBKScyxQbQXzO&t=3030