A decision tree in machine learning is a versatile, interpretable algorithm used for predictive modelling.

It structures decisions based on input data, making it suitable for both classification and regression tasks.

<img src="decision-tree-structure.png" width="400">

A Decision Tree is a popular machine learning algorithm used for both classification and regression tasks. It models decisions and their possible consequences, forming a tree-like structure. Here's a detailed explanation of how it works, its components, advantages, and disadvantages:

    Components of a Decision Tree

**Root Node:** Represents the entire dataset, which is then split into two or more sets.

**Decision Nodes/Internal node:** These are nodes where the data is split based on some criteria.

**Leaf Nodes (Terminal Nodes)**: Represent the final output or class label.

**Branches:** Connect the nodes, showing the flow from question to outcome.

    How Decision Trees Work
**Splitting:** The dataset is split into subsets based on the feature that results in the most significant information gain or the smallest Gini impurity.

    Criteria for Splitting:
**Gini Impurity:** Measures the frequency at which any element of the dataset will be misclassified when randomly labeled.

**Entropy:** A measure from information theory used to calculate the information gain from each split.

**Information Gain:** The reduction in entropy or impurity after a dataset is split on an attribute.

    Stopping Criteria: The splitting process continues until a stopping criterion is met, such as:
Maximum depth of the tree.

Minimum number of samples per node.

Minimum impurity decrease.

**Example**

Suppose we have a dataset to decide if someone should play tennis based on weather conditions. Attributes might include "Outlook", "Temperature", "Humidity", and "Wind".

Root Node: Split on the feature with the highest information gain, e.g., "Outlook".

Decision Nodes: If "Outlook" is "Sunny", check "Humidity"; if "Overcast", directly decide "Yes"; if "Rainy", check "Wind".

Leaf Nodes: The final decision based on the conditions, e.g., "Yes" or "No".

    Advantages of Decision Trees

Easy to Understand and Interpret: The tree structure is intuitive and can be visualized.

Requires Little Data Preparation: No need for normalization or scaling of data.

Handles Both Numerical and Categorical Data: Flexible in the type of data it can process.

Non-parametric: Makes no assumptions about the distribution of data.

    Disadvantages of Decision Trees

Overfitting: Trees can grow very complex and overfit the data.

Unstable: Small variations in the data can result in a completely different tree.

Bias: Can be biased if some classes dominate.

Computational Complexity: Finding the optimal tree is computationally expensive.

    Pruning

To counter overfitting, pruning is used. This involves removing nodes that provide little power in predicting target variables, reducing the tree’s size and complexity.

<img src="decision-tree-ex1.png" width="350">

    Splitting in Decision Trees

* On what factors condition should be used?
* Which feature to be used for splitting? EX : should i consider age or salary to split data set
* When to stop splitting?

    Impurity Measures : Helps is splitting of data

* Classification Error
* Gini
* Entropy

**Classification Error**

<img src="classification-error.png">

Red : 8 (assume)
Blue :2

ClassficationError (CE) = 1-maxPi(probability of i class) # this will show the error of classifying minority class wrongly

Pred = 8/10 = 0.8
Pblue = 2/10 = 0.2 

CE = 1-max(0.8,0.2)
   = 1-0.8
   = 0.2

For keeping Impurity low So in case of CE is I assign everything to majority class what is the error rate (minority misclassified Data points)? 

**Gini Index**

<img src="gini-formula.webp">

Where p(i) is the probability of a specific class and the summation is done for all classes present in the dataset.

Both of these formulae are equivalent, and the result you get from these formulae would be the same. The first formula is more computationally efficient, therefore it is more commonly used.

For example, 
let’s consider a toy dataset with two classes “Yes” and “No” and the following class probabilities:

p(Yes) = 0.3 and p(No) = 0.7

Using the first formula for Gini impurity:

Gini impurity = 1 — (0.3)² — (0.7)² = 0.45

Using the second formula for Gini impurity:

Gini impurity = 1 — (0.3 * (1–0.3)) — (0.7 * (1–0.7)) = 0.45

As you can see, both formulae give the same result of 0.45.

In the example, the Gini impurity value of 0.45 represents that there is a 45% chance of misclassifying a sample if we were to randomly assign a label from the dataset to that sample. This means that the dataset is not completely pure, and there is some degree of disorder in it.

    Explanation:

For example, if we have a dataset with two classes “Yes” and “No”, and the class probabilities are:

p(Yes) = 0.3 and p(No) = 0.7

***If we were to randomly pick a sample from the dataset, there would be a 30% chance that the sample belongs to the “Yes” class and a 70% chance that the sample belongs to the “No” class. If we were to randomly assign one of the class labels to the sample, there would be a 45% chance that we would assign the wrong label to the sample.***

<img src="gini curve.png" width="450">

***Value range will be 0 to 0.5 (max/highest) and then down to 1***

    Entropy

<img src="entropy-formula.png" width="450">

**Example :** So if we had a total of 100 data points in our dataset with 30 belonging to the positive class and 70 belonging to the negative class then ‘P+’ would be 3/10 and ‘P-’ would be 7/10. Pretty straightforward.

If I was to calculate the entropy of my classes in this example using the formula above. Here’s what I would get.

<img src="entropy-calculation.webp" width="450">

The entropy here is approximately 0.88. This is considered a high entropy , a high level of disorder ( meaning low level of purity). Entropy is measured between 0 and 1.(Depending on the number of classes in your dataset, entropy can be greater than 1 but it means the same thing , a very high level of disorder. For the sake of simplicity, the examples in this blog will have entropy between 0 and 1).

---------------

* *Range for Entropy is from 0 to 1*
* *Range for Gini is from 0 to 0.5*
* *Range for Misclassification is from 0 to 0.5*

<img src="entropy-gini-missclassification.png" width="700">

    Which impurity measure to be used between Gini and Entropy


 The choice between Gini impurity and entropy as an impurity measure in decision trees depends on several factors, including the specific characteristics of your dataset and the computational efficiency you are aiming for. Here is a detailed comparison to help you decide:

### Gini Impurity
- **Formula**:

<img src="gini-formula1.png">

Where \( p_i \) is the probability of class \( i \).
- **Interpretation**: Measures the probability of incorrectly classifying a randomly chosen element if it was labeled according to the distribution of labels in the subset.
- **Computational Efficiency**: Slightly faster to compute because it does not involve logarithms.
- **Tendency**: Tends to create more balanced splits and can lead to a more efficient tree in terms of depth.
- **Usage**: Often preferred in practice due to its simplicity and speed.

### Entropy (Information Gain)
- **Formula**: 


<img src="entropy-formula.png" width="450">

Where \( p_i \) is the probability of class \( i \).
- **Interpretation**: Measures the amount of uncertainty or disorder in the subset. Higher entropy means more disorder.
- **Computational Efficiency**: Slightly more computationally intensive due to the logarithmic calculations.
- **Tendency**: Can produce more complex trees as it is more sensitive to small changes in the dataset and might create more branches to reduce uncertainty.
- **Usage**: Useful when you need a more detailed measure of impurity and are less concerned with computational speed.

### Practical Considerations
- **Performance**: In most cases, both measures will produce similar results. The differences in the resulting trees are often minor, and the choice between them may not significantly affect the accuracy of the model.
- **Dataset Characteristics**: For larger datasets or when computational efficiency is a concern, Gini impurity is typically preferred. For smaller datasets or when a more precise measure of impurity is required, entropy might be more appropriate.
- **Empirical Testing**: It is often beneficial to empirically test both measures on your specific dataset to see which one yields better results in terms of model performance.

### Summary
- **Use Gini Impurity** if you prioritize computational efficiency and generally balanced splits.
- **Use Entropy** if you need a more detailed impurity measure and are less concerned about the slight increase in computation time.

Ultimately, the best approach is to experiment with both measures and evaluate their performance on your specific problem.

    Which feature to be used for splitting? EX : should i consider age or salary to split data set

* Information Gain
Information gain is a measure used to determine which feature to split on at each step in the construction of a decision tree. It is based on the concept of entropy from information theory.

We need to target impurity pre and post spilt and it should get reduced after the spilt i.e.
impurity (post spilt) < impurity (pre spilt)

So to quantify the impurity after spilt it will be :

Change in Impurity  =  impurity (pre spilt) - impurity (post spilt)

If randomness/impurity decreases , purity increases then we say information gain


Split the data set with the feature which reduces the impurity highest will be used.

The Information Gain measures the expected reduction in entropy. Entropy measures impurity in the data and information gain measures reduction in impurity in the data. The feature which has minimum impurity will be considered as the root node.


Information gain is used to decide which feature to split on at each step in building the tree. The creation of sub-nodes increases the homogeneity, that is decreases the entropy of these nodes. The more the child node is homogeneous, the more the variance will be decreased after each split. Thus Information Gain is the variance reduction and can calculate by how much the variance decreases after each split.

Information gain of a parent node can be calculated as the entropy of the parent node subtracted entropy of the weighted average of the child node.

example, the dataset has 10 observations belonging to two classes YES and NO. Where 6 observations belong to the class, YES, and 4 observations belong to class NO.

<img src="info-gain-dataset.webp" width="450">

Color is the feature in this case

<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*jKjpQrsVoCEd_9tLyqBFWA.png">

Pyes is the probability of choosing Yes and Pno is the probability of choosing a No. Here Pyes is 6/10 and Pno is 4/10.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*14Ds9IW7u4f_m95h95a6Vw.png">

*E(S), is approximately equal to 0.971*


If all the 10 observations belong to 1 class then entropy will be equal to zero. Which implies the node is a pure node.

<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*tqwSSSfFaZ_6mqWeEmJKDQ.png">


If both classes YES and NO have an equal number of observations, then entropy will be equal to 1.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*ueGvwyJ6S13k9dGhV-4kDA.png">

Back to problem statement

<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*-rCWFe9Yh15TPbkS7ikClQ.png">

<img src="https://miro.medium.com/v2/resize:fit:786/format:webp/1*SDQdSI7MmFyMDwtsZARohA.png">

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*kqZvldmU9eH8g6a6De2BVg.png">

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*yrCuVeZqZobn442z-T8LYw.png">

***For a dataset having many features, the information gain of each feature is calculated. The feature having maximum information gain will be the most important feature which will be the root node for the decision tree.***

And this process continue until all feature/data splits are done, and Information Gain will be calculated at each stage

-----------

    What if the feature is continuous data ?

    put the notes from class note

-------------------

    Decision Tress is a greedy algorithm

As we need to divide the leaf node to the highest purity which may lead to overfitting hence following methods are applied.

* Pre Pruning
* Post Pruning
* Ensemble