# **Decision Trees**

A **Decision Tree** is a popular supervised learning algorithm used for both classification and regression tasks. It splits the data into subsets based on feature values, creating a tree-like model of decisions. The goal of a decision tree is to classify data by learning simple decision rules inferred from the features.

---

## **Basic Concepts**

- **Node**: A decision node represents a feature or attribute, where the data is split based on certain conditions.
- **Leaf Node**: A leaf node represents the final decision or outcome (e.g., class label for classification or a continuous value for regression).
- **Root Node**: The root node is the topmost node of the tree, representing the feature that provides the best split at the initial level.
- **Branches**: Branches connect nodes, representing the outcome of a decision made at a parent node.

---

## **Working of Decision Trees**

### **1. Splitting the Data**

The decision tree algorithm recursively splits the dataset into smaller subsets. The split is based on features that result in the best separation of data. The goal is to reduce the **impurity** of the data at each split. There are different metrics used to calculate the "best" split.

### **2. Stopping Criterion**

The algorithm continues splitting until:
- The tree reaches a specified depth.
- The data points in a node are pure (i.e., all belong to the same class).
- There is no further information gain from splitting.

---

## **Splitting Criteria**

To determine the best feature to split on, the decision tree algorithm evaluates the following measures:

### **1. Gini Impurity (for Classification)**

The **Gini Impurity** measures how often a randomly selected element would be incorrectly classified. The formula is:

$$
\text{Gini}(D) = 1 - \sum_{i=1}^{C} p_i^2
$$

Where:
- \( p_i \) is the proportion of class \( i \) in the dataset.
- \( C \) is the number of classes.

The goal is to choose the feature that results in the lowest Gini impurity.

### **2. Entropy and Information Gain (for Classification)**

**Entropy** measures the uncertainty or disorder in a dataset. It is calculated as:

$$
\text{Entropy}(D) = - \sum_{i=1}^{C} p_i \log_2(p_i)
$$

Where:
- \( p_i \) is the proportion of class \( i \) in the dataset.

**Information Gain** is the reduction in entropy when a feature is used to split the dataset. It is calculated as:

$$
\text{Information Gain} = \text{Entropy}(D) - \sum_{v \in V} \frac{|D_v|}{|D|} \text{Entropy}(D_v)
$$

Where:
- \( D_v \) represents the subset of data where the feature \( v \) has a particular value.

The feature that maximizes information gain is selected for the split.

### **3. Mean Squared Error (MSE) (for Regression)**

For regression tasks, the algorithm uses the **Mean Squared Error (MSE)** as the impurity measure:

$$
\text{MSE}(D) = \frac{1}{|D|} \sum_{i=1}^{|D|} (y_i - \hat{y})^2
$$

Where:
- \( y_i \) is the actual value of the target variable.
- \( \hat{y} \) is the predicted value (mean of the target values in the node).

The algorithm tries to minimize the MSE when choosing the best split.

---

## **Building a Decision Tree**

1. **Choose the Best Feature**: At each node, evaluate each feature and calculate the Gini Impurity, Entropy, or MSE for all possible splits.
2. **Split the Data**: Based on the chosen feature, divide the dataset into subsets.
3. **Repeat**: Recursively apply the process to each subset, building the tree from the top (root) down to the leaf nodes.
4. **Stop Criteria**: The recursion ends when the data in a node is pure or when the tree reaches a predefined maximum depth.

---

## **Pruning**

**Pruning** is the process of removing unnecessary branches from the decision tree to avoid overfitting. There are two types of pruning:

### **1. Pre-Pruning (Early Stopping)**

Pre-pruning involves stopping the tree-building process early before it becomes too complex. This can be done by limiting:
- The maximum depth of the tree.
- The minimum number of samples required to split a node.
- The minimum improvement in the splitting criterion.

### **2. Post-Pruning (Cost Complexity Pruning)**

Post-pruning involves first building the full tree and then trimming branches that have little importance. This can be done by evaluating the performance of the tree on a validation set and removing branches that don't contribute much to improving accuracy.

---

## **Advantages of Decision Trees**

- **Easy to understand and interpret**: Decision trees are easy to visualize and understand, even for non-experts.
- **No need for feature scaling**: Decision trees do not require normalization or standardization of data.
- **Handles both numerical and categorical data**: Decision trees can handle different types of features without the need for preprocessing.
- **Non-linear relationships**: Decision trees can capture complex, non-linear relationships between features.

---

## **Disadvantages of Decision Trees**

- **Overfitting**: Decision trees can easily overfit, especially with deep trees that fit noise in the data.
- **Instability**: Small changes in the data can lead to completely different tree structures.
- **Greedy Algorithm**: The decision tree algorithm is greedy, making locally optimal splits without considering the global best solution.
- **Bias towards dominant classes**: In imbalanced datasets, decision trees may favor the majority class.

---

## **Ensemble Methods**

To overcome some of the limitations of decision trees, ensemble methods like **Random Forests** and **Gradient Boosting** are often used. These methods combine multiple decision trees to improve accuracy and reduce overfitting.

- **Random Forest**: Builds multiple decision trees using different subsets of data and features, and combines their results to improve robustness and accuracy.
- **Gradient Boosting**: Builds decision trees sequentially, where each tree corrects the errors of the previous tree.

---

## **Applications**

Decision trees are widely used in a variety of fields, including:
- **Classification**: Spam detection, fraud detection, medical diagnoses.
- **Regression**: Predicting house prices, sales forecasting.
- **Feature Selection**: Used in feature engineering for reducing the number of variables in predictive models.

---

## **Example**

Given a dataset with features like **age** and **income**, the decision tree algorithm would:

- Start at the root node and ask a question like: "Is age greater than 40?"
- Based on the answer, it would split the data into two branches (e.g., age <= 40 and age > 40).
- Each branch would continue to split based on another feature (e.g., "Is income > 50k?").
- Eventually, the tree would end at leaf nodes where a class label or regression value is assigned.

---

## **Summary**

Decision trees are intuitive models used for classification and regression tasks. They split data into subsets based on feature values, aiming to create a simple decision rule for each subset. While they are easy to interpret, they are prone to overfitting and require careful tuning and pruning. Ensemble methods like Random Forests and Gradient Boosting can be used to improve performance and robustness.


![dec](images/dec.png)

### Gini Impurity

Gini Impurity is a measure used in decision tree algorithms to evaluate how "impure" or "mixed" a node is. It quantifies the likelihood of a random element being incorrectly classified if it is randomly labeled based on the distribution of labels in the dataset. Lower Gini Impurity means that the node is more "pure," i.e., most of the data points belong to a single class.

#### Formula for Gini Impurity:

$$
\text{Gini Impurity} = 1 - \sum_{i=1}^{k} p_i^2
$$

Where:
- \( p_i \) is the probability of an element belonging to class \( i \),
- \( k \) is the number of unique classes.

#### Intuition:
- **Low Gini Impurity**: A value close to 0 means that most of the items in the node belong to the same class (i.e., the node is "pure").
- **High Gini Impurity**: A value close to 1 means that the items in the node are evenly distributed across multiple classes (i.e., the node is "impure").

#### Example:

Suppose we have a dataset of 100 items with two classes: **Class A** and **Class B**. 80 items belong to Class A, and 20 items belong to Class B. 

The Gini Impurity for this dataset is calculated as:

$$
\text{Gini Impurity} = 1 - \left( \left( \frac{80}{100} \right)^2 + \left( \frac{20}{100} \right)^2 \right)
$$
$$
\text{Gini Impurity} = 1 - \left( 0.64 + 0.04 \right) = 1 - 0.68 = 0.32
$$

This means the impurity is 0.32, indicating that while Class A is dominant, there is still some mix of Class B in the dataset.

#### Use in Decision Trees:

In decision tree algorithms (e.g., CART), the goal is to **minimize the Gini Impurity** at each decision node. The algorithm evaluates different feature splits and selects the one that results in the lowest Gini Impurity in the child nodes, aiming to create nodes that are as pure as possible (i.e., contain mostly data points from a single class).


#### Classification Using Gini Index

The Gini Index (or Gini Impurity) is a key metric in decision tree algorithms (e.g., CART, Classification and Regression Trees) for building classification models. It is used to evaluate the quality of splits at each node of the tree and helps the algorithm decide which feature and threshold should be used to split the data.

1. calculate gini index for whole dataset (label)

![gini](images/gini1.png)

![gini](images/gini3.png)
![gini](images/gini4.png)
![gini](images/gini5.png)
![gini](images/gini6.png)

## Decision Tree Parameters Explained

### 1. **`criterion`** (default=`'gini'`)
This parameter specifies the function used to measure the **quality of a split** at each node.

- **`'gini'`**: Measures **Gini impurity**. A lower Gini value indicates that the node contains mostly one class.
- **`'entropy'`**: Measures **entropy** (information gain). This is based on information theory and maximizes the reduction in entropy.

### 2. **`max_depth`** (default=`None`)
The **maximum depth** of the tree. If set to `None`, the tree will grow until all nodes are pure or other stopping criteria are met.

- Limiting the depth helps prevent **overfitting** by controlling the complexity of the tree.

### 3. **`min_samples_split`** (default=`2`)
This parameter specifies the **minimum number of samples** required to split an internal node.

- **Default value is 2**, meaning any node with 2 or more samples can be split.
- A higher value helps prevent **overfitting** by avoiding splits with fewer samples.

### 4. **`min_samples_leaf`** (default=`1`)
The **minimum number of samples** required to be at a leaf node.

- **Default value is 1**, meaning a leaf can contain just one sample.
- Increasing this value forces the tree to have more samples per leaf, reducing overfitting.

### 5. **`max_features`** (default=`None`)
Limits the number of features to consider when splitting a node.

- **If set to `None`**, all features are used at each node.
- **If set to an integer**, only that number of features are considered.
- **If set to a float**, it uses a percentage of total features (e.g., `max_features=0.5` uses 50% of the features).

### 6. **`max_leaf_nodes`** (default=`None`)
Limits the number of **leaf nodes** in the tree.

- If set to a positive integer, the tree will stop growing once it reaches that number of leaf nodes.
- **If set to `None`**, the tree grows until other stopping criteria are met.

### 7. **`min_impurity_decrease`** (default=`0.0`)
The **minimum decrease in impurity** required for a node to split.

- If the decrease in impurity is smaller than this value, the node will not split, helping to **prune** the tree.

### 8. **`class_weight`** (default=`None`)
Assigns different **weights to each class** in the dataset.

- **`None`** means all classes are weighted equally.
- **`"balanced"`** automatically adjusts weights inversely proportional to class frequencies.
- You can manually set weights as a dictionary, e.g., `{0: 1, 1: 2}`.

### 9. **`random_state`** (default=`None`)
Controls the **random number generator** for reproducibility.

- Ensures that the results can be reproduced if the same random seed is used.

### 10. **`splitter`** (default=`'best'`)
The strategy used to split a node.

- **`'best'`**: Selects the best split based on the chosen criterion.
- **`'random'`**: Randomly selects a split, useful in ensemble methods like Random Forests.

### 11. **`presort`** (default=`False`)
This parameter is deprecated in newer versions of `sklearn`.

- It was previously used to **pre-sort** the data before splitting, but is now handled automatically.

---

### Example Putting It All Together

```python
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(
    criterion='entropy',    # Use 'entropy' to calculate splits based on information gain
    max_depth=5,            # Limit the tree depth to avoid overfitting
    min_samples_split=4,    # A node must have at least 4 samples to be split
    min_samples_leaf=2,     # A leaf must have at least 2 samples
    max_features='sqrt',    # Use only a random subset of features (e.g., sqrt of total features)
    random_state=42         # Ensure the results are reproducible
)
