## Overfitting

Decision trees overfit easily because they can grow very complex and end up memorizing the training data instead of learning general patterns. 

#### Why decision trees overfit

- Decision trees are non‑parametric models: instead of a fixed set of weights (like linear or logistic regression), they can keep adding splits and nodes until almost every training point gets its own path.
​
- When a tree grows unchecked:
  - Training accuracy becomes very high (often 100%), but
  - Test/validation accuracy drops because the tree is fitting noise and specific quirks of the training set.
​
- This is more pronounced than in many other models because trees can create extremely detailed “if–then” rules, like carving out tiny regions for a handful of samples.
​
Regularization for linear models usually means adding penalties on weights (L1, L2), but trees do not have a simple weight vector. Instead, trees are regularized by controlling structure: depth, number of leaves, and which splits are allowed.

### Pre‑pruning: stop the tree from growing too much

Pre‑pruning applies constraints during tree building so the tree never reaches its full possible complexity. The few intuitive rules and how they map to scikit‑learn hyperparameters:

#### 1. Minimum samples to split a node – min_samples_split

Idea:
- “Don’t split a node if it contains fewer than 1% of the total training samples.”
- That means small nodes (few samples) are not further divided.

Effect:
- Prevents the tree from making very specific rules for tiny groups of points that are likely noise.  
​
In scikit‑learn, this is controlled by min_samples_split:
    - If you set it higher than the default (2), nodes must have more data before they can split.

#### 2. Maximum depth of the tree – max_depth

Idea:
- “Don’t allow more than, say, 7 questions from root to leaf.”
- Each level adds another decision; restrict how many you can have.

Effect:
- Directly limits model complexity: shallower trees are simpler, have higher bias but lower variance, and usually overfit less.
- max_depth in scikit‑learn enforces this limit; None means no explicit cap (tree can grow until leaves are pure or restricted by other parameters).

#### 3. Minimum impurity decrease / ΔWS threshold – min_impurity_decrease

The split quality can be measured by weighted entropy (WS), and the drop in WS (ΔWS) from parent to children represents how much uncertainty is reduced.  

If a split barely reduces WS, it’s not very informative.

Idea:

- “Don’t create a split if its ΔWS is less than 0.01.”

- Only splits that produce a meaningful reduction in entropy are allowed.

Effect:

- Filters out weak splits that reduce impurity only slightly and are often fitting noise.
​
In scikit‑learn, the corresponding parameter is min_impurity_decrease: a node will split only if the impurity decrease is at least this value.

Because there are many such natural controls (max_depth, min_samples_split, min_samples_leaf, max_leaf_nodes, min_impurity_decrease, ccp_alpha, etc.), decision tree implementations provide a large set of hyperparameters. You rarely tune all of them at once; defaults are provided, and more careful tuning is done with tools like grid search or randomized search.

#### Why understanding ΔWS and hyperparameters matters

- If you know how the tree is built (entropy, weighted entropy, ΔWS), then parameters like min_impurity_decrease make sense: they are thresholds on “how much information” a split must add.

- Without that background, hyperparameters can feel like arbitrary numbers; with it, you can pick values more intelligently.  
​
There are also other, more advanced options such as ccp_alpha (cost‑complexity pruning parameter in scikit‑learn), which explicitly trades off tree size and training error via a penalty term.

### Post‑pruning: grow first, then prune

Post‑pruning takes the opposite approach:
1. Let the tree grow fully (or nearly fully), often to very low training error.
2. Then remove branches that don’t actually help on unseen data.   
​ 

Example:
- Consider a deep branch with many rules but relatively few samples (e.g., 8 of one class and 32 of another).
- This branch may be overly specific to the training data.   

A simple validation‑based pruning strategy:
1. Hold out a validation set (separate from training).
2. Evaluate the model on the validation set with the branch intact.
3. Then “collapse” that branch:
    - Replace the entire subtree with a single leaf that predicts the majority class at the root of that branch (e.g., always predict “virginica”).
4. Evaluate again on the validation set without the branch.
5. If the validation error stays the same or improves (or worsens only negligibly), the branch is not useful and can be pruned away.   

​
Intuition:
- If a complex branch doesn’t improve performance on unseen data, it is likely modeling noise rather than signal, so removing it should improve generalization or at least simplify the model without harm.   
​

Helpful methods in scikit‑learn:
- predict at a node/leaf returns the most common class in that region (majority vote).
- predict_proba returns class probabilities—the fraction of samples of each class in that region—useful for understanding the model’s confidence.   


Academic treatments describe pruning similarly: replace a whole subtree with a leaf if the expected error of the subtree is higher than that of the single leaf according to a validation or error‑estimate rule.
​

### Many ways to regularize trees (you don’t need all of them)

Research and practice have proposed many different methods to control decision tree complexity:


Pre‑pruning:
- Limit depth, minimum samples per split/leaf, maximum number of leaves, minimum impurity decrease.


Post‑pruning:
- Validation‑based subtree replacement.
- Cost‑complexity pruning (ccp_alpha).
- Other heuristics that evaluate subtrees and decide whether to keep them.
​

Every possible pruning method need not be memorized. 
Knowing a few core ideas—pre‑pruning hyperparameters, validation‑based post‑pruning—is enough to start building better trees and to recognize when a model is likely overfitted.