Training error, test error, and overfitting in decision trees all revolve around one idea: as a tree grows deeper, it fits the training data better, but beyond a point it starts memorizing noise instead of learning patterns, which makes test performance worse.

### Training vs test error

Training error: Percentage of misclassified examples on the training set (data used to fit the tree).

Test error: Percentage of misclassified examples on unseen test data held out from training.

### Key relationships for a decision tree:

As tree depth (or number of nodes) increases:

- Training error monotonically decreases, often reaching 0% (perfect accuracy), because the model keeps creating splits that separate training points.

- Test error first decreases (model fits true patterns), then increases (model fits noise), giving a U-shaped curve when plotted against tree size.

This pattern—low training error but high test error at large tree sizes—is the hallmark of overfitting.

### Overfitting in decision trees

Overfitting means the model captures random fluctuations (noise) in the training data rather than the underlying signal.

For decision trees:

- Early splits capture real structure (e.g., separating species by petal length in Iris).
- Later splits can become extremely specific, carving out tiny regions that only contain a few training points.
- When the tree keeps growing until leaves are pure (only one class), training accuracy becomes 100%, but generalization to new data degrades.

Typical symptoms:
- Very deep tree.
- Almost zero training error.
- Noticeably higher test error than simpler (shallower) trees.

Mitigation strategies (conceptual):
- Limit maximum depth (max_depth).
- Require a minimum number of samples per leaf (min_samples_leaf).
- Use pruning strategies or ensembles (Random Forests, Gradient Boosted Trees).

### Why the Iris tree didn’t reach 100% training accuracy
In the Iris example, a decision tree built with scikit-learn achieves about 99.3% training accuracy, not 100%.

Reason:
- There is a leaf (terminal node) that contains 3 samples from different classes, i.e., it is impure.

- All three samples in that leaf:
    - Have exactly the same petal_length and petal_width (4.8 and 1.8 in the example).
    - But belong to different species (e.g., two from one class, one from another).

Given only these two features (petal_length, petal_width), no further rule can separate these three points:
- Any split on petal_length or petal_width would put all three points on the same side, because their feature values are identical.
- scikit-learn therefore stops splitting at this node, leaving it impure.

Resulting behavior:
- scikit-learn assigns the majority class in that leaf as the prediction (e.g., if 2 are class A and 1 is class B, prediction is A).
- The one minority-class sample becomes a training error, which prevents 100% accuracy.

This leads to an important generalization:

*A scikit-learn decision tree (with default settings and unlimited depth) will achieve perfect training accuracy, except when there exist training samples from different classes with exactly the same feature vector.*

If one of those three flowers had slightly different features (e.g., petal length 4.8001 instead of 4.8), the tree could separate it with another split and achieve 100% training accuracy.

This strong tendency toward perfect training accuracy is a warning sign: such trees are highly prone to overfitting.