# Summary.

- https://scikit-learn.org/stable/modules/tree.html#tree

- **Splitting criterion (impurity functions)**  
  - Gini: $Gini = 1 - \sum p_i^2$  
  - Entropy: $Entropy = -\sum p_i \log_2 p_i$  
  - Split choice: maximize impurity decrease (information gain for entropy).

- **Stopping criteria**  
  - Node pure ($Gini=0$ or $Entropy=0$).  
  - `max_depth`, `min_samples_split`, `min_samples_leaf`, `min_impurity_decrease`, `max_leaf_nodes`.

- **Pruning**  
  - Pre-pruning: stop early via hyperparams.  
  - Post-pruning (CART): cost-complexity pruning $R_\alpha(T) = R(T) + \alpha |T|$.

- **Bias–variance**  
  - Shallow tree → high bias, low variance.  
  - Deep tree → low bias, high variance.

- **Regression trees**  
  - Use variance/MSE: $MSE = \frac{1}{n}\sum (y_i - \bar{y})^2$.  
  - Leaf predicts mean (or median).


# Signature.

In [None]:
DecisionTreeClassifier(criterion                    = 'gini',
                       splitter                     = 'best',
                       max_depth                    = None, 
                       min_samples_split            = 2, 
                       min_samples_leaf             = 1, 
                       min_weight_fraction_leaf     = 0.0, 
                       max_features                 = None, 
                       random_state                 = None, 
                       max_leaf_nodes               = None, 
                       min_impurity_decrease        = 0.0, 
                       class_weight                 = None, 
                       ccp_alpha                    = 0.0, 
                       monotonic_cst                = None)


# Example Usage.

In [None]:
# Imports.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Basic.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# 1. Load data.
train = pd.read_csv("data/titanic/train.csv")
test  = pd.read_csv("data/titanic/test.csv")

# 2. Select features (some numerics).
features    = ["Pclass", "Age", "SibSp", "Parch", "Fare"]
X           = train[features]
y           = train["Survived"]

# Imputation (with 0).
X = X.fillna(0)

# 3. Train/test split.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Train.
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# 5. Evaluate.
y_pred = clf.predict(X_val)
acc = accuracy_score(y_val, y_pred)
print("Validation Accuracy:", acc)


Validation Accuracy: 0.6703910614525139
