### Mathematics of the Random Forest Algorithm
I will explore **mathematical foundation of the Random Forest algorithm**, which is an **ensemble of decision trees** trained using **bagging (bootstrap aggregation)** and **random feature selection**. Let’s go step by step through its **mathematical formulation**, both for **regression** and **classification**, linking to the idea of **variance reduction** and **ensemble averaging**.

**Given** a dataset
$$
D = {(x_i, y_i)}_{i=1}^n, \quad x_i \in \mathbb{R}^m, \quad y_i \in \mathcal{Y}
$$
where
$$
\mathcal{Y} =
\begin{cases}
\mathbb{R}, & \text{for regression} [4pt]
{1, 2, \dots, K}, & \text{for classification}
\end{cases}
$$

We aim to learn a predictor $h(x)$ that generalizes well by **aggregating multiple decision trees** $h_j(x)$.


###  **Bootstrap Sampling**

We draw $q$ bootstrap samples, each of size $n$, **with replacement**:

$$
D_j = {(x_i^{(j)}, y_i^{(j)})}_{i=1}^{n}, \quad j = 1, 2, \dots, q
$$

Each $D_j$ is drawn from the original dataset $D$, such that approximately **63.2% of the original samples** appear in each bootstrap sample (the rest are “out-of-bag” samples).

### **Random Feature Subsampling**

At **each split** within the decision tree $h_j(x)$:

* Randomly select a subset of features
  $$
  \mathcal{F}_j \subseteq {1, 2, \dots, m}, \quad |\mathcal{F}_j| = u \le m
  $$
* Choose the **best feature $f^* \in \mathcal{F}_j$** and **split point** based on a **split criterion** (e.g., Gini, entropy, or MSE).

This introduces **decorrelation** between trees, reducing the overall **variance** of the ensemble.


### **Individual Tree Learning**

Each tree $h_j(x)$ is trained independently on $D_j$ using the selected feature subsets.

Each $h_j(x)$ approximates the **true function** $f(x)$ as:
$$
h_j(x) = f(x) + \epsilon_j(x)
$$
where $\epsilon_j(x)$ is the model-specific error (random due to sampling and feature randomness)

### **Ensemble Aggregation**

After training $q$ trees, Random Forest aggregates them into a single predictor $h(x)$.

### **For Regression: Averaging**

$$
\boxed{
h(x) = \frac{1}{q} \sum_{j=1}^q h_j(x)
}
$$

* The final prediction is the **mean** of all individual tree predictions.
* This reduces the variance:
$Var h(x) = \rho \sigma^2 + \frac{1 - \rho}{q}\sigma^2$, where $\rho$ is the average pairwise correlation between trees.

Hence, by reducing correlation (via random feature selection), the ensemble **reduces variance**.


### **For Classification: Majority Voting**

Each tree outputs a class prediction:
$$
h_j(x) \in {1, 2, \dots, K}
$$
Then the final classifier is:
$$
\boxed{
h(x) = \arg\max_{k \in \mathcal{Y}} \sum_{j=1}^q \mathbf{1}\big(h_j(x) = k\big)
}
$$
where $\mathbf{1}(\cdot)$ is the indicator function (equals 1 if true, 0 otherwise).

This is a **majority voting** rule — the most common class among all trees is selected.


### **Bias–Variance Decomposition**

If $h_j(x)$ are unbiased estimators of $f(x)$ with variance $\sigma^2$ and correlation $\rho$, the ensemble variance is:

$$
\operatorname{Var}[h(x)] = \rho \sigma^2 + \frac{1 - \rho}{q} \sigma^2
$$

This means:

* As $q \to \infty$, variance → $\rho \sigma^2$.
* Lower correlation $\rho$ ⇒ stronger variance reduction.

Hence, **decorrelation (via feature subsampling)** is key to Random Forest’s success.

### **Out-of-Bag (OOB) Error Estimate**

Each sample $x_i$ is not used in approximately 1/3 of trees.
We can estimate the model’s generalization error as:

$$
\text{OOB Error} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\big(y_i \ne \hat{y}_i^{OOB}\big)
$$
where $\hat{y}_i^{OOB}$ is the majority vote among trees that did *not* include $(x_i, y_i)$ during training.


### **Summary Table**

| Step | Mathematical Operation                                            | Purpose                    |
| ---- | ----------------------------------------------------------------- | -------------------------- |
| 1    | $D_j \sim D$ (bootstrap with replacement)                       | Data randomness            |
| 2    | Randomly select $u \le m$ features per split                    | Feature randomness         |
| 3    | Train each tree $h_j(x)$ independently                          | Base learner               |
| 4a   | $h(x) = \frac{1}{q}\sum h_j(x)$                                 | Regression aggregation     |
| 4b   | $h(x) = \arg\max_k \sum 1(h_j(x)=k)$                            | Classification aggregation |
| 5    | $\text{OOB Error} = \frac{1}{n}\sum 1(y_i \ne \hat{y}_i^{OOB})$ | Model evaluation           |
