# Scaling Laws in Machine Learning

Ever found yourself asking questions like:

* "If I collect twice as much data, how much better will my model *actually* get?"
* "What is the performance boost of doubling my model's parameters number?"
* "How can I predict the performance of a model without having to run every single experiment?"

If so, you're in the right place! These are super common (and important!) questions in the world of machine learning. The exciting part is that the way a model's performance improves is determined by maths rules, and these rules are what we call **scaling laws**. They describe how a model's error or performance changes as we scale up certain resources, like the amount of training data or the size of the model itself. 

**What's Inside This Notebook?**

In this notebook, we'll take a look at the mathematical intuition that underpins these scaling laws, focusing on two areas:

1.  **Scaling with Dataset Size:** We'll first investigate how a model's error typically decreases as we feed it more training data.
2.  **Scaling with Model Parameters:** Next, we'll focus on how does increasing the complexity or the number of parameters in a model affect its error? 
3.  **Practical experiment:** Finally, we will code these two scaling laws with a regression model (RandomForest) and verify these theoretical results.

Ready to dive in? Let's get started!

---

## 1. Scaling law with respect to dataset size

### 1.1 The Basic Idea: Learning from Limited Data

When we train a machine learning model, we're trying to find a model that works well not just on the training data we have, but also on new, unseen data. The error on unseen data is the **true error**, let's call it $R(h)$ for a model $h$. The error on our training data (of size $N$) is the **empirical error**, $\hat{R}_N(h)$.

We hope that if $\hat{R}_N(h)$ is small, then $R(h)$ will also be small. However, because our training data is just a sample, $\hat{R}_N(h)$ is only an estimate of $R(h)$. The key question is: how good is this estimate?

### 1.2 Deriving the scaling law

Imagine, for a moment, that we have just one fixed model $h$ (we didn't even train it, we just picked it). We want to know its true error $R(h)$. We can estimate this by calculating its error $\hat{R}_N(h)$ on $N$ data points.

Each data point gives us a piece of information about whether $h$ is correct or not. Let $Z_i = 1$ if $h$ is wrong on the $i$-th data point, and $Z_i = 0$ if it's right.
The true error is the average of $Z_i$ over all possible data:
$$ R(h) = \mathbb{E}[Z_i] $$
The empirical error is the average over our sample:
$$ \hat{R}_N(h) = \frac{1}{N} \sum_{i=1}^{N} Z_i $$

Statistical theory (specifically, Hoeffding's inequality, which is a concentration inequality) tells us how close the sample average $\hat{R}_N(h)$ is likely to be to the true average $R(h)$. It states that the probability of the difference being large is small:
$$ P\left( |R(h) - \hat{R}_N(h)| \ge \epsilon \right) \le 2 e^{-2N\epsilon^2} $$
This formula means:
- The probability that the true error $R(h)$ and the empirical error $\hat{R}_N(h)$ differ by more than some amount $\epsilon$ decreases very quickly as $N$ (the dataset size) increases.
- It also decreases as $\epsilon$ (the allowed difference) increases.


Now, let's flip this around. Suppose we want to be reasonably confident (say, with probability $1-\delta$, where $\delta$ is small, like 0.05) about how much $R(h)$ can differ from $\hat{R}_N(h)$. We can say that, with high probability:
$$ |R(h) - \hat{R}_N(h)| \le \sqrt{\frac{\ln(2/\delta)}{2N}} $$
This simplifies to:
$$ MSE(h) \approx O\left(\frac{1}{\sqrt{N}}\right) $$
So, for a single, fixed model, the "uncertainty" in our estimate of the true error decreases proportionally to $1/\sqrt{N}$.


### 1.3 Conclusion

In practice, Machine Learning models are never perfect and there always exist a bias. We can model it with $E_{\text{bias}}$ which represent some underlying bias – the error that even an infinitely large dataset couldn't eliminate and C a constant.
$$ MSE(h) = \frac{C}{\sqrt{N}} + E_{\text{bias}}$$
$$ MSE(h) = A \cdot N^{-\alpha} + B \quad \Rightarrow \quad \boxed{\alpha = 1/2}$$

This shows a common way the $1/\sqrt{N}$ scaling arises: it's often linked to how quickly our statistical estimation error (the uncertainty from using a finite sample) decreases as we get more data. While more complex scaling laws (like $N^{-\alpha}$ with different $\alpha$) exist, this $1/\sqrt{N}$ behavior is a fundamental reference point from statistical learning theory.

---

## 2. Scaling law with respect to parameters number

### 2.1 Problem Formulation

We train a random forest with $N$ independent regression trees. Each tree is an estimator of the target function. We assume:

- Each tree has a prediction variance of $\sigma^2$
- Trees are independent (or weakly correlated)
- The final prediction is the average of the individual tree predictions


### 2.2 Variance of the Averaged Model

Let $Y_i$ be the prediction of the $i$-th tree. The ensemble prediction is:

$$
\hat{Y} = \frac{1}{N} \sum_{i=1}^N Y_i
$$

If the trees are independent with variance $\operatorname{Var}(Y_i) = \sigma^2$, then the variance of the average is:

$$
\operatorname{Var}(\hat{Y}) = \frac{\sigma^2}{N}
$$

This is a classical property of the mean of independent estimators.

### 2.3 Mean Squared Error (MSE)

The total error can be written using the bias-variance decomposition:

$$
\text{MSE} = \underbrace{\text{Bias}^2}_{\text{model bias}} + \underbrace{\text{Variance}}_{\frac{\sigma^2}{N}}
$$

In a random forest, the trees are typically deep and hence low-bias (bias $\approx 0$)

So we can approximate:

$$
\text{MSE}(N) \approx \frac{A}{N} + B
$$

Which is of the form:

$$
\text{MSE}(N) = A \cdot N^{-\alpha} + B \quad \Rightarrow \quad \boxed{\alpha = 1}
$$

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.optimize import curve_fit
from tqdm import tqdm
import ssl
ssl._create_default_https_context = ssl._create_unverified_context


### Load dataset ###

data = fetch_california_housing()
X, y = data.data, data.target

# Data nomalization
scaler_X = StandardScaler()
X = scaler_X.fit_transform(X)

# Train - test split
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




### Training models ###

n_points = 10

# Dataset sizes to test (full dataset fractions)
fractions = np.linspace(1e-2, 1, n_points) # 1% to 100%
train_sizes = [int(f * len(X_train_full)) for f in fractions]

# Number of parameters to test
estimators_nb = np.linspace(1, 200, n_points, dtype=int)


errors_samples = []
errors_params = []

for params, size in tqdm(zip(estimators_nb, train_sizes), total=len(train_sizes)):
    
    # Parameters number changes
    model = RandomForestRegressor(n_estimators=params)
    model.fit(X_train_full, y_train_full)
    y_pred = model.predict(X_test)
    mse_params = mean_squared_error(y_test, y_pred)
    errors_params.append(mse_params)
    
    # Dataset size changes
    X_train = X_train_full[:size]
    y_train = y_train_full[:size]
    model = RandomForestRegressor(n_estimators=100) # Default
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse_samples = mean_squared_error(y_test, y_pred)
    errors_samples.append(mse_samples)
    
    
    



### Compute the regression for the scaling laws ###

def scaling_law(N, A, alpha, B):
    return A * N**(-alpha) + B

def regression(x, y):
    """ Compute the regression of the points (x, y) on the scaling law function. """
    
    # Fit of A, alpha, B
    params, covariance = curve_fit(scaling_law, x, y, bounds=([0, 0, 0], [np.inf, 1, np.inf]))
    A, alpha, B = params

    print(f"A = {A:.3f}, alpha = {alpha:.3f}, B = {B:.3f}")
    print("Covariances :\n", covariance, "\n")

    # Lists of points
    N_fit = np.linspace(min(x), max(x), 100)
    Error_fit = scaling_law(N_fit, *params)
    
    return N_fit, Error_fit, alpha


x_samples, y_samples, alpha_samples = regression(train_sizes, errors_samples)
x_params, y_params, alpha_params = regression(estimators_nb, errors_params)





### Plots the results ###

plt.figure(figsize=(10, 5))

# Error vs Size
plt.subplot(1, 2, 1)
plt.scatter(train_sizes, errors_samples, marker='o', label='Data')
plt.plot(x_samples, y_samples, color='red', label="Fit: " + r"$A \cdot N^{-\alpha} + B$" + f", α={alpha_samples:.2f}")
plt.xlabel("Training dataset size")
plt.ylabel("Error (MSE)")
plt.title("Scaling Law: Error vs Dataset size")
plt.legend()

# Error vs Parameters
plt.subplot(1, 2, 2)
plt.scatter(estimators_nb, errors_params, marker='o', label='Data')
plt.plot(x_params, y_params, color='red', label="Fit: " + r"$A \cdot N^{-\alpha} + B$" + f", α={alpha_params:.2f}")
plt.xlabel("Parameters number")
plt.ylabel("Error (MSE)")
plt.title("Scaling Law: Error vs Parameters number")
plt.legend()

plt.tight_layout()
plt.show()


 10%|█         | 1/10 [00:02<00:25,  2.82s/it]


KeyboardInterrupt: 

## Conclusion

We found the correct maths prediction for the scaling law with respect to the parameters number, but our model seems to have a smaller alpha coefficient for the dataset-size scaling law. 
In large models (deep learning), we observe in pratice a slower decay than $1/N$, often of the form $N^{-\alpha}$ with $\alpha \in [0.1, 0.3]$. This slowdown can be explained by:
* Underoptimization of the model (it does not fully converge)
* Redundancy in the data
* Increasing task complexity
* The fact that the model does not have fixed capacity (the model grows with the dataset)

The last point is key: **Kaplan et al.** showed that to maintain optimal training, the number of parameters $P$, the dataset size $N$, and the compute budget $C$ must grow together, following coordinated power laws. Specifically, given a 100-fold increase in computational resources (C), Kaplan et al. recommended scaling model size by approximately 28.8 times
($P_{opt} \propto C^{0.73}$), while increasing dataset size by only 3.5 times ($N_{opt} \propto C^{0.27}$).
