# Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are both statistical tools used for **model selection**. They help you choose the best model from a set of candidate models by balancing two competing goals: **goodness of fit** and **model complexity**.

Both criteria are particularly useful when comparing models that may have different numbers of parameters, as they penalize models for being more complex to prevent **overfitting**.

## Akaike Information Criterion (AIC)

AIC is an estimator of prediction error and is rooted in information theory. It estimates the relative amount of information a model loses when representing the process that generated the data. The model with the lowest AIC value is considered the best among the candidate models. 

The formula is:

$$AIC = -2 \ln(\hat{L}) + 2k$$

* **$-2 \ln(\hat{L})$**: This is the **goodness-of-fit term**. It's derived from the maximum likelihood ($\hat{L}$) of the model, which measures how well the model fits the data. A higher likelihood (and thus a smaller negative log-likelihood) indicates a better fit.
* **$2k$**: This is the **penalty term**. It's a penalty for model complexity, where $k$ is the total number of parameters estimated by the model. A more complex model (one with more parameters) gets a higher penalty.

$k$ is equal to the number of structural parameters in the model's function $f(x_i, \beta)$ (e.g., the $\beta$ coefficients in a regression model) if the error variance $\sigma^2$ is known. However, if the error variance $\sigma^2$ is unknown, it must be estimated from the data, which means it counts as an additional estimated parameter and $k = p + 1$. We will review this in detail below.

The AIC's goal is to find the model that best approximates the unknown data-generating process, even if that process isn't one of the candidate models. It's focused on **predictive accuracy**.

## Bayesian Information Criterion (BIC)

BIC, also known as the Schwarz Information Criterion (SIC), is a criterion for model selection derived from a Bayesian perspective. Like AIC, it balances goodness of fit and complexity, but it does so more aggressively. The model with the lowest BIC value is the one preferred. 

The formula is:

$$BIC = -2 \ln(\hat{L}) + k \ln(n)$$

* **$-2 \ln(\hat{L})$**: This is the same goodness-of-fit term as in AIC.
* **$k \ln(n)$**: This is the **penalty term**. The penalty for complexity is stronger than AIC's because it includes the natural logarithm of the number of data points ($n$).

Because of the $\ln(n)$ factor, BIC applies a much heavier penalty for additional parameters, especially as the sample size grows. This means BIC tends to favor **simpler models** more strongly than AIC. It assumes that one of the candidate models is the "true" model, and its goal is to find that true model.

### Key Differences and When to Use Which

| Feature | Akaike Information Criterion (AIC) | Bayesian Information Criterion (BIC) |
| :--- | :--- | :--- |
| **Penalty for Complexity** | Penalizes with $2k$. | Penalizes with $k \ln(n)$. |
| **Focus** | Finds the best **approximating** model for predictive accuracy. | Finds the "true" model. |
| **Sample Size** | The penalty is constant with respect to sample size. | The penalty increases with sample size, favoring simpler models. |
| **Behavior** | Tends to select more complex models than BIC. | Tends to select more simpler models than AIC. |

NOTE: The terms "approximating model" and "true model" refer to their underlying assumptions about the reality you are trying to model.

**AIC assumes that the "true" data-generating process is infinitely complex and therefore unknowable.** We can only ever hope to approximate it with our models.

* **Goal**: AIC's goal is to find the model that provides the best trade-off between bias (how far your model is from the true process) and variance (how much your model would change with different data). In simpler terms, it's about building the model that will give you the most accurate predictions on a new, unseen dataset.
* **Best for**: This makes AIC ideal for tasks where **prediction** is the primary goal, such as forecasting future stock prices, predicting customer behavior, or machine learning applications. You're not trying to discover the fundamental laws of the universe, you're just trying to make the most useful prediction you can.
* **Mathematical Justification**: AIC's penalty term ($2k$) is derived from the Kullback-Leibler (KL) divergence, which is a measure of the information lost when approximating reality with a given model. By minimizing AIC, you are minimizing this information loss.

**BIC assumes that there is a "true" model that generated the data, and this model exists within your set of candidate models.** This is often a good assumption for scientific fields where we believe there are underlying, fixed physical laws governing a system.

* **Goal**: BIC's goal is to select the model that is most likely to be this "true" model. Because of its heavier penalty for more parameters ($k\ln(n)$), BIC is more likely to choose simpler, more parsimonious models. It essentially bets that the simplest model that explains the data well is the correct one.
* **Best for**: This makes BIC more suitable for tasks of **explanation and discovery**, where you want to find the most fundamental, elegant, and concise model that describes a phenomenon. This is common in fields like physics, biology, and some social sciences.
* **Mathematical Justification**: BIC is derived from Bayesian inference and is an approximation of the posterior probability of a model being the true model, given the data. By minimizing BIC, you are maximizing this posterior probability.

In essence:

* **AIC** asks: "Which model will give me the most accurate predictions for the future?"
* **BIC** asks: "Which model is most likely to be the true explanation for the data I have?"

This is why AIC often chooses a slightly more complex model than BIC. AIC is willing to accept a little extra complexity if it improves predictive accuracy, while BIC is more conservative, preferring a simpler model unless the evidence for a more complex one is overwhelming.

In practice, if your goal is to find the model with the highest predictive power, AIC is often preferred. If your goal is to select the most efficient or fundamental model that is likely to be the "true" one, BIC might be a better choice.

When you compare two models, you find AIC Difference $\Delta \text{AIC}$ and BIC Difference $\Delta \text{BIC}$ for these modesl. Here are some general guidelines for interpreting these differences, based on conventions in the scientific community:

* **AIC Difference ($\Delta \text{AIC}$)**: A difference of more than 2 between two models is often considered significant. If the difference is between 0 and 2, the models are considered to have a similar level of support from the data. A model with a $\Delta$AIC of 10 or more is considered to have very little support compared to the best model.
* **BIC Difference ($\Delta \text{BIC}$)**: Because BIC has a stronger penalty for complexity, its differences are interpreted more stringently. A difference of 0-2 is considered as weak evidence for the model with the lower BIC, a difference of 2-6 is considered positive evidence, a difference of 6-10 is strong evidence, and a difference of more than 10 is considered very strong evidence.

## Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a powerful and flexible method for estimating the parameters of a statistical model. It's used to find the parameter values that make the observed data most probable. The **likelihood function** quantifies how probable the observed data is for a given set of parameters. The **log-likelihood function** is simply the natural logarithm of the likelihood function.

The core idea of MLE is to find the parameter values that maximize the likelihood function, $L(\theta|x)$, where $\theta$ represents the parameters and $x$ is the data. The likelihood function is often a product of probability density functions (or probability mass functions) for each data point. For a set of independent and identically distributed (i.i.d.) observations, the likelihood is:

$$L(\theta|x) = \prod_{i=1}^{n} f(x_i|\theta)$$

Working with this product can be difficult, especially with many data points, as multiplying many small numbers can lead to computational underflow. This is where the **log-likelihood** comes in. By taking the natural logarithm, the product is transformed into a sum, which is much more computationally stable and easier to differentiate.

$$\ln L(\theta|x) = \ln \left( \prod_{i=1}^{n} f(x_i|\theta) \right) = \sum_{i=1}^{n} \ln f(x_i|\theta)$$

Since the logarithm is a monotonic function, maximizing the log-likelihood function yields the **exact same parameter estimates** as maximizing the likelihood function.

The connection between Maximum Likelihood and regression modeling becomes clear when we make assumptions about the distribution of the errors. While Ordinary Least Squares (OLS) regression minimizes the sum of squared residuals, MLE provides a more general framework that can lead to the same result under specific conditions.

**For OLS:**
The OLS method finds the coefficients that minimize the **sum of squared residuals** (or errors). That is, it minimizes the following:

$$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2$$

**Relation to MLE:**
The OLS solution is the Maximum Likelihood Estimate for a linear regression model **if we assume that the error terms ($\epsilon_i$) are independently and identically distributed (i.i.d.) according to a normal distribution with a mean of zero and a constant variance ($\sigma^2$).**

When we make this assumption, the probability of observing a particular data point $y_i$ for a given $x_i$ is a normal distribution centered at the predicted value, $\beta_0 + \beta_1x_i$. The log-likelihood function for this model can be written as:

$$\ln L(\beta_0, \beta_1, \sigma^2|x,y) = -\frac{n}{2} \ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2$$

NOTE: We will derive a similar formula below but in a bit more general form.

Notice the last term in the equation. To maximize the log-likelihood function, we need to maximize this entire expression. The first term is a constant with respect to the $\beta$ coefficients, and the second term is negative. Therefore, to make the overall value as large as possible, we must make the negative term as small as possible. This is equivalent to minimizing the **sum of squared residuals**, which is exactly what OLS does.

**Why is this important?** This connection shows that OLS is not just an arbitrary method for fitting a line; it has a probabilistic foundation under the assumption of normally distributed errors. For other types of regression, such as logistic regression, there is no OLS equivalent. Instead, the parameters are always estimated using Maximum Likelihood, where the log-likelihood function is based on the appropriate distribution for the outcome variable (e.g., a Bernoulli distribution for binary outcomes).

## Derivation of the Log-Likelihood for Least Squares

NOTE: $\ln(L)$ denotes log-likelihood function. $\ln(\hat{L})$ denotes the specific value of the log-likelihood function after the model has been fitted to the data and this specific value is maximized log-likelihood.

Here is the derivation (note that we derive all the formulas in this section under the assumption that we know exact values of $\sigma_i$ or $\sigma$):

1.  **Start with the probability of one data point.** We assume that the probability of observing a single data point $(x_i, y_i)$ given the model's prediction $f(x_i, \beta)$ is described by a normal distribution with a mean of $f(x_i, \beta)$ and a standard deviation of $\sigma_i$. This is equivalent to the assumption that the error terms $\epsilon_i$ are independently distributed according to a normal distribution with a mean of zero and a variance $\sigma_i^2$, so $\epsilon_i \sim N(0, \sigma_i^2)$ (remember that $y_i = f(x_i, \beta) + \epsilon_i$). The probability density function (PDF) for a single point is:

    $$p(y_i) = \frac{1}{\sqrt{2\pi\sigma_i^2}} e^{-\frac{(y_i - f(x_i, \beta))^2}{2\sigma_i^2}}$$

2.  **Form the likelihood function.** Assuming each data point is an independent measurement, the total likelihood ($L$) of observing all $n$ data points is the product of their individual probabilities:

    $$L = \prod_{i=1}^{n} p(y_i) = \prod_{i=1}^{n} \left( \frac{1}{\sqrt{2\pi\sigma_i^2}} e^{-\frac{(y_i - f(x_i, \beta))^2}{2\sigma_i^2}} \right)$$

3.  **Take the natural logarithm.** To simplify the product and make the calculations more manageable, we take the natural logarithm of the likelihood function. A product becomes a sum, and the exponential term simplifies.

    $$\ln(L) = \ln \left( \prod_{i=1}^{n} p(y_i) \right) = \sum_{i=1}^{n} \ln(p(y_i))$$   
    
    $$\ln(L) = \sum_{i=1}^{n} \ln \left( \frac{1}{\sqrt{2\pi\sigma_i^2}} e^{-\frac{(y_i - f(x_i, \beta))^2}{2\sigma_i^2}} \right)$$

4.  **Simplify the expression.** Using logarithm rules ($\ln(a \cdot b) = \ln(a) + \ln(b)$ and $\ln(e^x) = x$), we can expand the sum:

    $$\ln(L) = \sum_{i=1}^{n} \left( \ln\left(\frac{1}{\sqrt{2\pi\sigma_i^2}}\right) - \frac{(y_i - f(x_i, \beta))^2}{2\sigma_i^2} \right)$$   
    
    $$\ln(L) = - \sum_{i=1}^{n} \ln(\sqrt{2\pi\sigma_i^2}) - \frac{1}{2} \sum_{i=1}^{n} \frac{(y_i - f(x_i, \beta))^2}{\sigma_i^2}$$

5. The second term in this equation is directly related to our Chi-Squared ($\chi^2$) statistic. Since $\chi^2 = \sum_{i=1}^{N} \frac{(y_i - f(x_i, \beta))^2}{\sigma_i^2}$, we can rewrite the equation as:

    $$\ln(L) = - \frac{1}{2}\chi^2 - \sum_{i=1}^{n} \ln(\sqrt{2\pi\sigma_i^2})$$

6. When comparing different models on the same dataset, the second term, which depends on the known measurement uncertainties ($\sigma_i$), is a constant and doesn't affect which model is selected. Therefore, for our purposes, we can write a simplified and more practical relationship:

    $$\ln(L) \propto - \frac{1}{2}\chi^2$$

    This means that maximizing the log-likelihood is equivalent to **minimizing the chi-squared statistic**.

7.  If $\sigma_i = \sigma$ for all $i$ (assumption of homoscedasticity), we can simplify this expression even more:

    $$\ln(L) = - \sum_{i=1}^{n} \ln(\sqrt{2\pi\sigma^2}) - \frac{1}{2} \sum_{i=1}^{n} \frac{(y_i - f(x_i, \beta))^2}{\sigma^2}$$

    $$\ln(L) = - n \ln(\sqrt{2\pi\sigma^2}) - \frac{1}{2} \sum_{i=1}^{n} \frac{(y_i - f(x_i, \beta))^2}{\sigma^2}$$

    $$\ln(L) = - \frac{n}{2} \ln(2\pi\sigma^2) - \frac{1}{2} \sum_{i=1}^{n} \frac{(y_i - f(x_i, \beta))^2}{\sigma^2}$$

    $$\ln(L) = - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\sigma^2) - \frac{1}{2} \sum_{i=1}^{n} \frac{(y_i - f(x_i, \beta))^2}{\sigma^2}$$

    $$\ln(L) = - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\sigma^2) - \frac{1}{2} \chi^2$$

### How to Use This for AIC and BIC

Since the AIC and BIC formulas depend on the maximized log-likelihood $-2\ln(\hat{L})$, we can use the results of our derivation. When a fitting procedure like `curve_fit` finds the optimal parameters $\hat{\beta}$, it does so by minimizing the $\chi^2$ statistic. This is the best-fit of $\chi^2$. Let's call this minimized value $\chi^2_{min}$.

Because maximizing the log-likelihood is equivalent to minimizing the chi-squared statistic, the maximized log-likelihood, $\ln(\hat{L})$, is given by:

$$\ln(\hat{L}) \propto - \frac{1}{2}\chi^2_{min}$$

Now we can calculate the information criteria using this **minimized** $\chi^2$ value from the fit:

* **For AIC:**

    $$-2\ln(\hat{L}) = -2 \left( - \frac{1}{2}\chi^2_{min} + \text{constant} \right) = \chi^2_{min} + \text{constant}'$$

    Since the constants don't affect model ranking, we can simply use the proportional formula:

    $$AIC \propto \chi^2_{min} + 2k$$

    NOTE: This last formula is valid for model comparison only (not valid to calculate the exact value of AIC). Also, this last formula is valid only if $\sigma_i$ values are known.

* **For BIC:**

    $$BIC \propto \chi^2_{min} + k\ln(n)$$

    NOTE: This last formula is valid for model comparison only (not valid to calculate the exact value of AIC). Also, this last formula is valid only if $\sigma_i$ values are known.

Remember: $\chi^2_{min}$ is the final chi-squared value calculated from the Weighted Least Squares, $k$ is the number of parameters, and $n$ is the number of data points. By calculating these values for different models fitted to the same data, you can choose the model with the lowest AIC or BIC as the one that provides the best balance of fit and parsimony.

If the standard deviations $\sigma_i$ are the same for all the data points and are equal to $\sigma$ (homoscedasticity), then the maximized log-likelihood to be used in the AIC and BIC formulas is:

$$\ln(\hat{L}) = - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\sigma^2) - \frac{1}{2} \chi^2_{min}$$

## The Case of Unknown Variance: Ordinary Least Squares (OLS)

In the previous chapter, we operated under the assumption that the standard deviations of the errors, $\sigma_i$, were known. This is common in the physical sciences where measurement uncertainties can be well-defined. However, in many other fields, these uncertainties are not known beforehand.

The most common scenario is to assume that while the variance is unknown, it is constant for all data points. This is the assumption of **homoscedasticity** ($\sigma_i = \sigma$ for all *i*), which is fundamental to **Ordinary Least Squares (OLS)**. When $\sigma$ is unknown, it becomes another parameter that we must estimate from the data.

### Estimating the Unknown Variance $\sigma^2$

Our goal is to maximize the log-likelihood function with respect to all model parameters, which now include the $\beta$ parameters *and* the unknown variance $\sigma^2$.

We start with the log-likelihood function for the homoscedastic case, as derived in Step 7 of the previous chapter:

$$\ln(L(\beta, \sigma^2)) = - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - f(x_i, \beta))^2$$

This expression is often written in terms of the **Residual Sum of Squares (RSS)**, where $RSS = \sum_{i=1}^{n} (y_i - f(x_i, \beta))^2$.

$$\ln(L(\beta, \sigma^2)) = - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\sigma^2) - \frac{RSS}{2\sigma^2}$$

To find the values of $\beta$ and $\sigma^2$ that maximize this function, we can use a two-step process:

1.  For any given value of $\sigma^2$, maximizing the log-likelihood is equivalent to minimizing the `RSS` term. This is exactly what the standard least-squares fitting procedure does to find the optimal parameters $\beta$.
2.  Once the optimal $\beta$ parameters are found and the `RSS` is minimized, we can find the **Maximum Likelihood Estimate (MLE)** for the variance, denoted as $\hat{\sigma}^2$, by taking the partial derivative of the log-likelihood function with respect to $\sigma^2$ and setting it to zero.

Let's perform the second step. Treating `RSS` as a constant (since $\beta$ is now fixed), we differentiate with respect to $\sigma^2$:

$$\frac{\partial \ln(L)}{\partial \sigma^2} = - \frac{n}{2\sigma^2} + \frac{RSS}{2(\sigma^2)^2}$$

Setting the derivative to zero to find the maximum:

$$\frac{n}{2\hat{\sigma}^2} = \frac{RSS}{2(\hat{\sigma}^2)^2}$$

Solving for $\hat{\sigma}^2$ gives us the MLE for the variance:

$$\hat{\sigma}^2 = \frac{RSS}{n}$$

> This is a fundamentally important result: the maximum likelihood estimate for the unknown variance is the residual sum of squares divided by the number of data points.

There are several important notes regarding this results:
1. The Maximum Likelihood Estimate (MLE) for the variance, $\hat{\sigma}^2 = RSS/n$, is a **biased estimator**.
2. The unbiased estimator, $\hat{\sigma}^2 = RSS/(n-p)$, is what is typically used for inference and is reported in most statistical software as the "Mean Squared Error" (MSE). Here, $p$ is the number of model parameters.
3. **For calculating the log-likelihood and the information criteria (AIC, BIC) derived from it, we MUST use the biased Maximum Likelihood Estimate ($RSS/n$).**

Let's review the last statement in more details.

Our primary goal was to find the parameter values that maximize the log-likelihood function. We found that the value of $\sigma^2$ that maximizes this function (after `RSS` has been minimized by finding the best $\hat{\beta}$ parameters) is:

$$\hat{\sigma}^2_{MLE} = \frac{RSS}{n} = \frac{1}{n}\sum_{i=1}^{n} (y_i - f(x_i, \hat{\beta}))^2$$

This is the **Maximum Likelihood Estimate (MLE)** for the variance. While it is the correct value for maximizing the likelihood, it is a **biased estimator** of the true population variance $\sigma^2$. On average, it will slightly underestimate the true variance.

In applied statistics, the goal is often to get the most accurate possible estimate of the true population variance. For this, we use the **unbiased estimator**, which corrects for the bias mentioned above by adjusting the denominator. If we have estimated $k$ model parameters (i.e., the number of coefficients in $\beta$), the unbiased estimate is:

$$\hat{\sigma}^2_{unbiased} = \frac{RSS}{n-p}$$

This is the **Mean Squared Error (MSE)**, and its square root is the **Residual Standard Error (RSE)** reported in the output of virtually all OLS regression software. The denominator, $n-p$, represents the residual **degrees of freedom**. By dividing by a smaller number, we increase the value of the estimate, correcting for the downward bias of the MLE.

**Which estimator should we use for the log-likelihood?**

The answer lies in the fundamental definition of AIC and BIC: they are based on the **maximized value of the log-likelihood function**.

The derivation of AIC and BIC starts with $\ln(\hat{L})$, which is the value of $\ln(L)$ when *all* parameters have been set to their Maximum Likelihood Estimates. Therefore, to be mathematically consistent, we **must** use the MLE for the variance, $\hat{\sigma}^2_{MLE} = RSS/n$, when calculating the log-likelihood for AIC and BIC.

Plugging the unbiased estimate $RSS/(n-p)$ into the log-likelihood formula would result in a value that is *not* the true maximum likelihood, and the theoretical foundation of AIC and BIC would no longer hold.

**In summary:**
*   For **calculating AIC and BIC**, use the **biased** MLE variance: $\hat{\sigma}^2 = RSS/n$.
*   For **reporting the model's error variance or standard error for inference**, use the **unbiased** variance: $\hat{\sigma}^2 = RSS/(n-p)$.

### The Maximized Log-Likelihood

Now we can substitute this estimate $\hat{\sigma}^2$ back into the log-likelihood equation to get the *maximized* log-likelihood, $\ln(\hat{L})$. This value represents the highest possible likelihood given the data and the model form.

$$\ln(\hat{L}) = - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln\left(\frac{RSS}{n}\right) - \frac{RSS}{2\left(\frac{RSS}{n}\right)}$$

Simplifying the last term:

$$\ln(\hat{L}) = - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln\left(\frac{RSS}{n}\right) - \frac{n}{2}$$

Using the logarithm rule $\ln(a/b) = \ln(a) - \ln(b)$:

$$\ln(\hat{L}) = - \frac{n}{2} \left( \ln(2\pi) + \ln(RSS) - \ln(n) + 1 \right)$$

This is the final expression for the maximized log-likelihood when $\sigma^2$ is unknown.

### Calculating AIC and BIC with Unknown Variance

With the maximized log-likelihood, $\ln(\hat{L})$, we can now calculate the information criteria. A crucial point is that we have estimated an additional parameter: the variance $\sigma^2$. Therefore, the total number of estimated parameters, $k$, becomes:

* **Number of Parameters $k$:** If your model has $p$ parameters for the function $f(x,\beta)$, then the total number of estimated parameters is $k = p + 1$.
* **Example:** For instance, in a simple linear regression model $y = \beta_0 + \beta_1 x$, there are two $\beta$ parameters ($p=2$). When we also estimate the variance, the total number of parameters for the AIC/BIC calculation becomes $k = p + 1 = 3$.

The general formulas for AIC and BIC are:

$$AIC = -2\ln(\hat{L}) + 2k$$
$$BIC = -2\ln(\hat{L}) + k\ln(n)$$

We will use our expression for the maximized log-likelihood:
$\ln(\hat{L}) = - \frac{n}{2} \left( \ln(2\pi) + \ln(RSS) - \ln(n) + 1 \right)$

**For AIC:**

Substituting the expression for $\ln(\hat{L})$ into the AIC formula gives:

$$AIC = -2 \left[ - \frac{n}{2} \left( \ln(2\pi) + \ln(RSS) - \ln(n) + 1 \right) \right] + 2k$$
$$AIC = n \left( \ln(2\pi) + \ln(RSS) - \ln(n) + 1 \right) + 2k$$

For the purpose of comparing different models on the same dataset, any terms that do not depend on the model can be dropped. The terms $n$, $\ln(2\pi)$, $\ln(n)$, and 1 are constant across all models. This leaves us with a simplified, proportional formula:

$$AIC \propto n \ln(RSS) + 2k$$

Here, $k = p + 1$ is the total number of estimated parameters (including the variance).

**For BIC:**

We follow the exact same logic for BIC. Substituting the expression for $\ln(\hat{L})$ into the BIC formula gives:

$$BIC = -2 \left[ - \frac{n}{2} \left( \ln(2\pi) + \ln(RSS) - \ln(n) + 1 \right) \right] + k\ln(n)$$
$$BIC = n \left( \ln(2\pi) + \ln(RSS) - \ln(n) + 1 \right) + k\ln(n)$$

Again, for model comparison, we can drop the same terms that are constant for a given dataset, resulting in the simplified, proportional formula for BIC:

$$BIC \propto n \ln(RSS) + k\ln(n)$$

Here, $k = p + 1$ is the total number of estimated parameters (including the variance).

It is important to remember that these simplified formulas produce values that are not the true AIC or BIC, but rather values that maintain the same differences between models as the full formulas would. Since model selection depends only on these differences (e.g., finding the model with the minimum AIC), these simplified versions are sufficient and computationally convenient for that purpose. However, if reporting the absolute AIC or BIC value is required, the full formula should be used.

## Modification for Small Sample Size: The Corrected AIC (AICc)

The Akaike Information Criterion (AIC) is derived in the limit of a large sample size $n$. When the sample size is not large relative to the number of estimated parameters $k$, AIC can perform poorly. Specifically, it has a tendency to favor models with too many parameters, a phenomenon known as overfitting.

To address this, the **Corrected Akaike Information Criterion (AICc)** was developed. AICc includes a second-order correction term that increases the penalty for extra parameters, with the size of this correction being larger for smaller sample sizes.

The AICc is defined as the AIC with an added penalty term:

$$AICc = AIC + \frac{2k(k+1)}{n - k - 1}$$

Let's derive the more common form of this equation. We start with the full definition of AIC:

$$AIC = -2\ln(\hat{L}) + 2k$$

Now, we add the correction term:

$$AICc = \left( -2\ln(\hat{L}) + 2k \right) + \frac{2k(k+1)}{n - k - 1}$$

We can combine the two terms that include $k$ by factoring out $2k$:

$$AICc = -2\ln(\hat{L}) + 2k \left( 1 + \frac{k+1}{n - k - 1} \right)$$

To simplify the expression in the parenthesis, we find a common denominator:

$$1 + \frac{k+1}{n - k - 1} = \frac{(n - k - 1) + (k+1)}{n - k - 1} = \frac{n}{n - k - 1}$$

By substituting this simplified fraction back into the equation, we arrive at the most common and elegant formula for AICc:

$$AICc = -2\ln(\hat{L}) + 2k \left( \frac{n}{n - k - 1} \right)$$

This form clearly shows that the standard AIC penalty $2k$ is being multiplied by a correction factor $n / (n - k - 1)$.

Let's examine the correction factor $n / (n - k - 1)$:

*   **When $n$ is large:** As $n$ becomes much larger than $k$, the fraction $n / (n - k - 1)$ approaches $n/n$, which is 1. In this case, AICc converges to AIC. This is exactly what we want; for large samples, the correction is negligible.
*   **When $n$ is small:** When $n$ is close to $k$, the denominator $n - k - 1$ becomes very small, making the correction factor large. This imposes a much heavier penalty on more complex models (those with a larger $k$), helping to prevent overfitting.

Given its properties, it is often recommended to **use AICc by default**, rather than AIC, especially if you are not in a "big data" context. A common rule of thumb is to use AICc when the ratio $n/k$ is less than 40.

*   **Number of Parameters $k$:** As before, $k$ must include all estimated parameters. In the OLS case with $p$ model coefficients, $k = p + 1$ to account for the estimated variance.

We can write the proportional formula for AICc in the OLS case (unknown variance) by starting with the proportional formula for AIC and adding the correction term:

$$AICc \propto n \ln(RSS) + 2k + \frac{2k(k+1)}{n - k - 1}$$

Or, using the multiplicative correction factor:

$$AICc \propto n \ln(RSS) + 2k \left( \frac{n}{n - k - 1} \right)$$

NOTE: There is an obvious restriction: $n$ should be greater than $k+1$ ($n > k + 1$) to avoid the situation when denominator is zero or negative and AICc cannot be calculated.

Is There a Small-Sample Correction for BIC? While AIC has a widely accepted and commonly used correction, **there is no standard, universally adopted small-sample correction for BIC.**

The reasons for this are rooted in their different theoretical origins:
1.  **Different Foundations:** AIC is derived from principles of information theory and aims to find the model that best approximates the true data-generating process with minimal information loss. Its derivation allows for a more straightforward second-order correction.
2.  **Asymptotic Nature of BIC:** BIC is derived from a Bayesian probability framework. It is an asymptotic approximation to the full Bayesian model evidence, which is used to calculate the Bayes factor. The $k \ln(n)$ penalty term arises naturally from this approximation.
3.  **Stronger Inherent Penalty:** The penalty term in BIC, $k \ln(n)$, is already much stronger than AIC's $2k$ penalty (for any $n > 7$). This strong penalty, which scales with sample size, already makes BIC less prone to the kind of overfitting in small samples that necessitated the creation of AICc.

## Additional Materials

* https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
* https://www.quantstart.com/articles/Maximum-Likelihood-Estimation-for-Linear-Regression/
* https://en.wikipedia.org/wiki/Akaike_information_criterion