# Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are both statistical tools used for **model selection**. They help you choose the best model from a set of candidate models by balancing two competing goals: **goodness of fit** and **model complexity**.

Both criteria are particularly useful when comparing models that may have different numbers of parameters, as they penalize models for being more complex to prevent **overfitting**.

## Akaike Information Criterion (AIC)

AIC is an estimator of prediction error and is rooted in information theory. It estimates the relative amount of information a model loses when representing the process that generated the data. The model with the lowest AIC value is considered the best among the candidate models. 

The formula is:

$$AIC = -2 \ln(L) + 2k$$

* **$-2 \ln(L)$**: This is the **goodness-of-fit term**. It's derived from the maximum likelihood ($L$) of the model, which measures how well the model fits the data. A higher likelihood (and thus a smaller negative log-likelihood) indicates a better fit.
* **$2k$**: This is the **penalty term**. It's a penalty for model complexity, where $k$ is the number of parameters in the model. A more complex model (one with more parameters) gets a higher penalty.

The AIC's goal is to find the model that best approximates the unknown data-generating process, even if that process isn't one of the candidate models. It's focused on **predictive accuracy**.

## Bayesian Information Criterion (BIC)

BIC, also known as the Schwarz Information Criterion (SIC), is a criterion for model selection derived from a Bayesian perspective. Like AIC, it balances goodness of fit and complexity, but it does so more aggressively. The model with the lowest BIC value is the one preferred. 

The formula is:

$$BIC = -2 \ln(L) + k \ln(n)$$

* **$-2 \ln(L)$**: This is the same goodness-of-fit term as in AIC.
* **$k \ln(n)$**: This is the **penalty term**. The penalty for complexity is stronger than AIC's because it includes the natural logarithm of the number of data points ($n$).

Because of the $\ln(n)$ factor, BIC applies a much heavier penalty for additional parameters, especially as the sample size grows. This means BIC tends to favor **simpler models** more strongly than AIC. It assumes that one of the candidate models is the "true" model, and its goal is to find that true model.

### Key Differences and When to Use Which

| Feature | Akaike Information Criterion (AIC) | Bayesian Information Criterion (BIC) |
| :--- | :--- | :--- |
| **Penalty for Complexity** | Penalizes with $2k$. | Penalizes with $k \ln(n)$. |
| **Focus** | Finds the best **approximating** model for predictive accuracy. | Finds the "true" model. |
| **Sample Size** | The penalty is constant with respect to sample size. | The penalty increases with sample size, favoring simpler models. |
| **Behavior** | Tends to select more complex models than BIC. | Tends to select more simpler models than AIC. |

NOTE: The terms "approximating model" and "true model" refer to their underlying assumptions about the reality you are trying to model.

**AIC assumes that the "true" data-generating process is infinitely complex and therefore unknowable.** We can only ever hope to approximate it with our models.

* **Goal**: AIC's goal is to find the model that provides the best trade-off between bias (how far your model is from the true process) and variance (how much your model would change with different data). In simpler terms, it's about building the model that will give you the most accurate predictions on a new, unseen dataset.
* **Best for**: This makes AIC ideal for tasks where **prediction** is the primary goal, such as forecasting future stock prices, predicting customer behavior, or machine learning applications. You're not trying to discover the fundamental laws of the universe, you're just trying to make the most useful prediction you can.
* **Mathematical Justification**: AIC's penalty term ($2k$) is derived from the Kullback-Leibler (KL) divergence, which is a measure of the information lost when approximating reality with a given model. By minimizing AIC, you are minimizing this information loss.

**BIC assumes that there is a "true" model that generated the data, and this model exists within your set of candidate models.** This is often a good assumption for scientific fields where we believe there are underlying, fixed physical laws governing a system.

* **Goal**: BIC's goal is to select the model that is most likely to be this "true" model. Because of its heavier penalty for more parameters ($k\ln(n)$), BIC is more likely to choose simpler, more parsimonious models. It essentially bets that the simplest model that explains the data well is the correct one.
* **Best for**: This makes BIC more suitable for tasks of **explanation and discovery**, where you want to find the most fundamental, elegant, and concise model that describes a phenomenon. This is common in fields like physics, biology, and some social sciences.
* **Mathematical Justification**: BIC is derived from Bayesian inference and is an approximation of the posterior probability of a model being the true model, given the data. By minimizing BIC, you are maximizing this posterior probability.

In essence:

* **AIC** asks: "Which model will give me the most accurate predictions for the future?"
* **BIC** asks: "Which model is most likely to be the true explanation for the data I have?"

This is why AIC often chooses a slightly more complex model than BIC. AIC is willing to accept a little extra complexity if it improves predictive accuracy, while BIC is more conservative, preferring a simpler model unless the evidence for a more complex one is overwhelming.

In practice, if your goal is to find the model with the highest predictive power, AIC is often preferred. If your goal is to select the most efficient or fundamental model that is likely to be the "true" one, BIC might be a better choice.

When you compare two models, you find AIC Difference $\Delta \text{AIC}$ and BIC Difference $\Delta \text{BIC}$ for these modesl. Here are some general guidelines for interpreting these differences, based on conventions in the scientific community:

* **AIC Difference ($\Delta \text{AIC}$)**: A difference of more than 2 between two models is often considered significant. If the difference is between 0 and 2, the models are considered to have a similar level of support from the data. A model with a $\Delta$AIC of 10 or more is considered to have very little support compared to the best model.
* **BIC Difference ($\Delta \text{BIC}$)**: Because BIC has a stronger penalty for complexity, its differences are interpreted more stringently. A difference of 0-2 is considered as weak evidence for the model with the lower BIC, a difference of 2-6 is considered positive evidence, a difference of 6-10 is strong evidence, and a difference of more than 10 is considered very strong evidence.

## Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a powerful and flexible method for estimating the parameters of a statistical model. It's used to find the parameter values that make the observed data most probable. The **likelihood function** quantifies how probable the observed data is for a given set of parameters. The **log-likelihood function** is simply the natural logarithm of the likelihood function.

The core idea of MLE is to find the parameter values that maximize the likelihood function, $L(\theta|x)$, where $\theta$ represents the parameters and $x$ is the data. The likelihood function is often a product of probability density functions (or probability mass functions) for each data point. For a set of independent and identically distributed (i.i.d.) observations, the likelihood is:

$$L(\theta|x) = \prod_{i=1}^{n} f(x_i|\theta)$$

Working with this product can be difficult, especially with many data points, as multiplying many small numbers can lead to computational underflow. This is where the **log-likelihood** comes in. By taking the natural logarithm, the product is transformed into a sum, which is much more computationally stable and easier to differentiate.

$$\ln L(\theta|x) = \ln \left( \prod_{i=1}^{n} f(x_i|\theta) \right) = \sum_{i=1}^{n} \ln f(x_i|\theta)$$

Since the logarithm is a monotonic function, maximizing the log-likelihood function yields the **exact same parameter estimates** as maximizing the likelihood function.

The connection between Maximum Likelihood and regression modeling becomes clear when we make assumptions about the distribution of the errors. While Ordinary Least Squares (OLS) regression minimizes the sum of squared residuals, MLE provides a more general framework that can lead to the same result under specific conditions.

**For OLS:**
The OLS method finds the coefficients that minimize the **sum of squared residuals** (or errors). That is, it minimizes the following:

$$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2$$

**Relation to MLE:**
The OLS solution is the Maximum Likelihood Estimate for a linear regression model **if we assume that the error terms ($\epsilon_i$) are independently and identically distributed (i.i.d.) according to a normal distribution with a mean of zero and a constant variance ($\sigma^2$).**

When we make this assumption, the probability of observing a particular data point $y_i$ for a given $x_i$ is a normal distribution centered at the predicted value, $\beta_0 + \beta_1x_i$. The log-likelihood function for this model can be written as:

$$\ln L(\beta_0, \beta_1, \sigma^2|x,y) = -\frac{n}{2} \ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2$$

NOTE: We will derive a similar formula below but in a bit more general form.

Notice the last term in the equation. To maximize the log-likelihood function, we need to maximize this entire expression. The first term is a constant with respect to the $\beta$ coefficients, and the second term is negative. Therefore, to make the overall value as large as possible, we must make the negative term as small as possible. This is equivalent to minimizing the **sum of squared residuals**, which is exactly what OLS does.

**Why is this important?** This connection shows that OLS is not just an arbitrary method for fitting a line; it has a probabilistic foundation under the assumption of normally distributed errors. For other types of regression, such as logistic regression, there is no OLS equivalent. Instead, the parameters are always estimated using Maximum Likelihood, where the log-likelihood function is based on the appropriate distribution for the outcome variable (e.g., a Bernoulli distribution for binary outcomes).

## Derivation of the Log-Likelihood for Least Squares

Yes, there is a very useful formula for the log-likelihood ($\ln(L)$) that we can use for our case. It's derived by making one key assumption: that the errors (the residuals) follow a normal distribution. This is the same underlying assumption that allows us to use least squares methods in the first place.

Here is the derivation:

1.  **Start with the probability of one data point.** The probability of observing a single data point $(x_i, y_i)$ given the model's prediction $f(x_i, \beta)$ is described by a normal distribution with a mean of $f(x_i, \beta)$ and a standard deviation of $\sigma_i$. The probability density function (PDF) for a single point is:

    $$p(y_i) = \frac{1}{\sqrt{2\pi\sigma_i^2}} e^{-\frac{(y_i - f(x_i, \beta))^2}{2\sigma_i^2}}$$

2.  **Form the likelihood function.** Assuming each data point is an independent measurement, the total likelihood ($L$) of observing all $N$ data points is the product of their individual probabilities:

    $$L = \prod_{i=1}^{N} p(y_i) = \prod_{i=1}^{N} \left( \frac{1}{\sqrt{2\pi\sigma_i^2}} e^{-\frac{(y_i - f(x_i, \beta))^2}{2\sigma_i^2}} \right)$$

3.  **Take the natural logarithm.** To simplify the product and make the calculations more manageable, we take the natural logarithm of the likelihood function. A product becomes a sum, and the exponential term simplifies.

    $$\ln(L) = \ln \left( \prod_{i=1}^{N} p(y_i) \right) = \sum_{i=1}^{N} \ln(p(y_i))$$   
    
    $$\ln(L) = \sum_{i=1}^{N} \ln \left( \frac{1}{\sqrt{2\pi\sigma_i^2}} e^{-\frac{(y_i - f(x_i, \beta))^2}{2\sigma_i^2}} \right)$$

4.  **Simplify the expression.** Using logarithm rules ($\ln(a \cdot b) = \ln(a) + \ln(b)$ and $\ln(e^x) = x$), we can expand the sum:

    $$\ln(L) = \sum_{i=1}^{N} \left( \ln\left(\frac{1}{\sqrt{2\pi\sigma_i^2}}\right) - \frac{(y_i - f(x_i, \beta))^2}{2\sigma_i^2} \right)$$   
    
    $$\ln(L) = - \sum_{i=1}^{N} \ln(\sqrt{2\pi\sigma_i^2}) - \frac{1}{2} \sum_{i=1}^{N} \frac{(y_i - f(x_i, \beta))^2}{\sigma_i^2}$$

    $$\ln(L) = - \sum_{i=1}^{N} \ln(\sqrt{2\pi\sigma_i^2}) - \frac{1}{2} \sum_{i=1}^{N} \frac{(y_i - f(x_i, \beta))^2}{\sigma_i^2}$$



### The Final Formula

The second term in this equation is directly related to our Chi-Squared ($\chi^2$) statistic. Since $\chi^2 = \sum_{i=1}^{N} \frac{(y_i - f(x_i, \beta))^2}{\sigma_i^2}$, we can rewrite the equation as:

$$\ln(L) = - \frac{1}{2}\chi^2 - \sum_{i=1}^{N} \ln(\sqrt{2\pi\sigma_i^2})$$

When comparing different models on the same dataset, the second term, which depends on the constant measurement uncertainties ($\sigma_i$), is a constant and doesn't affect which model is selected. Therefore, for our purposes, we can write a simplified and more practical relationship:

$$\ln(L) \propto - \frac{1}{2}\chi^2$$

This means that maximizing the log-likelihood is equivalent to **minimizing the chi-squared statistic**.

### How to Use This for AIC and BIC

Since the AIC and BIC formulas depend on $-2\ln(L)$, we can substitute our derived expression. When `curve_fit` finds the optimal parameters, it's minimizing the $\chi^2$ value. This is the best-fit $\chi^2$.

So, for a model fitted with least squares, we can calculate the information criteria using the final $\chi^2$ from the fit:

* **For AIC:**
    $$-2\ln(L) = -2 \left( - \frac{1}{2}\chi^2 + \text{constant} \right) = \chi^2 + \text{constant'}$$
    Since the constants don't affect the model ranking, we can simply use:
    $$AIC = \chi^2 + 2k$$

* **For BIC:**
    $$BIC = \chi^2 + k\ln(N)$$

Where $\chi^2$ is the final chi-squared value calculated from the WLS fit, $k$ is the number of parameters, and $N$ is the number of data points. By calculating these values for different models fitted to the same data, you can choose the model with the lowest AIC or BIC as the one that provides the best balance of fit and parsimony.

Excellent clarification question. You've hit on a crucial and often misunderstood point in statistical modeling, particularly in the context of regression.

Let's break down your options. The correct interpretation is a combination of your second and third points, which are essentially two ways of stating the same core assumption.

### The Correct Assumption: Normality of Errors

You are correct with these two statements:

*   **We expect that errors (the residuals) for each point follow a normal distribution.**
*   **The probability of observing a single data point $$y_i$$ given the model's prediction $$f(x_i, \beta)$$ is described by a normal distribution with a mean of $$f(x_i, \beta)$$.**

These two statements describe the same concept from slightly different angles. In a standard linear regression model, we model the relationship as:

$$
y_i = f(x_i, \beta) + \epsilon_i
$$

Where:
- $$y_i$$ is the observed outcome for the i-th data point.
- $$f(x_i, \beta)$$ is the model's prediction for $$y_i$$ based on its input $$x_i$$ and parameters $$\beta$$. This is the **mean** of the distribution for $$y_i$$.
- $$\epsilon_i$$ is the **error term** or residual for that data point.

The fundamental assumption is that this error term, $$\epsilon_i$$, is a random variable drawn from a normal distribution with a mean of 0 and a constant variance $$\sigma^2$$. We write this as:

$$
\epsilon_i \sim N(0, \sigma^2)
$$

This directly implies your third point: the observed value $$y_i$$ for a given $$x_i$$ is expected to follow a normal distribution centered around the model's prediction. We write this as:

$$
y_i | x_i \sim N(f(x_i, \beta), \sigma^2)
$$

This means that if we could collect many $$y$$ values for the *exact same* $$x$$ value, those $$y$$ values would form a normal (bell-shaped) distribution around the line of best fit.

### Why the First Option is Incorrect

You asked: **"Do we expect that all the data points should be on the normal distribution curve?"**

This is a common misconception. Here's why it's incorrect:
1.  **Data points ($$x_i, y_i$$) are pairs:** They represent points in a 2D (or higher dimensional) space. A single normal distribution describes a single variable, not a relationship between variables.
2.  **The dependent variable $$y$$ itself doesn't need to be normally distributed:** The assumption applies to the *residuals* ($$y_i - \hat{y}_i$$), not the raw $$y$$ values. Your $$y$$ values could have any shape (e.g., be skewed or bimodal), but as long as the errors around the regression line are normally distributed, the assumption holds.

In summary, when we use the log-likelihood formula for a normal distribution in a regression context, we are not assuming the raw data is normal. Instead, we are making a precise assumption that the **errors of our model's predictions are normally distributed.** This assumption is what allows us to calculate the likelihood of observing our data given the model, which is a necessary step for calculating AIC and BIC.

That first formula, $\sigma^2 = \frac{SSR}{n-k-1}$, is not quite right. The formula for the estimated error variance is a sample statistic, and it's more accurately represented by $s^2$, which is commonly called the **Mean Squared Error (MSE)**. The second formula is correct and provides the proper definition.

---

### What is the Mean Squared Error of Residuals ($MSE$)?

The **Mean Squared Error of Residuals ($MSE$)** is the average squared difference between the observed values and the values predicted by the regression model. It's the unbiased estimator for the true error variance, $\sigma^2$, in a linear regression model. 

The formula is:

$$MSE = s^2 = \frac{SSR}{n - k - 1}$$

Here's what each term means:

* **$s^2$** or **$MSE$**: The estimated error variance.
* **$SSR$ (Sum of Squared Residuals)**: The sum of the squared differences between the observed values ($y_i$) and the predicted values ($\hat{y}_i$).
    $$SSR = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
* **$n$**: The number of observations or data points.
* **$k$**: The number of independent variables (predictors) in the model.
* **$n - k - 1$**: The **degrees of freedom** for the residuals. This is the denominator because we lose one degree of freedom for the intercept and one for each predictor we estimate.

---

### Derivation and Intuition

The formula isn't derived in the traditional sense of a mathematical proof but is based on the concept of an **unbiased estimator**.

The goal is to estimate the true variance of the errors, $\sigma^2$, which we don't know. The sum of squared residuals, $SSR$, is a good starting point, but simply dividing by $n$ would give us a biased estimate. The reason for this is that we use the sample data to estimate the regression coefficients (the slope and intercept), which forces the residuals to have a mean of zero. This constraint means the residuals are not truly independent.

By dividing by the **degrees of freedom**, $n - k - 1$, instead of just $n$, we correct for this bias. The degrees of freedom represent the number of values in the final calculation of a statistic that are free to vary. In linear regression, once you've estimated the intercept and the slopes (a total of $k+1$ parameters), the values of the residuals are no longer entirely free to vary. Therefore, dividing by $n - k - 1$ provides an **unbiased estimator** of the true population variance, $\sigma^2$.

## Additional Materials

* https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
* https://www.quantstart.com/articles/Maximum-Likelihood-Estimation-for-Linear-Regression/