# Normality of Errors

## Explanation:

Definition: The residuals of the model should follow an approximately normal distribution. This assumption is particularly important for the validity of statistical tests (like t-tests for coefficients) and for making accurate confidence intervals.

Why It Matters: If the residuals are not normally distributed, it could mean that the model is missing key variables or that the relationship between the independent and dependent variables is not properly captured by the model. This can lead to biased estimates and incorrect inferences.


## How to Check:
 
- Q-Q Plot (Quantile-Quantile Plot): A Q-Q plot compares the quantiles of the residuals to the quantiles of a normal distribution. If the points roughly follow a straight diagonal line, the residuals are approximately normally distributed.

- Shapiro-Wilk Test: This is a formal statistical test for normality. A significant p-value suggests that the residuals are not normally distributed.

- Histogram of Residuals: You can also simply plot a histogram of the residuals to visually inspect whether they follow a bell-shaped curve.

- Breusch-Pagan Test: This statistical test can also be used to formally detect heteroscedasticity.

## Example:

In a model predicting student test scores, if the residuals are heavily skewed or show kurtosis (peaks or tails much higher or lower than normal), it may indicate that the model is not capturing some important aspect of the data, like a non-linear relationship or the presence of outliers.


In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Dados simulados
np.random.seed(42)
tamanho = np.random.normal(150, 50, 100)  # Tamanho das casas
erro = np.random.normal(0, 20, 100)  # Erro aleatório
preco = 200 + 3 * tamanho + erro  # Preço da casa

# Regressão OLS
X = sm.add_constant(tamanho)  # Adiciona o intercepto
modelo = sm.OLS(preco, X).fit()

# Resultados do modelo
print(modelo.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.980
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     4901.
Date:                Tue, 27 Aug 2024   Prob (F-statistic):           1.71e-85
Time:                        08:19:05   Log-Likelihood:                -435.28
No. Observations:                 100   AIC:                             874.6
Df Residuals:                      98   BIC:                             879.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        208.7440      6.376     32.738      0.0

- F test for the model: at least one of the betas is statistically significant. (I want small p-value)
- T test for the parameters: I want small (P>|t| ) to be small
- R^2_adjut: compare different models possibily with different sample sizes and parameters