# Understanding Components of OLS Summary

**The OLS summary report is a detailed output that provides various metrics and statistics to help evaluate the model's performance and interpret its results.**

```
OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.669
Model:                            OLS   Adj. R-squared:                  0.667
Method:                 Least Squares   F-statistic:                     299.2
Date:                Mon, 01 Mar 2021   Prob (F-statistic):           2.33e-37
Time:                        16:19:34   Log-Likelihood:                -88.686
No. Observations:                 150   AIC:                             181.4
Df Residuals:                     148   BIC:                             187.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.2002      0.257    -12.458      0.000      -3.708      -2.693
x1             0.7529      0.044     17.296      0.000       0.667       0.839
==============================================================================
Omnibus:                        3.538   Durbin-Watson:                   1.279
Prob(Omnibus):                  0.171   Jarque-Bera (JB):                3.589
Skew:                           0.357   Prob(JB):                        0.166
Kurtosis:                       2.744   Cond. No.                         43.4
==============================================================================
```

## The Header Section: Dependent Variable and Model Information

- **Dependent Variable and Model:** The dependent variable (also known as the explained variable) is the variable that we aim to predict or explain using the independent variables. The model section indicates that the method used is *Ordinary Least Squares (OLS)*, which minimizes the sum of the squared errors between the observed and predicted values.
- **Number of observations:** The number of observation is the size of our sample, i.e. N = 150.
- **Degrees of freedom(df):** Degree of freedom is the number of independent observations on the basis of which the sum of squares is calculated.<br>

Degrees of freedom: 
$$
Df = N − K Df =N−K
$$
*Where **N** = sample size(no. of observations) and  **K** = number of variables + 1 (including the intercept).*

## Coefficient Interpretation: Standard Error, T-Statistics, and P-Value Insights

```
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.2002      0.257    -12.458      0.000      -3.708      -2.693
x1             0.7529      0.044     17.296      0.000       0.667       0.839
==============================================================================
```

- **Constant term:** The constant terms is the intercept of the regression line. In regression we omits some independent variables that do not have much impact on the dependent variable, the intercept tells the average value of these omitted variables and noise present in model. For example, in the regression equation $$Y = 5.0 + 0.75X$$ , the constant term (intercept) of 5.0 indicates that when X=0, the predicted value of Y is 5.0, representing the baseline level influenced by omitted factors.
- **Coefficient term:** The coefficient term tells the change in Y for a unit change in X. For Example, if X rises by 1 unit then Y rises by 0.7529.
- **Standard Error of Parameters:** Standard error is also called the standard deviation. It is a measure of how much the coefficient estimates would vary if the same model were estimated with different samples from the same population. Larger standard errors indicate less precise estimates. Standard error is calculated by as:
$$
Standard Error = \sqrt{\frac{N-K}{Residual Sum of Squares}} \cdot \sqrt{\frac{1}{\displaystyle\sum(X_1-\overline{X})^2}}
$$
*where:*
- ***Residual Sum of Squares** is the sum of the squared differences between the observed values and the predicted values.*
- ***N** is the number of observations.*
- ***K** is the number of independent variables in the model, including the intercept.*
- ***$X_i$** represents each independent variable value, and **$\overline{X}$** is the mean of those values.*
- **T-Statistics and P-Values:**
    - The t-statistics are calculated by dividing the coefficient by its standard error. These values are used to test the null hypothesis that the coefficient is zero (i.e., the independent variable has no effect on the dependent variable).
    - The p-values associated with these t-statistics indicate the probability of observing the estimated coefficient (or a more extreme value) if the null hypothesis were true. A p-value below a certain significance level (usually 0.05) suggests that the coefficient is statistically significant, meaning the independent variable has a significant effect on the dependent variable.
- **Confidence Intervals:** The confidence intervals give a range within which the true coefficient likely falls, with a certain level of confidence (usually 95%).
    - If a confidence interval includes zero, it means there’s a chance the variable might not actually impact the outcome.
    - If zero isn’t in the interval, it’s more likely that the variable genuinely affects the outcome.

## Evaluating Model Performance: Goodness of Fit Metrics

```
==============================================================================
Dep. Variable:                      y   R-squared:                       0.669
Model:                            OLS   Adj. R-squared:                  0.667
Method:                 Least Squares   F-statistic:                     299.2
Date:                Mon, 01 Mar 2021   Prob (F-statistic):           2.33e-37
Time:                        16:19:34   Log-Likelihood:                -88.686
No. Observations:                 150   AIC:                             181.4
Df Residuals:                     148   BIC:                             187.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
```

- **R-Squared (R²):** R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, where 1 indicates that the model explains all the variance.
- **Adjusted R-Squared:** Adjusted R-squared is a modified version of R-squared that adjusts for the number of independent variables in the model, providing a more accurate measure of the model's explanatory power when comparing models with different numbers of variables.
    - For example, an R-squared value of 0.669 means that about 66.9% of the variance in the dependent variable is explained by the model.
    - If the adjusted R-squared decreases when adding more variables, it suggests that the additional variables do not contribute significantly to the model and may be omitted.
- **F-Statistic and Prob(F-Statistic):** The F-statistic is used to test the overall significance of the model. The null hypothesis is that all coefficients (except the intercept) are zero, meaning the model does not explain any variance in the dependent variable. The p-value associated with the F-statistic indicates the probability of observing the F-statistic (or a more extreme value) if the null hypothesis were true. A small p-value (typically less than 0.05) indicates that the model is statistically significant, meaning at least one of the independent variables has a significant effect on the dependent variable.

## Testing Model Assumptions with Diagnostics

```
==============================================================================
Omnibus:                        3.538   Durbin-Watson:                   1.279
Prob(Omnibus):                  0.171   Jarque-Bera (JB):                3.589
Skew:                           0.357   Prob(JB):                        0.166
Kurtosis:                       2.744   Cond. No.                         43.4
==============================================================================
```

The remaining terms are not often used. Ordinary Least Squares (OLS) summary provides several diagnostic checks to help assessspecific assumptions about the data. Terms like Skewness and Kurtosis tells about the distribution of data. Below are key diagnostics included in the OLS summary:
- **Omnibus:** The Omnibus test evaluates the joint normality of the residuals. A higher value suggests a deviation from normality.
- **Prob(Omnibus):** This p-value indicates the probability of observing the test statistic under the null hypothesis of normality. A value above 0.05 suggests that we do not reject the null hypothesis, implying that the residuals may be normally distributed.
- **Jarque-Bera (JB):** The Jarque-Bera test is another test for normality that assesses whether the sample skewness and kurtosis match those of a normal distribution.
- Prob(JB): Similar to the Prob(Omnibus), this p-value assesses the null hypothesis of normality. A value greater than 0.05 indicates that we do not reject the null hypothesis.
- **Skew:** Skewness measures the asymmetry of the distribution of residuals. A skewness value close to zero indicates a symmetrical distribution, while positive or negative values indicate right or left skewness, respectively.
- **Kurtosis:** Kurtosis measures the "tailedness" of the distribution. A kurtosis value of 3 indicates a normal distribution, while values above or below suggest heavier or lighter tails, respectively.
- **Durbin-Watson**: This statistic tests for autocorrelation in the residuals from a regression analysis. Values close to 2 suggest no autocorrelation, while values less than 1 or greater than 3 indicate positive or negative autocorrelation, respectively.
- **Cond. No.:** The condition number assesses multicollinearity, where values above 30 suggest potential multicollinearity issues among the independent variables.<br>
***Skewness and kurtosis for the normal distribution are 0 and 3 respectively.***<br> These diagnostic tests are essential for validating the reliability of a linear regression model, helping ensure that the model's assumptions are satisfied.