In [1]:
import pandas as pd 
import numpy as np 
import statsmodels.api as sm

model_data_df = pd.read_csv('../data/boston.csv')

In [4]:
independent_features_df = model_data_df[['CRIM','RM','LSTAT']]
dependent_feature_df = model_data_df['MEDV']

In [5]:
# Add a constant term to the independent variable matrix for the intercept
X_with_intercept = sm.add_constant(independent_features_df)

# Fit the linear regression model
model = sm.OLS(dependent_feature_df, X_with_intercept)
results = model.fit()

# Display the summary table
print(results.summary())


                            OLS Regression Results                            
Dep. Variable:                   MEDV   R-squared:                       0.646
Model:                            OLS   Adj. R-squared:                  0.644
Method:                 Least Squares   F-statistic:                     305.2
Date:                Sat, 28 Oct 2023   Prob (F-statistic):          1.01e-112
Time:                        15:50:38   Log-Likelihood:                -1577.6
No. Observations:                 506   AIC:                             3163.
Df Residuals:                     502   BIC:                             3180.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.5623      3.166     -0.809      0.4

The summary table produced by the linear regression model in the context of the Boston Housing dataset provides a comprehensive overview of the model's performance and the relationships between the independent and dependent variables. Let's break down the key components of the summary table in a more detailed manner:
### 1. Dependent Variable (MEDV):

Meaning: This is the variable we are trying to predict— the median value of owner-occupied homes in $1000s.

### 2. Model:

Meaning: OLS (Ordinary Least Squares) is the method used for estimating the coefficients of the linear regression model.

### 3. Coefficients:

Intercept (const): Represents the estimated value of the dependent variable when all independent variables are zero.

RM, CRIM, LSTAT: Represents the change in the dependent variable for a one-unit change in the independent variable respectively.
    
### 4. Coefficient Statistics:

coef (Coefficient Estimate): The estimated value of the coefficient.

std err (Standard Error): Indicates the precision of the estimate. Smaller values are preferable.

t (t-statistic): The t-statistic tests the null hypothesis that the coefficient is equal to zero. Larger absolute values indicate more significance.

P>|t| (p-value): The p-value associated with the t-statistic. A small p-value (< 0.05) suggests that the variable is statistically significant.

[0.025 0.975] (Confidence Interval): The 95% confidence interval for the coefficient. If it contains zero, the variable may not be significant.

### 5. Model Fit:

R-squared: Represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A higher R-squared indicates a better fit.

Adj. R-squared: Similar to R-squared but adjusted for the number of predictors. It penalizes excessive use of variables.

### 6. F-statistic:

Meaning: A measure of how well the overall linear regression model fits the data. Tests the null hypothesis that all coefficients are equal to zero.

Prob (F-statistic): The p-value associated with the F-statistic. A small p-value indicates that at least one independent variable is significant.

### 7. Degrees of Freedom:

Df Residuals: Degrees of freedom of the residuals. It's the difference between the number of observations and the number of parameters estimated.

Df Model: Degrees of freedom of the model. It's the number of estimated parameters excluding the constant term.

### 8. Covariance Type:
Meaning: Here "nonrobust" refers to the type of covariance estimation used in calculating the standard errors of the coefficients. 
Let's break down what "nonrobust" means in this context:

Covariance Type:

    Meaning: Covariance represents the measure of how much two variables change together. In linear regression, the standard errors of the coefficients are calculated based on the estimated covariance matrix.

    Types:

        Robust Covariance: It accounts for potential violations of assumptions such as heteroscedasticity and outliers. Robust covariance estimators, like Huber-White or White's covariance, are less sensitive to the presence of outliers.

        Nonrobust Covariance: It assumes that the residuals are homoscedastic (constant variance) and normally distributed. It is more sensitive to outliers.

Implications of "nonrobust":

    Assumption: The use of "nonrobust" implies that the standard errors of the coefficients are calculated under the assumption of homoscedasticity (constant variance) and normal distribution of residuals.

    Sensitivity: Nonrobust estimates may be more sensitive to outliers or violations of assumptions in the data.

    Considerations: Researchers might choose nonrobust covariance when they have confidence in the assumptions of homoscedasticity and normality. However, in the presence of potential violations, a robust covariance type might be preferred for more reliable standard error estimates.

Practical Considerations:

    Robust vs. Nonrobust: The choice between robust and nonrobust covariance depends on the characteristics of the data and the researcher's preference. If there are concerns about outliers or heteroscedasticity, robust covariance might be more appropriate.

    Diagnostic Tools: Diagnostic tests for residuals, such as residual plots and tests for homoscedasticity, can guide the choice between robust and nonrobust covariance types.

    Reporting: In research papers or reports, it's essential to specify the covariance type used in the analysis to ensure transparency and reproducibility.

In summary, "Covariance Type: nonrobust" indicates that the standard errors of the coefficients in the linear regression model are calculated under the assumption of homoscedasticity and normality, making them more sensitive to potential violations of these assumptions.

### 9. Omnibus Test:

Meaning: The Omnibus test is a statistical test that assesses the overall goodness of fit of the model. In the context of linear regression, it tests the null hypothesis that the model as a whole does not explain a significant amount of variance in the dependent variable.

Interpretation:
    A small p-value (typically < 0.05) suggests that the model as a whole is statistically significant.
    It provides a general indication of whether the set of independent variables has explanatory power.

### 10. Skewness:

Meaning: Skewness measures the asymmetry of the distribution of residuals. It indicates whether the residuals are symmetrically distributed around the mean.

Interpretation:
    Skewness close to zero suggests a roughly symmetric distribution of residuals.
    Positive skewness indicates that the distribution has a longer right tail.
    Negative skewness indicates that the distribution has a longer left tail.

### 11. Kurtosis:

Meaning: Kurtosis measures the "tailedness" of the distribution of residuals. It indicates the heaviness of the tails relative to a normal distribution.

Interpretation:
    Kurtosis close to zero suggests a distribution similar to the normal distribution.
    Positive kurtosis indicates heavier tails (more extreme values) than the normal distribution.
    Negative kurtosis indicates lighter tails than the normal distribution.

### Why Are Skewness and Kurtosis Important?

    Assumption Checking: Skewness and kurtosis are essential for checking the assumptions of a linear regression model. Normality of residuals is a common assumption, and these statistics provide insights into the distribution of residuals.

    Model Performance: The Omnibus test, along with skewness and kurtosis, helps evaluate the overall performance of the model. Significant results in the Omnibus test may prompt further investigation into the model's fit.

    Residual Analysis: Skewness and kurtosis of residuals give an indication of how well the residuals align with the assumptions of normality.


### 12. Durbin-Watson Statistic:

Meaning: The Durbin-Watson statistic is a test for autocorrelation in the residuals. It assesses whether there is a pattern or correlation between successive residuals.

Interpretation:
    The Durbin-Watson statistic ranges from 0 to 4.
    A value around 2 suggests no autocorrelation.
    Values significantly below 2 suggest positive autocorrelation (residuals are correlated positively with adjacent residuals).
    Values significantly above 2 suggest negative autocorrelation (residuals are correlated negatively with adjacent residuals).

Importance: Autocorrelation in residuals violates the assumption of independence, which is crucial for the validity of regression analysis. Durbin-Watson helps detect this issue.

### 13. Jarque-Bera Test:

Meaning: The Jarque-Bera test is a goodness-of-fit test that assesses whether the residuals of a regression model have the skewness and kurtosis matching a normal distribution.

Interpretation:
    A small p-value (typically < 0.05) indicates that the residuals do not follow a normal distribution.
    High values of the test statistic suggest non-normality.

Importance: The normality assumption is vital for the reliability of statistical inferences based on the regression model. Deviation from normality might affect the accuracy of hypothesis tests and confidence intervals.

### Why Are Durbin-Watson Statistic and Jarque-Bera Test Important?

Durbin-Watson: Helps identify and address autocorrelation issues in the residuals, ensuring the validity of statistical inferences.

Jarque-Bera Test: Assesses the normality assumption of residuals. Deviations from normality may indicate that the model is not capturing certain patterns in the data.

 

### Conclusion:

This comprehensive summary table allows us to assess the significance of each variable, understand the overall fit of the model, and identify potential areas for improvement. It's a powerful tool for interpreting the results of a linear regression analysis on the Boston Housing dataset.