# Python-Practice 7.1.1: Multiple regression models.

Consider the body fat dataset and a model where the response variable Y is percent body fat and the predictor variables `X_1` = triceps skinfold thickness (mm) and `X_2` = midarm circumference (cm). The model is constructed using the code below.

In [1]:
import pandas as pd
from statsmodels.formula.api import ols

fat = pd.read_csv('https://static-resources.zybooks.com/static/fat.csv')

Y = fat['body_fat_percent']

m12 = ols('Y ~ triceps_skinfold_thickness_mm + midarm_circumference_cm', data = fat).fit()
print(m12.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.786
Model:                            OLS   Adj. R-squared:                  0.761
Method:                 Least Squares   F-statistic:                     31.25
Date:                Mon, 21 Oct 2024   Prob (F-statistic):           2.02e-06
Time:                        12:17:49   Log-Likelihood:                -45.050
No. Observations:                  20   AIC:                             96.10
Df Residuals:                      17   BIC:                             99.09
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept     

## Summary

- R-squared measures the proportion of total variation in Y that is accounted for by the multiple regression model, which is 0.786. Adj. R-squared is an adjustment to R-squared that allows alternative models for the same response variable to be compared. F-statistic and Prob (F-statistic) tests whether no linear regression relationship exists between Y and the the set `{X_1,X_2}`.

- The coef column in the table below are the estimates for the parameters.
    - `b_0 = 6.7916`
    - `b_1 = 1.0006`
    - `b_2 = -0.4314`
- The equation for the model is `\hat{Y} = 6.7916 + 1.0006(X_1) - 0.4314(X_2)`. 
- The std err column contains standard errors of the regression parameter estimators, which measure the precision of the estimators. 
- The t column contains individual t-statistics for the regression parameter estimators, equal to each estimate divided by its standard error. 
- The next column contains individual p-values for the regression parameter estimators, equal to the sum of the tail areas beyond the t-statistic. 
    - The last two columns give the lower and upper bounds of the 95% confidence interval.

# Analysis of Variance Table

The mean standard error and standard residual error can be obtained using an analysis of variance table. Consider the body fat dataset and a model where the response variable `Y` is percent body fat and the two predictor variables are `X_1=` is triceps skinfold thickness (mm) and `X_2=` midarm circumference (cm).

In the output, the `sum_sq` column gives the SSR and SSE, which are 389.455334 and 105.934166 respectively. 389.455334 is the variation in percent body fat explained by the variation in triceps skinfold thickness and midarm circumference. 105.934166 is the variation in percent body fat unexplained by the variation in triceps skinfold thickness and midarm circumference.

The mean square residual can be obtained by dividing SSE = 105.934166 by N-p = 20-3 = 17, which is also given by the residual degrees of freedom, dfE in the `df` column. Using these values yields the mean square residual.

```math
MSE = \dfrac{SSE}{N - p} = \dfrac{105.934166}{20 - 3} = 6.231
```

Thus the residual standard error is...

```math
RMSE = \sqrt{MSE} = \sqrt{6.231421} = 2.496
```


In [28]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

fat = pd.read_csv('fat.csv')

Regression = fat[['triceps_skinfold_thickness_mm', 'midarm_circumference_cm']]

Y = fat['body_fat_percent']

m12 = ols('Y ~ Regression', fat).fit()

print(sm.stats.anova_lm(m12, typ=2))

# Find the sum of squared errors
sse = m12.ssr
print("sse:", sse)

# Find the sum of squared residuals
ssr = m12.ess
print(ssr)

# Find the mean squared error
mse = m12.mse_resid
print(mse)

# Find the root mean squared error
rmse = mse ** 0.5
print(rmse)




                sum_sq    df          F    PR(>F)
Regression  389.455334   2.0  31.249317  0.000002
Residual    105.934166  17.0        NaN       NaN
p-value for the F-statistic: 2.021895311507621e-06
sse: 105.93416604723377
389.4553339527661
6.231421532190222
2.496281541050653


In [34]:
# In python, to find the RMSE, we can use the following code:
from math import sqrt

ssr = 369.984612
sse = 98.404888
n = 20  # Corrected sample size

p = 3  # Number of predictors

mse = sse / (n - p - 1)

rmse = sqrt(mse)

print("MSE: ", mse)
print("RMSE: ",rmse)



MSE:  6.1503055
RMSE:  2.4799809475074603
