# Chapter 2
Chapter 2 focuses on the introduction of between-person analysis. The data used in this example are a subset of the []() dataset. They consist of data from 550 older adults, for which the following measurements were collected:
- age
- cognition (assessed with )
- grip strength
- sex
- dementia diagnosis

A review of general linear models (GLM) and Analysis of Variance (ANOVA) methods is provided.

## Import packages

In [16]:
import os
from sas7bdat import SAS7BDAT
import statsmodels.formula.api
import statsmodels.stats.anova

## Constants

In [29]:
# File paths
FILE_PATH = os.path.join("Data", "SAS_Chapter2.sas7bdat")

# File columns
AGE_COL = "age"
COGNITION_COL = "cognition"
GRIP_COL = "grip"

## Read data

In [7]:
with SAS7BDAT(FILE_PATH, skip_header=False) as reader:
    df = reader.to_data_frame()

# Between-Person Analysis

## Empty Model
Let's now start our review of between-person analysis methods using an empty model, that does not use any characteristic to predict cognition:

$
y_i = \beta_0 + e_i
$

The only parameter that our model needs to determine is the intercept $\beta_0$.

In [11]:
empty_model = statsmodels.formula.api.ols(formula="cognition ~ 1", data=df).fit()
empty_model.summary()

0,1,2,3
Dep. Variable:,cognition,R-squared:,-0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,
Date:,"Mon, 15 Jul 2024",Prob (F-statistic):,
Time:,08:35:48,Log-Likelihood:,-2098.2
No. Observations:,550,AIC:,4198.0
Df Residuals:,549,BIC:,4203.0
Df Model:,0,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,24.8218,0.469,52.973,0.000,23.901,25.742

0,1,2,3
Omnibus:,23.202,Durbin-Watson:,1.985
Prob(Omnibus):,0.0,Jarque-Bera (JB):,14.625
Skew:,-0.26,Prob(JB):,0.000667
Kurtosis:,2.393,Cond. No.,1.0


As we can see from the results above, the intercept (24.82) corresponds to the grand total mean of cognition across all the dataset:

df["cognition].mean()
24.8218

The variance of the residuals can be determined using "empty_model.mse_resid", and for the empty model it corresponds to 120.76.

In [13]:
print(f"Grand mean for cognition: {df[COGNITION_COL].mean():.2f}")
print(f"Residual variance for empty model: {empty_model.mse_resid:.2f}")

Grand mean for cognition: 24.82
Residual variance for empty model: 120.76


## Age Model
We expect age to be a good predictor of cognition, with a decrease in cognition as age progresses. We can then build a model that uses age to predict the cognition. Since the samples of our dataset contain data from people with age > 80. we need to center the age predictor in order to have a meaningful 0 value: we will use 85 as a 0 value.
We substract this value to create a new variable named "centered_age", and we will use it in our models.

In [14]:
df["centered_age"] = df[AGE_COL] - 85

Let's now fit the model using centered_age as a predictor.

In [15]:
age_model = statsmodels.formula.api.ols(formula="cognition ~ centered_age", data=df).fit()
age_model.summary()

0,1,2,3
Dep. Variable:,cognition,R-squared:,0.029
Model:,OLS,Adj. R-squared:,0.027
Method:,Least Squares,F-statistic:,16.4
Date:,"Mon, 15 Jul 2024",Prob (F-statistic):,5.87e-05
Time:,08:41:25,Log-Likelihood:,-2090.1
No. Observations:,550,AIC:,4184.0
Df Residuals:,548,BIC:,4193.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,24.7818,0.462,53.612,0.000,23.874,25.690
centered_age,-0.5461,0.135,-4.049,0.000,-0.811,-0.281

0,1,2,3
Omnibus:,19.7,Durbin-Watson:,1.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,12.788
Skew:,-0.238,Prob(JB):,0.00167
Kurtosis:,2.425,Cond. No.,3.43


The coefficient for centered_age is equal to -0.5461, and as expected it is a negative coefficient, meaning that there is a decline in cognition as age progresses. In particular, for a one-unit change in age, there is an expected decrease in cognition of -0.5461.

The intercept value (now equal to 24.7818) corresponds to the expected cognition value for a person with 85 years old.

The $R^2$ of the model that uses age as a unique predictor is 0.03. Another way to get to the same figure is by computing the relative change in residual variance with respect to the empty model:

$
(age_model.mse_resid - empty_model.mse_resid)/empty_model.mse_resid
$


In [28]:
print(f"{(empty_model.mse_resid-age_model.mse_resid)/empty_model.mse_resid:.2f}")

0.03


In order to determine if the 

## Age and Grip

In [30]:
df["centered_grip"] = df[GRIP_COL] - 9

In [31]:
age_grip_model = statsmodels.formula.api.ols(formula="cognition ~ centered_age + centered_grip", data=df).fit()
age_grip_model.summary()

0,1,2,3
Dep. Variable:,cognition,R-squared:,0.075
Model:,OLS,Adj. R-squared:,0.072
Method:,Least Squares,F-statistic:,22.14
Date:,"Mon, 15 Jul 2024",Prob (F-statistic):,5.66e-10
Time:,08:53:46,Log-Likelihood:,-2076.8
No. Observations:,550,AIC:,4160.0
Df Residuals:,547,BIC:,4173.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,24.7008,0.452,54.662,0.000,23.813,25.588
centered_age,-0.4176,0.134,-3.115,0.002,-0.681,-0.154
centered_grip,0.8025,0.154,5.206,0.000,0.500,1.105

0,1,2,3
Omnibus:,17.496,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,12.268
Skew:,-0.248,Prob(JB):,0.00217
Kurtosis:,2.462,Cond. No.,3.57


In [34]:
print(f"{(empty_model.mse_resid-age_grip_model.mse_resid)/empty_model.mse_resid:.2f}")

0.07
