# Model selection 

In [1]:
import pandas as pd
import statsmodels.formula.api as smf

In the previous sprint, we covered several aspects of `ols()` and illustrated them on the basis of the reaction times datatset offered by Gries:

In [2]:
df = pd.read_csv("../../datasets/gries/05-2_reactiontimes.csv", sep="\t")
df["LENGTH"] = df.CASE.str.len()
# Level our categoricals so statsmodels can see them. Unless we use them in something like difference
# encoding it won't make much model difference, but it can make a lot of interpretability
# difference.
df["IMAGEABILITY"] = df.IMAGEABILITY.astype(
    pd.api.types.CategoricalDtype(categories=["lo", "hi"], ordered=True)
)
df["FAMILIARITY"] = df.FAMILIARITY.astype(
    pd.api.types.CategoricalDtype(categories=["lo", "med", "hi"], ordered=True)
)
df

Unnamed: 0,CASE,RT,FREQUENCY,FAMILIARITY,IMAGEABILITY,MEANINGFULNESS,LENGTH
0,almond,650.9947,0.693147,,,,6
1,ant,589.4347,1.945910,med,hi,415.0,3
2,apple,523.0493,2.302585,hi,hi,451.0,5
3,apricot,642.3342,0.693147,lo,lo,,7
4,asparagus,696.2092,0.693147,med,lo,442.0,9
...,...,...,...,...,...,...,...
72,tortoise,733.0323,1.386294,lo,lo,403.0,8
73,walnut,663.5908,2.484907,med,lo,468.0,6
74,wasp,725.7056,1.098612,,,,4
75,whale,609.9745,0.000000,med,hi,474.0,5


In [3]:
preds = df.columns[2:]
preds

Index(['FREQUENCY', 'FAMILIARITY', 'IMAGEABILITY', 'MEANINGFULNESS', 'LENGTH'], dtype='object')

We discussed how independent variables (whether they were numeric, binary or categorical) could be used and combined to predict the reaction time as our main dependent variable. We saw how we can also combine these predictors (using a `+` in the formula notation), since they all seem to have some worth in modelling RT. We could even combine them all:

In [4]:
model = smf.ols(f"RT ~ {' + '.join(preds)}", data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                     RT   R-squared:                       0.381
Model:                            OLS   Adj. R-squared:                  0.290
Method:                 Least Squares   F-statistic:                     4.207
Date:                Sun, 24 Nov 2024   Prob (F-statistic):            0.00219
Time:                        10:12:18   Log-Likelihood:                -238.94
No. Observations:                  48   AIC:                             491.9
Df Residuals:                      41   BIC:                             505.0
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept            628.5299     69

Interestingly, as a whole, this model is well below the critical threshold for a statistical significance, but most of the estimates for the parameters now appear to be insignificant (apart from the simple LENGTH of a word). This is confusing and raises the issue: which predictors should we include?

## Parameter Selection

If we have the opportunity to include multiple variables in a linear model, we'll need a way to determine whether it's a good idea to add them. In other words, we'll need strategies for **model selection**, i.e. determining the "best" model for our data. There are basically two options:

1. Forward selection: you start with a small dataset and gradually add more predictors, if these prove "useful".
2. Backward selection: you start out with a huge model that covers many predictors, and you gradually remove predictors that don't appear to be "useful".

Several views exist on this matter. Many people will consider it bad practice to test for predictors that weren't theoretically motivated in the first place, but heeding that advice, it can become difficult to discover new things. Both strategies ultimately have the same goal: "Arrive at a **minimally adequate model**", i.e. the smallest model that gives you an optimal fit. As a rule of thumb, smaller model are *always* to be prefered (cf. Occam's Razor). How do you determine whether a predictor is "useful"? There are two "metrics", that can be used:

1. Look at $p$-values (if they are significant you keep them in). The problem is that you still end up with many variables that are all somewhat significant, but in a very large model.
2. Work with a **criterion based approach**, using for instance the **Akaike Information Criterion** to compare models with and without a predictor. The nice property of AIC is that it penalizes models on the basis of their size (the number of predictors and parameters): the AIC for a model with *many* parameters will be automatically adjusted downwards.

<blockquote>
SIDEBAR AIC and BIC

You can read more then you ever wanted to about the AIC [on wikipedia](https://en.wikipedia.org/wiki/Akaike_information_criterion), but the main things we want you to remember are:
- lower AIC (or BIC) are good
- AIC is only useful to compare **models for the same thing**. It makes no sense as an absolute scale (ie don't compare the AIC for different datasets or different dependent variables)
- AIC (and even more so BIC) already penalize model parameters, so *as a rule of thumb* if adding a parameter increases $R^2$ and decreases AIC it is probably worthwhile
</blockquote>

## Buckaroo Datascience

**For the sake of argument** let's look at a kind of hacky way to compare models. Since our prediction formula can be specified as a string, we can modify the specification string, and refit the model with every combination of parameters. The code below is a little challenging, so just think about the intuition.

We we take *all subsets* from our five predictors, (from one feature to four, and *all combinations at each size*), build a string from that set, and then record a few model criteria. Remember that both AIC and BIC are good ways to compare model performance, and a **lower score is better**. The (adjusted) $R^2$ tells us how much variability is explained, and a **higher score is better**. So, we'll just combine them to create two *ad hoc* measures of Model Goodness, which we call `R2AIC` and `R2BIC`.

In [5]:
from itertools import chain, combinations

res = []
for subset in chain.from_iterable(combinations(preds, r) for r in range(1, len(preds))):
    model = smf.ols(f"RT ~ {' + '.join(subset)}", data=df).fit()
    res.append(
        {
            "model": " + ".join(subset),
            "BIC": model.bic,
            "AIC": model.aic,
            "ADJR2": model.rsquared_adj,
            "R2AIC": 1 / model.aic * model.rsquared_adj,
            "R2BIC": 1 / model.bic * model.rsquared_adj,
        }
    )

This gives us a `list` of `dict` objects, with identical keys. This is one of the (many) ways to build a pandas `DataFrame`.

In [6]:
res[:3]

[{'model': 'FREQUENCY',
  'BIC': 838.1186784652348,
  'AIC': 833.4310676215274,
  'ADJR2': 0.20851919991526646,
  'R2AIC': 0.0002501936968948677,
  'R2BIC': 0.00024879435964499373},
 {'model': 'FAMILIARITY',
  'BIC': 585.8007482134077,
  'AIC': 579.7787486577103,
  'ADJR2': 0.2054617772138626,
  'R2AIC': 0.00035437962790037193,
  'R2BIC': 0.0003507366247661615},
 {'model': 'IMAGEABILITY',
  'BIC': 572.0371794822198,
  'AIC': 568.0965956551156,
  'ADJR2': 0.044895344959894556,
  'R2AIC': 7.902766061838885e-05,
  'R2BIC': 7.848326397338655e-05}]

Now we can rank our models!

In [7]:
pd.DataFrame(res).sort_values("R2AIC", ascending=False)

Unnamed: 0,model,BIC,AIC,ADJR2,R2AIC,R2BIC
23,FAMILIARITY + MEANINGFULNESS + LENGTH,497.421993,488.065988,0.320929,0.000658,0.000645
14,MEANINGFULNESS + LENGTH,492.394047,486.780444,0.313357,0.000644,0.000636
27,FREQUENCY + FAMILIARITY + MEANINGFULNESS + LENGTH,501.137868,489.910662,0.307007,0.000627,0.000613
29,FAMILIARITY + IMAGEABILITY + MEANINGFULNESS + ...,501.287542,490.060336,0.304843,0.000622,0.000608
20,FREQUENCY + MEANINGFULNESS + LENGTH,496.091368,488.606564,0.300291,0.000615,0.000605
24,IMAGEABILITY + MEANINGFULNESS + LENGTH,496.236317,488.751513,0.298175,0.00061,0.000601
26,FREQUENCY + FAMILIARITY + IMAGEABILITY + LENGTH,564.365891,552.544139,0.335448,0.000607,0.000594
28,FREQUENCY + IMAGEABILITY + MEANINGFULNESS + LE...,499.909941,490.553936,0.284803,0.000581,0.00057
22,FAMILIARITY + IMAGEABILITY + LENGTH,563.08507,553.23361,0.315421,0.00057,0.00056
17,FREQUENCY + FAMILIARITY + LENGTH,582.953582,572.916917,0.321765,0.000562,0.000552


In [8]:
for x in pd.DataFrame(res).sort_values("R2AIC", ascending=False).model.head(10):
    print(f'"{x}",')

"FAMILIARITY + MEANINGFULNESS + LENGTH",
"MEANINGFULNESS + LENGTH",
"FREQUENCY + FAMILIARITY + MEANINGFULNESS + LENGTH",
"FAMILIARITY + IMAGEABILITY + MEANINGFULNESS + LENGTH",
"FREQUENCY + MEANINGFULNESS + LENGTH",
"IMAGEABILITY + MEANINGFULNESS + LENGTH",
"FREQUENCY + FAMILIARITY + IMAGEABILITY + LENGTH",
"FREQUENCY + IMAGEABILITY + MEANINGFULNESS + LENGTH",
"FAMILIARITY + IMAGEABILITY + LENGTH",
"FREQUENCY + FAMILIARITY + LENGTH",


... so the best model **by this ranking** has only three features. Thinking about some sanity checks:
- There are many models with an AIC/BIC in the high 400s, so nothing exceptional is happening (many are much worse)
- The Adjusted $R^2$ for the top model doesn't clearly stand out (it's not even the best)
- This is the best combination, and also one of the most parsimonious 

So it seems fair. The second-ranked model is also worth a very close look -- we would probably need to go on and think hard about our predictors to make a choice. Some things to think about:
- Is there a good theoretical *reason* to include `FAMILIARITY` -- it scored the best of the single predictor models?
    - This can only be answered according to the scientific hypothesis motivating the study!
- Is there 'overlap' between `FAMILIARITY` and `MEANINGFULNESS`? (the two-predictor model `FAMILIARITY + LENGTH` didn't do so well...)

Some people would choose the leaner model on principle; some wouldn't. There is no right or wrong answer, it is a choice that you need to make using your scientific expertise, *and not just by looking at numbers*. Sorry.

Anyway, here are your two leading candidates. Choose your fighter.

In [9]:
model = smf.ols("RT ~ C(FAMILIARITY,Diff) + MEANINGFULNESS + LENGTH", data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                     RT   R-squared:                       0.379
Model:                            OLS   Adj. R-squared:                  0.321
Method:                 Least Squares   F-statistic:                     6.553
Date:                Sun, 24 Nov 2024   Prob (F-statistic):           0.000329
Time:                        10:12:18   Log-Likelihood:                -239.03
No. Observations:                  48   AIC:                             488.1
Df Residuals:                      43   BIC:                             497.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

In [10]:
model = smf.ols("RT ~ MEANINGFULNESS + LENGTH", data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                     RT   R-squared:                       0.343
Model:                            OLS   Adj. R-squared:                  0.313
Method:                 Least Squares   F-statistic:                     11.72
Date:                Sun, 24 Nov 2024   Prob (F-statistic):           7.97e-05
Time:                        10:12:18   Log-Likelihood:                -240.39
No. Observations:                  48   AIC:                             486.8
Df Residuals:                      45   BIC:                             492.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept        664.7122     59.808     11.

```
Version History

Current: v1.0.1

7/10/24: 1.0.0: first draft, BN
09/10/24: 1.0.1: proofread, MK
```