# Stepwise Procedures

* Backward Elimination: involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically insignificant loss of fit.
* Forward Selection: involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.
* Mixed Selection: a combination of the above, testing at each step for variables to be included or excluded.

In [None]:
# get data https://www.javahabit.com/2019/02/10/part-5-ml-mltr-backward-elimination/
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/gitmystuff/Datasets/refs/heads/main/Startups.csv')
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [None]:
# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['Profit'], axis=1),
    df['Profit'],
    test_size=0.25,
    random_state=42)

In [None]:
# use sklearn one hot encoder
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categories='auto', drop='first', sparse=False, handle_unknown='ignore')

cat_features = ['State']
ohe_train = ohe.fit_transform(X_train[cat_features])
ohe_train = pd.DataFrame(ohe_train, columns=ohe.get_feature_names_out(cat_features))
ohe_train.index = X_train.index
X_train = X_train.join(ohe_train)
X_train.drop(cat_features, axis=1, inplace=True)

ohe_test = ohe.transform(X_test[cat_features])
ohe_test = pd.DataFrame(ohe_test, columns=ohe.get_feature_names_out(cat_features))
ohe_test.index = X_test.index
X_test = X_test.join(ohe_test)
X_test.drop(cat_features, axis=1, inplace=True)

print(X_train.shape)
print(X_test.shape)
print(X_train.info())

(37, 5)
(13, 5)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 8 to 38
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        37 non-null     float64
 1   Administration   37 non-null     float64
 2   Marketing Spend  37 non-null     float64
 3   State_Florida    37 non-null     float64
 4   State_New York   37 non-null     float64
dtypes: float64(5)
memory usage: 2.8 KB
None


In [None]:
import statsmodels.api as sm

# features should have a constant or intercept
# X_train = sm.add_constant(X_train)
X_train.insert(0, 'const', 1)
model = sm.OLS(y_train, X_train).fit()
model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.954
Model:,OLS,Adj. R-squared:,0.947
Method:,Least Squares,F-statistic:,129.9
Date:,"Sat, 25 Jun 2022",Prob (F-statistic):,7.82e-20
Time:,17:36:31,Log-Likelihood:,-389.14
No. Observations:,37,AIC:,790.3
Df Residuals:,31,BIC:,799.9
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.65e+04,9705.241,5.821,0.000,3.67e+04,7.63e+04
R&D Spend,0.8108,0.055,14.860,0.000,0.700,0.922
Administration,-0.0899,0.069,-1.297,0.204,-0.231,0.051
Marketing Spend,0.0299,0.023,1.326,0.195,-0.016,0.076
State_Florida,275.3521,4144.948,0.066,0.947,-8178.325,8729.029
State_New York,-337.2775,3997.891,-0.084,0.933,-8491.031,7816.476

0,1,2,3
Omnibus:,16.694,Durbin-Watson:,1.727
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22.904
Skew:,-1.249,Prob(JB):,1.06e-05
Kurtosis:,5.935,Cond. No.,1800000.0


## Backward Elimination Process

* Note Adj R-squared
* AIC
* BIC
* Note P>|t| greater than 0.05

Let's get rid of the feature with the highest P>|t| and run it again to see if anything improves

**AIC**: The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

https://en.wikipedia.org/wiki/Akaike_information_criterion

**BIC**: In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

https://en.wikipedia.org/wiki/Bayesian_information_criterion

In [None]:
model = sm.OLS(y_train, X_train.drop(['State_Florida'], axis=1)).fit()
model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.954
Model:,OLS,Adj. R-squared:,0.949
Method:,Least Squares,F-statistic:,167.5
Date:,"Sat, 25 Jun 2022",Prob (F-statistic):,5.64e-21
Time:,17:36:31,Log-Likelihood:,-389.14
No. Observations:,37,AIC:,788.3
Df Residuals:,32,BIC:,796.3
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.658e+04,9481.835,5.967,0.000,3.73e+04,7.59e+04
R&D Spend,0.8103,0.053,15.219,0.000,0.702,0.919
Administration,-0.0899,0.068,-1.317,0.197,-0.229,0.049
Marketing Spend,0.0303,0.022,1.408,0.169,-0.014,0.074
State_New York,-476.0515,3355.251,-0.142,0.888,-7310.474,6358.371

0,1,2,3
Omnibus:,16.898,Durbin-Watson:,1.724
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.524
Skew:,-1.255,Prob(JB):,7.79e-06
Kurtosis:,5.993,Cond. No.,1780000.0


In [None]:
model = sm.OLS(y_train, X_train.drop(['State_Florida', 'State_New York'], axis=1)).fit()
model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.954
Model:,OLS,Adj. R-squared:,0.95
Method:,Least Squares,F-statistic:,230.2
Date:,"Sat, 25 Jun 2022",Prob (F-statistic):,3.43e-22
Time:,17:36:31,Log-Likelihood:,-389.15
No. Observations:,37,AIC:,786.3
Df Residuals:,33,BIC:,792.8
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.636e+04,9219.913,6.113,0.000,3.76e+04,7.51e+04
R&D Spend,0.8093,0.052,15.571,0.000,0.704,0.915
Administration,-0.0891,0.067,-1.330,0.193,-0.225,0.047
Marketing Spend,0.0305,0.021,1.439,0.160,-0.013,0.074

0,1,2,3
Omnibus:,16.634,Durbin-Watson:,1.751
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22.805
Skew:,-1.244,Prob(JB):,1.12e-05
Kurtosis:,5.932,Cond. No.,1750000.0


In [None]:
model = sm.OLS(y_train, X_train.drop(['State_Florida', 'State_New York', 'Administration'], axis=1)).fit()
model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.952
Model:,OLS,Adj. R-squared:,0.949
Method:,Least Squares,F-statistic:,336.8
Date:,"Sat, 25 Jun 2022",Prob (F-statistic):,3.88e-23
Time:,17:36:31,Log-Likelihood:,-390.12
No. Observations:,37,AIC:,786.2
Df Residuals:,34,BIC:,791.1
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.502e+04,3549.675,12.684,0.000,3.78e+04,5.22e+04
R&D Spend,0.7838,0.049,16.042,0.000,0.685,0.883
Marketing Spend,0.0402,0.020,1.999,0.054,-0.001,0.081

0,1,2,3
Omnibus:,13.268,Durbin-Watson:,1.745
Prob(Omnibus):,0.001,Jarque-Bera (JB):,15.266
Skew:,-1.085,Prob(JB):,0.000484
Kurtosis:,5.278,Cond. No.,617000.0


In [None]:
model = sm.OLS(y_train, X_train.drop(['State_Florida', 'State_New York', 'Administration', 'Marketing Spend'], axis=1)).fit()
model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.946
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,616.8
Date:,"Sat, 25 Jun 2022",Prob (F-statistic):,8.17e-24
Time:,17:36:31,Log-Likelihood:,-392.18
No. Observations:,37,AIC:,788.4
Df Residuals:,35,BIC:,791.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.889e+04,3102.222,15.758,0.000,4.26e+04,5.52e+04
R&D Spend,0.8557,0.034,24.836,0.000,0.786,0.926

0,1,2,3
Omnibus:,14.01,Durbin-Watson:,1.88
Prob(Omnibus):,0.001,Jarque-Bera (JB):,17.119
Skew:,-1.102,Prob(JB):,0.000192
Kurtosis:,5.499,Cond. No.,170000.0


## Forward Selection Process

* Note Adj r-squared
* AIC
* BIC

**AIC**: The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

https://en.wikipedia.org/wiki/Akaike_information_criterion

**BIC**: In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

https://en.wikipedia.org/wiki/Bayesian_information_criterion

In [None]:
# forward selection
model = sm.OLS(y_train, X_train[['R&D Spend']]).fit()
model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared (uncentered):,0.949
Model:,OLS,Adj. R-squared (uncentered):,0.947
Method:,Least Squares,F-statistic:,663.5
Date:,"Sat, 25 Jun 2022",Prob (F-statistic):,8.7e-25
Time:,17:36:31,Log-Likelihood:,-430.86
No. Observations:,37,AIC:,863.7
Df Residuals:,36,BIC:,865.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,1.3166,0.051,25.758,0.000,1.213,1.420

0,1,2,3
Omnibus:,5.218,Durbin-Watson:,1.634
Prob(Omnibus):,0.074,Jarque-Bera (JB):,1.902
Skew:,-0.043,Prob(JB):,0.386
Kurtosis:,1.893,Cond. No.,1.0


In [None]:
# forward selection
model = sm.OLS(y_train, X_train[['R&D Spend', 'Marketing Spend']]).fit()
model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared (uncentered):,0.967
Model:,OLS,Adj. R-squared (uncentered):,0.966
Method:,Least Squares,F-statistic:,519.1
Date:,"Sat, 25 Jun 2022",Prob (F-statistic):,9.630000000000001e-27
Time:,17:36:31,Log-Likelihood:,-422.42
No. Observations:,37,AIC:,848.8
Df Residuals:,35,BIC:,852.1
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.8340,0.115,7.258,0.000,0.601,1.067
Marketing Spend,0.1790,0.040,4.499,0.000,0.098,0.260

0,1,2,3
Omnibus:,3.294,Durbin-Watson:,2.156
Prob(Omnibus):,0.193,Jarque-Bera (JB):,1.536
Skew:,0.061,Prob(JB):,0.464
Kurtosis:,2.009,Cond. No.,8.9


In [None]:
# forward selection
model = sm.OLS(y_train, X_train[['R&D Spend', 'Marketing Spend', 'Administration']]).fit()
model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared (uncentered):,0.988
Model:,OLS,Adj. R-squared (uncentered):,0.987
Method:,Least Squares,F-statistic:,972.9
Date:,"Sat, 25 Jun 2022",Prob (F-statistic):,5.2e-33
Time:,17:36:31,Log-Likelihood:,-403.16
No. Observations:,37,AIC:,812.3
Df Residuals:,34,BIC:,817.2
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.7102,0.071,9.997,0.000,0.566,0.855
Marketing Spend,0.0969,0.026,3.708,0.001,0.044,0.150
Administration,0.2897,0.037,7.892,0.000,0.215,0.364

0,1,2,3
Omnibus:,0.653,Durbin-Watson:,1.884
Prob(Omnibus):,0.722,Jarque-Bera (JB):,0.684
Skew:,-0.011,Prob(JB):,0.71
Kurtosis:,2.334,Cond. No.,9.79


In [None]:
# forward selection
model = sm.OLS(y_train, X_train[['R&D Spend', 'Marketing Spend', 'Administration', 'State_New York']]).fit()
model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared (uncentered):,0.989
Model:,OLS,Adj. R-squared (uncentered):,0.987
Method:,Least Squares,F-statistic:,715.4
Date:,"Sat, 25 Jun 2022",Prob (F-statistic):,1.5100000000000002e-31
Time:,17:36:31,Log-Likelihood:,-402.98
No. Observations:,37,AIC:,814.0
Df Residuals:,33,BIC:,820.4
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.7065,0.072,9.809,0.000,0.560,0.853
Marketing Spend,0.0964,0.026,3.646,0.001,0.043,0.150
Administration,0.2856,0.038,7.564,0.000,0.209,0.362
State_New York,2724.0444,4740.577,0.575,0.569,-6920.731,1.24e+04

0,1,2,3
Omnibus:,1.238,Durbin-Watson:,1.947
Prob(Omnibus):,0.539,Jarque-Bera (JB):,0.954
Skew:,-0.079,Prob(JB):,0.621
Kurtosis:,2.229,Cond. No.,620000.0


AIC and BIC both are trending up