Stepwise regression is a method used in statistics and machine learning to select a subset of features for building a linear regression model. Stepwise regression aims to minimize the model’s complexity while maintaining a high accuracy level.

This method is particularly useful in cases where the number of features is large, and it’s unclear which features are important for the prediction.

what if instead of taking all of the independent variables, we find the ones which are relevant, and only use the relevant variables in our regression analysis?

In [1]:
import seaborn as sns
# Load iris dataset from seaborn
iris = sns.load_dataset("iris")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [2]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [4]:
# Define the dependent and independent variables
x = iris.drop(["petal_length",'species'], axis=1)
y = iris['petal_length']

In [5]:
# Create a linear regression estimator
estimator = LinearRegression()

# Create the RFE object and specify the number of
selector = RFE(estimator, n_features_to_select=1)

# Fit the RFE object to the data
selector = selector.fit(x, y)

# Print the selected features
print(x.columns[selector.support_])

Index(['petal_width'], dtype='object')


In [7]:
!pip install stepwise-regression

Collecting stepwise-regression
  Downloading stepwise_regression-1.0.3-py3-none-any.whl (3.3 kB)
Installing collected packages: stepwise-regression
Successfully installed stepwise-regression-1.0.3


In [8]:
import pandas as pd
import statsmodels.api as sm
from stepwise_regression import step_reg

In [10]:
df = pd.read_csv('C:/Users/Admin/Machine Learning chapter 5/stepwisedataset.csv',index_col=False)
df.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [13]:
df.isnull().sum()

Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
BMI                                 34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
HIV/AIDS                             0
GDP                                448
Population                         652
thinness  1-19 years                34
thinness 5-9 years                  34
Income composition of resources    167
Schooling                          163
dtype: int64

In [14]:
df.isnull().sum().sum()

2563

In [15]:
df_impute =df.interpolate(method='linear')

(4) Stepwise regression using the packages
In this example, we will create a model to predict Life expectancy. Then we slice the data into independent variables and dependent variables. We also drop the string/categorical variables from independent variables because the aim of this post is to see how to use the packages to do stepwise regression. If you are looking for methods to encode them and include them in your model, you can read this post.

In [16]:
X = df_impute.drop(['Country','Status','Life expectancy'],axis=1)
y = df_impute['Life expectancy']

In [20]:
#Before using the packages, we create a linear regression model included all the selected variables.
# add a constant 
X = sm.add_constant(X)

# define the model and fit it
model = sm.OLS(y, X)
results = model.fit()

In [21]:
results.summary()

0,1,2,3
Dep. Variable:,Life expectancy,R-squared:,0.815
Model:,OLS,Adj. R-squared:,0.813
Method:,Least Squares,F-statistic:,674.9
Date:,"Wed, 07 Jun 2023",Prob (F-statistic):,0.0
Time:,04:48:35,Log-Likelihood:,-8310.2
No. Observations:,2938,AIC:,16660.0
Df Residuals:,2918,BIC:,16780.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,55.4834,35.209,1.576,0.115,-13.554,124.521
Year,-0.0004,0.018,-0.023,0.981,-0.035,0.034
Adult Mortality,-0.0203,0.001,-25.185,0.000,-0.022,-0.019
infant deaths,0.1000,0.009,11.721,0.000,0.083,0.117
Alcohol,0.1254,0.024,5.152,0.000,0.078,0.173
percentage expenditure,0.0002,7.97e-05,2.866,0.004,7.21e-05,0.000
Hepatitis B,-0.0086,0.004,-2.339,0.019,-0.016,-0.001
Measles,-3.107e-05,7.78e-06,-3.993,0.000,-4.63e-05,-1.58e-05
BMI,0.0475,0.005,9.648,0.000,0.038,0.057

0,1,2,3
Omnibus:,148.605,Durbin-Watson:,0.722
Prob(Omnibus):,0.0,Jarque-Bera (JB):,406.715
Skew:,-0.245,Prob(JB):,4.82e-89
Kurtosis:,4.756,Cond. No.,26000000000.0


From p-values, we can see Year, Population, thinness 1–19 years and thinness 5–9 years are statistically insignificant at the significant level of 0.05.

Next, let’s see if we can use the package to help us select the feature variables.

(II) Stepwise-regression model
There are two functions, namely backward_regression and forward_regression. There are four parameters in the functions.

X: the independent variables

y：the dependent variable

threshold_in: the threshold value set for p-value, normally 0.05

verbose: the default is False

In [22]:
backselect = step_reg.backward_regression(X, y, 0.05,verbose=False)
backselect

['const',
 'Adult Mortality',
 'infant deaths',
 'Alcohol',
 'percentage expenditure',
 'Hepatitis B',
 'Measles',
 'BMI',
 'under-five deaths',
 'Polio',
 'Total expenditure',
 'Diphtheria',
 'HIV/AIDS',
 'GDP',
 'thinness  1-19 years',
 'Income composition of resources',
 'Schooling']

In [23]:
len(backselect). Choose 17 including the constant. 

17

Next, let’s create a linear regression to check if the all the predictors are significant at level of 0.05.

In [28]:
X_backselect = X[['Adult Mortality',
 'infant deaths',
 'Alcohol',
 'percentage expenditure',
 'Hepatitis B',
 'Measles',
 'BMI',
 'under-five deaths',
 'Polio',
 'Total expenditure',
 'Diphtheria',
 'HIV/AIDS',
 'GDP',
 'thinness  1-19 years',
 'Income composition of resources',
 'Schooling']]

In [30]:
# add a constant 
X_backselect = sm.add_constant(X_backselect)

In [31]:
# define the model and fit it
backmodel = sm.OLS(y, X_backselect)

backres = backmodel.fit()

backres.summary()

0,1,2,3
Dep. Variable:,Life expectancy,R-squared:,0.815
Model:,OLS,Adj. R-squared:,0.813
Method:,Least Squares,F-statistic:,801.7
Date:,"Wed, 07 Jun 2023",Prob (F-statistic):,0.0
Time:,05:01:18,Log-Likelihood:,-8311.0
No. Observations:,2938,AIC:,16660.0
Df Residuals:,2921,BIC:,16760.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,54.6782,0.559,97.782,0.000,53.582,55.775
Adult Mortality,-0.0203,0.001,-25.318,0.000,-0.022,-0.019
infant deaths,0.1019,0.008,12.157,0.000,0.085,0.118
Alcohol,0.1260,0.024,5.275,0.000,0.079,0.173
percentage expenditure,0.0002,7.95e-05,2.864,0.004,7.18e-05,0.000
Hepatitis B,-0.0087,0.004,-2.375,0.018,-0.016,-0.002
Measles,-3.15e-05,7.75e-06,-4.063,0.000,-4.67e-05,-1.63e-05
BMI,0.0475,0.005,9.696,0.000,0.038,0.057
under-five deaths,-0.0762,0.006,-12.294,0.000,-0.088,-0.064

0,1,2,3
Omnibus:,150.552,Durbin-Watson:,0.721
Prob(Omnibus):,0.0,Jarque-Bera (JB):,410.692
Skew:,-0.251,Prob(JB):,6.6e-90
Kurtosis:,4.761,Cond. No.,133000.0


From the above results, we can see that all p-values are less or equal to 0.05, which indicates all the selected predictors are statistically significant at the level of 0.05.

 Forward selection method

In [32]:
forwardselect = step_reg.forward_regression(X, y, 0.05,verbose=False)
forwardselect

['Income composition of resources',
 'const',
 'Adult Mortality',
 'HIV/AIDS',
 'Schooling',
 'Diphtheria',
 'BMI',
 'percentage expenditure',
 'Measles',
 'Polio',
 'Alcohol',
 'thinness  1-19 years',
 'Hepatitis B',
 'Total expenditure']

In [34]:
len(forwardselect)

14

In [33]:
X_forwardselect = X[['Income composition of resources',
 'const',
 'Adult Mortality',
 'HIV/AIDS',
 'Schooling',
 'Diphtheria',
 'BMI',
 'percentage expenditure',
 'Measles',
 'Polio',
 'Alcohol',
 'thinness  1-19 years',
 'Hepatitis B',
 'Total expenditure']]

Compared with the backward selection method, the predictors selected by forward method has only 14 factors. Let’s create a model based on these 14 factors and examine its results.

In [35]:
# add a constant 
X_forwardselect = sm.add_constant(X_forwardselect)

In [36]:
# define the model and fit it
forwardmodel = sm.OLS(y, X_forwardselect)

forward = forwardmodel.fit()

forward.summary()

0,1,2,3
Dep. Variable:,Life expectancy,R-squared:,0.805
Model:,OLS,Adj. R-squared:,0.804
Method:,Least Squares,F-statistic:,926.5
Date:,"Wed, 07 Jun 2023",Prob (F-statistic):,0.0
Time:,05:11:55,Log-Likelihood:,-8387.1
No. Observations:,2938,AIC:,16800.0
Df Residuals:,2924,BIC:,16890.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Income composition of resources,7.0200,0.644,10.895,0.000,5.757,8.283
const,53.4215,0.564,94.731,0.000,52.316,54.527
Adult Mortality,-0.0208,0.001,-25.370,0.000,-0.022,-0.019
HIV/AIDS,-0.4833,0.018,-26.616,0.000,-0.519,-0.448
Schooling,0.6287,0.042,14.812,0.000,0.545,0.712
Diphtheria,0.0473,0.005,9.746,0.000,0.038,0.057
BMI,0.0486,0.005,9.703,0.000,0.039,0.058
percentage expenditure,0.0004,4.38e-05,8.232,0.000,0.000,0.000
Measles,-4.41e-05,7.06e-06,-6.244,0.000,-5.79e-05,-3.03e-05

0,1,2,3
Omnibus:,158.311,Durbin-Watson:,0.723
Prob(Omnibus):,0.0,Jarque-Bera (JB):,447.219
Skew:,-0.257,Prob(JB):,7.72e-98
Kurtosis:,4.841,Cond. No.,98800.0


From the above results, we can see that all p-values are less or equal to 0.05, which indicates all the selected predictors are statistically significant at the level of 0.05.

Compared with the results of backward stepwise regression and forward stepwise regression, we found that the goodness of fit (R²) of backward method is higher than that of forward one in our example.

Normally, forward and backward stepwise regression would give us the same result, but this is not always the case even we have the same features in the final model.