https://statisticsbyjim.com/regression/confounding-variables-bias/

If a residual is correlated to a feature, it is a sign of missing collinear features. We know which independent variable correlates with confounding variables

You saw one method of detecting omitted variable bias in this post. If you include different combinations of independent variables in the model, and you see the coefficients changing, you’re watching omitted variable bias in action!

**Multicollinearity**

It’s important to note a tradeoff that might occur between precision and bias. As you include the formerly omitted variables, you lessen the bias, but the multicollinearity can potentially reduce the precision of the estimates.

Bias exists if the residuals have an overall positive or negative mean

**Constant term**

A portion of the estimation process for the y-intercept is based on the exclusion of relevant variables from the regression model. When you leave relevant variables out, this can produce bias in the model. Bias exists if the residuals have an overall positive or negative mean. In other words, the model tends to make predictions that are systematically too high or too low. The constant term prevents this overall bias by forcing the residual mean to equal zero.

### Semi-partial correlation

https://ourcodingclub.github.io/2017/03/15/mixed-models.html#second

But we are not interested in quantifying test scores for each specific mountain range: we just want to know whether body length affects test scores and we want to simply control for the variation coming from mountain ranges.

<div id="Model instability with feature selection"></div>

### 3.3. Model instability with feature selection

In this section, I show that the values of regression coefficients fluctuate with different choices of features. We will use real-life data: rock properties and natural gas production. Please refer to <i>Section 0: Sample data description</i> of my <a href="https://aegis4048.github.io/mutiple_linear_regression_and_visualization_in_python#0.-Sample-data-description" target="_blank">previous post</a> for more information of this dataset. 

<a href="#fig-7">Figure (7)</a> shows the relative importance of the individual features relative to the response variable, using permutation feature ranking. We see that <i>Por</i> and <i>Brittle</i> are the most important features.

<div id="fig-7" class="row full_screen_margin_md mobile_responsive_plot_full_width" style="margin-top: 15px;">
    <div class="col"><img src="jupyter_images/multiple_linear_permutation_feature_importance.png"></div>
    <div class="col-12"><p class="image-description">Figure 7: Permutation feature ranking</p></div>
</div>

<div class="solution_panel closed">
    <div class="solution_title">
        <p class="solution_title_string">Source Code For Figure (7)</p>
        <ul class="nav navbar-right panel_toolbox">
            <li><a class="collapse-link"><i class="fa fa-chevron-down"></i></a></li>
        </ul>
    <div class="clearfix"></div>
    </div>
    <div class="solution_content">
        <pre>
            <code class="language-python">
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################
                
                # data source: https://github.com/GeostatsGuy/GeoDataSets/blob/master/unconv_MV_v5.csv
                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()
            </code>
        </pre>
    </div>
</div>

Based on the result of feature ranking, we train a linear model with two features: <code>features = ['Por', 'Brittle']</code>.

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model

# data source: https://github.com/GeostatsGuy/GeoDataSets/blob/master/unconv_MV_v5.csv
file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
df = pd.read_csv(file)

In [2]:
features = ['Por', 'Brittle']
target = 'Prod'

X = df[features].values.reshape(-1, len(features))
y = df[target]

ols = linear_model.LinearRegression()
model = ols.fit(X, y)

print('Features                :  %s' % features)
print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
print('R-squared               :  %.2f' % model.score(X, y))
print('Y-intercept             :  %.2f' % model.intercept_)

Features                :  ['Por', 'Brittle']
Regression Coefficients :  [320.39, 31.38]
R-squared               :  0.93
Y-intercept             :  -2003.01


<div style="margin-top: -20px"></div>

The trained model explains 93% variability (<code>R-squared</code>) of the data with two features. Let's say that we are not satisfied with the R-squared value, and that we want a more powerful predictive model. We proceed to train a new linear model with six features: <code>features = ['Por', 'Brittle', 'Perm', 'TOC', 'AI', 'VR']</code>

In [3]:
features = ['Por', 'Brittle', 'Perm', 'TOC', 'AI', 'VR']
target = 'Prod'

X = df[features].values.reshape(-1, len(features))
y = df[target]

ols = linear_model.LinearRegression()
model = ols.fit(X, y)

print('Features                :  %s' % features)
print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
print('R-squared               :  %.2f' % model.score(X, y))
print('Y-intercept             :  %.2f' % model.intercept_)

Features                :  ['Por', 'Brittle', 'Perm', 'TOC', 'AI', 'VR']
Regression Coefficients :  [230.3, 25.0, 116.23, -77.44, -363.74, 783.19]
R-squared               :  0.96
Y-intercept             :  -1230.26


<div style="margin-top: -20px"></div>

Indeed, the addition of more features increased the <code>R-squared</code> value from 93% to 96%. However, we observe significant changes in the values of the regression coefficients. For <code>Por</code>, the coefficient jumped from 320.4 $\longrightarrow$ 230.3. For <code>Brittle</code>, the coefficient jumped from 31.4 $\longrightarrow$ 25.0. I simulated more different combinations of features, and showed the results in <a href="#fig-8">figure (8)</a>. Even the <code>R-squared</code> value is over 93%, the regression coefficient value for <code>Por</code> seems to always fluctuate.

<div id="fig-8" class="row full_screen_margin_md mobile_responsive_plot_full_width" style="margin-top: 15px;">
    <div class="col"><img src="jupyter_images/multiple_linear_model_instability.png"></div>
    <div class="col-12"><p class="image-description">Figure 8: Unstable regression coefficients due to multicollinearity</p></div>
</div>

<div class="solution_panel closed">
    <div class="solution_title">
        <p class="solution_title_string">Source Code For Figure (8)</p>
        <ul class="nav navbar-right panel_toolbox">
            <li><a class="collapse-link"><i class="fa fa-chevron-down"></i></a></li>
        </ul>
    <div class="clearfix"></div>
    </div>
    <div class="solution_content">
        <pre>
            <code class="language-python">
                import pandas as pd
                import numpy as np
                from sklearn import linear_model
                import matplotlib.pyplot as plt
                
                # data source: https://github.com/GeostatsGuy/GeoDataSets/blob/master/unconv_MV_v5.csv
                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                ########################################################################################

                features = ['Por', 'Brittle', 'Perm', 'TOC', 'AI', 'VR']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle', 'Perm', 'TOC', 'VR']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle', 'Perm', 'TOC', 'AI']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle', 'Perm', 'TOC']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle', 'Perm', 'AI']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle', 'Perm', 'VR']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle', 'TOC', 'VR']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle', 'TOC']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle', 'VR']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle', 'AI']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
                print('')

                ########################################################################################

                features = ['Por', 'Brittle']
                target = 'Prod'

                X = df[features].values.reshape(-1, len(features))
                y = df[target].values

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)

                print('Features                :  %s' % features)
                print('Regression Coefficients : ', [round(item, 2) for item in model.coef_])
                print('R-squared               :  %.2f' % model.score(X, y))
                print('Y-intercept             :  %.2f' % model.intercept_)
            </code>
        </pre>
    </div>
</div>

We observe instability in regression coefficients with feature selection, because this particular data set shows high multicollinearity, as shown by the variance inflation factors (VIF) of the features. I discuss VIF in more detail in <i>Section 4.1: VIF</i> <a href="#VIF">below</a>.