# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Tasklist 13 and 14: Simple Linear Regression. Multiple Linear Regression.

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

### Intro 

The goal of this Tasklist is to consolidate our knowledge of theoretical and practical insights provided in sessions 13 and 14 on (Multiple) Linear Regression. So far we have gone through simple and multiple linear regressions, parametric bootstrap and part and partial correlation. Looking back at the things that we have learned in the previous sessions we are now beginning to develop a full circle of neccessary steps in one basic data science project. 

In this tasklist we are going to use [Car Price Prediction](https://www.kaggle.com/datasets/hellbuoy/car-price-prediction) data set from Kaggle. Read about the problem statement and the goal of the study from the link provided. **The only thing that needs to happen with the Car Price Prediction dataset here is for you to study and understand the Linear Regression Modeling of this data set as it is provided in this notebook.**

**YOUR REAL TASK is to analyze something else:** the **Market Mix data** provided as `market_mix.csv` in the `_data` directory of this session (the source file is found in the [veer064/Linear-Regression](https://github.com/veer064/Linear-Regression) GitHub repo). **The only thing that you need to do about the Market Mix data set is to use Multiple Linear Regression to predict the `NewVolSales` variable and inspect the multicolinearity present in the model.** So, fist study the analysis of the Car Price Prediction as it is provided here, and then simply load `market_mix.csv`, clean all cell contents, and repeat everything that is done with Multiple Linear Regression in the Car Price Prediction study on the Market Mix data set where your outcome variable will be `NewVolSales`.

Data is provided in two files in `_data` folder. Namely:
- `car_prices.csv` - file with data we are going to work with,
- `data_dictionary.xlsx` - file that contains descriptions for each column in CSV file. **Warning** `carCompany` column from the dictionary is named `CarName` in CSV file.

We will use words column, feature, predictor interchangably though the tasklist, so do not let that confuse you.

**We strongly suggest studying Sessions 13 and 14 thoroughly and the two texts provided in the References section of this TaskList before proceeding.**

Also, idea is to use Python libraries we have introduced in Sessions 13 and 14.


Let's start by importing neccessary libraries.

In [None]:
import warnings

warnings.filterwarnings("ignore")

In [None]:
import os 

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor

from scipy import stats 

from statsmodels.regression.linear_model import RegressionResultsWrapper

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

Like we said, we have the data set in `_data` folder. Let's load it.

In [None]:
import os
work_dir = os.getcwd()
data_dir = os.path.join(work_dir, "_data")
os.listdir(data_dir)

In [None]:
df = pd.read_csv(os.path.join(data_dir, 'car_prices.csv'))

Before we go any further, we will define helper methods for easier preview of model performance analysis. It is not necessary to understand the code for the following two methods at hand in order to proceed with the tasks.

In [None]:
# define a method that will return metrics for plotting influential points
def get_influential_points_summary(model: RegressionResultsWrapper, k: int = 1):
    model_inf = model.get_influence()
    # inf_frame = model_inf.summary_frame()
    # w_cookD = np.argwhere(model_inf.cooks_distance[0] > 1)
    n = len(model.resid)
    # w_leverage = np.argwhere(model_inf.hat_matrix_diag > 2*(k+1)/n)

    inf_plot_frame = pd.DataFrame(columns=['Residuals', 'Leverage', 'Cook Dist.'])

    inf_plot_frame['Residuals'] = model.resid
    inf_plot_frame['Leverage'] = model_inf.hat_matrix_diag
    inf_plot_frame['Cook Dist.'] = model_inf.cooks_distance[0]

    return inf_plot_frame

In [None]:
# define method to plot model performance analyis
def plot_model_analysis(model: RegressionResultsWrapper, target) -> None:
    fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 8))
    fig.subplots_adjust(hspace=0.4, wspace=.4)
    
    axes = axes.flatten()

    sns.histplot(model.resid, element='step', linewidth=.0, ax=axes[0]);
    sns.despine()
    # axes[0].grid(alpha=.3)
    axes[0].set_title('Residuals hisogram', fontsize=10)
    

    sm.qqplot(model.resid, line='q', ax=axes[1])
    sns.despine()
    axes[1].set_title('Q-Q Plot of model residuals', fontsize=10)
    
    ip_summary = get_influential_points_summary(model)
    
    sns.scatterplot(data=ip_summary, x='Leverage', y='Residuals', size='Cook Dist.', hue=(ip_summary['Cook Dist.'] > 1), legend=None, ax=axes[2])
    sns.despine()
    axes[2].set_title("Influence Plot\n Size of the blob corresponds to Cook's distance", fontsize=10);

    sns.scatterplot(x=target, y=linear_model.resid, s=10, ax=axes[3])
    sns.despine()
    axes[3].axhline(y=0, color='red')
    axes[3].grid(alpha=.4)
    axes[3].set_title('Price vs Residual', fontsize=10)
    axes[3].set_xlabel('price')
    axes[3].set_ylabel('residuals')

    sns.scatterplot(x=target, y=linear_model.predict(), s=10, ax=axes[4])
    sns.despine()
    axes[4].axline((0, 0), (1, 1), color='red')
    axes[4].grid(alpha=.4)
    axes[4].set_title('Price vs Predicted', fontsize=10)
    axes[4].set_xlabel('price')
    axes[4].set_ylabel('predicted')

    sns.scatterplot(x=target, y=linear_model.resid/target, s=10, ax=axes[5])
    sns.despine()
    axes[5].axhline(y=0, color='red')
    axes[5].grid(alpha=.4)
    axes[5].set_title('Price vs residual/price', fontsize=10)
    axes[5].set_xlabel('price')
    axes[5].set_ylabel('residual/price')

    fig.suptitle('Model analysis', fontsize=15)

---

**01.** Before we go into modeling, we want to make sure we understand the data at hand. Remember EDA? Great! Let's do it here.

a) Give a short preview of the data, number of missing values per column, information about the data types, and descriptive statistics for data set.

In [None]:
df.head(20)

In [None]:
df.columns

In [None]:
df.isna().sum(axis=0)

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
df[df.select_dtypes(include='object').columns].describe()

In [None]:
# Let's see how much observations do we have per each category for each categorical variable
for column in df.select_dtypes(include='object').columns[1:]:
    print(df[column].value_counts())

**Conclusions:**
- 25 predictor variables, and 1 outcome variable (price).
- 15 predictor variables are of numerical type. One of these is ordinal variable (*symboling*).
- 10 predictor variables are of categorical type.
- There are no missing values in data set.
- *carName* categorical variable has the highest cardinaly among all variables of same type.
- Some categories have very few observations e.g. *drivewheel* has only 3 observations for *rear* value.

---

b) `price` is obviously the variable that we will be predicting based on other predictor variables. Create charts for each of numerical variable depicting how its affects the price.

There are multiple ways, but try to come up the elegant one.

In [None]:
numerical_variables = df.select_dtypes(exclude='object').columns.to_list()

In [None]:
df[numerical_variables]

In [None]:
_df = df[numerical_variables].melt(id_vars='price', value_vars=numerical_variables[1:-1])

In [None]:
g = sns.FacetGrid(data=_df, col='variable', col_wrap=5, sharex=False, height=2, aspect=1.3)
g.map(sns.scatterplot, 'value', 'price', s=10);

----

c) Do the same thing for categorical variables. You can exclude `CarName` from overview. Also, have in mind you have more than one way at your disposal.

In [None]:
categorical_variables = df.select_dtypes(include='object').columns.to_list() 

In [None]:
_df = df[categorical_variables + ['price']].melt(id_vars='price', value_vars=categorical_variables[1:])

In [None]:
g = sns.FacetGrid(data=_df, col='variable', col_wrap=5, sharey=False, height=2, aspect=1.3)
g.map(sns.boxplot, 'price', 'value' );

---

d) Figure out the way to get Peason's correlation coefficient between each of the numerical variables. Remember, it has its statistic value and p-value. 

In [None]:
df[numerical_variables].corr(method='pearson')

---

e) Plot these coefficients using `heatmap` chart from `seaborn` library.

In [None]:
plt.figure(figsize=(10, 7))
sns.heatmap(df[numerical_variables].corr(method='pearson'), annot=True, fmt='.2f');

---

f) Let's define our significancy level $\alpha$ at 0.05. Use `heatmap` to plot which feature pairs have their Pearsons' correlation significant.

In [None]:
data = []

for i1 in numerical_variables[1:]:
    for i2 in numerical_variables[1:]:
        data.append({'name_a': i1, 'name_b': i2, 'pvalue': stats.pearsonr(df[i1], df[i2]).pvalue})

In [None]:
_df = pd.DataFrame(data).pivot(index='name_a', columns='name_b', values='pvalue') < 0.05
_df = _df[numerical_variables[1:]]
_df = _df.T[numerical_variables[1:]]

In [None]:
plt.figure(figsize=(4, 3))
sns.heatmap(_df);
plt.xlabel('')
plt.ylabel('');

---

2. Since there is a lot of both numerical and categorical variables, let's start with simple linear regression.

a) Regress `price` on categorical variable `fueltype` and describe OLS regression results. Then analyse residuals and influential observations.

In [None]:
# - fitting the linear model to the data
linear_model = smf.ols(formula='price ~ fueltype', data=df).fit()
print(linear_model.summary())

In [None]:
plot_model_analysis(linear_model, df['price'])

**Conclusion**

Short version: Our model is not statistically significant. 

Long version: 
- Null hypothesis (H0) says β1 = β2 = ... = 0 (regression model does not exist). 
- Alternative hypothesis H1 says there is at least one regression coefficient that is different from 0.
- By looking at the F-statistic and it's probability (p-value) we should either have a proof for rejecting H0 or not. In our case we do not have evidence to be able to reject H0 and we say this regression model is not good i.e. there is no relationshipt between predictor variable and target variable.

---

b) Regress `price` on categorical variable `carbody` and describe the results. Use category *hatcback* as reference.

In [None]:
# - fitting the linear model to the data
linear_model = smf.ols(formula='price ~ C(carbody, Treatment(reference="hatchback"))', data=df).fit()
print(linear_model.summary())

In [None]:
plot_model_analysis(linear_model, df['price'])

**Conclusion**

This model's F-statistic probability (p-value) is way below 0.05 alpha value. Based on this we have the evidence to reject H0 and say that at least one of the model's coefficients has it's value different from 0. Therefore, we continue with interpreting the rest of the regression report.

Model has very low value of R2. It's not that good at describing the variance in price based on `carbody` variable.

Based on p-values and stdandard errors of coefficients, we can see that most of the coefficients are statistically significant (except for coefficient for *wagon* category).

---

c) Let's try multiple linear regression model now. First use only numerical features when modeling and describe OLS regression results and results of visual inspection.

In [None]:
numerical_variables

In [None]:
fomula_features = ' + '.join(numerical_variables[1:-1])
fomula_features

In [None]:
# - fitting the linear model to the data
linear_model = smf.ols(formula='price ~ ' + fomula_features, data=df).fit()
linear_model.summary()

In [None]:
plot_model_analysis(linear_model, target=df['price'])

**Conclusion**

This model's F-statistic probability (p-value) is way below 0.05 alpha value. Based on this we can reject null hypothesis and say that there is at least one model's coefficient that is different from 0.

Based on R2 value of 0.852 we can say this model is pretty good at describing variance in price by numerical predictors.

Most model's coefficients have statistically significant effect on describing price.

Residual's histogram and QQ plot show that residuals have distribution very similar to the normal distribution.

Predicted vs residual shows something that looks like a heteroskedastisity. This can bee seen on Price vs. Predicted plot as well.

However, the last plot shows ratio between residual/price and price, and this gives insight in percentage of errors for each prediction. Here we see that most of the prediction errors are in +-25% range.

---

d) Get the VIF for numerical features and explain the findings.

In [None]:
### --- Variance Inflation Factors (VIFs)

# - appending the columns of ones to the predictors' data
model_frame_predictors = sm.add_constant(df[numerical_variables[1:-1]])
model_frame_predictors

In [None]:
# The lower bound of VIF is 1; 
# - there is no upper bound;
# - VIF > 2 indicates high variance inflation
vifs = [variance_inflation_factor(model_frame_predictors.values, i) for i in range(1, len(numerical_variables[1:-1])+1)]
vifs = np.array(vifs).reshape(1, -1)

df_vifs = pd.DataFrame(vifs, columns=numerical_variables[1:-1]).T
df_vifs.rename(columns={0: 'vif'}).sort_values(by='vif', ascending=False)

Conclusion: If we say that VIF > 2 is a sign of variance inflation, we can see that most of our predictors have their variance inflated due to collinearity with other predictors. One interesting thing here are top 2 predictors (`citympg` and `highwaympg`), that have high VIF. This matches with findinds from the heatmap of correlations. We saw previously that Pearson's correlation coefficient between these two is 0.97!

---

e) Let's try multiple linear regression model, but now with only categorical features when modeling and describe OLS regression results and results of visual inspection.

In [None]:
categorical_variables

In [None]:
fomula_features = ' + '.join(categorical_variables[1:])
fomula_features

In [None]:
# - fitting the linear model to the data
linear_model = smf.ols(formula='price ~ ' + fomula_features, data=df).fit()
linear_model.summary()

In [None]:
plot_model_analysis(linear_model, target=df['price'])

**Conclusion**

This model's F-statistic probability (p-value) is way below 0.05 alpha value. Based on this we can reject null hypothesis and say that there is at least one model's coefficient that is different from 0.

Based on R2 value of 0.852 we can say this model is pretty good at describing variance in price by numerical predictors.

Most model's coefficients have statistically significant effect on describing price.

Residual's histogram and QQ plot show that residuals have distribution very similar to the normal distribution.

Predicted vs residual shows something that looks like a heteroskedastisity. This can bee seen on Price vs. Predicted plot as well.

However, the last plot shows ratio between residual/price and price, and this gives insight in percentage of errors for each prediction. Here we see that most of the prediction errors are in +-40% range.

---

f) Do the regression analysis for the model that encompasses all of the predictors. Describe the results and make conclusions.

In [None]:
predictors_list = numerical_variables[1:-1] + categorical_variables[1:]

In [None]:
fomula_features = ' + '.join(predictors_list)
fomula_features

In [None]:
# - fitting the linear model to the data
linear_model = smf.ols(formula='price ~ ' + fomula_features, data=df).fit()
linear_model.summary()

In [None]:
plot_model_analysis(linear_model, target=df['price'])

**Conclusion**

This model's F-statistic probability (p-value) is way below 0.05 alpha value. Based on this we can reject null hypothesis and say that there is at least one model's coefficient that is different from 0.

Based on R2 value of 0.94 we can say this model is even better than previous models at describing variance in price by predictors.

There is a lot of model's coefficients that have statistically significant effect on describing price.

Residual's histogram and QQ plot show that residuals have distribution very similar to the normal distribution.

Predicted vs residual shows something that looks like a heteroskedastisity. This can bee seen on Price vs. Predicted plot as well.

However, the last plot shows ratio between residual/price and price, and this gives insight in percentage of errors for each prediction. Here we see that most of the prediction errors are in +-20% range.

---

3. Use `sklearn` to fit the linear regression model, get the predictions, plot the Price vs. Predicted chart and calculate R2 metric for the prediction results.

In [None]:
df[numerical_variables[1:]]

In [None]:
X = pd.concat([df[numerical_variables[1:-1]], pd.get_dummies(df[categorical_variables[1:]])], axis=1)
y = df['price']

In [None]:
lr = LinearRegression()
lr.fit(X, y)

In [None]:
y_preds = lr.predict(X)

In [None]:
r2_score(y, y_preds)

In [None]:
sns.scatterplot(x=y, y=y_preds)
sns.despine()
plt.grid(alpha=.3)
plt.axline((0, 0), (1, 1), color='red')
plt.title('Price vs Predicted plot', fontsize=15)
plt.xlabel('price')
plt.ylabel('predicted');

In [None]:
df_importances = pd.DataFrame(list(zip(lr.coef_, lr.feature_names_in_)))
df_importances = df_importances.rename(columns={0: 'coefficient', 1: 'feature'})
df_importances = df_importances.sort_values(by='coefficient', ascending=False)

In [None]:
plt.figure(figsize=(13, 10))
df_importances.set_index('feature').plot(kind='barh', ax=plt.gca())
sns.despine()
plt.grid(alpha=.3);
plt.title('Feature importances', fontsize=15);

For each dummy encoded categorical variable, one of its values is taken for the base line. You have to be sure what category is base line so you can interpret these feature importances correctly.

### References
- [Interpreting Linear Regression Through statsmodels.summary()](https://medium.com/swlh/interpreting-linear-regression-through-statsmodels-summary-4796d359035a)
- [Mastering f-statistics in Linear Regression: Formula, Examples](https://vitalflux.com/interpreting-f-statistics-in-linear-regression-formula-examples/amp/)

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>