# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Tasklist 15: Session 15: Regularization in MLR. The Maximum Likelihood Estimation (MLE).

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

### Intro 

The goal of this task list is to consolidate our knowledge of the theoretical and practical insights provided on regularization in session 15. So far, we have shown two types of regularization, namely Ridge and Lasso. Also, we did introduce another type of parameter estimation called Maximum Likelihood Estimation. We said that after making the assumption that errors are normally distributed, we can use MLE for linear regression model parameter estimation instead of good old OLS.

Today you are going to do some practicing with the data set provided. This is not the first time you will see this data set. We used it in session 03. It is the Boston Housing Data Set (available from GitHub [here](https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv)).

As a refreshed, here is the list of variables and their description:
- **crim**: per capita crime rate by town.

- **zn**: proportion of residential land zoned for lots over 25,000 sq.ft.

- **indus**: proportion of non-retail business acres per town.

- **chas**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

- **nox**: nitrogen oxides concentration (parts per 10 million).

- **rm**: average number of rooms per dwelling.

- **age**: proportion of owner-occupied units built prior to 1940.

- **dis**: weighted mean of distances to five Boston employment centres.

- **rad**: index of accessibility to radial highways.

- **tax**: full-value property-tax rate per \$10,000.

- **ptratio**: pupil-teacher ratio by town.

- **b**: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

- **lstat**: lower status of the population (percent).

- **medv**: median value of owner-occupied homes in \$1000s.

Variable **medv** will be the one we are going to predict using Multiple Linear Regression with or without regularization.

Let's start by importing neccessary libraries.

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import os 

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor

from scipy import stats 

from statsmodels.regression.linear_model import RegressionResultsWrapper

from sklearn import linear_model
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score

Like we said, we have the data set in `_data` folder.

In [3]:
import os
work_dir = os.getcwd()
data_dir = os.path.join(work_dir, "_data")
os.listdir(data_dir)

['BostonHousingData.csv', 'iris.csv', 'kc_house_data.csv']

## Regularization

**01.** Load the data set and do the initial scan of the data set.

In [4]:
### your code here ###

**02.** Are there any missing values?

In [5]:
### your code here ###

**03.** Perform the EDA on data set provided.

In [6]:
### your code here ###

**04.** We talked about certain assumptions that have to be fulfilled before using linear models (MLR). What can you conclude by now?

In [7]:
### your answer here ###

**05.** Use `statsmodels` to fit the MLR model. 

In [8]:
### your code here ###

**06.** What are your conclusions based on the results of `statsmodels` model report?

In [9]:
### your answer here ###

**07.** What are the *VIF*s for each predictor variable? What are your conclusions?

In [10]:
### your answer here ###

**08.** Use `sklearn` **Ridge** regularized linear model and search for the best `alpha` value, like we did in session 15. 

In [11]:
### your code here ###

**09.** Plot how different *alpha* values affect: model parameters, L2 norm, MSE and R2 scores. What are your conclusions based on the charts?

In [12]:
### your code here ###

**10.** Use `sklearn` **Ridge** regularized linear model and search for the best `alpha` value, like we did in session 15. 

In [13]:
### your code here ###

**11.** Plot how different *alpha* values affect: model parameters, L2 norm, MSE and R2 scores. What are your conclusions based on the charts?

In [14]:
### your code here ###

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>