# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Tasklist 17/18: Regularized Binomial and Multinomial Logistic Regression. 

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

### Intro 

In this Tasklist you will work with the Regularized Binomial and Multinomial Logistic Regression models in `scikit-learn`. For the Multinomial Regression you will essentialy need only to adapt the existing Python code from Session 18 to attempt to solve the Wine Quality problem in a different way. And now for the Regularized Binomial Logistic Regression:

In [1]:
# suppressing those annoying warnings
import warnings

warnings.filterwarnings("ignore")

In [2]:
# importing necessary libraries
import os 

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression

In [3]:
# setting the working directory

work_dir = os.getcwd()
data_dir = os.path.join(work_dir, "_data")
os.listdir(data_dir)

['winequality-red.csv', 'Iris.csv', 'framingham.csv', 'kc_house_data.csv']

## Regularized Binomial Logistic Regression

Consider the `framingham.csv` data set, provided in your `_data` directory for Session 18. The source of this data set is Kaggle: [Logistic regression To predict heart disease](https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression) where a thorough description of all variables is provided; please read through the data set and task description on Kaggle carefully before proceeding.

In [4]:
data_set = pd.read_csv(os.path.join(data_dir, "framingham.csv"))
data_set.head(10)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0
5,0,43,2.0,0,0.0,0.0,0,1,0,228.0,180.0,110.0,30.3,77.0,99.0,0
6,0,63,1.0,0,0.0,0.0,0,0,0,205.0,138.0,71.0,33.11,60.0,85.0,1
7,0,45,2.0,1,20.0,0.0,0,0,0,313.0,100.0,71.0,21.68,79.0,78.0,0
8,1,52,1.0,0,0.0,0.0,0,1,0,260.0,141.5,89.0,26.36,76.0,79.0,0
9,1,43,1.0,1,30.0,0.0,0,1,0,225.0,162.0,107.0,23.61,93.0,88.0,0


The task is to predict the `TenYearCHD` binary variable from all available (numeric and categorical) predictors.

**1.** Provide an elementary EDA of the data: 

   - **1.1** Produce the correlation matrix of all `float64` predictors and speculate if multicolinearity might be present or not; use `seaborn` to visualize the correlation matrix by a heatmap. 
   - **1.2** Provide the categorized box-plots for the following variables vs. the outome: `cigsPerDay`, `totChol`, `sysBP`, `diaBP`, `BMI`.

**1.1** Correlation Matrix

In [5]:
data_set.dtypes

male                 int64
age                  int64
education          float64
currentSmoker        int64
cigsPerDay         float64
BPMeds             float64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol            float64
sysBP              float64
diaBP              float64
BMI                float64
heartRate          float64
glucose            float64
TenYearCHD           int64
dtype: object

In [2]:
### --- YOUR CODE HERE --- ###

**1.2** Box-plots

In [3]:
### --- YOUR CODE HERE --- ###

**2.** Perform an ElasticNet regularization of the Binomial Logistic Regression model like we did in Session17 to predict `TenYearCHD`.

There are only two binary coded categorical predictors, `male` and `currentSmoker`, so there will be no need to separate categorical from numerical predictors and perform Dummy Coding with `sklearn.preprocessing.OneHotEncoder` here. However, make sure that there are no missing values in the data, because Linear Models like the Binomial Logistic Regression do not tolerate missing values! Keep only the rows with complete observations for this exercise. There should be `3656` observations left following the clean-up.

In [4]:
### --- YOUR CODE HERE --- ###

Do we have a class imbalance problem in this data set? Examine:

In [5]:
### --- YOUR CODE HERE --- ###

Now go for the ElasticNet like we did in Session17 and make sure to use the `class_weight` argument properly; use the weighted F-1 score to assess the model performance.

Do not forget to use `sklearn.preprocessing.StandardScaler` to standardize your feature matrix for regularization.

You can use the following values for the inverse penalty `C` and the `l1_ratio`:

`C_range = [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]`

`l1_range = [0, .1, .25, .5, .75, .9, 1]`

In [6]:
### --- YOUR CODE HERE --- ###

Refit the model using the optimal `C` and `l1_ratio` values, print out the model coefficients, and discuss the effect of each predictor.

In [7]:
### --- YOUR CODE HERE --- ###

## Regularized Multinomial Regression

**3.** Perform an ElasticNet regularization of the Multinomial Regression model predict `quality` by first grouping the outcome variable into three larger classes (instead of using the original six classes).

Load the `winequality-red.csv` data set from the Session 18 `_data` directory:

In [8]:
### --- YOUR CODE HERE --- ###

Overview of the `quality` variable:

In [9]:
### --- YOUR CODE HERE --- ###

Recode the `quality` variable to group the existing classes in the following way:

- quality levels 3, 4, and 5 go into new class 1;
- quality level 6 goes into a new class 2;
- quality levels 7 and 8 go into a new class 3.

**Hint.** Use [pandas.DataFrame.replace](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html).

Show us the distribution of instances across the classes again:

In [10]:
# Recode the 'quality' variable

Prepare `X`, and `y`, and use the following values of `C` and `l1_ratio` in ElasticNet for the Multinomial Regression model to predict `quality` (again, use the weighted F-1 score to select the best model):

`C_range = [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]`

`l1_range = [0, .1, .25, .5, .75, .9, 1]`

In [11]:
### --- YOUR CODE HERE --- ###

Print out the confusion matrix from the best model

In [12]:
### --- YOUR CODE HERE --- ###

Are the results any better than those that we were able to obtain from the initial qualities classes in Session18? Please comment.

`### --- YOUR ANSWER HERE --- ###`

***

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>