# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Tasklist 16: Binomial Logistic Regression. 

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

### Intro 

In this Tasklist you'll get to implement Binomial Logistic Regression. This notebook is intended to be used alongside with the notebooks from live sessions and previous tasklists. We'll try to keep things simple here - for EDAs of the given dataset it will suffice to give some basic information and plot some simple graphs. As for ML the model, it doesn't need to be highly accurate - here, the goal is just to implement `statsmodels`, and interpret the results. So, let's dig in!

In [1]:
# suppressing those annoying warnings
import warnings

warnings.filterwarnings("ignore")

In [2]:
# importing necessary libraries
import os 

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor

from scipy import stats 

from statsmodels.regression.linear_model import RegressionResultsWrapper

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [3]:
# setting the working directory

work_dir = os.getcwd()
data_dir = os.path.join(work_dir, "_data")
os.listdir(data_dir)

['binary.csv', 'Fish.csv', 'winequality-white.csv', 'winequality.names']

## Binomial Logistic Regression

**00.** You'll implement BLR on [*ULCA Admission*](https://stats.idre.ucla.edu/stat/data/binary.csv) dataset, which lists various scores of candidates, based on their success in previous education stage, and whether the candidate was accepted to ULCA or not. The name of this dataset file is `binary.csv`. 

In [4]:
# importing the dataset
df = pd.read_csv(os.path.join(data_dir, 'binary.csv'))
df

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.00,1
3,1,640,3.19,4
4,0,520,2.93,4
...,...,...,...,...
395,0,620,4.00,2
396,0,560,3.04,3
397,0,460,2.63,2
398,0,700,3.65,2


The fields of this dataset are:

- `admit`: binary variable; whether the candidate is accepted `[1]` or not `[0]` to ULCA. This is going to be our target variable;


- `gre`: *Graduate Record Examinations* - score of the GRE exam, a standardized exam, required for admission to gradute programs (prijemni);


- `gpa`: *Grade Point Average* - score based on candidate's grades in previous education stage on 1.0 - 4.0 scale;


- `rank`: categorical variable; - candidate's rank based on his succes in previous education stage, compared to other students in their class on 1 - 4 discrete scale. A candidate with low `gpa` score and high rank is an indication that they come from a weaker class; a candidate with high `gpa` score and low rank comes from though class.

**01.** Let's inspect our data using `.info()` method:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   admit   400 non-null    int64  
 1   gre     400 non-null    int64  
 2   gpa     400 non-null    float64
 3   rank    400 non-null    int64  
dtypes: float64(1), int64(3)
memory usage: 12.6 KB


As we see, there are no `NaN` values, which is good. However, `rank`, which is a categorical variable is read as a numerical variable. 

Now, using `pd.CategoricalDtype` with argument `ordered=True`, cast the `rank` column to categorical type.

In [6]:
### your code here ###

**02.** Perform some elementary EDA to depict how the values of `gre`, `gpa` and `rank` are distributed among the binary values of `admint` target variable. 

In [7]:
### your code here ###

**03.** Compute the correlation between the values of numerical variables in this data? Are these results statistically significant?  

In [8]:
### your code here ###

**04.** Use `.logit()` from `stasmodels` to fit a Binomial Logistic Regression model to the given data; this model should predict probabilities for `admit` binary labels from all the other variables.

In [9]:
### your code here ###

**05.** Looking at the model summary, answer: Is the model statistically significant? What about the predictors of the model?

In [10]:
### your code here ###

**06.** Interpret the values of model coefficients. 

*Hint:* Try taking their exponentials.

In [11]:
### your code here ###

**07.** Use the model on train data to predict the probabilities of a candidate being admitted to ULCA. We predict that a candidate will be admitted if the predicted probability is higher than 0.5.

In [12]:
### your code here ###

**08.** Evaluate model performance on train data by calculating its accuracy.

In [13]:
### your code here ###

**09.** What is the log-likelihood of the model?

In [14]:
### your code here ###

***

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>