# Logistic Regression with Categorical Predictor in Python

## Mohammad Abdul Wahed

## Contents


*   Objective
*   Description of sud.csv Dataset
*   Importing Libraries
*   Loading Data
*   Data preparation(Creating Dummy variables)
*   Splitting the data into train and test set using Twinning technique
*   Fitting a model using Logistic Regression
*   Using the model to predict `admit` using test dataset
*   Model evaluation and accuracy

## Objective

The objective is to develop a logistic regression model that predicts whether a student will get admitted based on gpa, gre score and prestige of institution.

## Description of sud.csv Dataset

This dataset has a binary response (outcome, dependent) variable called admit. There are three predictor variables: gre, gpa and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

## Importing Libraries

In [24]:
# Let's import the required packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

## Loading Data

In [23]:
# Load the data using pandas.read_csv()
df = pd.read_csv("binary.csv")

In [2]:
# Let's inspect the df.head()
print(df.head())

   admit  gre   gpa  rank
0      0  380  3.61     3
1      1  660  3.67     3
2      1  800  4.00     1
3      1  640  3.19     4
4      0  520  2.93     4


The dataset contains several columns which we use as predictor variables:
- `gpa`  
- `gre` score  
- `rank` or prestige of an applicant's undergraduate alma matter  

The column `admit` is our binary target variable

The column named `rank` could present a problem since `rank` is also the name of a method belonging to pandas `DataFrame`; Specifically, `rank` calculates the ordered rank (1 through n) of a `DataFrame/Series`. We want to rename our `rank` column to 'prestige'.

In [3]:
df.columns = ["admit", "gre", "gpa", "prestige"]
print(df.columns)

Index(['admit', 'gre', 'gpa', 'prestige'], dtype='object')


In [4]:
df.describe()

Unnamed: 0,admit,gre,gpa,prestige
count,400.0,400.0,400.0,400.0
mean,0.3175,587.7,3.3899,2.485
std,0.466087,115.516536,0.380567,0.94446
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


We see that the mean of outcome is 0.34 which means that the dataset is imbalanced(outcome '0' and outcome '1' are not in proportion). The F1 score metric becomes especially valuable when working on classification models in which our data set is imbalanced. We will implement it later in this notebook.

## Data preparation(Creating Dummy variables)


`pandas` gives us a great deal of control over how categorical variables are represented. Here we'll **dummify** the "prestige" column using `get_dummies`.

`get_dummies` creates a new `DataFrame` with binary indicator variables for each category / option in the column specified. In this case, `prestige` has four levels: 1 being most prestigious and 4 least. 

When we call `get_dummies` we get a dataframe with 4 columns of binary values (0 or 1) indicating which level the initial data point belongs to. 


In [5]:
dummy_ranks = pd.get_dummies(df['prestige'], prefix = 'prestige')
dummy_ranks.head()

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1


Create a clean data frame for our logistic regression model later:

In [6]:
cols_to_keep = ['admit', 'gre', 'gpa']

# use .join to combine the columns 
# df[[ col2, col4 ]] allows us to subset columns 2 and 4
data = df[cols_to_keep].join(dummy_ranks[['prestige_2', 'prestige_3', 'prestige_4']])
data.head()

Unnamed: 0,admit,gre,gpa,prestige_2,prestige_3,prestige_4
0,0,380,3.61,0,1,0
1,1,660,3.67,0,1,0
2,1,800,4.0,0,0,0
3,1,640,3.19,0,0,1
4,0,520,2.93,0,0,1


Notice how we did not include `prestige_1`, that is because the lack of any `1` between prestige 2 to 4 would indicate a level of `prestige_1`. When we treat `prestige_1` as our baseline and exclude it from our fit we also prevent multicollinearity, or the dummy variable trap which is a result of including a dummy variable for every single category.

##Splitting the data into training and test set using Twinning technique

Twinning technique partitions datasets into statistically similar disjoint sets, termed as twins.

Let's install the twinning package

In [None]:
pip install git+https://github.com/avkl/twinning.git

In [8]:
from twinning import twin

The following code generates an 80-20 partition of the dataset. twin() accepts a numpy ndarray as the dataset, and an integer parameter r representing the inverse of the partitioning ratio, i.e., for an 80-20 split, r = 1 / 0.2 = 5. The function returns indices of the smaller twin.

In [25]:
twin_idx = twin(data.to_numpy(), r=5)

In [28]:
twin_idx

array([116, 327, 290, 201, 190, 390, 268, 391,  89, 370, 241,  79,  26,
       364, 148, 316, 247, 232,  97, 298, 387, 138, 158, 326,  13, 160,
       308,  93, 163, 240,  65,  73, 292,   9, 185, 149, 113,  29, 332,
       366, 122, 396,  20, 271, 153, 169, 249, 399, 221, 114, 331,  54,
       196, 245, 165, 103, 170,  99, 108, 304,  83,  47, 283, 230, 314,
       111, 342, 282, 210,  16, 141,  27,  59, 289, 242, 301, 263,  12,
       234, 237], dtype=uint64)

Creating a dataframe by dropping indices in twin_idx to create bigger twin which will be used to train the model

In [29]:
data_train=data.drop(data.index[[116, 327, 290, 201, 190, 390, 268, 391,  89, 370, 241,  79,  26,
       364, 148, 316, 247, 232,  97, 298, 387, 138, 158, 326,  13, 160,
       308,  93, 163, 240,  65,  73, 292,   9, 185, 149, 113,  29, 332,
       366, 122, 396,  20, 271, 153, 169, 249, 399, 221, 114, 331,  54,
       196, 245, 165, 103, 170,  99, 108, 304,  83,  47, 283, 230, 314,
       111, 342, 282, 210,  16, 141,  27,  59, 289, 242, 301, 263,  12,
       234, 237]])

Splitting the data into train and test set

In [30]:
X_train = data_train.iloc[:, :-1].values
Y_train = data_train.iloc[:, -1].values
X_test = data.iloc[[116, 327, 290, 201, 190, 390, 268, 391,  89, 370, 241,  79,  26,
       364, 148, 316, 247, 232,  97, 298, 387, 138, 158, 326,  13, 160,
       308,  93, 163, 240,  65,  73, 292,   9, 185, 149, 113,  29, 332,
       366, 122, 396,  20, 271, 153, 169, 249, 399, 221, 114, 331,  54,
       196, 245, 165, 103, 170,  99, 108, 304,  83,  47, 283, 230, 314,
       111, 342, 282, 210,  16, 141,  27,  59, 289, 242, 301, 263,  12,
       234, 237], :-1].values
Y_test = data.iloc[[116, 327, 290, 201, 190, 390, 268, 391,  89, 370, 241,  79,  26,
       364, 148, 316, 247, 232,  97, 298, 387, 138, 158, 326,  13, 160,
       308,  93, 163, 240,  65,  73, 292,   9, 185, 149, 113,  29, 332,
       366, 122, 396,  20, 271, 153, 169, 249, 399, 221, 114, 331,  54,
       196, 245, 165, 103, 170,  99, 108, 304,  83,  47, 283, 230, 314,
       111, 342, 282, 210,  16, 141,  27,  59, 289, 242, 301, 263,  12,
       234, 237], -1].values       

## Fitting a model using Logistic Regression

Recall that we are predicting the `admit` column using `gre`, `gpa` and the prestige dummy variables 2 through 4. 

In [None]:
model = LogisticRegression()
model.fit(X_train, Y_train)

## Using the model to predict `admit` using test dataset




In [32]:
Y_pred = model.predict(X_test)

## Model evaluation and accuracy

Since our dataset is imbalanced, we use F1 score as our performance metric

The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0

In [33]:
confusion_matrix(Y_test, Y_pred)

array([[63,  3],
       [ 5,  9]])

In [34]:
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.93      0.95      0.94        66
           1       0.75      0.64      0.69        14

    accuracy                           0.90        80
   macro avg       0.84      0.80      0.82        80
weighted avg       0.90      0.90      0.90        80



Computing the F1 score

 We use  `average = 'weighted'`.This accounts for label imbalance

In [36]:
 f1_score(Y_test, Y_pred, average='weighted')

0.8969001148105626

Our overall accuracy is 89%