# Logistic Regression Basics

Author: Dr Jessica Caterson

[GitHub Profile](https://github.com/jjcato9)

This python tutorial will review logisitic regression applied to an example with a dataset assessing high versus low-risk of prolonged hospital length of stay.

For this session, we will use the [Healthcare Analytics II Dataset](https://www.kaggle.com/datasets/nehaprabhavalkar/av-healthcare-analytics-ii?resource=download) available on Kaggle.

Download the folder, and save into the same folder as this notebook.

## What is Logistic Regression?

Before we dive into the data, let's consider briefly what logistic regression is, and its utility in health data science...

**Logistic regression** is a form of *generalised linear modelling*. It uses a number of parameters ($X_i$) to determine a discrete outcome, $Y$ e.g. disease vs no disease, side effect vs no side effect, treatment vs no treatment. 

The simplest logistic model relates one covariate (X) to one outcome (Y) which is either present or absent. Logistic regression then calculates the probability of the outcome Y given a certain value X. This can be written simply as:

$ Y = \beta0 + \beta1X $

- Where $ \beta0 $ and $ \beta1 $ are coefficients which influence the outcome Y

## The Link Function

The **link function** is the function used to link the probability with the above model. For a logistic regression, the link funcition is also called the **logit function**, which is as follows:

$ logit(\pi) = log(\pi / 1-\pi) $

From this link function, we can therefore transform $logit(\pi)$ to find $\pi$, the probability, with the following equation:

If: $ logit(\pi) = L $

$ \pi = exp(L)/(1 + exp(L)) $

We will return to how the link function is applied to Logistic Regression later in this notebook.

## Predicting Length of Stay: Data Preparation

We are going to use the provided dataset to determine patient's length of stay as a binary outcome (long vs. short) based on a number of covariables. 

Let's take a look at the data to see what data we have available:

In [1]:
#get required packages
import pandas as pd
import os
import numpy as np
import math

#get local working directory
cwd = os.getcwd()

#open training data set
training_data = pd.read_csv(cwd+'/healthcare/train_data.csv')

#view top of the dataset
training_data.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50


The data contains a vast range of information on patients include where they were admitted, to which department, reason for admission, illness severity, age, and our outcome: length of stay.

For this example, we are going to see how illness severity impacts length of stay.

For our outcome, we are going to split the data into two outcomes: long length of stay, and short length of stay. This is because, for logistic regression, the maths gets a lot more complex for multiple outcome possibilities. Whilst it is possible to do this, in order to understand the maths behind it, we are going to keep this example simple.

We will define long length of stay as >100 days.

In [2]:
#filter out relevant parameters
data = training_data[['case_id','Severity of Illness','Stay']]

#rename columns
data = data.rename(columns = {'Severity of Illness':'illness_severity','Stay':'los'})

#get data information
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318438 entries, 0 to 318437
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   case_id           318438 non-null  int64 
 1   illness_severity  318438 non-null  object
 2   los               318438 non-null  object
dtypes: int64(1), object(2)
memory usage: 7.3+ MB
None


In [3]:
#get unique value counts for illness severity
print(data['illness_severity'].value_counts())

Moderate    175843
Minor        85872
Extreme      56723
Name: illness_severity, dtype: int64


In [4]:
#get unique values for length of stay
print(data['los'].value_counts())

21-30                 87491
11-20                 78139
31-40                 55159
51-60                 35018
0-10                  23604
41-50                 11743
71-80                 10254
More than 100 Days     6683
81-90                  4838
91-100                 2765
61-70                  2744
Name: los, dtype: int64


In [5]:
#define our predictor category
predictors = pd.DataFrame(pd.Categorical(data['illness_severity']))
predictors = predictors.rename(columns = {0:'illness_severity'})

In [6]:
#change los from categorical to binary outcome and create a list
#Let us set any LOS >100 days as a long length of stay

long_los = np.where(data['los'] == ('More than 100 Days'),1,0)
print(long_los)

[0 0 0 ... 0 0 0]


In [7]:
#combine predictor and outcome into finalised data set
model_data = predictors
model_data['long_los'] = long_los
model_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318438 entries, 0 to 318437
Data columns (total 2 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   illness_severity  318438 non-null  category
 1   long_los          318438 non-null  int64   
dtypes: category(1), int64(1)
memory usage: 2.7 MB


## Logistic Regression in Python

Now our data is set up appropriately, we can perform Logistic Regression. For this example, we are going to use `statsmodels`. However, multiple other packages offer logistic regression, such as `sklearn` which is particularly useful when using Logistic Regression for machine learning.

In [8]:
import statsmodels.formula.api as sm

With `statsmodels`, we set up the equation $ Y = \beta0 + \beta1X $ as follows:

In [9]:
model = sm.logit('long_los ~ illness_severity',model_data).fit()

print(model.summary())

Optimization terminated successfully.
         Current function value: 0.100436
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:               long_los   No. Observations:               318438
Model:                          Logit   Df Residuals:                   318435
Method:                           MLE   Df Model:                            2
Date:                Sun, 26 Feb 2023   Pseudo R-squ.:                 0.01394
Time:                        21:42:03   Log-Likelihood:                -31983.
converged:                       True   LL-Null:                       -32435.
Covariance Type:            nonrobust   LLR p-value:                4.890e-197
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                       -3.3354      0.023   -144.732      0.000

## Logistic Regression Interpretation

The model output is detailed... so let's break it down to understand what it all means. 

Firstly, if you look at the bottom table, you'll notice that each illness severity category has been separated. Here, the logistic regression model has turned this  categorical variable into **dummy variables**. 

**Dummy variables** are binary variables which represent each sub-category within a category. For example, for illness severity, we would create two dummy variables: $w_{mod}$ and $w_{min}$. For an individual with severe illness, these variables would both be 0, for moderate illness, $w_{mod}$ is 1 and $w_{min}$ 0 and for minor illness, $w_{min}$ is 1 and $w_{mod}$ is 0.

Our simple equation would now look like this:

$ Y = \beta_0 + \beta_1w_{mod} + \beta_2w_{min} $ where $\beta_0$ represents severe illness

If we want to turn data into dummy variables (which is useful when it comes to machine learning concepts), pandas has a get_dummies function which does all of this for us!

In [21]:
model.params

Intercept                      -3.335368
illness_severity[T.Minor]      -1.149178
illness_severity[T.Moderate]   -0.484121
dtype: float64

The coefficients shown are the **log-odds** (for the intercept (severe illness)) and **log-odds ratios** for minor and moderate illness. Each log-odds ratio is compared with this intercept group.

Odds and Odds Ratios are generated by logistic regression: 

- **Odds** describes the proportion of one outcome compared with another.

- **Odds Ratios** describe the change in odds between two groups e.g. exposed and non-exposed.

For example, lets say patients who smoke have an Odds of 2 for developing lung cancer in the next 10 years...
*This means, for every smoker than doesn't develop lung cancer in the next ten years, 2 will.*

Conversely, lets say the odds ratio of developing lung cancer between smokers and non-smokers is 3. This means, that smokers have three times the Odds of non-smokers of developing lung cancer.

### So how do we calculate odds and odds ratios from logistic regression?
Let's return to our simple logistic regression model:
$ Y = \beta0 + \beta1X $

Here, we can say that:

$logit(\pi) = \beta0 + \beta1X $

Consequently,

- $\beta0$ = log-odds of non-exposed group (i.e. X = 0)
- $\beta1$ = log-odds of exposed group (i.e. X = 1)

If we take the exponential of these values, we get the odds:

- $exp(\beta0)$ = odds of non-exposed group
- $exp(\beta0 + \beta1)$ = log-odds of exposed group (i.e. X = 1)

Remebering our indices rules... we know that $ x^a * x^b = x^{a+b} $

Therefore, $exp(\beta0 + \beta1) = exp(\beta0) * exp(\beta1)$ 

This means that $exp(\beta1)$ represents the odds ratio between the unexposed and exposed groups.

Let's do this for our model output - recall...

In [10]:
print(model.summary())

                           Logit Regression Results                           
Dep. Variable:               long_los   No. Observations:               318438
Model:                          Logit   Df Residuals:                   318435
Method:                           MLE   Df Model:                            2
Date:                Sun, 26 Feb 2023   Pseudo R-squ.:                 0.01394
Time:                        21:42:03   Log-Likelihood:                -31983.
converged:                       True   LL-Null:                       -32435.
Covariance Type:            nonrobust   LLR p-value:                4.890e-197
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                       -3.3354      0.023   -144.732      0.000      -3.381      -3.290
illness_severity[T.Minor]       -1.1492      0.040    -28.850      0.000     

In [11]:
#odds of long length of stay for severe illness
log_odds_severe = model.params[0]
odds_severe = math.exp(log_odds_severe)

#odds ratios for minor and moderate illness
log_or_minor = model.params[1]
or_minor = math.exp(log_or_minor)

log_or_mod = model.params[2]
or_mod = math.exp(log_or_mod)

print('Odds Severe:' + str(odds_severe))
print('Odds Ratio Minor:' + str(or_minor))
print('Odds Ratio Moderate:' + str(or_mod))

#odds for minor and moderate groups
beta0_add_beta1 = model.params[0] + model.params[1]
beta0_add_beta2 = model.params[0] + model.params[2]

odds_minor = math.exp(beta0_add_beta1)
odds_mod = math.exp(beta0_add_beta2)

print('Odds Minor:' + str(odds_minor))
print('Odds Moderate:' + str(odds_mod))

Odds Severe:0.03560148248224058
Odds Ratio Minor:0.31689699925659315
Odds Ratio Moderate:0.6162384072368543
Odds Minor:0.011282002967708205
Odds Moderate:0.0219390008601267


### Interpretation
So what can we say from this model?

The Odds of having a length of stay >100 days for severe illness is 0.035. This is compared with 0.011 for minor illnesses and 0.022 for moderate illnesses. The odds ratios between severe & minor and severe and moderate illness are 0.32 and 0.62 respectively.

## Estimation of the Odds & Odds Ratios

When we have a dataset that we wish to apply logistic regression to, we are using sample data to *infer* information about a true population. We cannot quantify this true population, and therefore, the outputs of our model produce **estimates** of the *true parameters*. Because they are estimates, they inevitably have **error**. It is important to calculate this error along with our estimates when presenting results.

#### Estimation of the Parameters: Odds & Odds Ratios
In logistic regression, we use Maximum Likelihood Estimation to estimate the odds and odds ratios. A discussion of this is beyond the scope of this workbook.

#### Confidence Intervals
In frequentist statistics (which we are discussing here), 95% confidence intervals represent a range of values, for which, if we repeated our sampling of the true population an infinite number of times, 95% of our estimate ranges would contain the true population parameter. 

The choice of 95%, whilst widely used in healthcare, is abitrary, and can be replaced with any % value with acceptable false positive rate $\alpha$ for (100-$\alpha$)% confidence.

There are two methods to calculate confidence intervals for logistic regression: Wald-Type and Profile Likelihood
- **Wald-Type** Confidence Intervals are similar to what you may have encountered previously, where 95% CIs are given by:

$$ Estimate \pm 1.96 * Estimate Standard Error $$


- **Profile Likelihood** Confidence Intervals are more complex, and depend on the log-likelihood ratio of the parameters of interest

Confidence Intervals for the log-odds are also shown in the model output. We can convert these log values to get 95% CIs for odds and odds ratios. 

In [12]:
odds_ratios = pd.DataFrame({"OR": model.params,
        "Lower CI": model.conf_int()[0],
        "Upper CI": model.conf_int()[1],})

odds_ratios = np.exp(odds_ratios)
print(odds_ratios)

                                    OR  Lower CI  Upper CI
Intercept                     0.035601  0.034029  0.037246
illness_severity[T.Minor]     0.316897  0.293097  0.342629
illness_severity[T.Moderate]  0.616238  0.582970  0.651405


### Statistical Significance

When creating regression models, we are often looking for *associatons* between certain predictors and outcomes. In this case, where illness severity is associated with length of stay. 

A common way to represent a *significant* relationship is the p-value. The p-value represents the probability of observing the outcome given the null hypothesis, which is that there is no association between $X$ and $Y$. We typically say that, if the probability of observing our data, given the null hypothesis, is less than 0.05, that this gives us sufficient evidence to reject this hypothesis.

It is important to note that the p-value should not be regarded as a binary value. Instead, the lower the p-value, the greater reassurance we can have that the outcome is unlikely given the null hypothesis.

If we return to our model summary, we can see that the P>|z| values are 0.00 for each of these outputs. Given the null hypothesis is that there is no difference in odds of LOS >100 days for severity of illness, P>|z| which represent p-values give us strong evidence to reject this null hypothesis. Therefore, there appears to be an association between illness severity and LOS >100 days. 

## Next Steps
This notebook has covered some of the basics of logistic regression. It is possible to see how different factors affect hospital length of stay. You may also wish to change the definition of long length of stay, and see how this affects the model outputs. By adding in multiple factors, you can adjust for these e.g. age may affect the influence of disease severity on length of stay, by including this, you adjust for it.

Logistic Regression can also be used to predict outcomes for unseen data. This is the basis of some machine learning algorithms. This will be discussed elsewhere. 