# MQB7046 MODELLING PUBLIC HEALTH DATA - Logistic Regression



Logistic regression models are statistical models used to study effects of independent variables (IVs) on binary categorical outcomes, such as presence or absence of disease (e.g., diabetes or no diabetes).  The model is also called a binary logistic model. 
When there are multiple IVs (e.g., risk factors and treatments) the model is referred to as a multiple or multivariable logistic regression model. 

In the Statsmodels library, logistic regression can be performed using the following approach:
1) statsmodels.formula.api
2) statsmodels.api
3) statsmodels.discrete.discrete_model
4) statsmodels.genmod.generalized_linear_modelicients.



##### The General Steps Required to Run Multiple Logistic Regression:

1. **Prepare Your Data**:
   - Clean and preprocess the data, handle any missing values, outliers, and formatting issues.

2. **Fit the Logistic Regression Model**:
   - Use the `logit()` function from `statsmodels.formula.api` (or any above) to specify and fit the logistic regression model to the data. Specify the model using a formula string that includes the dependent variable and independent variables.

3. **Evaluate Performance**:
   - **Likelihood Ratio Test**: Perform a likelihood ratio test to compare the fit of the fitted model to a reduced model (null model). This test assesses whether the full model significantly improves the fit over the null model.
   - **Wald Test**: Perform a Wald test for each coefficient in the model to test the significance of individual predictors. This test assesses whether each coefficient significantly differs from zero.
   - **AIC and BIC**: Calculate the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to compare the goodness-of-fit of the model to other models. Lower values indicate better fit.

4. **Assess Model Assumptions**:
   - Check the assumptions of logistic regression, such as linearity of the log odds, absence of multicollinearity, and absence of influential outliers.

5. **Interpret the Results**:
   - Interpret the coefficients of the logistic regression model to understand the relationship between the independent variables and the probability of the outcome. Report the estimated coefficients, odds ratios, confidence intervals, and statistical significance of the variables involved.


#### Practical 3

A medical student wanted to study the association between body mass index (BMI) and hypercholesterolemia (HCL). He selected 40 patients, all of similar age, and obtained their BMI values and hypercholesterolemia status from their medical records. Based on the BMI values, the patients were classified as overweight or normal. The data are provided as hcl.csv. 
- sex: 0: male, 1: female
- hcl: 0: no HCL, 1: HCL
- overweight: 0: Normal, 1: overweight

Run a multiple logistic regression to estimate the adjusted odds ratios to quantify the effect of sex and BMI on hypocholesterolemia. 


In [None]:
# Install statsmodels

# pip install statsmodels

In [33]:
# Import Pandas and the statsmodels Formula API

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import plotly.express as px
from scipy.stats import chi2_contingency

In [10]:
# Load data into a DataFrame 

hcl = pd.read_csv(r"hcl.csv")

# or hcl = pd.read_csv(r"C:\Users\USER\MPHD notebooks\hcl.csv")
hcl.head()

Unnamed: 0,sex,overweight,hcl
0,1,1,1
1,0,0,0
2,0,0,0
3,1,1,1
4,0,1,1


In [7]:
# Display basic information about the dataset
print(hcl.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   sex         40 non-null     int64
 1   overweight  40 non-null     int64
 2   hcl         40 non-null     int64
dtypes: int64(3)
memory usage: 1.1 KB
None
--------------------
sex           0
overweight    0
hcl           0
dtype: int64


In [8]:
# Display descriptive statistics
hcl.describe().round(2)

Unnamed: 0,sex,overweight,hcl
count,40.0,40.0,40.0
mean,0.5,0.57,0.5
std,0.51,0.5,0.51
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.5,1.0,0.5
75%,1.0,1.0,1.0
max,1.0,1.0,1.0


In [19]:
# Generate frequencies tables, calculate percentage
print(hcl.pivot_table(index = "sex", columns = "hcl", values = "overweight", aggfunc = len))
print("--------------------")
print(hcl.pivot_table(index = "overweight", columns = "hcl", values = "overweight", aggfunc = len))
print("--------------------")
print(hcl.pivot_table(index = "sex", columns = "overweight", values = "hcl", aggfunc = len))

hcl   0   1
sex        
0    10  10
1    10  10
--------------------
hcl          0   1
overweight        
0           13   4
1            7  16
--------------------
overweight  0   1
sex              
0           9  11
1           8  12


In [20]:
# Check for missing values
print(hcl.isnull().sum())

sex           0
overweight    0
hcl           0
dtype: int64


In [24]:
# Explore data distribution
px.bar(hcl, x = "overweight", y = "hcl", facet_row = "sex")





In [29]:
# Conduct bivariate analysis
for column in ["sex", "overweight"]:
    chi2, p_value, _, _ = chi2_contingency(hcl.pivot_table(index = column, columns = "hcl", aggfunc = len))
    print(f"""Chi-square between {column} and HCL = {chi2:.2f} with p_value = {p_value:.2f}""")


Chi-square between sex and HCL = 0.00 with p_value = 1.00
Chi-square between overweight and HCL = 6.55 with p_value = 0.01


In [31]:
# Fitting a logistic regression
# smf.logit is used to create a logistic regression model using a formula interface (hcl given (~) sex and overweight)
# Define the formula for logistic regression
formula = 'hcl ~ sex + overweight'

# Fit the logistic regression model
log_reg_model = smf.ols(formula=formula, data=hcl).fit()

# Examining fit results
print(log_reg_model.summary())

                            OLS Regression Results                            
Dep. Variable:                    hcl   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.165
Method:                 Least Squares   F-statistic:                     4.850
Date:                Fri, 05 Apr 2024   Prob (F-statistic):             0.0135
Time:                        15:07:52   Log-Likelihood:                -24.376
No. Observations:                  40   AIC:                             54.75
Df Residuals:                      37   BIC:                             59.82
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.2462      0.132      1.869      0.0

In [35]:
# Calculate odds ratios
pd.Series(np.exp(log_reg_model.params), index = log_reg_model.params.index)

Intercept     1.279096
sex           0.977187
overweight    1.586513
dtype: float64

In [40]:
# Assess model fit
fit_assessment = {"Standard Error":}
print(f"Standard Error = \n{log_reg_model.bse}")


Standard Error = 
Intercept     0.131722
sex           0.146522
overweight    0.148198
dtype: float64


In [None]:
# Assess model assumptions

In [None]:
# Check for multicollinearity

