# MQB7046 MODELLING PUBLIC HEALTH DATA - Multinomial logistic regression

###### Prepared by Claire Choo (4/4/2024)

Multinomial logistic regression is a statistical method used to model the relationship between multiple categorical outcome variables and one or more independent variables.

* Data Preparation:
  - Load the dataset and ensure it is properly formatted.
  - Convert any categorical variables into dummy variables or use one-hot encoding.


* Model Specification:
  - Import the statsmodels.api module
  - Define independent variables and dependent variable
  - Add a constant termther independent variables using sm.add_constant().
  - Specify the multinomial logistic regression model using sm.MNLog().


* Model Fitting:
  - Fit the model with the data using the .fit() method.


* Interpretation and Evaluation: 
  - Interpret the coefficients, p-values, and confidence intervals to assess the significance of the independent variables (or predictors).
  - Assess the goodness-of-fit of the model using appropriate metrics like pseudo-R squared, likelihood ratio tests, or others.
  - Consider the odds ratios associated with each predictor variable to assess the magnitude of the effect.


#### Practical 4

The researchers are interested to examine factors associated with employment status of a group of individuals.

Variable / Definition:
1) id  : identification number
2) Sex : Gender of participants (Male, Female)
3) Age : Age group of participants (25-44,45-54, 55-64)
4) Marital: Marital status (Married/Cohabiting, Single, Widowed/Divorced) 
5) Employment : Employment status (Employed, Nonemployed, Unemployed)


In [1]:
import pandas as pd
import numpy as np
from mphd import *

# Load the dataset into a DataFrame
df = pd.read_excel(r"open.xlsx")

# Display the first few rows of the DataFrame to verify that the data is loaded correctly
df.head()

Unnamed: 0,id,Sex,Age,Marital,Employment
0,1,Male,55-64,Married/Cohabiting,Employed
1,2,Male,25-44,Married/Cohabiting,Nonemployed
2,3,Male,25-44,Married/Cohabiting,Employed
3,4,Male,55-64,Married/Cohabiting,Employed
4,5,Male,55-64,Married/Cohabiting,Employed


In [2]:
# Check the structure of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 741 entries, 0 to 740
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          741 non-null    int64 
 1   Sex         741 non-null    object
 2   Age         741 non-null    object
 3   Marital     741 non-null    object
 4   Employment  741 non-null    object
dtypes: int64(1), object(4)
memory usage: 29.1+ KB


In [3]:
# Descriptive Analysis on the data
categorical_descriptive_analysis(df, independent_variables=df.columns[1:4], dependent_variables="Employment", margins=True)

Descriptive Analysis for independent variables:
Descriptive Analysis on Sex
+--------+-------+------------+
|  Sex   | count | percentage |
+--------+-------+------------+
| Female | 396.0 |   53.44    |
|  Male  | 345.0 |   46.56    |
|  All   | 741.0 |   100.0    |
+--------+-------+------------+
Chi2 test between Sex and Employment have chi2 statistics value = 68.35 and p_value of 0.00
-----------------------------------------------------
Descriptive Analysis on Age
+-------+-------+------------+
|  Age  | count | percentage |
+-------+-------+------------+
| 25-44 | 388.0 |   52.36    |
| 45-54 | 169.0 |   22.81    |
| 55-64 | 184.0 |   24.83    |
|  All  | 741.0 |   100.0    |
+-------+-------+------------+
Chi2 test between Age and Employment have chi2 statistics value = 124.44 and p_value of 0.00
-----------------------------------------------------
Descriptive Analysis on Marital
+--------------------+-------+------------+
|      Marital       | count | percentage |
+----------

In [4]:
# Convert categorical variables into numerical labels
data = label_encode(df = df.copy(deep = True), columns= df.columns[1:], prefix = "encoded", convert_numeric = True)

# encoded columns = 
encoded_columns = ["id",] + [x for x in data.columns if "encoded" in x]

# Print the dataframe head
data.head()

Unnamed: 0,id,Sex,Age,Marital,Employment,encoded_Sex,encoded_Age,encoded_Marital,encoded_Employment
0,1,Male,55-64,Married/Cohabiting,Employed,1,2,0,0
1,2,Male,25-44,Married/Cohabiting,Nonemployed,1,0,0,1
2,3,Male,25-44,Married/Cohabiting,Employed,1,0,0,0
3,4,Male,55-64,Married/Cohabiting,Employed,1,2,0,0
4,5,Male,55-64,Married/Cohabiting,Employed,1,2,0,0


In [5]:
# Convert X to dummy variables
dummy_df = pd.get_dummies(df, columns = df.columns[1:], dtype = int)
dummy_df.head()

Unnamed: 0,id,Sex_Female,Sex_Male,Age_25-44,Age_45-54,Age_55-64,Marital_Married/Cohabiting,Marital_Single,Marital_Widowed/ Divorced,Employment_Employed,Employment_Nonemployed,Employment_Unemployed
0,1,0,1,0,0,1,1,0,0,1,0,0
1,2,0,1,1,0,0,1,0,0,0,1,0
2,3,0,1,1,0,0,1,0,0,1,0,0
3,4,0,1,0,0,1,1,0,0,1,0,0
4,5,0,1,0,0,1,1,0,0,1,0,0


# Run multinomial logistic regression

In [30]:
# Test 1: Single Independent Variable (Sex) with Single Target Variable
#add constant
sex_only_model = multinominal_logistic_regression(data, mode ="sm.MNLogit", x=("encoded_Sex",), y = "encoded_Employment")
sex_only_model_assessment = assess_model_fitness(sex_only_model, "Employment ~ Sex")
sex_only_model_assessment.round(4)

Optimization terminated successfully.
         Current function value: 0.695885
         Iterations 7
                               Employment ~ Sex                               
Dep. Variable:     encoded_Employment   No. Observations:                  741
Model:                        MNLogit   Df Residuals:                      737
Method:                           MLE   Df Model:                            2
Date:                Mon, 15 Apr 2024   Pseudo R-squ.:                 0.02219
Time:                        08:58:31   Log-Likelihood:                -515.65
converged:                       True   LL-Null:                       -527.35
Covariance Type:            nonrobust   LLR p-value:                 8.291e-06
encoded_Employment=1       coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.9200      0.114     -8.085      0.000      -1.143      -0.697

Unnamed: 0,model_name,likelihood_ratio,llr_p_value,pseudo-r-squared,aic_akaike_information_criterion,bic_bayesin_information_criterion
0,Employment ~ Sex,23.4007,0.0,0.0222,1039.302,1057.734


In [44]:
### Test 2: Adding more IVs in the model
independent_variable_list = iterate_independent_variables(independent_variables=encoded_columns[1:4], dependent_variables=encoded_columns[4])
columns_for_independent_variable_list = independent_variable_list[0]
formula_list = independent_variable_list[1]
model_list = [multinominal_logistic_regression(data, mode = "smf.ols", formula = formula) for formula in formula_list]

assessment_df = pd.DataFrame()

for num in range(0, len(model_list)):
    # assessment_df = pd.concat([assessment_df, assess_model_fitness(model_list[num], title = formula_list[num])])
    print(model_list[num].summary(title = formula_list[num]))
    print(f"Odd ratio for = {formula_list[num]}")
    print(np.exp(model_list[num].params).round(4).to_markdown(tablefmt = "pretty"))
    #                             "p_values":[model_list[num].pvalues],
    temp_df = pd.DataFrame({"model_name":[formula_list[num]],
                            "Log-Likelihood":[model_list[num].llf],
                            "R-Squared":[model_list[num].rsquared],
                            "F-statistic":[model_list[num].fvalue],
                            "F-p_value":[model_list[num].f_pvalue],
                            "aic_akaike_information_criterion":[model_list[num].aic], 
                            "bic_bayesin_information_criterion":[model_list[num].bic]})
    assessment_df = pd.concat([assessment_df, temp_df], ignore_index = True)
    print("-------------------------------------------------------------------------------------------------------")

                       encoded_Employment ~ encoded_Sex                       
Dep. Variable:     encoded_Employment   R-squared:                       0.006
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     4.185
Date:                Mon, 15 Apr 2024   Prob (F-statistic):             0.0411
Time:                        09:36:00   Log-Likelihood:                -636.01
No. Observations:                 741   AIC:                             1276.
Df Residuals:                     739   BIC:                             1285.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.3586      0.029     12.483      

In [46]:
assessment_df.round(4)

Unnamed: 0,model_name,Log-Likelihood,R-Squared,F-statistic,F-p_value,aic_akaike_information_criterion,bic_bayesin_information_criterion
0,encoded_Employment ~ encoded_Sex,-636.0071,0.0056,4.1851,0.0411,1276.0143,1285.2303
1,encoded_Employment ~ encoded_Age,-635.2513,0.0077,5.7027,0.0172,1274.5026,1283.7186
2,encoded_Employment ~ encoded_Marital,-627.6901,0.0277,21.0569,0.0,1259.3802,1268.5962
3,encoded_Employment ~ encoded_Sex + encoded_Age,-632.6108,0.0147,5.5071,0.0042,1271.2217,1285.0457
4,encoded_Employment ~ encoded_Sex + encoded_Mar...,-626.507,0.0308,11.728,0.0,1259.0141,1272.8381
5,encoded_Employment ~ encoded_Age + encoded_Mar...,-624.8469,0.0351,13.4378,0.0,1255.6938,1269.5178
6,encoded_Employment ~ encoded_Sex + encoded_Age...,-623.2453,0.0393,10.0494,0.0,1254.4906,1272.9226


In [47]:
model_list[6].summary()

0,1,2,3
Dep. Variable:,encoded_Employment,R-squared:,0.039
Model:,OLS,Adj. R-squared:,0.035
Method:,Least Squares,F-statistic:,10.05
Date:,"Mon, 15 Apr 2024",Prob (F-statistic):,1.7e-06
Time:,10:45:10,Log-Likelihood:,-623.25
No. Observations:,741,AIC:,1254.0
Df Residuals:,737,BIC:,1273.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.2455,0.036,6.749,0.000,0.174,0.317
encoded_Sex,-0.0749,0.042,-1.787,0.074,-0.157,0.007
encoded_Age,0.0636,0.025,2.553,0.011,0.015,0.112
encoded_Marital,0.1205,0.028,4.344,0.000,0.066,0.175

0,1,2,3
Omnibus:,200.046,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,394.158
Skew:,1.572,Prob(JB):,2.57e-86
Kurtosis:,4.698,Cond. No.,3.7


In [None]:
# VIF?

In [None]:
# Comparing model 2 and model 1 using AIC, BIC

