# Assignment 2 for MQB7046: Modelling Public Health Data
Name: Chin Wei Hong </br>
Matrix Number: 22110451 </br>

__Question:__

Data were collected for 1,200 men and women in a study on mental health. In this cross-sectional study, all participants were asked a series of 20 questions about depressive symptoms, and a sum score of depressive symptoms was constructed.All other variables were also collected during interviews in the participants’ homes.The dataset consists of the following information:

|Variable|Coding|
|:------:|:----:|
|age|age in years|
|sex|sex: 1:"male", 2:"female"|
|educ|education: 1:"primary", 2:"vocational", 3:"secondary", 4:"university"|
|married|marital status: 0:"married", 1:"unmarried"|
|deprivation|material deprivation: 1:"yes", 0:"no"|
|bmi|body mass index in kg/m2|
|sbp|systolic blood pressure in mmHg|
|alcohol|problem drinking: 1:"yes", 0:"no"|
|smoke|current smoking: 1:"yes", 0:"no"|
|depscore|depression score: score range from 0 - 60|

Score of depressive symptoms (CES-D score) was constructed from 20 questions. The answers for each question are on a scale of 0-3, and the overall score ranges from 0-60. Score 0 means no depressive symptoms, score 60 means high score of depressive symptoms. Based on evidence from previous studies in various populations, respondents can be categorised into the following groups: </br>
a) not depressed (0–9 points), </br>
b) mildly depressed (10–15 points), </br>
c) moderately depressed (16–24 points), or </br>
d) severely depressed (more than 25 points). 

In other studies, a score of less than 16 indicates no depression, 16 and more indicates a clinically relevant depression.
The investigators are interested in the relationship between material deprivation of individuals and depression. It has been hypothesized that people who are more deprived are more likely to develop depression/more depressive symptoms than people in better material circumstances.
***
#### Instruction
Perform any relevant analyses and write a report that describe your analysis steps and summarise your findings.
1. State the related research question, objective and hypothesis based on the information provided above.
2. Summarise all variables available in the dataset.
3. Based on the information provided in (1), perform any required analyses to facilitate drawing relevant conclusions. Undertake any necessary data cleaning and preprocessing as needed. Use appropriate statistical methods or techniques and specify any assumptions necessary for the analysis. If warranted, provide justification for the chosen approach.
    1. Decide how you would like to analyse your dependent variable (continuous or categorical).
    2. Evaluate the association between possible risk factors and depression using non-regression and regression analysis.
    3. Also decide what variables that you need to consider as possible confounders or effect modifiers? If yes/no, provide your justifications
    4. Perform any additional analysis required to test these assumptions.
4. Summarise your results in appropriate tabular format and comment on your findings. State your conclusion.

In [1]:
import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))

import pandas as pd
import numpy as np
import wh0102 as mphd

# Prepare the value information for each categorical data
data_dictionary = {
    "sex": {1:"male", 2:"female"},
    "educ":{1:"primary", 2:"vocational", 3:"secondary", 4:"university"},
    "married":{0:"married", 1:"unmarried"},
    "deprivation":{1:"yes", 0:"no"},
    "alcohol":{1:"yes", 0:"no"},
    "smoke":{1:"yes", 0:"no"},
}

# Prepare the depression score categorical information with {type:[[bins], [lables]]}
depscore_category = {2:[[0, 16, np.inf], ["no_depression", "clinically_relevant_depression"]],
                     4:[[0, 10, 16, 25, np.inf], ["no_depression", "mild_depress", "moderate_depress", "severe_depress"]]}

# Load the dataset
df = pd.read_csv("connass2.csv")

print("Display the head of dataframe to ensure the data properly loaded")
print(df.head().to_markdown(tablefmt = "pretty", index = False))

print("Display the information of the dataframe")
print(df.info())

Display the head of dataframe to ensure the data properly loaded
+-------+-----+----------+------+-------------+-------+--------+---------+-------+---------+
|  age  | sex | depscore | educ | deprivation |  bmi  |  sbp   | alcohol | smoke | married |
+-------+-----+----------+------+-------------+-------+--------+---------+-------+---------+
| 51.55 | 2.0 |   15.0   | 4.0  |     0.0     | 34.13 | 177.33 |   0.0   |  0.0  |   0.0   |
| 51.93 | 2.0 |   9.0    | 3.0  |     0.0     | 30.61 | 176.0  |   0.0   |  0.0  |   0.0   |
| 51.19 | 2.0 |   11.0   | 4.0  |     0.0     | 30.64 | 188.67 |   0.0   |  0.0  |   1.0   |
| 52.3  | 2.0 |   9.0    | 3.0  |     0.0     | 23.9  | 150.67 |   0.0   |  0.0  |   0.0   |
| 59.33 | 1.0 |   13.0   | 3.0  |     0.0     | 33.34 | 138.67 |   0.0   |  0.0  |   0.0   |
+-------+-----+----------+------+-------------+-------+--------+---------+-------+---------+
Display the information of the dataframe
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1203 en

In [2]:
# Convert the dataframe for easier descriptive analysis
data = mphd.categorical_data.reverse_encode(df, json_dict=data_dictionary)

The column of sex is having 1 (0.08%) rows more 1 unique value than what {1: 'male', 2: 'female'}:
+-----+------+-----+----------+------+-------------+-------+--------+---------+-------+---------+
|     | age  | sex | depscore | educ | deprivation |  bmi  |  sbp   | alcohol | smoke | married |
+-----+------+-----+----------+------+-------------+-------+--------+---------+-------+---------+
| 531 | 52.7 |  5  |    3     |  2   |      1      | 22.14 | 111.33 |    1    |   0   |    0    |
+-----+------+-----+----------+------+-------------+-------+--------+---------+-------+---------+


__intepretation__:
since only 1 row (0.08%) is affected and the sex = 5 most likely due to missing at random/ wrongly key in, therefore it is safe for us to discard the row of sex == 5

In [3]:
# Drop the row that contain sex == 5
df = df.query("sex != 5")
data = data.query("sex != 5")

In [6]:
# Descriptive Analysis on categorical data
chi2_positive_df, chi2_negative_df = mphd.categorical_data.categorical_descriptive_analysis(data, 
                                                                                            independent_variables = list(data_dictionary.keys()), 
                                                                                            dependent_variables = None)

Descriptive Analysis for independent variables:
Descriptive Analysis on sex
+--------+--------+------------+
|  sex   | count  | percentage |
+--------+--------+------------+
| female | 571.0  |    47.5    |
|  male  | 631.0  |    52.5    |
|  All   | 1202.0 |   100.0    |
+--------+--------+------------+
----------------------------------------------------------------
Descriptive Analysis on educ
+------------+--------+------------+
|    educ    | count  | percentage |
+------------+--------+------------+
|  primary   | 162.0  |   13.48    |
| secondary  | 470.0  |    39.1    |
| university | 374.0  |   31.11    |
| vocational | 196.0  |   16.31    |
|    All     | 1202.0 |   100.0    |
+------------+--------+------------+
----------------------------------------------------------------
Descriptive Analysis on married
+-----------+--------+------------+
|  married  | count  | percentage |
+-----------+--------+------------+
|  married  | 919.0  |   76.46    |
| unmarried | 283.0  |   

In [None]:
# Deal with missing data
print("Columns with missing data:")
missing_data_summary = pd.DataFrame(data.isnull().sum()[df.isnull().sum() > 0]).rename(columns = {0:"count"})

print(missing_data_summary.to_markdown(tablefmt = "pretty"))
missing_data_columns = data.isnull().sum()[df.isnull().sum() > 0].index
data.query(" | ".join([f"{column}.isnull()" for column in missing_data_columns]))
# data.loc[(data.loc[:,"bmi"].isnull() | data.loc[:,"sbp"].isnull())]
# mphd.continous_data.descriptive_analysis(df = )
print(data.loc[:,("age", "depscore", "bmi", "sbp")].corr().round(4).to_markdown(tablefmt = "pretty"))

In [None]:
# Continous data descriptive analysis
continous_descriptive_df = mphd.continous_data.descriptive_analysis(df = data.loc[~(data.loc[:,"bmi"].isnull() | data.loc[:,"sbp"].isnull())])

In [None]:
df.loc[:,"dep_2"] = pd.cut(df.loc[:,"depscore"], bins = depscore_category[2][0], labels=depscore_category[2][1])
df.loc[:,"dep_4"] = pd.cut(df.loc[:,"depscore"], bins = depscore_category[4][0], labels=depscore_category[4][1])
df

In [None]:
df.query("bmi.isnull() | sbp.isnull()")

In [None]:
mphd.categorical_data.categorical_descriptive_analysis(df, independent_variables = list(data_dictionary.keys()), dependent_variables = None)