# Transforming Healthcare with Data Analytics and AI

---
## Module 7 Descriptives

This week we learned about different variable types and how to summarise these. We also learned how to visualise simple relationships between different types of variables, and run simple analyses to test for differences across groups. 

Now we're going to apply the techniques learned in this module to data from the MIMIC II clinical database, focusing on the authentic failed extubation example we initiated with the digital phenotypes last week. 

## Extubation status question
As you will have just discussed, the goal of these few weeks is to see if we can predict extubation status. In this Descriptives class, we will do some preliminary analyses to see if any of our variables appear to relate to extubation status, and we will get to know our sample. 

### Learning objectives
By the end of this class, you will be able to:
1. Be able to summarise and visualise variables by variable type; that is:
2. Review and summarise categorical variables to describe your sample
3. Summarise numeric variables according to distribution
4. Compare numeric variables by group/categorical variable
5. Interpret scatterplots and know when they are useful

### Pre-class activities
This class is written with the assumption that **you have already**: 
- **Completed the reading and exercises** on variable types (categorical, Boolean, continuous etc). 
- **Familiarised yourself with the MIMIC II data** and the data dictionary for the table we will use today (ext_data).
- **Familiarised yourself with Jupyter notebooks**

### Activities
There are five activities in this notebook, which you will complete during the classes:
* Activity 1 Visualise categorical variables
* Activity 2 Summarise categorical variables
* Activity 3 Summarise numeric variables
* Demo Introducing Scatterplots

### How to use this notebook

Go the 'Run' menu, then scroll down to 'Run all cells' and click this - this will run all cells! 

If you want to re-run any cell, click in the cell, and then click `Ctrl + Enter` to run the cell.

You will occasionally be asked to modify some small bits of text within the code, and select variables from some drop-down widgets to automatically re-run some code. You do not need to know how to code in order to do this successfully.

You are not expected to understand the Python code, or to be able to write Python code. It is made available for students who are interested. 


# Introducing the MIMIC II data base and the extracted data set

We are using a smaller data set (called **ext_data.csv**), extracted from MIMIC 2. 

**Each row** in the data set **comprises**: 

- **One patient**, starting with their unique `subject_id`
  
- A unique `icustay_id` (if a patient attended the ICU more than once, they will have more than one row in the data - one row per unique `icustay_id`)
  
- Demographic data and `admission_type_descr` (elective, emergency or urgent admission)
- Comorbidities, clinical observations and some intubation details
  
You should be familiar with these variables from the data dictionary: `Data dictionary ext_data`

# Configure notebook and import Python packages
This section installs and imports the relevant Python packages and configures the notebook.

Now these libraries are installed, import them into this session by running the cell below.

In [None]:
# Import your utilities and all necessary modules
from d2k_utils_mod8 import *
from ipywidgets import interact, widgets
from IPython.display import display
from scipy.stats import fisher_exact  

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
sns.set()
%matplotlib inline

# Show the variables for the first few patients

In [None]:
# Import the data
df = load_data("ext_data.csv")

# Print the first five rows
df.head()

# Activity 1: Visualise categorical variables
Using the table above, which variables would you need to be able to describe the patients in your sample?

## 1.1 Visualise categorical variables using bar charts

Other than age, the variables to describe your sample are categorical. 

To visualise categorical variables, produce bar charts.

In [None]:
# Categorical variables
categorical_variables = ['sex', 'marital_status_descr', 'ethnicity_descr', 'admission_type_descr', 'congestive_heart_failure', 'chronic_pulmonary']

# Create bar charts
for var in categorical_variables:
    create_bar_chart(var)

## 1.2 Combine small categories, and recreate bar charts

You will see many categories are very small. These are usually combined into 'Other'.

In your group, update the code below to decide how to combine the categories in each variable (**rename the category on the right**). 

eg:
```
data['marital_status_descr'] = data['marital_status_descr'].replace({
    'MARRIED': 'MARRIED',
    'SEPARATED': 'SEPARATED/DIVORCED',
    'DIVORCED': 'SEPARATED/DIVORCED',
    'SINGLE' : 'SINGLE',
    'WIDOWED' : 'WIDOWED',
    'UNKNOWN (DEFAULT)': None 
})
```

eg:
```
data['marital_status_descr'] = data['marital_status_descr'].replace({
    'MARRIED': 'MARRIED',
    'SEPARATED': 'NOT MARRIED/WIDOWED',
    'DIVORCED': 'NOT MARRIED/WIDOWED',
    'SINGLE' : 'NOT MARRIED/WIDOWED',
    'WIDOWED' : 'NOT MARRIED/WIDOWED',
    'UNKNOWN (DEFAULT)': None 
})
```

Any **categories** you would like to **drop** altogether, type `none` instead of a category name on the right-hand side of the colon **eg `'NOT SPECIFIED': None`**

**Re-run the code (press Ctrl + Enter).** This will create a new data set called **ext_categ.csv**, which is used in subsequent code beloq examining demographics.

In [None]:
# Modify admission_type_descr
df['admission_type_descr'] = df['admission_type_descr'].replace({    
    'ELECTIVE': 'ELECTIVE',
    'EMERGENCY': 'EMERGENCY',
    'URGENT': 'URGENT'
})

# Modify marital_status_descr
df['marital_status_descr'] = df['marital_status_descr'].replace({
    'MARRIED' : 'MARRIED',
    'SEPARATED': 'SEPARATED',
    'DIVORCED': 'DIVORCED',
    'SINGLE' : 'SINGLE',
    'WIDOWED' : 'WIDOWED',
    'UNKNOWN (DEFAULT)' : 'UNKNOWN (DEFAULT)'
})

# Modify ethnicity_descr
df['ethnicity_descr'] = df['ethnicity_descr'].replace({
    'WHITE' : 'WHITE',
    'ASIAN': 'ASIAN',
    'ASIAN - CHINESE': 'ASIAN - CHINESE',
    'BLACK/CAPE VERDEAN': 'BLACK/CAPE VERDEAN',
    'BLACK/AFRICAN AMERICAN': 'BLACK/AFRICAN AMERICAN',
    'AMERICAN INDIAN/ALASKA NATIVE': 'AMERICAN INDIAN/ALASKA NATIVE',
    'MULTI RACE ETHNICITY': 'MULTI RACE ETHNICITY',
    'HISPANIC OR LATINO': 'HISPANIC OR LATINO',
    'OTHER' : 'OTHER',
    'PATIENT DECLINED TO ANSWER': 'PATIENT DECLINED TO ANSWER',
    'UNKNOWN/NOT SPECIFIED': 'UNKNOWN/NOT SPECIFIED',
    'UNABLE TO OBTAIN': 'UNABLE TO OBTAIN'
})

# Optional: Save the cleaned data
df.to_csv('ext_categ.csv', index=False)
print("Cleaned dataset saved as 'ext_categ.csv'")

# Load the cleaned data
df = load_data("ext_categ.csv")

# Categorical variables
categorical_variables = ['sex', 'marital_status_descr', 'ethnicity_descr', 'admission_type_descr', 'congestive_heart_failure', 'chronic_pulmonary']

# Create bar charts for each categorical variable
for var in categorical_variables:
    create_bar_chart(var)  # Using global df, so we don't need to pass df

# Activity 2: Summarise categorical variables by extubation status

## 2.1 Are failed and successful extubation groups the same? Examine differences in demographic characteristics by extubation status
**Re-run the cell below (press `Ctrl + Enter`)** to produce tables summarising each key variables by extubation status, and the whole sample.

It also provides a p-value to indicate whether the distribution of the variable differs sufficiently by extubation status to conclude that there is a difference in the population.

## Go to your group worksheet
Review the results below, and then go to your group worksheet to: 
- Complete as much as you can of Table 1a and the associated paragraph describing the sample
- Add an * next to each significant demographic variable in the table in the worksheet, and add a footnote underneath the table explaining this asterisk (ie '*p <.05)'.

In [None]:
# List of variables to analyze
variables = ['sex', 'marital_status_descr', 'ethnicity_descr', 'admission_type_descr', 
            'congestive_heart_failure', 'chronic_pulmonary']

# Display values
for var in variables:
    print(f"\nAnalysis of {var}:")
    
    # Create the crosstab and get statistics
    ct, odds_ratio, p_value = create_crosstab(var)
    
    # Display the formatted table
    display(ct)
    
    # Print statistics with proper formatting
    print("\nStatistics:")
    if odds_ratio is not None:
        print(f"Odds Ratio: {odds_ratio:.4f}")
    print(f"P-value: {p_value:.4f}")
    print("\n" + "="*80 + "\n")  # Add a separator between variables

# Activity 3: Preliminary analyses of numeric variables

Always start by visualising the variables.

## 3.1 Histograms and box plots
In order to know whether to use the t-test or the Mann-Whitney, let's examine the distributions. It is always good to know the distributions of your variables. 

The following cell will produce both box-plots and histograms.

Once you run the code, a drop-down menu comprising significant variables will appear underneath the cell. You can select each of these in turn, which wil re-run the code to produce new histograms for each selected variable.

Step through at least three of these variables, to see which distributions are normal and which are not.

## Go to your worksheet
Complete table 1b:
- For those that are normal, report the mean, SD and independent-samples t-test in your table on your worksheet.
- For those that are not normal, report the mean, interquartile-range and Mann-Whitney U-test in your worksheet.

In [None]:
# List of variables to analyze
variables = ['mean_Heart Rate', 'mean_Respiratory Rate', 'Respiratory SOFA Score', 'mean_FiO2 Set', 'Albumin (>3.2)',
             'Tidal Volume (Obser)', 'PaO2', 'PaCO2', 'hgb', 'icustay_los']

# Create the interactive widget
widget = interact(create_histo, 
                    variable_to_analyze=widgets.Dropdown(
                        options=variables,
                        value='mean_Heart Rate',
                        description='Variable:',
                        disabled=False,
                    ))
display(widget)

# Demo: Visualise relationship between two (numeric) variables
## Scatterplots

Next week we will conduct a regression. Our overall clinical problem is to predict extubation status. 

The code below creates a scatterplot, with a drop down widget so you can generate various scatterplots for various pairs of variables.

You can see that the scatterplots with extubation status (`ext_status_bin_num`) look rather odd, as the the dots can only scatter along the 0 and 1 lines. This is because `ext_status_bin_num` is a binary variable.

While correlations and regressions can be conducted with binary variables, they are more useful with numeric variables.

For this reason, instead of examining the relationship of each variable with `ext_status_bin_num`, we will examine it with `icustay_los`. 

In your groups, run through pairs of variables, producing the scatterplots. While doing so, go to your worksheet and answer some questions about the activity. For the last question, you will need to run the widget for the next pair of 'instruction and code cells' beneath this pair. 

We are just looking briefly at scatterplots this week - we will learn more about these when learning about regression.

In [None]:
# List of variables to choose from
variables = ['ext_status_bin_num', 'mean_Heart Rate', 'mean_Respiratory Rate', 
             'mean_FiO2 Set', 'Albumin (>3.2)','Tidal Volume (Obser)', 'PaO2', 'PaCO2', 
             'hgb', 'Respiratory SOFA Score','icustay_los']

# Create the interactive widget
interact(create_scatter, 
         x_var=widgets.Dropdown(options=variables, description='X Variable:', value='ext_status_bin_num'),
         y_var=widgets.Dropdown(options=variables, description='Y Variable:', value='mean_Heart Rate'))

## Scatterplots by extubation status
You can also examine scatter plots summarised by a third variable. In this case, examining each relationship separately by extubation status.

Again, you can select the variables from the widget that will appear beneath the cell once you run the code. 

In [None]:
from ipywidgets import interact, widgets
from IPython.display import display

# List of variables to choose from
variables = ['ext_status_bin_num', 'mean_Heart Rate', 'mean_Respiratory Rate', 
             'mean_FiO2 Set', 'Albumin (>3.2)','Tidal Volume (Obser)', 'PaO2', 'PaCO2', 
             'hgb', 'Respiratory SOFA Score','icustay_los']

# # Create the interactive widget
interact(create_grp_scatter, 
         x_var=widgets.Dropdown(options=variables, description='X Variable:', value='icustay_los'),
         y_var=widgets.Dropdown(options=variables, description='Y Variable:', value='mean_Heart Rate'))
