# KL2 Data Processing & Analysis

In [151]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1 Relabeling

### 1.1 Features Renaming & Preliminary Relabeling

The folder "Data Preprocessing" defines some rules to rename features as well as some preliminary relabeling of feature values.

This section load these rules and process the data as a first step.

In [152]:
# libraries, utility functions

import sys
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from glob import glob

def load_lookup_table(folder_path):
    '''
    "Data Preprocessing" folder contains a bunch of .xlsx's, each coming with one or multiple sheets.
    Each sheet is expected to have only 2 column.
    The first row is a mapping from original column name in collected data, to a more refined (?) variation.
    The rest of the rows are a mapping from field values on the questionnaire to an interpretable label, usually a string.
    If for some row there's no value at the second column, that means we are not interested in those feature values.
    '''
    column_map = {}
    lookup_table = {'columns': column_map} # `column_map`: map old column in original

    excel_paths = glob(os.path.join(folder_path, "*.xlsx"))
    for path in excel_paths:    # read each excel files
        xls = pd.ExcelFile(path)
        for sheet_name in xls.sheet_names:
            df = pd.read_excel(path, sheet_name=sheet_name, header=None)    # one df per sheet
            if df.shape[1] >= 2:  # Ensure at least two columns exist
                old_col_name, new_col_name = df.iloc[0, :2].str.strip()
                if pd.isna(old_col_name):
                    continue
                if pd.isna(new_col_name):
                    new_col_name = old_col_name
                column_map[old_col_name] = new_col_name
                relabel_map = dict(df.iloc[1:, :2].itertuples(index=False))
                lookup_table[new_col_name] = relabel_map
    # more: add two new feature names: 'diagnosis_b' and 'diagnosis_c'
    lookup_table['columns']['diagnosisb'] = 'diagnosis_b'
    lookup_table['columns']['diagnosisc'] = 'diagnosis_c'
    # set the mapping to the same as 'diagnosis'
    lookup_table['diagnosis_b'] = lookup_table['diagnosis']
    lookup_table['diagnosis_c'] = lookup_table['diagnosis']
        
    return lookup_table

def format_data(raw_data_path, lookup_table, keep_columns=True):
    df = pd.read_csv(raw_data_path, delimiter='|', low_memory=False)[:-1] # drop last row: summarizing # rows of the table
    df.iloc[:, 0] = df.iloc[:, 0].astype(float) # first column is coerced to int due to error caused by last special row 
    df.rename(columns=lookup_table['columns'], inplace=True)
    df.replace(lookup_table, inplace=True)
    if not keep_columns:
        columns = df.columns.intersection(lookup_table['columns'].values())
        df = df[columns]

    return df

# # Usage
# folder_path = "../data/Data Preprocessing/"
# raw_data_path = '../data/Original files/Cogan_eRD_RIC1_request.txt'  # Update this to your raw data file path
# output_path = '/path/to/output/renamed_data.xlsx'  # Update this to your desired output file path

# data_dict = load_data_dictionary(folder_path)
# relabeled_data = rename_columns(raw_data_path, data_dict, keep_columns=True)  # Set keep_columns to False if you want to drop columns not in the data dictionary

In [154]:
# load folder "Data Preprocessing" as a lookup table

folder_path = "./data/Data Preprocessing/"
raw_data_path = './data/Original files/Cogan_eRD_RIC1_request.txt'
lookup_table = load_lookup_table(folder_path)

lookup_table['columns']

{'diagnosis': 'diagnosis',
 'priorselfcare': 'selfcare_prior',
 'priorindoormobility': 'mobility_prior',
 'priorstairs': 'stairs_prior',
 'priorfunctionalcognition': 'func_cog_prior',
 'pdumanualwheelchair': 'wc_manual_prior',
 'pdumotorizedwheelchair': 'wc_motor_prior',
 'pdumechanicallift': 'mechlift_prior',
 'pduwalker': 'walker_prior',
 'pduorthoticsprosthetics': 'orth_pros_prior',
 'pdunone': 'no_device_prior',
 'eatingadm': 'eating_adm',
 'eatinggoal': 'eating_goal',
 'oralhygieneadm': 'oral_adm',
 'oralhygienegoal': 'oral_goal',
 'toiletinghygieneadm': 'toileting_adm',
 'showerbatheadm': 'bathe_adm',
 'showerbathegoal': 'bathe_goal',
 'dressupperbodyadm': 'dress_upper_adm',
 'dressupperbodygoal': 'dress_upper_goal',
 'dresslowerbodyadm': 'dress_lower_adm',
 'dresslowerbodygoal': 'dress_lower_goal',
 'dondofffootwearadm': 'footwear_adm',
 'dondofffootweargoal': 'footwear_goal',
 'rollleftrightadm': 'roll_lr_adm',
 'rollleftrightgoal': 'roll_lr_goal',
 'sittolyingadm': 'sit_lying_

In [155]:
# preprocess data with lookup table

df = format_data(raw_data_path, lookup_table, keep_columns=True)

df

Unnamed: 0,sex,marital_status,admityear,admitclass,admit_from,prehospital_living,prehospital_living_with,payor_primary,payor_secondary,impgroupadmit,...,los,ric,cmg,tier,shortstayexpired,shortstaycmg,transferpatient,incompletestay,unplanned_discharge,id
0,female,Married,2023.0,1.0,Short-term general hospital,Home,Family,Medicare FFS,Not listed,1.2,...,21.0,1.0,104.0,3.0,,,f,f,,1.0
1,male,Married,2023.0,1.0,Short-term general hospital,Home,Family,Medicare Advantage,Not listed,1.2,...,14.0,1.0,102.0,,,,f,f,,2.0
2,male,Married,2023.0,1.0,Short-term general hospital,Home,Family,Medicare Advantage,Not listed,1.2,...,5.0,1.0,103.0,,,,f,f,,3.0
3,female,Widowed,2022.0,1.0,Short-term general hospital,Home,Family,Medicare FFS,Not listed,1.1,...,9.0,1.0,102.0,,,,f,f,,4.0
4,female,,2023.0,1.0,Short-term general hospital,Home,Alone,Not listed,Not listed,1.3,...,22.0,1.0,106.0,3.0,f,,t,f,,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43740,female,Widowed,2022.0,1.0,Short-term general hospital,Home,Family,Medicare FFS,Not listed,1.3,...,17.0,1.0,104.0,3.0,,,f,f,,43741.0
43741,female,Never married,2022.0,,Long-term care hospital,Home,Alone,Medicare FFS,Not listed,1.1,...,7.0,1.0,106.0,,f,,t,t,,43742.0
43742,female,Married,2022.0,1.0,Short-term general hospital,Home,Attendant,Medicare FFS,Not listed,1.2,...,29.0,1.0,106.0,,,,f,f,,43743.0
43743,male,Married,2022.0,1.0,Short-term general hospital,Home,Family,Medicare Advantage,Not listed,1.1,...,16.0,1.0,106.0,,f,,t,f,,43744.0


In [156]:
# have a look at unique diagnosis
print("# of unique diagnosis:", df['diagnosis'].unique().size)
df['diagnosis'].unique()

# of unique diagnosis: 340


array(['Cerebral infarction due to thrombosis of left middle cerebral artery',
       'Other cerebral infarction due to occlusion or stenosis of small artery',
       'Cerebral infarction due to unspecified occlusion or stenosis of right anterior cerebral artery',
       'Cerebral infarction due to unspecified occlusion or stenosis of left middle cerebral artery',
       'Cerebral infarction, unspecified',
       'Cerebral infarction due to thrombosis of right middle cerebral artery',
       'Nontraumatic subarachnoid hemorrhage, unspecified',
       'Cerebral infarction due to unspecified occlusion or stenosis of right cerebellar artery',
       'Hemiplegia and hemiparesis following cerebral infarction affecting left non-dominant side',
       'Cerebral infarction due to unspecified occlusion or stenosis of right middle cerebral artery',
       'Cerebral infarction due to unspecified occlusion or stenosis of right posterior cerebral artery',
       'Cerebral infarction due to unspecif

In [157]:
# save preprocessed data

df.to_csv('./data/Cleaned files/Cogan_1_1.csv', index=False)

#### Examine some features of interest here

In [158]:
lookup_table['language']

{nan: nan}

### 1.2 Add Aggregated Outcomes 

In [213]:
# load formated dataframe and see all feature names

import pandas as pd
from tabulate import tabulate

# always start from loading the CSV file from previous section
df = pd.read_csv('./data/Cleaned files/Cogan_1_1.csv', low_memory=False)

feature_names = [*df.head()]
print("total # of features:", len(feature_names))

print(tabulate(enumerate(feature_names), headers=["Index", "Name"], tablefmt="grid"))

total # of features: 449
+---------+--------------------------------+
|   Index | Name                           |
|       0 | sex                            |
+---------+--------------------------------+
|       1 | marital_status                 |
+---------+--------------------------------+
|       2 | admityear                      |
+---------+--------------------------------+
|       3 | admitclass                     |
+---------+--------------------------------+
|       4 | admit_from                     |
+---------+--------------------------------+
|       5 | prehospital_living             |
+---------+--------------------------------+
|       6 | prehospital_living_with        |
+---------+--------------------------------+
|       7 | payor_primary                  |
+---------+--------------------------------+
|       8 | payor_secondary                |
+---------+--------------------------------+
|       9 | impgroupadmit                  |
+---------+-------------------

In [214]:
# define outcome features
selfcare_items = [
    "eating",
    "oral",
    "toileting",
    "bathe",
    "dress_upper",
    "dress_lower",
    "footwear",
]
mobility_items = [
    "roll_lr",
    "sit_lying",
    "lying_sit",
    "sit_stand",
    "bed_chair",
    "toilet_trans",
    "car_trans",
    "walk10ft",
    "walk50ft",
    "walk150ft",
    "walk10ft_uneven",
    "walk1step",
    "walk4step",
    "walk12step",
    "pickup"
]
print("# of selfcare items:", len(selfcare_items))
print("# of mobility items:", len(mobility_items))

# of selfcare items: 7
# of mobility items: 15


In [215]:
# define ordinal sum to measure mapping
selfcare_table = {
    7: 19.65,
    8: 24.7,
    9: 27.64,
    10: 29.69,
    11: 31.44,
    12: 33.02,
    13: 34.46,
    14: 35.78,
    15: 36.97,
    16: 38.04,
    17: 38.98,
    18: 39.84,
    19: 40.62,
    20: 41.35,
    21: 42.05,
    22: 42.73,
    23: 43.39,
    24: 44.07,
    25: 44.77,
    26: 45.52,
    27: 46.33,
    28: 47.25,
    29: 48.33,
    30: 49.64,
    31: 51.28,
    32: 53.37,
    33: 56.1,
    34: 59.86,
    35: 65.47,
    36: 71.86,
    37: 76.96,
    38: 81.08,
    39: 84.76,
    40: 88.57,
    41: 93.54,
    42: 100.64,
}
mobility_table = {
    15: 4.86,
    16: 11.07,
    17: 14.3,
    18: 16.1,
    19: 17.35,
    20: 18.34,
    21: 19.16,
    22: 19.89,
    23: 20.54,
    24: 21.14,
    25: 21.71,
    26: 22.26,
    27: 22.79,
    28: 23.3,
    29: 23.81,
    30: 24.31,
    31: 24.81,
    32: 25.31,
    33: 25.82,
    34: 26.32,
    35: 26.82,
    36: 27.32,
    37: 27.83,
    38: 28.33,
    39: 28.82,
    40: 29.32,
    41: 29.81,
    42: 30.29,
    43: 30.77,
    44: 31.25,
    45: 31.72,
    46: 32.18,
    47: 32.65,
    48: 33.11,
    49: 33.57,
    50: 34.02,
    51: 34.48,
    52: 34.93,
    53: 35.39,
    54: 35.85,
    55: 36.32,
    56: 36.88,
    57: 37.28,
    58: 37.77,
    59: 38.28,
    60: 38.81,
    61: 39.36,
    62: 39.94,
    63: 40.56,
    64: 41.21,
    65: 41.92,
    66: 42.68,
    67: 43.52,
    68: 44.46,
    69: 45.53,
    70: 46.75,
    71: 48.17,
    72: 49.83,
    73: 51.76,
    74: 53.98,
    75: 56.42,
    76: 58.93,
    77: 61.36,
    78: 63.64,
    79: 65.73,
    80: 67.83,
    81: 69.82,
    82: 71.78,
    83: 73.76,
    84: 75.79,
    85: 77.91,
    86: 80.21,
    87: 82.62,
    88: 86.36,
    89: 90.90,
    90: 98.38
}


In [216]:
# safe buffer
last_df = df

In [229]:
# for outcome features, replace invalid responses as nan and count invalid samples
df = last_df.copy()
selfcare_columns = [item + '_adm' for item in selfcare_items] + [item + '_dc' for item in selfcare_items]
mobility_columns = [item + '_adm' for item in mobility_items] + [item + '_dc' for item in mobility_items]

subset_columns = [item + '_adm' for item in selfcare_items+mobility_items]
subset_columns += [item + '_dc' for item in selfcare_items+mobility_items]
df.loc[:, subset_columns] = df.loc[:, subset_columns].replace({
    'Refused': np.nan,
    'Not_applicable': np.nan,
    'Not_attempted': np.nan,
    'Safety': np.nan
})
print(df[subset_columns].isna().sum(axis=0))
# df = df.dropna(subset=selfcare_columns)
# df = df.dropna(subset=subset_columns)
# print("# of valid patients changed from", len(last_df), "to", len(df))

eating_adm              2474
oral_adm                1387
toileting_adm            984
bathe_adm               4420
dress_upper_adm         1750
dress_lower_adm         1343
footwear_adm             958
roll_lr_adm             1229
sit_lying_adm           1430
lying_sit_adm           1433
sit_stand_adm           3711
bed_chair_adm           2403
toilet_trans_adm        5618
car_trans_adm          24704
walk10ft_adm           15365
walk50ft_adm           24103
walk150ft_adm          32291
walk10ft_uneven_adm    31558
walk1step_adm          25778
walk4step_adm          29097
walk12step_adm         37282
pickup_adm             29537
eating_dc               3732
oral_dc                 3152
toileting_dc            3077
bathe_dc                3336
dress_upper_dc          3124
dress_lower_dc          3099
footwear_dc             3135
roll_lr_dc              3199
sit_lying_dc            3145
lying_sit_dc            3142
sit_stand_dc            3553
bed_chair_dc            3125
toilet_trans_d

In [249]:
# calculate 4 measures
inverse_lookup = {v:k for k,v in lookup_table[selfcare_items[0]+'_adm'].items()}
print(inverse_lookup)

selfcare_sum_adm = df[[item + '_adm' for item in selfcare_items]].replace(inverse_lookup).sum(axis=1, skipna=False)
# print(selfcare_sum_adm[:10])
selfcare_sum_dc = df[[item + '_dc' for item in selfcare_items]].replace(inverse_lookup).sum(axis=1, skipna=False)
print("selfcare sum adm nan count:", np.sum(np.isnan(selfcare_sum_adm)))
print("selfcare sum dc nan count:", np.sum(np.isnan(selfcare_sum_dc)))

selfcare_measure_adm = selfcare_sum_adm.map(selfcare_table)
# print(selfcare_measure_adm[:10])
selfcare_measure_dc = selfcare_sum_dc.map(selfcare_table)
print("selfcare measure adm nan count:", np.sum(pd.isna(selfcare_measure_adm).to_numpy()))
print("selfcare measure dc nan count:", np.sum(pd.isna(selfcare_measure_dc).to_numpy()))

mobility_sum_adm = df[[item + '_adm' for item in mobility_items]].replace(inverse_lookup).sum(axis=1, skipna=False)
mobility_sum_dc = df[[item + '_dc' for item in mobility_items]].replace(inverse_lookup).sum(axis=1, skipna=False)
print("mobility sum adm nan count:", np.sum(np.isnan(mobility_sum_adm)))
print("mobility sum dc nan count:", np.sum(np.isnan(mobility_sum_dc)))
mobility_measure_adm = mobility_sum_adm.map(mobility_table)
mobility_measure_dc = mobility_sum_dc.map(mobility_table)
print("mobility measure adm nan count:", np.sum(pd.isna(mobility_measure_adm).to_numpy()))
print("mobility measure dc nan count:", np.sum(pd.isna(mobility_measure_dc).to_numpy()))

{'Independent': 6, 'Setup': 5, 'Supervised': 4, 'Mod_assist': 3, 'Max_assist': 2, 'Dependent': 1, 'Refused': 7, 'Not_applicable': 9, 'Not_attempted': 10, 'Safety': 88}
selfcare sum adm nan count: 8035
selfcare sum dc nan count: 4230
selfcare measure adm nan count: 8035
selfcare measure dc nan count: 4230
mobility sum adm nan count: 39792
mobility sum dc nan count: 24421
mobility measure adm nan count: 39792
mobility measure dc nan count: 24421


In [251]:
# add outcome measure columns to the df
df['selfcare_measure_adm'] = selfcare_measure_adm
df['selfcare_measure_dc'] = selfcare_measure_dc
df['mobility_measure_adm'] = mobility_measure_adm
df['mobility_measure_dc'] = mobility_measure_dc

# add two delta measure columns, just for quick filtering out valid samples (measures @adm&dc are both valid)
df['selfcare_measure_delta'] = selfcare_measure_dc - selfcare_measure_adm
df['mobility_measure_delta'] = mobility_measure_dc - mobility_measure_adm
print("# of invalid samples for selfcare:", np.isnan(df['selfcare_measure_delta']).sum(), " out of", len(df))
print("# of invalid samples for mobility:", np.isnan(df['mobility_measure_delta']).sum(), " out of", len(df))

# of invalid samples for selfcare: 10490  out of 43745
# of invalid samples for mobility: 40102  out of 43745


In [252]:
# save processed data with 6 extra columns

df.to_csv('./data/Cleaned files/Cogan_1_2.csv', index=False)

### 1.3 Relabel & Precess Features


In [265]:
# total # of features and a list of all feature names

import pandas as pd
from tabulate import tabulate

# always start from loading the CSV file from previous section
df = pd.read_csv('./data/Cleaned files/Cogan_1_2.csv', low_memory=False)

feature_names = [*df.head()]
feature_dtypes = df.dtypes
print("total # of features:", len(feature_names))

print(tabulate([(index, name, feature_dtypes[name]) for index, name in enumerate(feature_names)], headers=["Index", "Name", "Dtype"], tablefmt="grid"))
categorical_feature_set = set()
numeric_feature_set = set()

total # of features: 455
+---------+--------------------------------+---------+
|   Index | Name                           | Dtype   |
|       0 | sex                            | object  |
+---------+--------------------------------+---------+
|       1 | marital_status                 | object  |
+---------+--------------------------------+---------+
|       2 | admityear                      | float64 |
+---------+--------------------------------+---------+
|       3 | admitclass                     | float64 |
+---------+--------------------------------+---------+
|       4 | admit_from                     | object  |
+---------+--------------------------------+---------+
|       5 | prehospital_living             | object  |
+---------+--------------------------------+---------+
|       6 | prehospital_living_with        | object  |
+---------+--------------------------------+---------+
|       7 | payor_primary                  | object  |
+---------+-----------------------------

In [69]:
# look at a specific feature
print(relabeled_data_preliminary[feature_names[13]]) # diagnosis_b
print(np.sum(~np.equal(relabeled_data_preliminary[feature_names[13]].to_numpy(dtype=str), 'nan')))
print(relabeled_data_preliminary[feature_names[13]].to_numpy(dtype=str)[~np.equal(relabeled_data_preliminary[feature_names[13]].to_numpy(dtype=str), 'nan')])

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
43740    NaN
43741    NaN
43742    NaN
43743    NaN
43744    NaN
Name: diagnosis_c, Length: 43745, dtype: object
353
['S72.041A'
 'Cerebral infarction due to unspecified occlusion or stenosis of left anterior cerebral artery'
 'Cerebral infarction due to unspecified occlusion or stenosis of bilateral carotid arteries'
 'Nontraumatic subdural hemorrhage, unspecified'
 'Nontraumatic intracerebral hemorrhage, intraventricular'
 'Dysarthria following cerebral infarction' 'E78.5' 'N39.0' 'S15.101A'
 'Other nontraumatic intracerebral hemorrhage'
 'Dysarthria following cerebral infarction'
 'Nontraumatic intracerebral hemorrhage in hemisphere, cortical' 'G93.6'
 'Essential (primary) hypertension' 'Compression of brain'
 'Nontraumatic extradural hemorrhage' 'G91.1'
 'Other nontraumatic subarachnoid hemorrhage' 'R29.717'
 'Cerebral infarction due to unspecified occlusion or stenosis of left posterior cerebral artery'


#### Column: "language"

In [266]:
# safe buffer
last_df = df

In [268]:
# process column "language": English or Non-English
df = last_df.copy() # always work on a fresh copy of last step's output

categorical_feature_set.add('language')

print(df['language'].unique())
df['language'] = df['language'].str.contains('eng|enlish', case=False, na=False)
df['language'] = df['language'].map({True: 'English', False: 'Non-English'})
print(df['language'].unique())

['English' 'ENGLISH' 'Spanish' 'Other' 'eng' 'english' 'spanish' 'SPANISH'
 'Vietnamese' 'Unknown' 'Polish' 'English and Spanish' 'Creole' 'Eng'
 'MANDARIN' 'Cambodia' 'Chinese' 'Korean' 'Englissh' 'HINDI' 'Enlglish'
 'POLISH' 'Sindhi/Urdu then English' 'French' 'Farsi' 'Panjabi' 'Mandarin'
 'Japanese' 'Malayalam' 'Ghana' 'Cambodian' 'ENLGISH' 'Indonesian'
 'tagalog' 'Swahili' 'Estonian' 'Portuguese' 'Cantonese'
 'Chinese, Mandarin' 'gujarati' 'German' 'other' 'Hmong' 'English/Spanish'
 'Russian' 'Englsih' 'Karen' 'Arabic' 'CREOLE' 'Undetermined' 'Gujarati'
 'Urdu' 'BANGLADESH' 'Hindi' 'Tagalog' 'TOISAN' 'Spanish and English'
 'Chinese - Canto' 'korean' 'creole' 'ASL' 'WHITE' 'Italian'
 'ENGLISH/TURKISH' 'Haitian' 'TAGALOG' 'en' 'CANTONESE' 'ENglish'
 'CHINESE' 'Serbian' 'Ukrainian' 'Persian' 'Swedish' 'English, Spanish'
 'Enbglish' 'Bosnian' 'Albanian' 'LAO' 'chinese' 'Chinese - Manda'
 'Slovak' 'French Creole' 'English, Other' 'ukranian' 'ENGLSIH' 'Enlish'
 'Chuukese' 'VIETNAMESE' 'A

#### Column: all categorical columns with an integer label

In [272]:
# safe buffer
last_df = df

In [302]:
# replace nan with "Unknown", convert int label to literal string
df = last_df.copy() # always work on a fresh copy of last step's output

integer_labeled_features = [
    "admitclass",
]
for name in integer_labeled_features:
    print(f"----------working on: {name}----------")
    print("uniques before:", df[name].unique())
    print("dtype before:", df[name].dtype)
    print("nan count before:", pd.isna(df[name]).sum())
    df[name] = df[name].apply(lambda x: str(int(x)) if pd.notna(x) else "Unknown").astype(str)
    print("uniques after:", df[name].unique())
    print("dtype after:", df[name].dtype)
    print("nan count after:", pd.isna(df[name]).sum())

----------working on: admitclass----------
uniques before: [ 1.  4.  3. nan  5.  2.]
dtype before: float64
nan count before: 3861
uniques after: ['1' '4' '3' 'Unknown' '5' '2']
dtype after: object
nan count after: 0


#### Column: all other categorical columns that need NaNs to be replaced

In [303]:
# safe buffer
last_df = df

In [304]:
# replace nan with "Unknown", replace spaces with underscores
df = last_df.copy() # always work on a fresh copy of last step's output

normal_categorical_features = [
    "sex",
    "marital_status",
    "prehospital_living",
    "payor_primary",
]

for name in normal_categorical_features:
    print(f"----------working on: {name}----------")
    print("uniques before:", df[name].unique())
    print("dtype before:", df[name].dtype)
    print("nan count before:", pd.isna(df[name]).sum())
    df[name] = df[name].apply(lambda x: str(x).replace(' ', '_') if pd.notna(x) else "Unknown").astype(str)
    print("uniques after:", df[name].unique())
    print("dtype after:", df[name].dtype)
    print("nan count after:", pd.isna(df[name]).sum())

----------working on: sex----------
uniques before: ['female' 'male']
dtype before: object
nan count before: 0
uniques after: ['female' 'male']
dtype after: object
nan count after: 0
----------working on: marital_status----------
uniques before: ['Married' 'Widowed' 'Unknown' 'Divorced' 'Separated' 'Never_married']
dtype before: object
nan count before: 0
uniques after: ['Married' 'Widowed' 'Unknown' 'Divorced' 'Separated' 'Never_married']
dtype after: object
nan count after: 0
----------working on: prehospital_living----------
uniques before: ['Home' 'Inpatient psychiatric facility' 'Home health service'
 'Intermediate care' 'Skilled nursing facility'
 'Medicaid nursing facility' 'Short-term general hospital' 'Not listed'
 'Long-term care hospital' 'Another IRF']
dtype before: object
nan count before: 0
uniques after: ['Home' 'Inpatient_psychiatric_facility' 'Home_health_service'
 'Intermediate_care' 'Skilled_nursing_facility'
 'Medicaid_nursing_facility' 'Short-term_general_hospital'

### 1.4 Select Interested Predictors


In [None]:
interested_predictors = [
    # >>> base info >>>
    "sex", # male / female
    "marital_status", # ['Married' 'Widowed' 'Unknown' 'Divorced' 'Separated' 'Never_married']
    "admitclass",
    "admit_from",
    "prehospital_living",
    "payor_primary",
    # "impgroupadmit", # what does this do?
    # "diagnosis", # need a pretrained LM for embedding
    "arthritis", 
    "heightinches",
    "weightpounds",
    # <<< base info <<<

    # >>> treatments >>>
    "ptindweek1",
    "ptconweek1",
    "ptgrpweek1",
    "ptcoweek1",
    "otindweek1",
    "otconweek1",
    "otgrpweek1",
    "otcoweek1",
    "slpindweek1",
    "slpconweek1",
    "slpgrpweek1",
    "slpcoweek1",
    "ptindweek2",
    "ptconweek2",
    "ptgrpweek2",
    "ptcoweek2",
    "otindweek2",
    "otconweek2",
    "otgrpweek2",
    "otcoweek2",
    "slpindweek2",
    "slpconweek2",
    "slpgrpweek2",
    "slpcoweek2",
    # <<< treatments <<<

    # >>> section A admission >>>
    "race", # integrated race categorical field
    "language", # integrated language categorical field: English or Non-English
    "transport_lack", # integrated "transport_lack" and "transport_lack_unable", "transport_lack_decline"
    # <<< section A admission <<<
    
    # >>> section B C admission >>>
    "hearing_adm",
    "vision_adm",
    "health_lit_adm",
    "expression_adm",
    "understand_verbal_adm",
    "conduct_bims", # bims: Brief Interview for Mental Status
    "bims_3words",
    "bims_year",
    "bims_month",
    "bims_day",
    "bims_recall_sock",
    "bims_recall_blue",
    "bims_recall_bed",
    "bims_total",
    "conduct_sams", # only if "conduct_bims" is false
    "sams_season",
    "sams_room",
    "sams_names",
    "sams_hosp",
    "sams_none_above",
    "acute_mental_change",
    "inattention_adm",
    "disorganized_adm",
    "altered_adm",
    # <<< section B C admission <<<
    
    # >>> section D admission >>>
    "low_interest_adm",
    "low_interest_freq_adm",
    "depressed_adm",
    "depressed_freq_adm",
    "sleep_trouble_adm",
    "sleep_trouble_freq_adm",
    "tired_adm",
    "tired_freq_adm",
    "appetite_adm",
    "appetite_freq_adm",
    "feel_bad_adm",
    "feel_bad_freq_adm",
    "concentrate_adm",
    "concentrate_freq_adm",
    "slowfast_adm",
    "slowfast_freq_adm",
    "selfharm_adm",
    "selfharm_freq_adm",
    "mood_total_adm",
    "socisolation_adm",
    # <<< section D admission <<<

    # >>> section GG admission >>>
    "selfcare_prior",
    "mobility_prior",
    "stairs_prior",
    "func_cog_prior",
    "wc_manual_prior",
    "wc_motor_prior",
    "mechlift_prior",
    "walker_prior",
    "orth_pros_prior",
    "no_device_prior",
    "eating_adm",
    "eating_goal",
    "oral_adm",
    "oral_goal",
    "toileting_adm",
    "toiletinghygienegoal",
    "bathe_adm",
    "bathe_goal",
    "dress_upper_adm",
    "dress_upper_goal",
    "dress_lower_adm",
    "dress_lower_goal",
    "footwear_adm",
    "footwear_goal",
    "roll_lr_adm",
    "roll_lr_goal",
    "sit_lying_adm",
    "sit_lying_goal",
    "lying_sit_adm",
    "lying_sit_goal",
    "sit_stand_adm",
    "sit_stand_goal",
    "bed_chair_adm",
    "bed_chair_goal",
    "toilet_trans_adm",
    "toilet_trans_goal",
    "car_trans_adm",
    "car_trans_goal",
    "walk10ft_adm",
    "walk10ft_goal",
    "walk50ft_adm",
    "walk50ft_goal",
    "walk150ft_adm",
    "walk150ft_goal",
    "walk10ft_uneven_adm",
    "walk10ft_uneven_goal",
    "walk1step_adm",
    "walk1step_goal",
    "walk4step_adm",
    "walk4step_goal",
    "walk12step_adm",
    "walk12step_goal",
    "pickup_adm",
    "pickup_goal",
    "wc_user",
    "wheel50ft_adm",
    "wheel50ft_goal",
    "wc50_type",
    "wheel150ft_adm",
    "wheel150ft_goal",
    "wc150_type",
    # <<< section GG admission <<<

    # >>> section H I J K M N O admission >>>
    "bladder_incontinence",
    "bowel_incontinence",
    "pvd_comorbid",
    "diabetes_comorbid",
    "no_comorbidities",
    "pain_sleep_adm",
    "pain_therapy_adm",
    "pain_activities_adm",
    "falls_hx",
    "prior_surgery",
    "nutrition_parenteral_adm",
    "nutrition_tube_adm",
    "nutrition_mech_diet_adm",
    "nutrition_ther_diet_adm",
    "nutrition_none_adm",
    "pressure_ulcer_adm",
    "stage1_pu_adm",
    "stage2_pu_adm",
    "stage3_pu_adm",
    "stage4_pu_adm",
    "unstageable_dressing_pu_adm",
    "unstageable_slough_pu_adm",
    "unstageable_deep_pu_adm",
    "antipsychotic_taking_adm",
    "antipsychotic_ind_adm",
    "anticoagulant_taking_adm",
    "anticoagulant_ind_adm",
    "antibiotic_taking_adm",
    "antibiotic_ind_adm",
    "opioid_taking_adm",
    "opioid_ind_adm",
    "antiplatelet_taking_adm",
    "antiplatelet_ind_adm",
    "hypoglycemic_taking_adm",
    "hypoglycemic_ind_adm",
    "med_highrisk_none",
    "drug_regimen_review",
    "med_follow_up",
    "chemo_adm",
    "chemo_iv_adm",
    "chemo_oral_adm",
    "chemo_other_adm",
    "radiation_adm",
    "oxygen_adm",
    "oxygen_cont_adm",
    "oxygen_int_adm",
    "oxygen_high_adm",
    "suctioning_adm",
    "suctioning_sched_adm",
    "suctioning_asneeded_adm",
    "trach_adm",
    "vent_invasive_adm",
    "vent_noninvasive_adm",
    "vent_bipap_adm",
    "vent_cpap_adm",
    "meds_iv_adm",
    "meds_iv_vasoactive_adm",
    "meds_iv_antibiotic_adm",
    "meds_iv_anticoagulant_adm",
    "meds_iv_other_adm",
    "transfusions_adm",
    "dialysis_adm",
    "hemodialysis_adm",
    "peritoneal_dialysis_adm",
    "iv_access_adm",
    "iv_access_periph_adm",
    "iv_access_mid_adm",
    "iv_access_cent_adm",
    "tx_none_adm",
    # <<< section H I J K M N O admission <<<

    # >>> the rest >>>
    "age_at_admit",
    "los",
    "ric",
    "cmg",
    "tier",
    "shortstayexpired",
    "shortstaycmg",
    "transferpatient",
    "incompletestay",
    # <<< the rest <<<

]

## The metric you're referring to is generally known as Feature Importance.

There are several methods to calculate feature importance, and the specific method depends on the type of model you're using. Below are some common techniques for estimating feature importance:

1. Gini Importance (Mean Decrease in Impurity):
Used in Decision Trees, Random Forests, and Gradient Boosting models.
It measures the total reduction of the criterion (like Gini impurity or entropy in classification, or variance in regression) brought by that feature across all trees in the model.
2. Permutation Feature Importance:
This is a model-agnostic approach.
It involves randomly shuffling the values of each feature and observing how it impacts the model's performance. A feature is considered important if permuting its values significantly reduces the model's accuracy.
3. SHAP (SHapley Additive exPlanations):
SHAP values are a game-theory-based method to explain the output of any machine learning model.
It provides an individual score for each feature per instance that shows how much each feature contributes (positively or negatively) to the final prediction.
4. LIME (Local Interpretable Model-agnostic Explanations):
Another model-agnostic technique that explains the prediction of individual instances by approximating the model locally with an interpretable one, such as a linear model, and extracting feature importance for that instance.
5. Coefficient Values in Linear Models:
For models like Linear Regression or Logistic Regression, the feature importance can be directly interpreted from the magnitude of the model's coefficients. Larger absolute values of coefficients indicate higher importance.
6. Feature Gain (XGBoost):
In XGBoost, feature importance can be measured using the gain (the average contribution of a feature to the model when it is used in trees).
7. Partial Dependence Plots (PDP):
Although not a feature importance score per se, PDPs can be used to show the impact of a feature on the predicted outcome by averaging over all other features.
Each of these methods provides a way to score the features in terms of their contribution to the model’s output, helping to interpret the importance of individual features in either classification or regression tasks.