# Identify Clinically Relevant Variables
Generate JSON files for which variables I want to include in the analysis. These JSON files will be utilized as input to my extraction queries.

I identify clinically relevant variables through a few methods:
  * high frequency events + low missing % (missing % varies across event type)
  * Diagnosis Risks and Relevant Tests from Literature
  * Variables included in past survival analysis studies

In [1]:
import pandas_gbq
from google.oauth2 import service_account

In [2]:
# apply credentials
credentials = service_account.Credentials.from_service_account_file('../Patient-Similarity-credentials.json')
pandas_gbq.context.credentials = credentials
pandas_gbq.context.project = "patient-similarity"

I save my variables to a json file that is loaded when I start extraction

In [3]:
variable_file = "clinical_variables.json"
clinical_variables = {}

## Lab Events
I have gone through and replaced what I wrote with how they appear in the dataset.  I will need to confirm that they do not have more than 1 item id given the change of data systems. Manual grouping based on perceived similarities (I am unsure if this is smart -- need to test)

<u>Enzymes</u>
  * Alanine Aminotransferase (ALT)
  * Asparate Aminotransferase (AST)
  * Alkaline Phosphatase
  * Lactate Dehydrogenase (LD)
    * Lactate Dehydrogenase, Ascites
    * Lactate
 
 
<u>Blood coagulation levels -- general name</u>
  * Fibrinogen, Functional
  * Platelet Count
  * INR(PT)
  * PT
  * PTT
  * Iron
    * Iron Binding Capacity, Total
    * Transferrin
    * Ferritin
  
  
<u>Breakdown products</u>
  * Albumin
    * Albumin, Ascites
  * Bilirubin, Total
    * Bilirubin, Total, Ascites
    * Bilirubin
    * Bilirubin, Direct
    * Bilirubin, Indirect
  * Creatinine
    * Creatine Kinase (CK)
    * Creatine Kinase, MB Isoenzyme
    * Creatinine, Urine


<u>Proteins</u>
 * Protein
    * Protein, Total
    * Total Protein, Urine
    * Total Protein, Ascites
 * Alpha-Fetoprotein


<u>Hepatitis levels</u>
  * Hepatitis B Surface Antibody
  * Hepatitis B Surface Antigen
  * Hepatitis C Virus Antibody
  * Hepatitis B Virus Core Antibody
  
  
<u>Antibodies</u>
  * Anti-Mitochondrial Antibody
  * Anti-Nuclear Antibody, Titer
    * Anti-Nuclear Antibody
  * Anti-Smooth Muscle Antibody
 
 
<u>Other</u>
  * Phosphate
  * Gamma Glutamyltransferase
  * Ammonia  
  * Acetaminophen
  * Sodium
  * Potassium

In [4]:
# putting all lab tests into a list
# do I need to group some of these? for example is creatine kinase the same as creatine kinase MB isoenzyme
lab_tests = ["Alanine Aminotransferase (ALT)", "Asparate Aminotransferase (AST)", "Fibrinogen, Functional",
             "Platelet Count", "INR(PT)", "PT", "PTT", "Albumin", "Albumin, Ascites", "Bilirubin, Total", 
             "Bilirubin, Total, Ascites", "Bilirubin", "Bilirubin, Direct", "Bilirubin, Indirect", "Alkaline Phosphatase",
             "Phosphate", "Lactate Dehydrogenase (LD)", "Lactate Dehydrogenase, Ascites", "Lactate", "Gamma Glutamyltransferase",
             "Protein", "Protein, Total", "Total Protein, Urine", "Total Protein, Ascites", "Ammonia", 
             "Hepatitis B Surface Antibody", "Hepatitis B Surface Antigen", "Hepatitis C Virus Antibody", 
             "Hepatitis B Virus Core Antibody", "Alpha-Fetoprotein", "Iron", "Iron Binding Capacity, Total", 
             "Transferrin", "Ferritin", "Anti-Mitochondrial Antibody", "Anti-Nuclear Antibody, Titer", 
             "Anti-Nuclear Antibody", "Anti-Smooth Muscle Antibody", "Acetaminophen", "Creatinine", "Creatine Kinase (CK)",
             "Creatine Kinase, MB Isoenzyme", "Creatinine, Urine", "Sodium", "Potassium"]

In [18]:
# split the lab tests into the relevant groups
enzymes = ["Alanine Aminotransferase (ALT)", "Asparate Aminotransferase (AST)", "Alkaline Phosphatase",
           "Lactate Dehydrogenase (LD)", "Lactate Dehydrogenase, Ascites", "Lactate"]

blood = ["Fibrinogen, Functional","Platelet Count", "INR(PT)", "PT", "PTT", "Iron", "Iron Binding Capacity, Total", 
             "Transferrin", "Ferritin"]

breakdown = ["Albumin", "Albumin, Ascites", "Bilirubin, Total", "Bilirubin, Total, Ascites", 
             "Bilirubin", "Bilirubin, Direct", "Bilirubin, Indirect","Creatinine", "Creatine Kinase (CK)",
             "Creatine Kinase, MB Isoenzyme", "Creatinine, Urine"]

proteins = ["Protein", "Protein, Total", "Total Protein, Urine", "Total Protein, Ascites", "Alpha-Fetoprotein"]

hepatitis = ["Hepatitis B Surface Antibody", "Hepatitis B Surface Antigen", "Hepatitis C Virus Antibody", 
             "Hepatitis B Virus Core Antibody"]

antibodies = [ "Anti-Mitochondrial Antibody", "Anti-Nuclear Antibody, Titer", 
             "Anti-Nuclear Antibody", "Anti-Smooth Muscle Antibody"]

other = ["Phosphate","Gamma Glutamyltransferase","Ammonia", "Acetaminophen","Sodium", "Potassium"]

In [5]:
# Now I identify the most common lab tests across patients that are not in this list
# most common by percentage of patients that have received the lab test
# maybe something like 75% of the patients receive the test 
# maybe for now I ignore or I weight these less - these seem to be very generic?
q = """SELECT A.ITEMID, A.LABEL , (B.num_occurences/2884) as perc_patients FROM `patient-similarity.mimic.d_labitems` as A
inner join (select itemid, count(distinct subject_id) as num_occurences from `patient-similarity.mimic.labevents`
            where subject_id in (select subject_id from `patient-similarity.mimic.liver_pts`) 
            group by itemid) as B
on a.itemid=b.itemid
order by perc_patients desc"""
high_perc_lab = pandas_gbq.read_gbq(q)
addt_lab_tests = list(high_perc_lab[(high_perc_lab['perc_patients'] >= .75) & (~high_perc_lab.LABEL.isin(lab_tests))].LABEL)
addt_lab_tests

['Urea Nitrogen',
 'Hematocrit',
 'Hemoglobin',
 'MCH',
 'MCHC',
 'MCV',
 'Red Blood Cells',
 'White Blood Cells',
 'RDW',
 'Anion Gap',
 'Bicarbonate',
 'Chloride',
 'Glucose',
 'Magnesium',
 'Calcium, Total',
 'Eosinophils',
 'Lymphocytes',
 'Monocytes',
 'Neutrophils',
 'Basophils',
 'pH',
 'Specific Gravity',
 'pH',
 'Yeast',
 'WBC',
 'Urobilinogen',
 'RBC',
 'Epithelial Cells',
 'Base Excess',
 'Calculated Total CO2',
 'pCO2',
 'pO2',
 'Ketone',
 'Urine Color',
 'Lipase',
 'Glucose',
 'Urine Appearance',
 'Blood',
 'Nitrite']

In [6]:
# are any of the pre_identified tests not available in at least 50% of patients?
high_perc_lab[(high_perc_lab['perc_patients'] < .5) & (high_perc_lab.LABEL.isin(lab_tests))]
# a decent amount - will need to think about this

Unnamed: 0,ITEMID,LABEL,perc_patients
84,50884,"Bilirubin, Indirect",0.491331
88,50952,Iron,0.475035
91,50924,Ferritin,0.470527
93,50998,Transferrin,0.464286
94,50953,"Iron Binding Capacity, Total",0.463245
112,50864,Alpha-Fetoprotein,0.325936
114,50866,Ammonia,0.31276
124,50976,"Protein, Total",0.298544
138,50856,Acetaminophen,0.26595
139,50940,Hepatitis B Surface Antibody,0.262829


In [7]:
# Let's identify which values are time series and which are just taking once (if any)
# subject, hospital id, item id, item name, count of number in hospital, average(time between tests)
# where the item name is in lab tests
q = f"""select ITEMID, LABEL, avg(num_times_in_vist) as avg_times_by_visit
from (
  SELECT A.SUBJECT_ID , A.HADM_ID , A.ITEMID , B.LABEL, count(A.HADM_ID) as num_times_in_vist
  FROM `patient-similarity.mimic.labevents` as A
  left join `patient-similarity.mimic.d_labitems` as B
  on A.ITEMID = B.ITEMID
  where SUBJECT_ID in (select subject_id from `patient-similarity.mimic.liver_pts`)
  and LABEL in {tuple(lab_tests)}
  and A.HADM_ID is not null
  group by SUBJECT_ID, HADM_ID, ITEMID, LABEL
  having num_times_in_vist >=1
  order by SUBJECT_ID ,  HADM_ID, num_times_in_vist desc
) 
group by ITEMID, LABEL
order by avg_times_by_visit desc
"""
avg_labtest_by_visit = pandas_gbq.read_gbq(q)
time_series_labtests = list(avg_labtest_by_visit[avg_labtest_by_visit['avg_times_by_visit'] >= 5].LABEL)
time_series_labtests

['Potassium',
 'Sodium',
 'Platelet Count',
 'Creatinine',
 'Phosphate',
 'PTT',
 'INR(PT)',
 'PT',
 'Bilirubin, Total',
 'Asparate Aminotransferase (AST)',
 'Alanine Aminotransferase (ALT)',
 'Alkaline Phosphatase',
 'Lactate',
 'Fibrinogen, Functional',
 'Albumin']

In [8]:
# now lets see what these actually look like over time 
q = f"""SELECT A.SUBJECT_ID , A.HADM_ID , A.ITEMID , B.LABEL, A.CHARTTIME, A.VALUE
FROM `patient-similarity.mimic.labevents` as A
left join `patient-similarity.mimic.d_labitems` as B
on A.ITEMID = B.ITEMID
where SUBJECT_ID in (select subject_id from `patient-similarity.mimic.liver_pts`)
and LABEL in {tuple(time_series_labtests)}
and A.HADM_ID is not null 
order by subject_id , A.HADM_ID,  B.LABEL ,  A.CHARTTIME 
limit 1000"""
pandas_gbq.read_gbq(q).head(50)

Unnamed: 0,SUBJECT_ID,HADM_ID,ITEMID,LABEL,CHARTTIME,VALUE
0,4,185777,50861,Alanine Aminotransferase (ALT),2191-03-15 14:12:00,28.0
1,4,185777,50861,Alanine Aminotransferase (ALT),2191-03-16 05:42:00,24.0
2,4,185777,50862,Albumin,2191-03-16 05:42:00,2.8
3,4,185777,50863,Alkaline Phosphatase,2191-03-15 14:12:00,994.0
4,4,185777,50863,Alkaline Phosphatase,2191-03-16 05:42:00,837.0
5,4,185777,50878,Asparate Aminotransferase (AST),2191-03-15 14:12:00,69.0
6,4,185777,50878,Asparate Aminotransferase (AST),2191-03-16 05:42:00,64.0
7,4,185777,50885,"Bilirubin, Total",2191-03-15 14:12:00,2.2
8,4,185777,50885,"Bilirubin, Total",2191-03-16 05:42:00,1.9
9,4,185777,50912,Creatinine,2191-03-15 14:12:00,0.5


In [19]:
# Now we save our variables to our json for future use
clinical_variables["lab tests"] = {
    "relevant_vars": lab_tests,
    "potential_vars" : addt_lab_tests,
    "time_series_vars": time_series_labtests,
    "enzymes" : enzymes,
    "blood info": blood,
    "breakdown products": breakdown,
    "proteins": proteins,
    "hepatitis": hepatitis,
    "antibodies (other)": antibodies,
    "other": other

}
clinical_variables['lab tests']['time_series_vars']


['Potassium',
 'Sodium',
 'Platelet Count',
 'Creatinine',
 'Phosphate',
 'PTT',
 'INR(PT)',
 'PT',
 'Bilirubin, Total',
 'Asparate Aminotransferase (AST)',
 'Alanine Aminotransferase (ALT)',
 'Alkaline Phosphatase',
 'Lactate',
 'Fibrinogen, Functional',
 'Albumin']

## Chart Events
Identify our cleanically chart events. This is by far the messiest and most difficult part of the dataset. 
A lot of manual work in identifying the variables

There is a lot of overlap with lab events that I ignore. Focus on other stuff such as patient history, lifestyle, etc. 

In [10]:
# identify frequent chart events
q = """select LABEL, count(distinct subject_id)/2884 as perc_pts from (
  SELECT A.ITEMID, B.LABEL,  A.SUBJECT_ID  
  FROM `patient-similarity.mimic.chartevents` as A
  left join  `patient-similarity.mimic.d_items` as B
  on a.itemid=b.itemid
  where subject_id in (select subject_id from `patient-similarity.mimic.liver_pts`)
)
where LABEL not in (select LABEL from `patient-similarity.mimic.d_labitems` )
group by LABEL
order by perc_pts desc
"""
chart_events = pandas_gbq.read_gbq(q)
list(chart_events[chart_events['perc_pts']>=.4].LABEL)

['Religion',
 'Respiratory Rate',
 'Heart Rate',
 'Heart Rhythm',
 'RUL Lung Sounds',
 'LLL Lung Sounds',
 'LUL Lung Sounds',
 'RLL Lung Sounds',
 'Abdominal Assessment',
 'Activity',
 'Respiratory Pattern',
 'Turn',
 'Skin Integrity',
 'Bowel Sounds',
 'Oral Cavity',
 'Urine Source',
 'Activity Tolerance',
 'Braden Mobility',
 'Braden Activity',
 'Braden Moisture',
 'Assistance Device',
 'Braden Nutrition',
 'Pain Present',
 'Position',
 'Pain Location',
 'Marital Status',
 'Diet Type',
 'Support Systems',
 'Orientation',
 'Side Rails',
 'BUN',
 'Pain Type',
 'Parameters Checked',
 'Daily Wake Up',
 'Phosphorous',
 'Therapeutic Bed',
 'Service',
 'Oral Care',
 'INR',
 'Family Communication',
 'Code Status',
 'Education Learner',
 'Education Readiness',
 'Education Barrier',
 'Education Method',
 'ALT',
 'AST',
 'Education Response',
 'Pain Level',
 'Back Care',
 'Skin Care',
 'Pain Cause',
 'Eye Care',
 'Restraint Location',
 'Daily Weight',
 'Skin Color',
 'Skin Condition',
 'LDH',
 

In [11]:
# I still need to do diet, pain, activity, and phys_asses
# for everything, I will have to select distinct 
# possibly, I will have to select the most common occuring one too.. there is a lot of repeats

# physicial assessments occur every 4-6 hours
# just take the mode of each category
phys_assess = ['Abdominal Assessment','Skin Color', 'Skin Condition','Speech','Gag Reflex', 
               'Cough Reflex', 'Oral Cavity', 'Bowel Sounds','Braden Moisture']

# activity measurements occur every 4-6 hours
# just take the mode of each category
activity = ['Activity', 'Braden Mobility', 'Braden Activity', 'Activity Tolerance']


# pain level appears to be associated with a certain spot. 
# how do I capture this? 
# maybe just average pain across the body and one hot encoding body locations
# I think this requires more thought
pain = ['Pain Present', 'Pain Location', 'Pain Type','Pain Level','Pain Cause']

# avg the two previous weights and then average daily weight and admit weight
# do I incorporate a time series here?
diet = ['Braden Nutrition', "Diet Type", 'Daily Weight','Previous WeightF',"Admit Wt", "Previous Weight", "Appetite",
       "Weight Change", "Special diet"]

demographics = ['Marital Status', 'Family Communication', 'Gender', 'Race']

# hypoxia is common in cirrhoisis
# heart and respiratory rate will probably be fine with min, max, average?
# can probably one hot the rest?
# liver disease accompanied by Heart Rate Variability decrease - how to capture
heart = ["Heart Rate", 'Heart Rhythm']
lung = ['Respiratory Rate',  'RUL Lung Sounds', 'LLL Lung Sounds', 'RLL Lung Sounds','LUL Lung Sounds', 
        'Respiratory Pattern','Respiratory Effort']

# medical histories will need to be expanded - maybe one hot encoding?
med_history = ['Past medical history', "CV - past medical history", 'Mental status', "Recreational drug use"]

Next step is to start developing the variable extraction functions in tandem.  I feel like I am going to have to repeat work if I do not write the data extraction algorithms as I go.  I think I am at a good enough base now

In [20]:
# add to our json file
clinical_variables['chart events'] = {
    "physical assessment" : phys_assess,
    "activity" : activity,
    "pain" : pain,
    "diet" : diet,
    "demographics" : demographics,
    "heart" : heart,
    "lung" : lung,
    "medical history" : med_history
}
clinical_variables['chart events']['medical history']

['Past medical history',
 'CV - past medical history',
 'Mental status',
 'Recreational drug use']

In [21]:
import json
with open('../data/clinical variables.txt', 'w') as f:
    json.dump(clinical_variables, f)