# Phenotype exploratory analysis

### Exploring list of patients from phenotype classification task from Gehrmann et al. (2018)

Downloaded as 'annotations.csv' from https://github.com/sebastianGehrmann/phenotyping/blob/master/data/annotations.csv

In [1]:
import pandas as pd
import numpy as np
import os
import psycopg2
from sqlalchemy import create_engine 
import string
import spacy
import re
from datetime import date, datetime, timedelta
import random

In [54]:
annotations = pd.read_csv('../data/annotations.csv')
annotations.columns.values[0] = 'hadm_id'
annotations.columns.values[1] = 'subject_id'
annotations.head()

Unnamed: 0,hadm_id,subject_id,chart.time,cohort,Obesity,Non.Adherence,Developmental.Delay.Retardation,Advanced.Heart.Disease,Advanced.Lung.Disease,Schizophrenia.and.other.Psychiatric.Disorders,Alcohol.Abuse,Other.Substance.Abuse,Chronic.Pain.Fibromyalgia,Chronic.Neurological.Dystrophies,Advanced.Cancer,Depression,Dementia,Unsure
0,118003,3644,118003,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0
1,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,185673,27694,999999,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,131938,16275,131938,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,198999,4059,198999,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0


**Cohort 1** refers to 'frequent flyers' and **Cohort 0** refers to random summaries from patients who are not frequent flyers

In [43]:
subjects = list(annotations['subject.id'].drop_duplicates())
subject_str = '(' + ", ".join(str(x) for x in subjects) + ')'
len(subjects)

1045

Converting list of unique subject ids to a string. Only 1045 unique patients from 1610 total

In [11]:
# connect to the mimic database and set the search path to the 'mimiciii' schema

dbschema='mimiciii'
cnx = create_engine('postgresql+psycopg2://aa5118:mimic@localhost:5432/mimic',
                    connect_args={'options': '-csearch_path={}'.format(dbschema)})

Confirming that it is only adults who are part of the phenotype classification task

In [44]:
df = pd.read_sql_query('''
    SELECT p.subject_id, p.dob, p.gender,
           ROUND((cast(chartdate as date) - cast(dob as date)) / 365.242,0) AS age_at_noteevent
    FROM patients p 
    INNER JOIN noteevents n 
    ON p.subject_id = n.subject_id 
    WHERE p.subject_id IN ''' + subject_str + '''
    ORDER BY ROUND((cast(chartdate as date) - cast(dob as date)) / 365.242,0)
''', cnx)
df.head()

Unnamed: 0,subject_id,dob,gender,age_at_noteevent
0,10510,2183-07-09,M,18.0
1,10510,2183-07-09,M,18.0
2,10510,2183-07-09,M,18.0
3,10510,2183-07-09,M,18.0
4,10510,2183-07-09,M,18.0


###### Now, we compare the manually annotated phenotypes with the diagnoses from the MIMIC table

In [58]:
df_diag = pd.read_sql_query('''
    SELECT * 
    FROM diagnoses_icd 
    JOIN d_icd_diagnoses 
    USING (icd9_code)
    ORDER BY subject_id, hadm_id
    ''', cnx)
df_diag.head()

Unnamed: 0,icd9_code,row_id,subject_id,hadm_id,seq_num,row_id.1,short_title,long_title
0,V3001,1,2,163353,1,13695,Single lb in-hosp w cs,"Single liveborn, born in hospital, delivered b..."
1,V053,2,2,163353,2,12202,Need prphyl vc vrl hepat,Need for prophylactic vaccination and inoculat...
2,V290,3,2,163353,3,13688,NB obsrv suspct infect,Observation for suspected infectious condition
3,0389,4,3,145834,1,660,Septicemia NOS,Unspecified septicemia
4,78559,5,3,145834,2,12992,Shock w/o trauma NEC,Other shock without mention of trauma


In [56]:
annotations_merged = pd.merge(annotations, df_diag, on=['subject_id', 'hadm_id'])

In [62]:
pd.options.display.max_columns = None
annotations_merged[annotations_merged['subject_id'] == 97736]

Unnamed: 0,hadm_id,subject_id,chart.time,cohort,Obesity,Non.Adherence,Developmental.Delay.Retardation,Advanced.Heart.Disease,Advanced.Lung.Disease,Schizophrenia.and.other.Psychiatric.Disorders,Alcohol.Abuse,Other.Substance.Abuse,Chronic.Pain.Fibromyalgia,Chronic.Neurological.Dystrophies,Advanced.Cancer,Depression,Dementia,Unsure,icd9_code,row_id,seq_num,row_id.1,short_title,long_title
9,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,00845,640909,1,80,Int inf clstrdium dfcile,Intestinal infection due to Clostridium difficile
10,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,19889,640910,2,2080,Secondary malig neo NEC,Secondary malignant neoplasm of other specifie...
11,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,53085,640911,3,6017,Barrett's esophagus,Barrett's esophagus
12,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,28860,640912,4,3185,Leukocytosis NOS,"Leukocytosis, unspecified"
13,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,7850,640913,5,12414,Tachycardia NOS,"Tachycardia, unspecified"
14,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,27652,640914,6,2156,Hypovolemia,Hypovolemia
15,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,V1089,640915,7,13920,Hx of malignancy NEC,Personal history of malignant neoplasm of othe...
16,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,V8741,640916,8,13967,Hx antineoplastic chemo,Personal history of antineoplastic chemotherapy
17,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,V1251,640917,9,13945,Hx-ven thrombosis/embols,Personal history of venous thrombosis and embo...
18,177830,97736,999999,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,V5861,640918,10,12792,Long-term use anticoagul,Long-term (current) use of anticoagulants


According to the manual annotation, the patient has advanced cancer. This corresponds to the second most severe diagnosis from the mimic diagnoses table - 'secondary malignant neoplasm' etc.

The most severe diagnosis appears to be 'Intestinal infection due to Clostridium difficile' which does not appear in the manual annotations. The reverse of this is also true for other patients. This indicates that perhaps we should add the diagnoses of the patients to the input of the text generation model