<div style="text-align: center; font-weight: bold;">
    <h1>Generating Research Ready EHR Datasets</h1>
    <h2>Part 3: Cohort creation, Data Aggregation and NLP </h2>
    <h4>Author: Vidul Ayakulangara Panickan</h3>
</div>




In [4]:
import os
import pandas as pd
from ehrt import Text2Cui

base_directory = os.path.dirname(os.getcwd())



ModuleNotFoundError: No module named 'ehrt'

Unnamed: 0,STR,CUI
0,asthmatics,C0004096
1,asthma nos,C0004096
2,asthma disorders,C0004096
3,unspecified asthma,C0004096
4,bronchitic asthma,C0004096
...,...,...
384,methacholine containing product,C0600370
385,methacholinum,C0600370
386,beta methylacetylcholine,C0600370
387,methacholine substance,C0600370


Dictionary loaded from /n/data1/hsph/biostat/celehs/lab/va67/EHR_TUTORIAL_WORKSPACE/scripts/meta_files/asthma_dict.csv


In [6]:
text2cui.traverse('''The patient does has asthma and methacholinum and has methacholine 
containing product with salbutamol product beta methylacetylcholine asthma disorders''')

'C0004096,C0600370,C0600370,C0001927,C0600370,C0004096'

## Step 7 Cohort Creation

EHR based studies are typically conducted on a group of patients who meet specific criteria. Some examples are

- **Disease-specific studies**: Patients diagnosed with a particular disease or condition.
- **Treatment-based studies**: Patients who have undergone a particular procedure or who were prescribed a particular medication.
- **Device-based studies**: Patients implanted with specific devices.

In all the cases above, the first task is to identify the patient cohort


### Creating an Asthma Cohort

For example, let's consider a cohort of patients diagnosed with asthma. To identify the list of asthma patients, a common strategy is to identify the ICD codes corresponding to asthma. However, the ICD codes are so granular that different studies on asthma may use different sets of ICD codes.

Alternatively, we could just use PheCodes corresponding to asthma and filter patients based on that. You can visit the PheWAS catalog website, which provides the mapping from PheCode to ICD code [here](https://phewascatalog.org/) and search for phecodes corresponding to Asthma.

We see that PheCode string Asthma corresponds to PheCode 495. Once we have the PheCode we can identify all ICD codes that falls under that phecode and then grab patients having atleast one of those phecodes. We'll go ahead and implement that.

In [5]:
# Following is the raw diagnoses file from MIMIC
diagnoses_icd_file = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp","diagnoses_icd.csv")
diagnoses_icd = pd.read_csv(diagnoses_icd_file,dtype=str)
diagnoses_icd = diagnoses_icd.rename(columns={'icd_code':'code'})
diagnoses_icd['coding_system']="ICD"+diagnoses_icd['icd_version']
display(diagnoses_icd.head())

# ICD to PheCode mapping from PheWAS Catalog
icd_to_phecode_file = os.path.join(base_directory, 'scripts', 'rollup_mappings',"ICD_to_PheCode.csv")
icd_to_phecode = pd.read_csv(icd_to_phecode_file, dtype=str)
display(icd_to_phecode.head())

# Comprehensive Diagnoses
comprehensive_diagnoses = pd.merge(diagnoses_icd, icd_to_phecode, on=['coding_system','code'])
display(comprehensive_diagnoses.head())

# Cohort of interest
comprehensive_asthma_cohort = comprehensive_diagnoses[comprehensive_diagnoses['PheCode']=='495']
display(comprehensive_asthma_cohort)
print(comprehensive_asthma_cohort.describe())

asthma_cohort = comprehensive_asthma_cohort[['subject_id']].drop_duplicates()
display(asthma_cohort)

Unnamed: 0,subject_id,hadm_id,seq_num,code,icd_version,coding_system
0,10000032,22595853,1,5723,9,ICD9
1,10000032,22595853,2,78959,9,ICD9
2,10000032,22595853,3,5715,9,ICD9
3,10000032,22595853,4,7070,9,ICD9
4,10000032,22595853,5,496,9,ICD9


Unnamed: 0,code,PheCode,coding_system
0,1,8,ICD9
1,10,8,ICD9
2,11,8,ICD9
3,19,8,ICD9
4,2,8,ICD9


Unnamed: 0,subject_id,hadm_id,seq_num,code,icd_version,coding_system,PheCode
0,10000032,22595853,1,5723,9,ICD9,571.81
1,10000826,20032235,4,5723,9,ICD9,571.81
2,10000826,28289260,1,5723,9,ICD9,571.81
3,10005866,26158160,4,5723,9,ICD9,571.81
4,10008924,23676183,7,5723,9,ICD9,571.81


Unnamed: 0,subject_id,hadm_id,seq_num,code,icd_version,coding_system,PheCode
2478863,10001725,25563031,4,49390,9,ICD9,495
2478864,10001884,26679629,7,49390,9,ICD9,495
2478865,10003019,20030125,5,49390,9,ICD9,495
2478866,10003019,20277210,10,49390,9,ICD9,495
2478867,10003019,20962108,15,49390,9,ICD9,495
...,...,...,...,...,...,...,...
5908161,17892612,24109018,1,49382,9,ICD9,495
5908162,17997063,25519468,11,49382,9,ICD9,495
5908163,18269165,28966193,6,49382,9,ICD9,495
5908164,18958101,23643092,8,49382,9,ICD9,495


       subject_id   hadm_id seq_num    code icd_version coding_system PheCode
count       42057     42057   42057   42057       42057         42057   42057
unique      20316     42035      39      11           2             2       1
top      18676703  24773199       5  J45909          10         ICD10     495
freq           60         3    4105   20679       21954         21954   42057


Unnamed: 0,subject_id
2478863,10001725
2478864,10001884
2478865,10003019
2478872,10004457
2478875,10004749
...,...
5908156,16550589
5908158,17562616
5908161,17892612
5908162,17997063


### Extract Cohort Data of Interest
Now for each set of rolled up data, we can extract the data of interest from rolledup data and aggregate them at patient level.



In [13]:
import pandas as pd
import os

base_directory = os.path.dirname(os.getcwd())
print(base_directory)

# Wew will extract data of interest, aggregate them and save them here
cohort_aggregateddata_directory =os.path.join(base_directory, 'processed_data', 'step6_cohort_aggregateddata')
os.makedirs(cohort_aggregateddata_directory, exist_ok=True)

# Wew will seperate codified aggregated data and nlp aggregated data 
cohort_aggregateddata_codified_directory = os.path.join(base_directory, 'processed_data', 'step6_cohort_aggregateddata', 'codified')
os.makedirs(cohort_aggregateddata_codified_directory, exist_ok=True)

/n/data1/hsph/biostat/celehs/lab/va67/EHR_TUTORIAL_WORKSPACE


In [14]:
# Processing data in batches to extract patient data of interest

rolledup_diagnoses_directory =  os.path.join(base_directory, 'processed_data', 'step4_rolledup_finaldata', 'Diagnoses')
rolledup_diagnoses_batch_files = os.listdir(rolledup_diagnoses_directory)
sample_rolled_diagnoses = pd.read_csv(os.path.join(rolledup_diagnoses_directory, rolledup_diagnoses_batch_files[0]), dtype=str)


os.path.join(base_directory, 'processed_data', 'step5_rolledup_finaldata')

extracted_diagnoses_dfs = []

for diagnoses_batch_file in rolledup_diagnoses_batch_files:
    diagnoses_batch = pd.read_csv(os.path.join(rolledup_diagnoses_directory, diagnoses_batch_file), dtype=str)
    diagnoses_batch_extracted = pd.merge(diagnoses_batch,asthma_cohort, on=['subject_id'], how='inner')
    display(diagnoses_batch_extracted)
    extracted_diagnoses_dfs.append(diagnoses_batch_extracted)

extracted_diagnoses = pd.concat(extracted_diagnoses_dfs)
display(extracted_diagnoses)

Unnamed: 0,subject_id,date,PheCode
0,10004457,2140-09-17,411.4
1,10004457,2140-09-17,272.1
2,10004457,2140-09-17,401.1
3,10004457,2140-09-17,495
4,10004457,2140-09-17,185
...,...,...,...
122569,19990563,2180-11-30,783
122570,19990563,2180-11-30,250.2
122571,19990563,2180-11-30,401.1
122572,19990563,2180-11-30,457.3


Unnamed: 0,subject_id,date,PheCode
0,10017393,2179-07-20,694.3
1,10017393,2179-07-20,960.2
2,10017393,2179-07-20,528.7
3,10017393,2179-07-20,361
4,10017393,2179-07-20,495
...,...,...,...
118535,19990581,2141-07-23,272.11
118536,19990581,2141-07-23,250.24
118537,19990581,2141-07-23,536.3
118538,19990581,2141-07-23,495


Unnamed: 0,subject_id,date,PheCode
0,10011912,2176-10-21,800.3
1,10011912,2176-10-21,070.3
2,10011912,2176-10-21,174.11
3,10011912,2176-10-21,495
4,10011912,2176-10-21,317.11
...,...,...,...
115431,19999442,2148-11-19,856
115432,19999442,2148-11-19,348
115433,19999442,2148-11-19,342
115434,19999442,2148-11-19,495


Unnamed: 0,subject_id,date,PheCode
0,10001884,2130-10-05,496.21
1,10001884,2130-10-05,1013
2,10001884,2130-10-05,276.14
3,10001884,2130-10-05,401.1
4,10001884,2130-10-05,272.11
...,...,...,...
115775,19997760,2187-07-09,286.2
115776,19997760,2187-07-09,594
115777,19997760,2187-07-09,972.6
115778,19997760,2187-07-09,972.1


Unnamed: 0,subject_id,date,PheCode
0,10003019,2174-12-25,743.21
1,10003019,2174-12-25,334
2,10003019,2174-12-25,697
3,10003019,2174-12-25,510
4,10003019,2174-12-25,495
...,...,...,...
124334,19996016,2159-12-10,415
124335,19996016,2159-12-10,256.4
124336,19996016,2159-12-10,1010.6
124337,19996016,2159-12-10,652


Unnamed: 0,subject_id,date,PheCode
0,10002800,2164-07-12,649
1,10002800,2164-07-12,521.1
2,10002800,2164-07-12,1010.6
3,10002800,2164-07-12,495
4,10002800,2164-07-12,646
...,...,...,...
117695,19996832,2179-02-21,296.22
117696,19996832,2179-02-21,297.2
117697,19996832,2179-02-21,495
117698,19996832,2179-02-21,318


Unnamed: 0,subject_id,date,PheCode
0,10011607,2184-04-26,495.2
1,10011607,2184-04-26,509.1
2,10011607,2184-04-26,411.3
3,10011607,2184-04-26,401.1
4,10011607,2184-04-26,290.1
...,...,...,...
115303,19998350,2128-02-21,327.32
115304,19998350,2128-02-21,495
115305,19998350,2128-02-21,278.1
115306,19998350,2128-02-21,300.1


Unnamed: 0,subject_id,date,PheCode
0,10001725,2110-04-11,599.2
1,10001725,2110-04-11,946
2,10001725,2110-04-11,618.5
3,10001725,2110-04-11,495
4,10001725,2110-04-11,530.11
...,...,...,...
113325,19997887,2117-04-07,318
113326,19997887,2117-04-07,288.2
113327,19997887,2117-04-07,338.1
113328,19997887,2117-04-07,789


Unnamed: 0,subject_id,date,PheCode
0,10004457,2140-09-17,411.4
1,10004457,2140-09-17,272.1
2,10004457,2140-09-17,401.1
3,10004457,2140-09-17,495
4,10004457,2140-09-17,185
...,...,...,...
113325,19997887,2117-04-07,318
113326,19997887,2117-04-07,288.2
113327,19997887,2117-04-07,338.1
113328,19997887,2117-04-07,789


### Aggregate Data at Patient Level

In [15]:
phecode_counts_per_patient = extracted_diagnoses.groupby(['subject_id', 'PheCode']).size().reset_index(name='counts')
display(phecode_counts_per_patient)

phecode_counts_per_patient_matrixformat = phecode_counts_per_patient.pivot_table(index='subject_id', columns='PheCode', values='counts', fill_value=0)
display(phecode_counts_per_patient_matrixformat)

phecode_counts_per_patient_matrixformat.to_csv(os.path.join(cohort_aggregateddata_codified_directory,"Diagnoses.csv"), index=None)

Unnamed: 0,subject_id,PheCode,counts
0,10001725,180.1,1
1,10001725,296.2,1
2,10001725,300.1,1
3,10001725,313.1,1
4,10001725,318,1
...,...,...,...
480814,19999442,433.21,1
480815,19999442,495,2
480816,19999442,591,1
480817,19999442,594,1


PheCode,008,008.5,008.51,008.52,008.6,008.7,010,031,038,038.1,...,983,985,986,987,988,989,990,994.1,994.2,994.21
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10001725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10001884,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10002800,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10003019,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,1
10004296,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19997760,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19997887,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19998350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19999112,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Natural Language Processing
Converting Unstructured Data to Structured Data

In [2]:
import os
import ehrt
import pandas as pd
from ehrt import Text2Cui

base_directory = os.path.dirname(os.getcwd())

ImportError: Unable to import required dependencies:
numpy: 

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
Here is how to proceed:
- If you're working with a numpy git repository, try `git clean -xdf`
  (removes all files not under version control) and rebuild numpy.
- If you are simply trying to use the numpy version that you have installed:
  your installation is broken - please reinstall numpy.
- If you have already reinstalled and that did not fix the problem, then:
  1. Check that you are using the Python you expect (you're using /n/data1/hsph/biostat/celehs/lab/va67/anaconda/anaconda3/envs/ehrenv2/bin/python),
     and that you have no directories in your PATH or PYTHONPATH that can
     interfere with the Python and numpy versions you're trying to use.
  2. If (1) looks fine, you can open a new issue at
     https://github.com/numpy/numpy/issues.  Please include details on:
     - how you installed Python
     - how you installed numpy
     - your operating system
     - whether or not you have multiple versions of Python installed
     - if you built from source, your compiler versions and ideally a build log

     Note: this error has many possible causes, so please don't comment on
     an existing issue about this - open a new one instead.

Original error was: No module named 'numpy.core._multiarray_umath'


In [3]:
asthma_dictionary_file = os.path.join(base_directory, 'scripts', 'meta_files','asthma_dict.csv')
asthma_dictionary = pd.read_csv(asthma_dictionary_file, dtype=str)
display(asthma_dictionary)


text2cui = Text2Cui(asthma_dictionary_file)

NameError: name 'base_directory' is not defined