### Select patients for CFR model: Split patients in train, val and test sets ###

In [1]:
import os
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [2]:
cfr_data_root = os.path.normpath('/mnt/obi0/andreas/data/cfr')
pet_data_dir = os.path.normpath('/mnt/obi0/phi/pet/pet_cfr')
meta_date = '200617'
meta_dir = os.path.join(cfr_data_root, 'metadata_'+meta_date)
print(meta_dir)

/mnt/obi0/andreas/data/cfr/metadata_200617


In [3]:
match_view_filename = 'pet_match365_diff_files_'+meta_date+'.parquet'
files_cfr = pd.read_parquet(os.path.join(meta_dir, match_view_filename))

print('Total number of patients      {}'.format(len(files_cfr.mrn.unique())))
print('Total number of echo studies  {}'.format(len(files_cfr.study.unique())))
print('Total number of PET studies   {}'.format(len(files_cfr.petmrn_identifier.unique())))
print('Total number of echos         {}'.format(len(files_cfr.filename.unique())))

files_cfr.head(2)

Total number of patients      3268
Total number of echo studies  6460
Total number of PET studies   3603
Total number of echos         307566


Unnamed: 0,mrn,study,pet_date,echo_date,petmrn_identifier,days_post_pet,pet_measurement,difference(days),filename,dir,datetime,file_base,identifier,frame_time,number_of_frames,heart_rate,deltaX,deltaY,a2c,a2c_laocc,a2c_lvocc_s,a3c,a3c_laocc,a3c_lvocc_s,a4c,a4c_far,a4c_laocc,a4c_lvocc_s,a4c_rv,a4c_rv_laocc,a5c,apex,other,plax_far,plax_lac,plax_laz,plax_laz_ao,plax_plax,psax_avz,psax_az,psax_mv,psax_pap,rvinf,subcostal,suprasternal,year_month,study_full_time,institution,model,manufacturer,max_view,sum_views
0,35156678,48b09010a2219aad_4903a582edf3bd118ffb3386065b,2018-10-15,2017-12-06,35156678_2018-10-15,-313,1.0,313.0,48b09010a2219aad_4903a582edf3bd118ffb3386065b_...,/mnt/obi0/phi/echo/npyFiles/BWH/48b0/48b09010a...,2017-12-06 13:11:41,48b09010a2219aad_4903a582edf3bd118ffb3386065b_...,48b09010a2219aad_4903a582edf3bd118ffb3386065b_...,47.769231,66.0,60.0,0.028951,0.028951,3.446593e-08,6.452001e-09,2.939033e-08,0.003095016,9.225302e-08,3.967397e-08,7.041133e-10,3.144316e-09,1.198068e-13,4.564848e-10,1.454091e-09,1.773068e-08,7.812481e-08,3.755003e-08,0.9968956,7.881907e-10,5.391607e-10,2.151894e-10,1.15716e-08,4.940982e-08,7.734493e-12,1.395856e-06,3.054052e-09,7.720808e-06,5.067335e-09,2.187016e-13,2.230961e-08,2017.0,20171206131141,BWH,Vivid E95,GE Vingmed Ultrasound,other,1.0
1,35156678,48b09010a2219aad_4903a582edf3bd118ffb3386065b,2018-10-15,2017-12-06,35156678_2018-10-15,-313,1.0,313.0,48b09010a2219aad_4903a582edf3bd118ffb3386065b_...,/mnt/obi0/phi/echo/npyFiles/BWH/48b0/48b09010a...,2017-12-06 13:11:41,48b09010a2219aad_4903a582edf3bd118ffb3386065b_...,48b09010a2219aad_4903a582edf3bd118ffb3386065b_...,20.087146,154.0,60.0,0.020448,0.020448,4.212547e-13,1.0,5.081462e-15,5.16278e-16,3.739126e-14,4.737708e-15,3.574842e-17,1.105575e-13,7.188538e-16,3.057809e-15,4.562448e-17,4.610617e-16,1.488447e-12,2.472197e-12,2.470155e-16,1.77298e-16,7.891783e-16,6.224149e-17,6.330659e-14,5.080497e-13,2.415044e-15,4.977857e-15,6.916603e-13,5.833864e-15,2.562479e-15,6.839081e-22,8.41705e-17,2017.0,20171206131141,BWH,Vivid E95,GE Vingmed Ultrasound,a2c_laocc,1.0


### Filter data sets: GLOBAL and NON-DEFECT variables ###

### Global variables ###

Notebook 3/17/2020: global_pet_cfr
File used: pet_cfr_petdata_02_26_2020_withperfandseg7.xlsx

As described above, 2871 after excluding CABG, transplant, and those with missing perfusion data
File used: post_2018_pets_with_clinical_cfr_all.csv

After excluding CABG, transplant, and missing CFR values, 167 PETs with any perfusion
Merge:

After combining, 3038 PETs
File saved as pets_with_echos_global_all.parquet
Use notes:

NOTE- there are petmrn_identifiers that have two rows- the one with post_2018==0 is the one to use, and the one with post_2018==1 should be excluded
Variables to use: rest_global_mbf, stress_global_mbf, global_cfr_calc
Other variables- myocardial_perfusion, segmental data, perfusion data, TID, gated SPECT results, calcium score, height, weight
For "cleaner" data, would exclude those with post-2018==1

#### Update 6/14/2020 ####

Created a revised version of the above that includes CABG cases (from pre-2018), and excludes post-2018 cases that are already in the pre-2018 dataset (n=7). Total 3718 rows. File at /mnt/obi0/phi/pet/pet_cfr/pets_with_echos_global_all_withcabg.parquet
Update 6/17/2020:

Created a revised version that has a column tracer_obi for the tracer used. Also, the 7 post-2018 duplicate cases are excluded so there are are 3031 studies.
File used- aiCohort_withPerfusion_addRadiopharm.xlsx
The ammonia cases before 7/25/2011 have the value 'listed as ammonia', and the rubidium cases after 7/25/2011 have the value 'listed as rubidium. For the remaining discrepant values (i.e. missing values, FDG, sestamibi), the tracer was assumed to be the tracer in use at the time.
Rubidium 1,740, ammonia 1,276, listed as ammonia 4, listed as rubidium 2
File saved as /mnt/obi0/phi/pet/pet_cfr/pets_with_echos_global_all_withtracer.parquet

In [4]:
global_pet_file = 'pets_with_echos_global_all_withtracer.parquet'
global_pet = pd.read_parquet(os.path.join(pet_data_dir, global_pet_file))
global_pet = global_pet.astype({'pet_date': 'datetime64[ns]'})
print(f'PET studies: {len(global_pet.petmrn_identifier.unique())}')
print(os.path.join(pet_data_dir, global_pet_file))
global_pet.head()

PET studies: 3031
/mnt/obi0/phi/pet/pet_cfr/pets_with_echos_global_all_withtracer.parquet


Unnamed: 0,mrn,pet_date,petmrn_identifier,post-2018,myocardial_perfusion,global_cfr_calc,rest_global_mbf,stress_global_mbf,rest_seg1_mbf,rest_seg2_mbf,rest_seg3_mbf,rest_seg4_mbf,rest_seg5_mbf,rest_seg6_mbf,rest_seg7_mbf,rest_seg8_mbf,rest_seg9_mbf,rest_seg10_mbf,rest_seg11_mbf,rest_seg12_mbf,rest_seg13_mbf,rest_seg14_mbf,rest_seg15_mbf,rest_seg16_mbf,rest_seg17_mbf,stress_seg1_mbf,stress_seg2_mbf,stress_seg3_mbf,stress_seg4_mbf,stress_seg5_mbf,stress_seg6_mbf,stress_seg7_mbf,stress_seg8_mbf,stress_seg9_mbf,stress_seg10_mbf,stress_seg11_mbf,stress_seg12_mbf,stress_seg13_mbf,stress_seg14_mbf,stress_seg15_mbf,stress_seg16_mbf,stress_seg17_mbf,summed_stress_score,summed_rest_score,summed_difference_score,TID_ratio,gated_spect_results,agatston_coronary_calcium_score,height_in,weight_lb,reportID,subjectID,radiopharmaceutical,radiopharmaceutical2,tracer_obi
0,7924277,2008-08-15,7924277_2008-08-15,0,abnormal,1.368012,0.644,0.881,0.448,0.38,0.435,0.83,0.402,0.334,0.73,0.89,0.537,0.388,0.607,0.823,0.964,0.766,0.445,1.34,1.504,0.691,0.707,0.529,0.348,0.558,0.562,1.325,1.446,0.636,0.502,1.136,1.186,1.415,0.979,0.67,1.174,1.676,18.0,18.0,0.0,1.09,normal,,67.0,133.0,69718.0,2540.0,Rubidium-82,,rubidium
1,12853099,2006-01-25,12853099_2006-01-25,0,normal,2.109661,1.532,3.232,1.082,0.94,1.135,1.224,1.332,1.945,1.482,1.977,1.475,1.656,1.509,1.444,1.801,2.18,1.677,1.359,1.977,4.291,3.144,2.069,2.657,2.626,3.712,3.981,3.359,2.215,4.01,3.89,3.883,3.249,2.609,3.483,3.443,3.037,0.0,0.0,0.0,,not assessed,0.0,64.0,160.0,69969.0,3379.0,Rubidium-82,,rubidium
2,20710471,2006-01-23,20710471_2006-01-23,0,abnormal,1.229572,1.028,1.264,0.872,0.892,1.126,1.884,1.061,1.057,1.286,1.657,0.735,0.995,1.153,1.252,1.38,0.966,0.86,0.943,1.101,1.213,0.826,1.864,0.91,1.32,1.936,1.283,1.617,1.33,1.266,1.366,1.726,1.533,1.477,1.484,1.037,1.537,20.0,18.0,2.0,0.97,normal,6602.0,71.0,200.0,69993.0,3542.0,Rubidium-82,,rubidium
3,12627030,2006-02-10,12627030_2006-02-10,0,abnormal,1.588915,1.732,2.752,1.984,1.499,1.556,1.333,1.134,1.292,2.499,1.883,1.703,1.794,1.873,1.933,1.767,2.056,1.662,1.804,1.754,2.94,0.664,0.64,2.197,1.413,3.395,4.533,1.301,2.149,2.391,3.3,4.714,3.184,2.567,4.154,4.662,3.595,16.0,0.0,16.0,0.79,normal,331.0,68.0,217.0,70008.0,2155.0,Rubidium-82,,rubidium
4,15324312,2006-02-23,15324312_2006-02-23,0,normal,2.656655,1.465,3.892,0.945,0.813,0.892,0.97,1.401,1.217,1.614,1.58,1.303,1.957,1.879,1.794,1.421,1.773,1.677,1.891,1.846,2.559,2.759,2.654,2.434,3.313,3.502,4.081,4.34,3.978,4.17,4.426,4.944,4.343,4.143,4.748,4.544,4.328,0.0,0.0,0.0,0.94,normal,0.0,64.0,146.0,70012.0,3417.0,Rubidium-82,,rubidium


In [5]:
global_pet[global_pet.petmrn_identifier=='1414556_2018-10-30']

Unnamed: 0,mrn,pet_date,petmrn_identifier,post-2018,myocardial_perfusion,global_cfr_calc,rest_global_mbf,stress_global_mbf,rest_seg1_mbf,rest_seg2_mbf,rest_seg3_mbf,rest_seg4_mbf,rest_seg5_mbf,rest_seg6_mbf,rest_seg7_mbf,rest_seg8_mbf,rest_seg9_mbf,rest_seg10_mbf,rest_seg11_mbf,rest_seg12_mbf,rest_seg13_mbf,rest_seg14_mbf,rest_seg15_mbf,rest_seg16_mbf,rest_seg17_mbf,stress_seg1_mbf,stress_seg2_mbf,stress_seg3_mbf,stress_seg4_mbf,stress_seg5_mbf,stress_seg6_mbf,stress_seg7_mbf,stress_seg8_mbf,stress_seg9_mbf,stress_seg10_mbf,stress_seg11_mbf,stress_seg12_mbf,stress_seg13_mbf,stress_seg14_mbf,stress_seg15_mbf,stress_seg16_mbf,stress_seg17_mbf,summed_stress_score,summed_rest_score,summed_difference_score,TID_ratio,gated_spect_results,agatston_coronary_calcium_score,height_in,weight_lb,reportID,subjectID,radiopharmaceutical,radiopharmaceutical2,tracer_obi
2870,1414556,2018-10-30,1414556_2018-10-30,0,normal,1.809645,0.788,1.426,0.839083,0.799958,0.810667,0.721833,0.783875,0.781667,0.803083,0.918875,0.917708,0.872042,0.849458,0.767792,0.693972,0.796028,0.812972,0.736222,0.642786,1.4265,1.442625,1.5275,1.390125,1.45075,1.4535,1.446542,1.720542,1.705208,1.582792,1.645125,1.386583,1.128667,1.501056,1.447167,1.197389,1.21775,0.0,0.0,0.0,,normal,145.0,63.0,189.0,166220.0,7521.0,N-13 Ammonia,,ammonia


In [6]:
# Exclude files without frame_time
file_cfr_meta = files_cfr.loc[~files_cfr.frame_time.isnull()]

# Add echo data to the pet studies (inner join, to keep only keys in both dataframes)
global_pet_echo = global_pet.merge(file_cfr_meta, on = ['mrn', 'pet_date', 'petmrn_identifier'], how='inner')
print(f'PET studies:  {len(global_pet_echo.petmrn_identifier.unique())}')
print(f'ECHO studies: {len(global_pet_echo.study.unique())}')

PET studies:  2734
ECHO studies: 4525


In [7]:
print(f'All data:    patients:     {len(file_cfr_meta.mrn.unique())}')
print(f'All data:    PET studies:  {len(file_cfr_meta.petmrn_identifier.unique())}')
print(f'All data:    ECHO studies: {len(file_cfr_meta.study.unique())}')
print(f'All data:    videos:       {len(file_cfr_meta.filename.unique())}')
print()
print(f'Global data: patients:     {len(global_pet_echo.mrn.unique())}')
print(f'Global data: PET studies:  {len(global_pet_echo.petmrn_identifier.unique())}')
print(f'Global data: ECHO studies: {len(global_pet_echo.study.unique())}')
print(f'Global data: videos:       {len(global_pet_echo.filename.unique())}')

excluded_pet_studies = len(file_cfr_meta.petmrn_identifier.unique()) - len(global_pet_echo.petmrn_identifier.unique())
print(f'Excluded PET studies: {excluded_pet_studies}')

All data:    patients:     3257
All data:    PET studies:  3580
All data:    ECHO studies: 6288
All data:    videos:       307292

Global data: patients:     2591
Global data: PET studies:  2734
Global data: ECHO studies: 4525
Global data: videos:       217192
Excluded PET studies: 846


In [8]:
# Find out which echo studies do not have a4c views
# Get all studies WITH a4c views
a4c_study_list = list(global_pet_echo[global_pet_echo.max_view=='a4c'].study.unique())
print(len(a4c_study_list))

# Remove all studies from global list that have a4c views
global_pet_echo_no_a4c = global_pet_echo[~global_pet_echo.study.isin(a4c_study_list)]
print(f'Number of ECHO studies without a4c view: {len(global_pet_echo_no_a4c.study.unique())}')
print(f'For this number of patients:             {len(global_pet_echo_no_a4c.mrn.unique())}')
print(f'With this number of videos:              {len(global_pet_echo_no_a4c.filename.unique())}')

3926
Number of ECHO studies without a4c view: 599
For this number of patients:             507
With this number of videos:              18799


In [9]:
# Exclusions and filters
# A4C VIEW
global_pet_echo_a4c = global_pet_echo.loc[global_pet_echo.max_view=='a4c']
global_pet_echo_a4c = global_pet_echo_a4c.loc[~global_pet_echo_a4c.frame_time.isnull()]
global_a4c_pre18 = global_pet_echo_a4c.loc[global_pet_echo_a4c['post-2018']==0]

print(f'Global data: patients:     {len(global_pet_echo.mrn.unique())}')
print(f'Global data: PET studies:  {len(global_pet_echo.petmrn_identifier.unique())}')
print(f'Global data: ECHO studies: {len(global_pet_echo.study.unique())}')
print(f'Global data: videos:       {len(global_pet_echo.filename.unique())}')
print()
print('After a4c and post-2018 filters:')
print(f'Global data: patients:     {len(global_a4c_pre18.mrn.unique())}')
print(f'Global data: PET studies:  {len(global_a4c_pre18.petmrn_identifier.unique())}')
print(f'Global data: ECHO studies: {len(global_a4c_pre18.study.unique())}')
print(f'Global data: videos:       {len(global_a4c_pre18.filename.unique())}')
print()
print(f'Lost {len(global_pet_echo_a4c.filename.unique())-len(global_a4c_pre18.filename.unique())} due to post-2018 filtering.')

Global data: patients:     2591
Global data: PET studies:  2734
Global data: ECHO studies: 4525
Global data: videos:       217192

After a4c and post-2018 filters:
Global data: patients:     2287
Global data: PET studies:  2408
Global data: ECHO studies: 3681
Global data: videos:       11077

Lost 708 due to post-2018 filtering.


In [10]:
# Global_pet_echo table with the variables to use (drop rows with na in any of those variables)
global_pet_echo.head(2)
global_pet_variables_target = ['rest_global_mbf', 
                               'stress_global_mbf', 
                               'global_cfr_calc', 
                               'post-2018', 
                               'tracer_obi']

global_pet_variables = global_pet_variables_target.copy()
global_pet_variables.extend(list(files_cfr.columns))

global_pet_echo_variables = global_a4c_pre18[global_pet_variables].dropna(subset=global_pet_variables_target,
                                                                          axis=0)

print(f'Global data: patients:     {len(global_pet_echo_variables.mrn.unique())}')
print(f'Global data: PET studies:  {len(global_pet_echo_variables.petmrn_identifier.unique())}')
print(f'Global data: ECHO studies: {len(global_pet_echo_variables.study.unique())}')
print(f'Global data: videos:       {len(global_pet_echo_variables.filename.unique())}')
print(f'Tracer values:             {len(global_pet_echo_variables.tracer_obi.unique())}')

Global data: patients:     2287
Global data: PET studies:  2408
Global data: ECHO studies: 3681
Global data: videos:       11077
Tracer values:             4


In [11]:
# Complete list of unique petmrn_identifier
petmrn_identifier_list = list(global_pet.petmrn_identifier.unique())
petmrn_identifier_set = list(set(petmrn_identifier_list))

In [12]:
# Let's filter Rahuls list of missing echos
missing_echo = pd.read_parquet(os.path.join(meta_dir, 'mrn_pet_missing_echo_file.parquet'))
print(f'petmrn_identifier in original list: {len(missing_echo.petmrn_identifier.unique())}')
missing_echo.tail()
missing_echo_filtered = missing_echo[missing_echo.petmrn_identifier.isin(petmrn_identifier_set)].\
                        drop(columns=['pet_measurement'], axis=1).reset_index(drop=True)
print(f'petmrn_identifier in filtered list:   {len(missing_echo_filtered.petmrn_identifier.unique())}')
missing_no_echo_date = missing_echo_filtered.loc[missing_echo_filtered.echo_date.isnull()]
print(f'petmrn_identifier without echo dates: {len(missing_no_echo_date.petmrn_identifier.unique())}')

missing_echo_filtered_file = 'mrn_pet_missing_echo_file_filtered.parquet'
missing_echo_filtered.to_parquet(os.path.join(meta_dir, missing_echo_filtered_file))

petmrn_identifier in original list: 229
petmrn_identifier in filtered list:   72
petmrn_identifier without echo dates: 2


### Split the patients in train, validate and test sets ###
Although each view might have a little different patient pupulation distribution, because not all views are in each study. However, we want the same MRNS in each data set and for all views so that we can directly compare the performance of the algorithm for the same patients. We can expand the data frame above to add the splits.

In [13]:
def patientsplit(patient_list):

    train_test_split = 0.85
    train_eval_split = 0.90

    # Take a test set from all patients
    patient_list_train = np.random.choice(patient_list,
                                          size = int(np.floor(train_test_split*len(patient_list))),
                                          replace = False)
    patient_list_test = list(set(patient_list).difference(patient_list_train))
    train_test_intersection = set(patient_list_train).intersection(set(patient_list_test)) # This should be empty
    print('Intersection of patient_list_train and patient_list_test:', train_test_intersection)

    # Further separate some patients for evaluation from the training list
    patient_list_eval = np.random.choice(patient_list_train,
                                         size = int(np.ceil((1-train_eval_split)*len(patient_list_train))),
                                         replace = False)

    patient_list_train = set(patient_list_train).difference(patient_list_eval)
    train_eval_intersection = set(patient_list_train).intersection(set(patient_list_eval))
    print('Intersection of patient_list_train and patient_list_eval:', train_eval_intersection)

    # Show the numbers
    print('total patients:', len(patient_list))
    print()
    print('patients in set:', np.sum([len(patient_list_train),
                                     len(patient_list_eval),
                                     len(patient_list_test)]))
    print()
    print('patients in train:', len(patient_list_train))
    print('patients in eval:', len(patient_list_eval))
    print('patients in test:', len(patient_list_test))

    return patient_list_train, patient_list_eval, patient_list_test

In [14]:
dataset = global_pet_echo_variables
dataset_filename = 'global_pet_echo_dataset_'+meta_date+'.parquet'
global_pet_echo_variables.head()

# Get a patient list
patient_list = list(dataset.sample(frac=1).mrn.unique())
patient_list_train, patient_list_eval, patient_list_test = patientsplit(patient_list)

patient_split = {'train': patient_list_train,
                 'eval': patient_list_eval,
                 'test': patient_list_test}

print('Patient IDs in train:', len(patient_split['train']))
print('Patient IDs in eval:', len(patient_split['eval']))
print('Patient IDs in test:', len(patient_split['test']))

print()

print('Intersection train-test:', set(patient_split['train']).intersection(set(patient_split['test'])))
print('Intersection train-eval:', set(patient_split['train']).intersection(set(patient_split['eval'])))
print('Intersection eval-test:', set(patient_split['eval']).intersection(set(patient_split['test'])))

Intersection of patient_list_train and patient_list_test: set()
Intersection of patient_list_train and patient_list_eval: set()
total patients: 2287

patients in set: 2287

patients in train: 1748
patients in eval: 195
patients in test: 344
Patient IDs in train: 1748
Patient IDs in eval: 195
Patient IDs in test: 344

Intersection train-test: set()
Intersection train-eval: set()
Intersection eval-test: set()


In [15]:
# Add mode column to files_cfr
split_list = []
for dset in patient_split.keys():
    dset_mrn_list = list(patient_split[dset])
    split_list.append(pd.DataFrame({'mrn': dset_mrn_list,
                                    'dset_mode': [dset]*len(dset_mrn_list)}))

split_df = pd.concat(split_list, ignore_index = True)

dataset_split = dataset.merge(right = split_df, on = 'mrn', how = 'left').\
                sample(frac = 1).\
                reset_index(drop = True)

In [16]:
print(f'TOTAL patients:     {len(dataset_split.mrn.unique())}')
print(f'TOTAL PET studies:  {len(dataset_split.petmrn_identifier.unique())}')
print(f'TOTAL ECHO studies: {len(dataset_split.study.unique())}')
print(f'TOTAL videos:       {len(dataset_split.filename.unique())}')

dset_list = ['train', 'eval', 'test']
for dset in dset_list:
    df_dset = dataset_split[dataset_split.dset_mode==dset]
    print()
    print(f'patients in {dset}:     {len(df_dset.mrn.unique())}')
    print(f'PET studies in {dset}:  {len(df_dset.petmrn_identifier.unique())}')
    print(f'ECHO studies in {dset}: {len(df_dset.study.unique())}')
    print(f'videos in {dset}:       {len(df_dset.filename.unique())}')

TOTAL patients:     2287
TOTAL PET studies:  2408
TOTAL ECHO studies: 3681
TOTAL videos:       11077

patients in train:     1748
PET studies in train:  1844
ECHO studies in train: 2811
videos in train:       8420

patients in eval:     195
PET studies in eval:  204
ECHO studies in eval: 300
videos in eval:       944

patients in test:     344
PET studies in test:  360
ECHO studies in test: 570
videos in test:       1713


In [17]:
# Let's look at the video tracer numbers
tracer_list = dataset_split.tracer_obi.unique()
print(tracer_list)
for tracer in tracer_list:
    df_tracer = dataset_split[dataset_split.tracer_obi==tracer]
    for dset in dset_list:
        df_dset = df_tracer[df_tracer.dset_mode==dset]
        print()
        print(f'{dset}: patients for {tracer}:     {len(df_dset.mrn.unique())}')
        print(f'{dset}: PET studies for {tracer}:  {len(df_dset.petmrn_identifier.unique())}')
        print(f'{dset}: ECHO studies for {tracer}: {len(df_dset.study.unique())}')
        print(f'{dset}: videos for {tracer}:       {len(df_dset.filename.unique())}')

['ammonia' 'rubidium' 'listed as rubidium' 'listed as ammonia']

train: patients for ammonia:     719
train: PET studies for ammonia:  745
train: ECHO studies for ammonia: 1245
train: videos for ammonia:       4106

eval: patients for ammonia:     75
eval: PET studies for ammonia:  77
eval: ECHO studies for ammonia: 127
eval: videos for ammonia:       439

test: patients for ammonia:     150
test: PET studies for ammonia:  155
test: ECHO studies for ammonia: 261
test: videos for ammonia:       867

train: patients for rubidium:     1061
train: PET studies for rubidium:  1094
train: ECHO studies for rubidium: 1559
train: videos for rubidium:       4296

eval: patients for rubidium:     121
eval: PET studies for rubidium:  127
eval: ECHO studies for rubidium: 173
eval: videos for rubidium:       505

test: patients for rubidium:     199
test: PET studies for rubidium:  204
test: ECHO studies for rubidium: 308
test: videos for rubidium:       844

train: patients for listed as rubidium:  

In [18]:
dataset_split.columns

Index(['rest_global_mbf', 'stress_global_mbf', 'global_cfr_calc', 'post-2018', 'tracer_obi', 'mrn', 'study', 'pet_date', 'echo_date', 'petmrn_identifier', 'days_post_pet', 'pet_measurement', 'difference(days)', 'filename', 'dir', 'datetime', 'file_base', 'identifier', 'frame_time', 'number_of_frames', 'heart_rate', 'deltaX', 'deltaY', 'a2c', 'a2c_laocc', 'a2c_lvocc_s', 'a3c', 'a3c_laocc', 'a3c_lvocc_s', 'a4c', 'a4c_far', 'a4c_laocc', 'a4c_lvocc_s', 'a4c_rv', 'a4c_rv_laocc', 'a5c', 'apex', 'other', 'plax_far', 'plax_lac', 'plax_laz', 'plax_laz_ao', 'plax_plax', 'psax_avz', 'psax_az', 'psax_mv', 'psax_pap', 'rvinf', 'subcostal', 'suprasternal', 'year_month', 'study_full_time', 'institution', 'model', 'manufacturer', 'max_view', 'sum_views', 'dset_mode'], dtype='object')

In [19]:
# Prepare the final set that we will use for writing TFR files. We don't want any rows with NAs in some columns.
col_set = ['rest_global_mbf', 'rest_global_mbf', 'global_cfr_calc', 'tracer_obi', 
           'pet_measurement', 'sum_views', 'dset_mode']
dataset_split_tfr = dataset_split.dropna(subset = col_set, axis=0)

print('Dropped {} rows.'.format(dataset_split.shape[0]-dataset_split_tfr.shape[0]))

# Add some other information that we need and shuffle the whole thing
dataset_split_tfr = dataset_split_tfr.assign(rate = np.round(1/dataset_split_tfr.frame_time*1e3, decimals = 1))
dataset_split_tfr = dataset_split_tfr.assign(dur = dataset_split_tfr.frame_time*1e-3*dataset_split_tfr.number_of_frames)

dataset_split_tfr = dataset_split_tfr.sample(frac = 1)

dataset_split.loc[~dataset_split.index.isin(dataset_split_tfr.index)]

Dropped 0 rows.


Unnamed: 0,rest_global_mbf,stress_global_mbf,global_cfr_calc,post-2018,tracer_obi,mrn,study,pet_date,echo_date,petmrn_identifier,days_post_pet,pet_measurement,difference(days),filename,dir,datetime,file_base,identifier,frame_time,number_of_frames,heart_rate,deltaX,deltaY,a2c,a2c_laocc,a2c_lvocc_s,a3c,a3c_laocc,a3c_lvocc_s,a4c,a4c_far,a4c_laocc,a4c_lvocc_s,a4c_rv,a4c_rv_laocc,a5c,apex,other,plax_far,plax_lac,plax_laz,plax_laz_ao,plax_plax,psax_avz,psax_az,psax_mv,psax_pap,rvinf,subcostal,suprasternal,year_month,study_full_time,institution,model,manufacturer,max_view,sum_views,dset_mode


In [20]:
print(list(dataset_split_tfr.tracer_obi.unique()))

['rubidium', 'ammonia', 'listed as ammonia', 'listed as rubidium']


In [21]:
dataset_split_tfr.to_parquet(os.path.join(meta_dir, dataset_filename))
print('Saved to file:', dataset_filename)
print(dataset_split_tfr.shape)
dataset_split_tfr.head()

Saved to file: global_pet_echo_dataset_200617.parquet
(11077, 60)


Unnamed: 0,rest_global_mbf,stress_global_mbf,global_cfr_calc,post-2018,tracer_obi,mrn,study,pet_date,echo_date,petmrn_identifier,days_post_pet,pet_measurement,difference(days),filename,dir,datetime,file_base,identifier,frame_time,number_of_frames,heart_rate,deltaX,deltaY,a2c,a2c_laocc,a2c_lvocc_s,a3c,a3c_laocc,a3c_lvocc_s,a4c,a4c_far,a4c_laocc,a4c_lvocc_s,a4c_rv,a4c_rv_laocc,a5c,apex,other,plax_far,plax_lac,plax_laz,plax_laz_ao,plax_plax,psax_avz,psax_az,psax_mv,psax_pap,rvinf,subcostal,suprasternal,year_month,study_full_time,institution,model,manufacturer,max_view,sum_views,dset_mode,rate,dur
2236,0.74,2.613,3.531081,0,rubidium,22302947,490176ecfed54843_4903a444c23a29accc9891d29685,2008-04-17,2007-10-31,22302947_2008-04-17,-169,1.0,169.0,490176ecfed54843_4903a444c23a29accc9891d29685_...,/mnt/obi0/phi/echo/npyFiles/BWH/4901/490176ecf...,2007-10-31 13:12:32,490176ecfed54843_4903a444c23a29accc9891d29685_...,490176ecfed54843_4903a444c23a29accc9891d29685_...,40.322581,63.0,75.0,0.048593,0.048593,1.526044e-07,6.288687e-09,9.698587e-08,1.122459e-07,2.441809e-07,1.198979e-07,0.71647,1.192437e-07,9.70748e-10,2.404599e-10,0.003660227,7.116121e-11,0.2797258,6.446593e-07,3.062847e-08,1.440944e-09,8.717898e-05,2.040459e-08,5.293421e-10,2.148711e-11,7.354546e-06,1.033843e-08,5.604565e-08,1.32573e-05,1.419972e-05,3.58138e-10,2.09198e-05,2007.0,20071031131232,BWH,Vivid7,GE Vingmed Ultrasound,a4c,1.0,eval,24.8,2.540323
274,0.955,2.308,2.416754,0,rubidium,19317163,4a1fe5efc43dec66_4903a44b32e8ddd345993135cebc,2008-08-01,2008-04-14,19317163_2008-08-01,-109,1.0,109.0,4a1fe5efc43dec66_4903a44b32e8ddd345993135cebc_...,/mnt/obi0/phi/echo/npyFiles/BWH/4a1f/4a1fe5efc...,2008-04-14 13:46:36,4a1fe5efc43dec66_4903a44b32e8ddd345993135cebc_...,4a1fe5efc43dec66_4903a44b32e8ddd345993135cebc_...,33.333,91.0,51.0,0.065784,0.065784,4.855964e-11,1.207787e-10,1.780659e-09,9.496988e-11,8.827991e-10,3.3681e-09,0.979295,0.02044863,2.538073e-09,8.161536e-10,2.676549e-06,2.534934e-08,8.560859e-13,1.130816e-09,2.296065e-09,1.084421e-09,1.520502e-11,3.056828e-08,1.652438e-09,3.760336e-08,7.733466e-07,4.376413e-06,3.777619e-11,0.0002479796,4.573167e-12,3.731746e-13,1.490915e-09,2008.0,20080414134636,BWH,iE33,Philips Medical Systems,a4c,1.0,train,30.0,3.033303
8830,0.609,1.35,2.216749,0,ammonia,27151703,4904ba6c53618c7f_4903a58793bfceb2ce3868c8151b,2013-03-21,2012-08-06,27151703_2013-03-21,-227,1.0,227.0,4904ba6c53618c7f_4903a58793bfceb2ce3868c8151b_...,/mnt/obi0/phi/echo/npyFiles/BWH/4904/4904ba6c5...,2012-08-06 08:37:10,4904ba6c53618c7f_4903a58793bfceb2ce3868c8151b_...,4904ba6c53618c7f_4903a58793bfceb2ce3868c8151b_...,40.119048,85.0,55.0,0.053153,0.053153,1.040321e-13,7.938899e-13,6.542746e-15,4.358271e-17,9.511394e-16,3.07085e-14,1.0,2.724828e-11,5.122407e-16,2.229038e-13,1.316427e-12,5.678736e-15,2.227005e-14,2.773552e-14,1.712471e-11,2.891193e-13,5.739325e-14,1.812508e-15,2.653622e-15,1.278631e-12,1.612062e-12,7.903387e-15,2.619207e-15,1.164287e-12,7.852771e-13,8.872656e-18,1.058232e-13,2012.0,20120806083710,BWH,Vivid E9,GE Vingmed Ultrasound,a4c,1.0,eval,24.9,3.410119
10233,0.902,1.653,1.832594,0,ammonia,4183273,4b7f0d5bca3a6ac1_4903a58dd547fecf9f5f4492bb1e,2017-11-07,2018-04-27,4183273_2017-11-07,171,1.0,171.0,4b7f0d5bca3a6ac1_4903a58dd547fecf9f5f4492bb1e_...,/mnt/obi0/phi/echo/npyFiles/BWH/4b7f/4b7f0d5bc...,2018-04-27 11:14:51,4b7f0d5bca3a6ac1_4903a58dd547fecf9f5f4492bb1e_...,4b7f0d5bca3a6ac1_4903a58dd547fecf9f5f4492bb1e_...,39.317,61.0,75.0,0.038882,0.038882,2.332095e-10,3.806092e-09,7.480858e-10,2.43249e-06,5.350698e-09,1.141113e-09,0.999665,3.392791e-09,5.445255e-08,2.732179e-10,0.0003302598,1.617755e-08,5.155525e-07,1.301765e-06,1.411343e-07,3.785148e-09,7.247441e-08,4.78543e-10,7.27334e-11,2.186505e-09,4.516802e-10,7.925628e-11,2.74294e-07,1.584705e-08,3.680352e-10,7.610151e-14,7.395368e-10,2018.0,20180427111451,BWH,Affiniti 70C,Philips Medical Systems,a4c,1.0,test,25.4,2.398337
7480,0.632,1.245,1.969937,0,rubidium,23855372,49004692a66dceb9_4903a44ab12e6bfea1c3eeda8650,2009-05-06,2009-05-06,23855372_2009-05-06,0,1.0,0.0,49004692a66dceb9_4903a44ab12e6bfea1c3eeda8650_...,/mnt/obi0/phi/echo/npyFiles/BWH/4900/49004692a...,2009-05-06 07:55:21,49004692a66dceb9_4903a44ab12e6bfea1c3eeda8650_...,49004692a66dceb9_4903a44ab12e6bfea1c3eeda8650_...,33.333,31.0,65.0,0.039485,0.039485,0.000844653,3.157553e-06,2.194643e-08,1.116456e-06,1.410417e-10,6.501203e-08,0.998814,5.295115e-05,4.470071e-09,3.981303e-09,1.266734e-05,3.402177e-08,5.156852e-08,9.454626e-09,3.872496e-09,7.032413e-08,2.078618e-09,5.728752e-08,6.096406e-09,4.154108e-08,4.196633e-07,9.960001e-06,1.258057e-10,0.0002595027,1.58763e-06,4.449165e-12,6.150728e-09,2009.0,20090506075521,BWH,iE33,Philips Medical Systems,a4c,1.0,train,30.0,1.033323


In [22]:
minframes = 40
max_frame_time = 33.34
print('Minimum frames:   {}'.format(minframes))
print('Maximum frame_time: {}'.format(max_frame_time))
minduration = max_frame_time*minframes*1e-3
print('Minimum duration: {}'.format(minduration))

#minframes = int(np.ceil(minrate*minduration))

maxrows = dataset_split_tfr.shape[0]
dataset_disqualified = dataset_split_tfr[(dataset_split_tfr.frame_time > max_frame_time) | (dataset_split_tfr.dur < minduration)]
dataset_qualified = dataset_split_tfr[(dataset_split_tfr.frame_time <= max_frame_time) & (dataset_split_tfr.dur >= minduration)]

n_videos = len(dataset_split_tfr.filename.unique())
n_disqualified = len(dataset_disqualified.filename.unique())
n_qualified = len(dataset_qualified.filename.unique())

print()
print('Total videos: {}'.format(n_videos))
print('Disqualified videos {}, fraction:{:.1f}'.format(n_disqualified, 
                                                     np.round(n_disqualified/n_videos*100), decimals=1))
print('Qualified videos {}, fraction:{:.1f}'.format(n_qualified, 
                                                     np.round(n_qualified/n_videos*100), decimals=1))

Minimum frames:   40
Maximum frame_time: 33.34
Minimum duration: 1.3336000000000001

Total videos: 11077
Disqualified videos 2936, fraction:27.0
Qualified videos 8141, fraction:73.0


In [23]:
# Report train val test numbers that qualify....
tracer_list = dataset_split.tracer_obi.unique()
print(tracer_list)
for dset in dset_list:
    df_dset = dataset_qualified[dataset_qualified.dset_mode==dset]
    print()
    print(f'{dset}: patients for TOTAL:     {len(df_dset.mrn.unique())}')
    print(f'{dset}: PET studies for TOTAL:  {len(df_dset.petmrn_identifier.unique())}')
    print(f'{dset}: ECHO studies for TOTAL: {len(df_dset.study.unique())}')
    print(f'{dset}: videos for TOTAL:       {len(df_dset.filename.unique())}')
    
    for tracer in tracer_list:
        df_tracer = df_dset[df_dset.tracer_obi==tracer]
        print()
        print(f'{dset}: patients for {tracer}:     {len(df_dset.mrn.unique())}')
        print(f'{dset}: PET studies for {tracer}:  {len(df_dset.petmrn_identifier.unique())}')
        print(f'{dset}: ECHO studies for {tracer}: {len(df_dset.study.unique())}')
        print(f'{dset}: videos for {tracer}:       {len(df_dset.filename.unique())}')

['ammonia' 'rubidium' 'listed as rubidium' 'listed as ammonia']

train: patients for TOTAL:     1391
train: PET studies for TOTAL:  1455
train: ECHO studies for TOTAL: 2050
train: videos for TOTAL:       6225

train: patients for ammonia:     1391
train: PET studies for ammonia:  1455
train: ECHO studies for ammonia: 2050
train: videos for ammonia:       6225

train: patients for rubidium:     1391
train: PET studies for rubidium:  1455
train: ECHO studies for rubidium: 2050
train: videos for rubidium:       6225

train: patients for listed as rubidium:     1391
train: PET studies for listed as rubidium:  1455
train: ECHO studies for listed as rubidium: 2050
train: videos for listed as rubidium:       6225

train: patients for listed as ammonia:     1391
train: PET studies for listed as ammonia:  1455
train: ECHO studies for listed as ammonia: 2050
train: videos for listed as ammonia:       6225

eval: patients for TOTAL:     157
eval: PET studies for TOTAL:  163
eval: ECHO studies for