# Retrieve sample metadata and labs from HISE

The Python version of the HISE SDK allows retrieval of sample metadata regardless of pipeline sample processing. We'll use this SDK to pull the data, then use R to do cleanup.

## Load packages

datetime: Used to add today's date to our output files  
hisepy: the HISE SDK  
os: Operating System files (used to make an output folder)  
pandas: DataFrames for Python  
session_info: displays the versioning Python and all of the packages we used  
warnings: Used to suppress some annoying warnings that don't impact data retrieval

In [1]:
from datetime import date
import hisepy
import os
import pandas
import re
import session_info

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
if not os.path.isdir('output'):
    os.mkdir('output')

In [3]:
def element_id(n = 3):
    import periodictable
    from random import randrange
    rand_el = []
    for i in range(n):
        el = randrange(0,118)
        rand_el.append(periodictable.elements[el].name)
    rand_str = '-'.join(rand_el)
    return rand_str

## Retrieve sample metadata from HISE

In [4]:
sample_meta_file_uuid = '2da66a1a-17cc-498b-9129-6858cf639caf'
res = hisepy.reader.read_files([sample_meta_file_uuid])
sample_meta = res['values']

In [5]:
sample_meta.shape

(108, 32)

## Retrieve CMV status from HISE

In [6]:
cmv_uuid = 'be338216-0c90-47df-923c-7f4d7c35bac4'
res = hisepy.reader.read_files([cmv_uuid])
cmv = res['values']

In [7]:
cmv = cmv[['subject.subjectGuid', 'subject.cmv']].drop_duplicates()

Add CMV to samples

In [8]:
sample_meta = sample_meta.merge(cmv, on = 'subject.subjectGuid', how = 'left')

## Compute age at draw

In [9]:
sample_meta['sample.drawDate'].head()

0    2019-10-01T00:00:00Z
1    2019-10-01T00:00:00Z
2    2019-10-01T00:00:00Z
3    2019-10-01T00:00:00Z
4    2019-10-01T00:00:00Z
Name: sample.drawDate, dtype: object

In [10]:
def drawDate_to_year(drawDate):
    year = re.sub('-.+', '', drawDate)
    year = int(year)
    return year

In [11]:
sample_meta['sample.drawYear'] = [drawDate_to_year(d) for d in sample_meta['sample.drawDate']]

In [12]:
sample_meta['sample.subjectAgeYears'] = sample_meta['sample.drawYear'] - sample_meta['subject.birthYear']

In [13]:
sample_meta.head()

Unnamed: 0,lastUpdated,sample.id,sample.bridgingControl,sample.sampleKitGuid,sample.visitName,sample.visitDetails,sample.drawDate,sample.daysSinceFirstVisit,file.id,file.name,...,file.userTags.group,file.userTags.name,file.userTags.origin,file.userTags.other,file.userTags.version,pbmc_sample_id,filename,subject.cmv,sample.drawYear,sample.subjectAgeYears
0,2024-02-16T22:17:47.127Z,f499ff83-e513-4d24-a10f-151348269fff,True,KT00001,Flu Year 1 Day 0,N/A - Flu-Series Timepoint Only,2019-10-01T00:00:00Z,0,fec489f9-9a74-4635-aa91-d2bf09d1faec,automated/merged/B001/labeled/B001-P1_PB00001-...,...,,,,,,PB00001-01,ref_h5_meta_data_2024-02-18.csv,Negative,2019,32
1,2024-02-16T22:17:47.127Z,750e90a9-a296-4f0f-969f-60225c2bca17,True,KT00002,Flu Year 1 Day 0,N/A - Flu-Series Timepoint Only,2019-10-01T00:00:00Z,0,7c0c7979-eebd-4aba-b5b2-6e76b4643623,automated/merged/B001/labeled/B001-P1_PB00002-...,...,,,,,,PB00002-01,ref_h5_meta_data_2024-02-18.csv,Negative,2019,28
2,2024-02-16T22:17:47.127Z,2db6fb3f-e3f4-454b-891b-9b068541b51d,True,KT00003,Flu Year 1 Day 0,N/A - Flu-Series Timepoint Only,2019-10-01T00:00:00Z,0,40efd03a-cb2f-4677-af42-a056cbfe5a17,automated/merged/B001/labeled/B001-P1_PB00003-...,...,,,,,,PB00003-01,ref_h5_meta_data_2024-02-18.csv,Negative,2019,30
3,2024-02-16T22:17:47.127Z,f04693c5-563c-4b5b-ae58-877d0d9ae2fe,True,KT00004,Flu Year 1 Day 0,N/A - Flu-Series Timepoint Only,2019-10-01T00:00:00Z,0,68fbcd34-1d63-461d-8195-df5b8dc61b31,automated/merged/2023-11-17T21:36:51.794326181...,...,,,,,,PB00004-01,ref_h5_meta_data_2024-02-18.csv,Negative,2019,30
4,2024-02-16T22:17:47.127Z,eb5b3a3d-002e-40a6-aa19-aa0e6a7fff8f,True,KT00006,Flu Year 1 Day 0,N/A - Flu-Series Timepoint Only,2019-10-01T00:00:00Z,0,ea8d98e9-e99e-4dc6-9e78-9866e0deac68,automated/merged/2023-11-17T21:36:51.794326181...,...,,,,,,PB00006-01,ref_h5_meta_data_2024-02-18.csv,Negative,2019,27


## Select columns to retain for visualization

In [14]:
keep_columns = [
    'cohort.cohortGuid', 
    'subject.subjectGuid', 'subject.biologicalSex', 'subject.birthYear',
    'subject.race', 'subject.ethnicity', 'subject.cmv', 'sample.visitName', 
    'sample.subjectAgeYears', 'sample.sampleKitGuid'
]

In [15]:
sample_meta = sample_meta[keep_columns]

## Retrieve lab groupings and headers from HISE

In [16]:
grouping_uuid = '0180d34e-bcb4-4b81-9214-818641ec54e3'
res = hisepy.reader.read_files([grouping_uuid])
lab_groups = res['values'].drop('filename', axis = 1)

In [17]:
lab_groups.head()

Unnamed: 0,category_name,lab,column_name
0,Anthropometric measures,Body Mass Index (BMI),am.bmi
1,Anthropometric measures,Height,am.height
2,Anthropometric measures,Weight,am.weight
3,Blood Chemistry,Alanine Transaminase (ALT),chem.alt
4,Blood Chemistry,Albumin,chem.albumin


## Query HISE to get data

First, we make a dictionary (with curly braces) that defines what we want to get. In this case, we want to get samples from our study that were previously specified in our sample metadata.

Each entry in the dictionary has to be a list (square braces), even if it has a single entry.

In [18]:
query_dict = {
    'sampleKitGuid': sample_meta['sample.sampleKitGuid'].tolist()
}

Now, we send this dictionary to HISE via hisepy

In [19]:
sample_data = hisepy.reader.read_samples(
    query_dict = query_dict
)

What we get back is a dictionary containing multiple kinds of information.

We can see what these are called with they `.keys()` method:

In [20]:
sample_data.keys()

dict_keys(['metadata', 'specimens', 'survey', 'labResults'])

And we can access each of these using square brackets:

In [21]:
sample_data['metadata'].columns

Index([                                    0,
                                        'id',
                               'lastUpdated',
                         'latestBatchUpdate',
                           'labLastModified',
                        'surveyLastModified',
                               'projectGuid',
                                'subject.id',
                       'subject.subjectGuid',
                            'subject.cohort',
                     'subject.biologicalSex',
                              'subject.race',
                             'subject.races',
                         'subject.ethnicity',
                         'subject.birthYear',
                   'subject.ageAtEnrollment',
                          'sample.visitName',
                       'sample.visitDetails',
                         'sample.sampleGuid',
                      'sample.sampleKitGuid',
                           'sample.drawDate',
                'sample.daysSinceF

In [22]:
sample_data['metadata'].shape

(495, 27)

In [23]:
sample_data['labResults'].columns

Index(['id', 'revisionHistory', 'revisionNumber', '% Basophils',
       '% Eosinophils', '% Lymphocytes', '% Monocytes', '% Neutrophils',
       'Absolute Basophil Count', 'Absolute Eosinophil Count (AEC)',
       ...
       'CMV IgG Serology', 'Converted to IA During Course of Study',
       'Covid-19 Vaccine Company Name', 'Covid-19 Vaccine Ever Received',
       'Num. of Days to/From IA Conversion',
       'Number of Days To/From Initial Covid-19 Vaccine',
       'Number of Days To/From Second Covid-19 Vaccine',
       'Number of Days To/From Third Covid-19 Vaccine',
       'CMV Ab Screen Index Value', 'CMV Ab Screen Result'],
      dtype='object', length=109)

In [24]:
sample_data['labResults'].shape

(495, 109)

In [25]:
sample_data['labResults'].head()

Unnamed: 0,id,revisionHistory,revisionNumber,% Basophils,% Eosinophils,% Lymphocytes,% Monocytes,% Neutrophils,Absolute Basophil Count,Absolute Eosinophil Count (AEC),...,CMV IgG Serology,Converted to IA During Course of Study,Covid-19 Vaccine Company Name,Covid-19 Vaccine Ever Received,Num. of Days to/From IA Conversion,Number of Days To/From Initial Covid-19 Vaccine,Number of Days To/From Second Covid-19 Vaccine,Number of Days To/From Third Covid-19 Vaccine,CMV Ab Screen Index Value,CMV Ab Screen Result
0,075fc922-035f-401e-b90c-f88ea50d0a95,"[{'historicalRevisionNumber': 1, 'modifiedDate...",6,0.5,1.6,31.1,7.7,59.1,22.0,69.0,...,,,,,,,,,,
1,4708ad58-3759-4a83-8fe5-9d7117b6d34f,"[{'historicalRevisionNumber': 1, 'modifiedDate...",3,0.9,2.4,39.5,8.9,48.3,50.0,132.0,...,,,,,,,,,,
2,5f6e9c4c-1dc2-4d84-beee-a4752a8280c3,"[{'historicalRevisionNumber': 1, 'modifiedDate...",4,0.7,0.7,45.4,12.1,41.1,29.0,29.0,...,,,,,,,,,,
3,8f2760aa-5a8f-4857-810d-d73a7d6f01da,"[{'historicalRevisionNumber': 1, 'modifiedDate...",4,1.1,2.5,49.8,10.9,35.7,32.0,73.0,...,,,,,,,,,,
4,fec103ba-a651-470b-adf9-4f2577fa9308,"[{'historicalRevisionNumber': 1, 'modifiedDate...",3,0.7,0.1,19.0,11.3,68.9,52.0,7.0,...,,,,,,,,,,


## Filter and join results

In [26]:
meta = sample_data['metadata']
labs = sample_data['labResults']

In [27]:
meta = meta[['sample.sampleKitGuid']]

In [28]:
keep_labs = []
for lab in labs.columns:
    if lab in lab_groups['lab'].tolist():
        keep_labs.append(lab)

In [29]:
labs = labs[keep_labs]

In [30]:
data = pandas.concat([meta, labs], axis = 1)

In [31]:
data.shape

(495, 59)

In [32]:
data = data.drop_duplicates()

In [33]:
data.shape

(216, 59)

Some rows are missing labs. Looks like this happens once for each sample:

In [34]:
list(data[data['sample.sampleKitGuid'] == 'KT00001'].loc[1:].to_numpy())

[array(['KT00001', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan], dtype=object)]

Count number of missing values per row and remove those without any lab results

In [35]:
n_missing = data.isna().sum(axis=1)

In [36]:
data = data[n_missing < 58]

In [37]:
data.shape

(108, 59)

Join to sample metadata to make a complete dataset

In [38]:
data = sample_meta.merge(data, on = 'sample.sampleKitGuid', how = 'left')

In [39]:
data.shape

(108, 68)

Drop CMV results so we don't have confusion with subject.cmv

In [40]:
drop_cmv = data.columns[data.columns.str.contains('CMV')]

In [41]:
data = data.drop(drop_cmv, axis = 1)

In [42]:
data.columns

Index(['cohort.cohortGuid', 'subject.subjectGuid', 'subject.biologicalSex',
       'subject.birthYear', 'subject.race', 'subject.ethnicity', 'subject.cmv',
       'sample.visitName', 'sample.subjectAgeYears', 'sample.sampleKitGuid',
       '% Basophils', '% Eosinophils', '% Lymphocytes', '% Monocytes',
       '% Neutrophils', 'Absolute Basophil Count',
       'Absolute Eosinophil Count (AEC)', 'Absolute Lymphocyte Count (ALC)',
       'Absolute Monocyte Count (AMC)', 'Absolute Neutrophil Count (ANC)',
       'Alanine Transaminase (ALT)', 'Albumin', 'Alkaline Phosphatase',
       'Anti-CCP3', 'Anti-CCP31', 'Aspartate Aminotransferase (AST)',
       'Bilirubin, Total (T-Bili)', 'Blood Urea Nitrogen (BUN)',
       'C-Reactive Protein, High-Sensitivity (HS-CRP)', 'Calcium',
       'Carbon Dioxide (CO2)', 'Chloride (Cl)', 'Cholesterol, HDL',
       'Cholesterol, LDL', 'Cholesterol, Non-HDL', 'Cholesterol, Total',
       'Cholesterol/HDL Ratio', 'Creatinine',
       'Estimated Glomerular Fil

### Save results

In [43]:
out_file = 'output/pbmc_reference_labs_{d}.csv'.format(d = date.today())
data.to_csv(out_file)

## Upload results to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [49]:
study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'
title = 'PBMC Reference Assembled Labs {d}'.format(d = date.today())

In [50]:
search_id = element_id()
search_id

'praseodymium-bohrium-moscovium'

In [51]:
in_files = [sample_meta_file_uuid, cmv_uuid, grouping_uuid]

In [52]:
out_files = [out_file]

In [53]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files,
    destination = search_id
)

you are trying to upload file_ids... ['output/pbmc_reference_labs_2024-04-18.csv']. Do you truly want to proceed?


(y/n) y


{'trace_id': '92eedc0c-b488-4cf9-9443-32fc2b2a8751',
 'files': ['output/pbmc_reference_labs_2024-04-18.csv']}

In [54]:
session_info.show()