# Retrieve sample metadata and labs from HISE

The Python version of the HISE SDK allows retrieval of sample metadata regardless of pipeline sample processing. We'll use this SDK to pull the data, then use R to do cleanup.

## Load packages

datetime: Used to add today's date to our output files  
hisepy: the HISE SDK  
os: Operating System files (used to make an output folder)  
pandas: DataFrames for Python  
session_info: displays the versioning Python and all of the packages we used  
warnings: Used to suppress some annoying warnings that don't impact data retrieval

In [1]:
from datetime import date
import hisepy
import os
import pandas
import session_info

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Retrieve sample metadata from HISE

In [2]:
sample_meta_file_uuid = 'd82c5c42-ae5f-4e67-956e-cd3b7bf88105'
res = hisepy.reader.read_files([sample_meta_file_uuid])
sample_meta = res['values']

## Query HISE to get data

First, we make a dictionary (with curly braces) that defines what we want to get. In this case, we want to get samples from our study that were previously specified in our sample metadata.

Each entry in the dictionary has to be a list (square braces), even if it has a single entry.

In [3]:
query_dict = {
    'sampleKitGuid': sample_meta['sample.sampleKitGuid'].tolist()
}

Now, we send this dictionary to HISE via hisepy

In [5]:
sample_data = hisepy.reader.read_samples(
    query_dict = query_dict
)

What we get back is a dictionary containing multiple kinds of information.

We can see what these are called with they `.keys()` method:

In [6]:
sample_data.keys()

dict_keys(['metadata', 'specimens', 'survey', 'labResults'])

And we can access each of these using square brackets:

In [7]:
sample_data['metadata'].columns

Index([                                    0,
                                        'id',
                               'lastUpdated',
                         'latestBatchUpdate',
                           'labLastModified',
                        'surveyLastModified',
                               'projectGuid',
                                'subject.id',
                       'subject.subjectGuid',
                            'subject.cohort',
                     'subject.biologicalSex',
                              'subject.race',
                             'subject.races',
                         'subject.ethnicity',
                         'subject.birthYear',
                   'subject.ageAtEnrollment',
                          'sample.visitName',
                       'sample.visitDetails',
                         'sample.sampleGuid',
                      'sample.sampleKitGuid',
                           'sample.drawDate',
                'sample.daysSinceF

In [8]:
sample_data['metadata'].head()

Unnamed: 0,0,id,lastUpdated,latestBatchUpdate,labLastModified,surveyLastModified,projectGuid,subject.id,subject.subjectGuid,subject.cohort,...,sample.visitDetails,sample.sampleGuid,sample.sampleKitGuid,sample.drawDate,sample.daysSinceFirstVisit,sample.diseaseStatesRecordedAtVisit,experiments,experimentComments,batchIds,panelIds
0,,f499ff83-e513-4d24-a10f-151348269fff,2024-04-08T17:45:36.871Z,2023-03-31T13:38:16.091Z,2022-10-07T18:21:02.71Z,0001-01-01T00:00:00Z,e206cf7a-5b13-478f-b842-a305fe4954d8,9e77e518-3801-47ff-8324-7d1dca8a8fd1,BR1001,BR1,...,N/A - Flu-Series Timepoint Only,599,KT00001,2019-10-01T00:00:00Z,0,[],"{'ATAC-seq': 'Completed', 'Flow cytometry': 'C...","{""ATAC-seq"":[{""OrderStatus"":""Completed"",""Order...","{'FlowCytometry': {'id': 'B013', 'status': 'av...","{'': {'batchId': '', 'status': 'available', 't..."
1,,750e90a9-a296-4f0f-969f-60225c2bca17,2024-04-08T17:45:36.972Z,2023-03-31T13:38:16.091Z,2021-09-13T20:09:05.913Z,2022-11-19T06:45:11.967Z,e206cf7a-5b13-478f-b842-a305fe4954d8,f9482e2e-d29f-4adc-9cd9-6803c778ffd6,BR1002,BR1,...,N/A - Flu-Series Timepoint Only,641,KT00002,2019-10-01T00:00:00Z,0,[],"{'ATAC-seq': 'Completed', 'Flow cytometry': 'C...","{""ATAC-seq"":[{""OrderStatus"":""Completed"",""Order...","{'FlowCytometry': {'id': 'B013', 'status': 'av...","{'': {'batchId': 'EXP-00447', 'status': 'avail..."
2,,2db6fb3f-e3f4-454b-891b-9b068541b51d,2024-04-08T17:45:37.259Z,2023-03-31T13:38:16.091Z,2021-09-13T20:09:05.964Z,2022-11-19T06:45:18.906Z,e206cf7a-5b13-478f-b842-a305fe4954d8,79ae1d71-8347-4115-a846-2d077cb5077c,BR1003,BR1,...,N/A - Flu-Series Timepoint Only,683,KT00003,2019-10-01T00:00:00Z,0,[],"{'ATAC-seq': 'Completed', 'Flow cytometry': 'C...","{""ATAC-seq"":[{""OrderStatus"":""Completed"",""Order...","{'FlowCytometry': {'id': 'B013', 'status': 'av...","{'': {'batchId': 'EXP-00447', 'status': 'avail..."
3,,f04693c5-563c-4b5b-ae58-877d0d9ae2fe,2024-04-08T17:45:37.26Z,2023-11-19T10:33:50.333Z,2021-09-13T20:09:06.018Z,2022-11-19T06:44:06.495Z,e206cf7a-5b13-478f-b842-a305fe4954d8,c77f3365-3e39-470b-b14c-049ffb49ea32,BR1004,BR1,...,N/A - Flu-Series Timepoint Only,725,KT00004,2019-10-01T00:00:00Z,0,[],"{'ATAC-seq': 'Completed', 'Flow cytometry': 'C...","{""ATAC-seq"":[{""OrderStatus"":""Completed"",""Order...","{'FlowCytometry': {'id': 'B022', 'status': 'av...","{'': {'batchId': 'B002', 'status': 'available'..."
4,,eb5b3a3d-002e-40a6-aa19-aa0e6a7fff8f,2024-04-08T17:45:37.261Z,2023-11-19T05:06:39.737Z,2021-09-13T20:09:06.069Z,2022-11-19T06:45:10.35Z,e206cf7a-5b13-478f-b842-a305fe4954d8,e2698854-1524-482d-80dc-d0ecac105beb,BR1005,BR1,...,N/A - Flu-Series Timepoint Only,727,KT00006,2019-10-01T00:00:00Z,0,[],"{'ATAC-seq': 'Completed', 'Flow cytometry': 'C...","{""ATAC-seq"":[{""OrderStatus"":""Completed"",""Order...","{'FlowCytometry': {'id': 'B013', 'status': 'av...","{'': {'batchId': 'B002', 'status': 'available'..."


In [9]:
sample_data['specimens'].columns

Index(['sampleId', 'specimenGuid', 'specimenType', 'specimenStatus',
       'externalContainerID', 'timeToProcessingOnset', 'totalViableCellCount',
       'subjectGuid', 'sampleKitGuid', 'projectGuid'],
      dtype='object')

In [10]:
sample_data['specimens'].head()

Unnamed: 0,sampleId,specimenGuid,specimenType,specimenStatus,externalContainerID,timeToProcessingOnset,totalViableCellCount,subjectGuid,sampleKitGuid,projectGuid
0,f499ff83-e513-4d24-a10f-151348269fff,CT00001-01,CyTOF,Removed,,0.0,,BR1001,KT00001,e206cf7a-5b13-478f-b842-a305fe4954d8
1,f499ff83-e513-4d24-a10f-151348269fff,PL00001-16,Plasma,Stored,,4800.0,,BR1001,KT00001,e206cf7a-5b13-478f-b842-a305fe4954d8
2,f499ff83-e513-4d24-a10f-151348269fff,PL00001-19,Plasma,Stored,,4800.0,,BR1001,KT00001,e206cf7a-5b13-478f-b842-a305fe4954d8
3,f499ff83-e513-4d24-a10f-151348269fff,PL00001-23,Plasma,Stored,,4800.0,,BR1001,KT00001,e206cf7a-5b13-478f-b842-a305fe4954d8
4,f499ff83-e513-4d24-a10f-151348269fff,PB00001-02,PBMC,Removed,,3900.0,,BR1001,KT00001,e206cf7a-5b13-478f-b842-a305fe4954d8


In [11]:
sample_data['survey'].columns

Index(['subjectGuid', 'sampleKitGuid', 'projectGuid', 'id', 'surveyDesignGuid',
       'revisionNumber', 'revisionHistory', 'auditInfo',
       'answers.could_not_sleep', 'answers.depressed',
       ...
       'answers.health-history_chicken_pox_imm_lifetime',
       'answers.health-history_covid19_diag_month',
       'answers.health-history_covid19_diag_year',
       'answers.health-history_has_covid19',
       'answers.health-history_if_covid_diagnosis_month',
       'answers.health-history_if_covid_diagnosis_year',
       'answers.health-history_if_covid_hospitalized',
       'answers.health-history_if_covid_pneumonia',
       'answers.health-history_if_covid_testing_yn',
       'answers.health-history_pneumococcal_imm_year'],
      dtype='object', length=495)

\> 1000 columns, so we don't show all of them here for surveys

In [12]:
sample_data['survey'].head()

Unnamed: 0,subjectGuid,sampleKitGuid,projectGuid,id,surveyDesignGuid,revisionNumber,revisionHistory,auditInfo,answers.could_not_sleep,answers.depressed,...,answers.health-history_chicken_pox_imm_lifetime,answers.health-history_covid19_diag_month,answers.health-history_covid19_diag_year,answers.health-history_has_covid19,answers.health-history_if_covid_diagnosis_month,answers.health-history_if_covid_diagnosis_year,answers.health-history_if_covid_hospitalized,answers.health-history_if_covid_pneumonia,answers.health-history_if_covid_testing_yn,answers.health-history_pneumococcal_imm_year
0,BR1002,KT00002,e206cf7a-5b13-478f-b842-a305fe4954d8,6aa938a8-8457-427a-ab17-2f010ef76144,fa47addf-be1b-482b-863b-5b3a4e1b1346,1.0,"[{'historicalRevisionNumber': 1, 'modifiedDate...","{'added': '2022-10-17T19:19:27.111Z', 'addedUs...",2.0,1.0,...,,,,,,,,,,
1,BR1002,KT00002,e206cf7a-5b13-478f-b842-a305fe4954d8,d0e669d0-b970-4d38-9d62-937b23021245,1ace95db-949a-4caa-9af7-2a0087beac29,1.0,"[{'historicalRevisionNumber': 1, 'modifiedDate...","{'added': '2022-11-14T22:09:52.205Z', 'addedUs...",,,...,,,,,,,,,,
2,BR1002,KT00002,e206cf7a-5b13-478f-b842-a305fe4954d8,00cb3a41-2a6c-48e9-b910-61a843b64f67,a3e8a6a6-9795-409c-90f6-78c6c3e10cb0,9.0,"[{'historicalRevisionNumber': 1, 'modifiedDate...","{'added': '2022-11-16T17:26:39.675Z', 'addedUs...",2.0,1.0,...,,,,,,,,,,
3,BR1003,KT00003,e206cf7a-5b13-478f-b842-a305fe4954d8,3f9be31a-52d8-4ffa-990b-80af8f26b4df,fa47addf-be1b-482b-863b-5b3a4e1b1346,1.0,"[{'historicalRevisionNumber': 1, 'modifiedDate...","{'added': '2022-10-17T19:19:27.363Z', 'addedUs...",3.0,2.0,...,,,,,,,,,,
4,BR1003,KT00003,e206cf7a-5b13-478f-b842-a305fe4954d8,4354c842-8ca9-4774-9bde-d45d9a115793,1ace95db-949a-4caa-9af7-2a0087beac29,0.0,[],"{'added': '2022-11-14T22:09:52.274Z', 'addedUs...",,,...,,,,,,,,,,


In [13]:
sample_data['labResults'].columns

Index(['id', 'revisionHistory', 'revisionNumber', '% Basophils',
       '% Eosinophils', '% Lymphocytes', '% Monocytes', '% Neutrophils',
       'Absolute Basophil Count', 'Absolute Eosinophil Count (AEC)',
       ...
       'CMV IgG Serology', 'Converted to IA During Course of Study',
       'Covid-19 Vaccine Company Name', 'Covid-19 Vaccine Ever Received',
       'Num. of Days to/From IA Conversion',
       'Number of Days To/From Initial Covid-19 Vaccine',
       'Number of Days To/From Second Covid-19 Vaccine',
       'Number of Days To/From Third Covid-19 Vaccine',
       'CMV Ab Screen Index Value', 'CMV Ab Screen Result'],
      dtype='object', length=108)

In [14]:
sample_data['labResults'].head()

Unnamed: 0,id,revisionHistory,revisionNumber,% Basophils,% Eosinophils,% Lymphocytes,% Monocytes,% Neutrophils,Absolute Basophil Count,Absolute Eosinophil Count (AEC),...,CMV IgG Serology,Converted to IA During Course of Study,Covid-19 Vaccine Company Name,Covid-19 Vaccine Ever Received,Num. of Days to/From IA Conversion,Number of Days To/From Initial Covid-19 Vaccine,Number of Days To/From Second Covid-19 Vaccine,Number of Days To/From Third Covid-19 Vaccine,CMV Ab Screen Index Value,CMV Ab Screen Result
0,075fc922-035f-401e-b90c-f88ea50d0a95,"[{'historicalRevisionNumber': 1, 'modifiedDate...",6,0.5,1.6,31.1,7.7,59.1,22.0,69.0,...,,,,,,,,,,
1,4708ad58-3759-4a83-8fe5-9d7117b6d34f,"[{'historicalRevisionNumber': 1, 'modifiedDate...",3,0.9,2.4,39.5,8.9,48.3,50.0,132.0,...,,,,,,,,,,
2,5f6e9c4c-1dc2-4d84-beee-a4752a8280c3,"[{'historicalRevisionNumber': 1, 'modifiedDate...",4,0.7,0.7,45.4,12.1,41.1,29.0,29.0,...,,,,,,,,,,
3,8f2760aa-5a8f-4857-810d-d73a7d6f01da,"[{'historicalRevisionNumber': 1, 'modifiedDate...",4,1.1,2.5,49.8,10.9,35.7,32.0,73.0,...,,,,,,,,,,
4,fec103ba-a651-470b-adf9-4f2577fa9308,"[{'historicalRevisionNumber': 1, 'modifiedDate...",3,0.7,0.1,19.0,11.3,68.9,52.0,7.0,...,,,,,,,,,,


## Save results for formatting

For our purposes, we'll output the sample metadata, clinical labs, and survey results using the `pandas` method `to_csv()` to write the results to files. 

We'll clean these up in R for later use.

In [18]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

In [19]:
meta_file = '{folder}/br2_hise_metadata_{date}.csv'.format(folder = out_dir, date = date.today())

sample_data['metadata'].to_csv(meta_file)

In [20]:
labs_file = '{folder}/br2_hise_labs_{date}.csv'.format(folder = out_dir, date = date.today())

sample_data['labResults'].to_csv(labs_file)

## Upload results to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [21]:
study_space_uuid = 'de025812-5e73-4b3c-9c3b-6d0eac412f2a'
title = 'DIHA raw clinical labs {d}'.format(d = date.today())

In [22]:
in_files = [sample_meta_file_uuid]

In [23]:
out_files = [meta_file, labs_file]

In [24]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files
)

Cannot determine the current notebook.
1) /home/jupyter/IH-A-Aging-Analysis-Notebooks/clinical-labs/00-Python_retrieve_sample_data.ipynb
2) /home/jupyter/IH-A-Aging-Analysis-Notebooks/clinical-labs/01-R_format_labs.ipynb
3) /home/jupyter/fh1-mm-labs/archive/00-R_retrieve_and_format_labs.ipynb
Please select (1-3) 


 1


you are trying to upload file_ids... ['output/br2_hise_metadata_2024-04-09.csv', 'output/br2_hise_labs_2024-04-09.csv']. Do you truly want to proceed?


(y/n) y


{'trace_id': 'ab4e0176-3448-4af2-9f1e-c3e3ba6e04b5',
 'files': ['output/br2_hise_metadata_2024-04-09.csv',
  'output/br2_hise_labs_2024-04-09.csv']}

In [25]:
session_info.show()