# Assemble all sample metadata

Here, we'll assemble all sample metadata, including CMV, BMI, and COVID vaccine data, for use in visualization tools.

The resulting .csv file includes these columns, ordered hierarchically based on origin:  
Cohort -> Subject -> Sample -> Specimen -> Data
  
`cohort.cohortGuid`: Cohort (BR1 or BR2)  
`subject.subjectGuid`: Subject GUID (e.g. BR1017)  
`subject.biologicalSex`: Subject Biological Sex (Female or Male)  
`subject.cmv`: Subject CMV Status (Negative or Positive)  
`subject.bmi`: Subject BMI (Integer value)  
`subject.race`: Subject Self-Reported Race  
`subject.ethnicity`: Subject Self-Reported Ethnicity   
`subject.birthYear`: Subject Birth Year (e.g. 1988)  
`subject.ageAtFirstDraw`: Subject Age at First Sample Draw (e.g. 32)  
`subject.covidVaxDose1.daysSinceFirstVisit`: Timing of first COVID-19 Vaccine dose relative to First Sample Draw (e.g. 440)  
`subject.covidVaxDose2.daysSinceFirstVisit`: Timing of second COVID-19 Vaccine dose relative to First Sample Draw (e.g. 461)  
`sample.sampleKitGuid`: Sample Kit GUID (e.g. KT00140)  
`sample.visitName`: Sample Visit Name (e.g. Flu Year 2 Day 7)  
`sample.drawDate`: Sample Draw Date (Year-Month; e.g. 2021-09)  
`sample.subjectAgeAtDraw`: Age of Subject at the time of Sample Draw (e.g. 33)  
`sample.daysSinceFirstVisit`: Timing of Sample Draw relative to First Sample Draw (e.g. 87)  
`specimen.specimenGuid`: Specimen GUID (a.k.a. pbmc_sample_id; e.g. PB00140-01)  
`pipeline.fileGuid`: File GUID for the originating Pipeline .h5 file  

In [1]:
from datetime import date
import hisepy
import os
import pandas as pd
import re

In [2]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

## Helper functions

In [3]:
def cache_uuid_path(uuid):
    cache_path = '/home/jupyter/cache/{u}'.format(u = uuid)
    if not os.path.isdir(cache_path):
        hise_res = hisepy.reader.cache_files([uuid])
    filename = os.listdir(cache_path)[0]
    cache_file = '{p}/{f}'.format(p = cache_path, f = filename)
    return cache_file

In [4]:
def read_csv_uuid(uuid):
    cache_file = cache_uuid_path(uuid)
    res = pd.read_csv(cache_file)
    return res

In [5]:
def element_id(n = 3):
    import periodictable
    from random import randrange
    rand_el = []
    for i in range(n):
        el = randrange(0,118)
        rand_el.append(periodictable.elements[el].name)
    rand_str = '-'.join(rand_el)
    return rand_str

## Prepare sample metadata

Used for grouping samples for output and to ensure that we have all of the metadata we require in the final files for analysis

In [6]:
sample_meta_uuid = 'd82c5c42-ae5f-4e67-956e-cd3b7bf88105'
sample_meta = read_csv_uuid(sample_meta_uuid)

### Rename specimen and file-specific columns

In [7]:
sample_meta = sample_meta.rename({'pbmc_sample_id': 'specimen.specimenGuid'}, axis = 1)
sample_meta = sample_meta.rename({'file.id': 'pipeline.fileGuid'}, axis = 1)

### Add age at sample draw and age at enrollment

In [8]:
def drawDate_to_drawYear(drawDate):
    drawYear = re.sub('-.+', '', drawDate)
    drawYear = int(drawYear)
    return(drawYear)

In [9]:
sample_meta['sample.drawYear'] = [drawDate_to_drawYear(d) for d in sample_meta['sample.drawDate']]
sample_meta['sample.subjectAgeAtDraw'] = sample_meta['sample.drawYear'] - sample_meta['subject.birthYear']

In [10]:
first_draw_age = (
    sample_meta
        .groupby('subject.subjectGuid', as_index = False)['sample.subjectAgeAtDraw']
        .min()
        .rename({'sample.subjectAgeAtDraw': 'subject.ageAtFirstDraw'}, axis = 1)
)

In [11]:
sample_meta = sample_meta.merge(first_draw_age, on = 'subject.subjectGuid', how = 'left')

### Simplify drawDate

In [12]:
sample_meta['sample.drawDate'] = [re.sub('([0-9]{4}-[0-9]{2})-.+', '\\1', d) for d in sample_meta['sample.drawDate']]

### Add CMV and BMI from clinical labs

In [13]:
cmv_meta_uuid = '9469f67c-b09a-454d-9fb9-f50ff3494d69'
cmv_meta = read_csv_uuid(cmv_meta_uuid)

In [14]:
cmv_meta = cmv_meta[['subject.subjectGuid', 'subject.cmv']].drop_duplicates()
cmv_meta.shape

(96, 2)

In [15]:
cmv_meta.head()

Unnamed: 0,subject.subjectGuid,subject.cmv
0,BR1001,Negative
1,BR1002,Negative
2,BR1003,Negative
3,BR1004,Negative
4,BR1005,Negative


In [16]:
bmi_meta_uuid = 'e507258c-d175-4d8e-a455-5229870dc991'
bmi_meta = read_csv_uuid(bmi_meta_uuid)

In [17]:
bmi_meta = bmi_meta[['sample.sampleKitGuid', 'subject.bmi']]
bmi_meta['subject.bmi'] = bmi_meta['subject.bmi'].round(0)

In [18]:
bmi_meta.head()

Unnamed: 0,sample.sampleKitGuid,subject.bmi
0,KT00001,23.0
1,KT00037,23.0
2,KT00008,23.0
3,KT00274,22.0
4,KT01560,22.0


### Add COVID Vaccination data

In [19]:
covid_meta_uuid = 'ff8878a6-bd90-4183-8554-4bb24032df44'
covid_meta = read_csv_uuid(covid_meta_uuid)

In [20]:
covid_meta.head()

Unnamed: 0,subject.subjectGuid,subject.covidVaxDose1.daysSinceFirstVisit,subject.covidVaxDose2.daysSinceFirstVisit
0,BR1001,,
1,BR1002,440.0,461.0
2,BR1003,440.0,461.0
3,BR1004,543.0,563.0
4,BR1005,451.0,492.0


### Combine sample-level metadata

In [21]:
combined_sample_meta = sample_meta.merge(cmv_meta, on = 'subject.subjectGuid', how = 'left')
combined_sample_meta = combined_sample_meta.merge(bmi_meta, on = 'sample.sampleKitGuid', how = 'left')
combined_sample_meta = combined_sample_meta.merge(covid_meta, on = 'subject.subjectGuid', how = 'left')

We only need to keep some of the metadata columns that pertain to cohort, subject, and sample. We'll also keep the originating File GUID to help us keep track of provenance. Let's select just these columns:

In [22]:
keep_meta = [
    'cohort.cohortGuid',
    'subject.subjectGuid', 'subject.biologicalSex', 'subject.cmv', 'subject.bmi',
    'subject.race', 'subject.ethnicity', 'subject.birthYear', 'subject.ageAtFirstDraw',
    'subject.covidVaxDose1.daysSinceFirstVisit', 'subject.covidVaxDose2.daysSinceFirstVisit',
    'sample.sampleKitGuid', 'sample.visitName', 'sample.drawDate', 
    'sample.subjectAgeAtDraw', 'sample.daysSinceFirstVisit',
    'specimen.specimenGuid', 'pipeline.fileGuid'
]

In [23]:
combined_sample_meta = combined_sample_meta[keep_meta]

In [24]:
combined_sample_meta.shape

(868, 18)

In [25]:
combined_sample_meta.columns

Index(['cohort.cohortGuid', 'subject.subjectGuid', 'subject.biologicalSex',
       'subject.cmv', 'subject.bmi', 'subject.race', 'subject.ethnicity',
       'subject.birthYear', 'subject.ageAtFirstDraw',
       'subject.covidVaxDose1.daysSinceFirstVisit',
       'subject.covidVaxDose2.daysSinceFirstVisit', 'sample.sampleKitGuid',
       'sample.visitName', 'sample.drawDate', 'sample.subjectAgeAtDraw',
       'sample.daysSinceFirstVisit', 'specimen.specimenGuid',
       'pipeline.fileGuid'],
      dtype='object')

In [26]:
combined_sample_meta.head()

Unnamed: 0,cohort.cohortGuid,subject.subjectGuid,subject.biologicalSex,subject.cmv,subject.bmi,subject.race,subject.ethnicity,subject.birthYear,subject.ageAtFirstDraw,subject.covidVaxDose1.daysSinceFirstVisit,subject.covidVaxDose2.daysSinceFirstVisit,sample.sampleKitGuid,sample.visitName,sample.drawDate,sample.subjectAgeAtDraw,sample.daysSinceFirstVisit,specimen.specimenGuid,pipeline.fileGuid
0,BR1,BR1001,Female,Negative,23.0,Caucasian,Non-Hispanic origin,1987,32,,,KT00001,Flu Year 1 Day 0,2019-10,32,0,PB00001-01,fec489f9-9a74-4635-aa91-d2bf09d1faec
1,BR1,BR1002,Male,Negative,22.0,Caucasian,Non-Hispanic origin,1991,28,440.0,461.0,KT00002,Flu Year 1 Day 0,2019-10,28,0,PB00002-01,7c0c7979-eebd-4aba-b5b2-6e76b4643623
2,BR1,BR1003,Female,Negative,21.0,Caucasian,Non-Hispanic origin,1989,30,440.0,461.0,KT00003,Flu Year 1 Day 0,2019-10,30,0,PB00003-01,40efd03a-cb2f-4677-af42-a056cbfe5a17
3,BR1,BR1004,Male,Negative,22.0,Caucasian,Non-Hispanic origin,1989,30,543.0,563.0,KT00004,Flu Year 1 Day 0,2019-10,30,0,PB00004-01,68fbcd34-1d63-461d-8195-df5b8dc61b31
4,BR1,BR1005,Female,Negative,20.0,Caucasian,Non-Hispanic origin,1992,27,451.0,492.0,KT00006,Flu Year 1 Day 0,2019-10,27,0,PB00006-01,ea8d98e9-e99e-4dc6-9e78-9866e0deac68


In [27]:
out_csv = 'output/diha_all_metadata_{d}.csv'.format(d = date.today())
combined_sample_meta.to_csv(out_csv)

  values = values.astype(str)


## Upload Sample data to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [28]:
study_space_uuid = 'de025812-5e73-4b3c-9c3b-6d0eac412f2a'
title = 'DIHA Complete Sample Metadata {d}'.format(d = date.today())

In [29]:
search_id = element_id()
search_id

'helium-selenium-neptunium'

In [33]:
in_files = [
    sample_meta_uuid,
    cmv_meta_uuid,
    bmi_meta_uuid,
    covid_meta_uuid
]

In [34]:
out_files = [out_csv]

In [36]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files,
    destination = search_id
)

Cannot determine the current notebook.
1) /home/jupyter/IH-A-Aging-Analysis-Notebooks/scrna-seq_analysis/03-data-assembly/18-Python_assemble_complete_metadata.ipynb
2) /home/jupyter/IH-A-Aging-Analysis-Notebooks/scrna-seq_analysis/03-data-assembly/17-Python_full_dataset_umap.ipynb
3) /home/jupyter/data-apps-vis/datasets/dynamics_imm_health/01-R_assemble_pseudobulk.ipynb
Please select (1-3) 


 1


you are trying to upload file_ids... ['output/diha_all_metadata_2024-05-17.csv']. Do you truly want to proceed?


(y/n) y


{'trace_id': 'cdd0065b-2794-49c2-a978-a37810a5f75e',
 'files': ['output/diha_all_metadata_2024-05-17.csv']}

In [37]:
import session_info
session_info.show()