ASAP CRN Unique ID generation - wave 1

# ASAP CRN Unique ID generation - wave 1


Postmortem-derived Brain Sequencing Collection


25 OCT 2023
Andy Henrie


### Dataset ID
- "ASAP_PBMSC" to identify that it is part of the Postmortem-derived Brain Sequencing Collection
- `ASAP_dataset_id`
    - also need to generate a "team_dataset_id" (Add to CDE/DataDictionary). TeamCODE+"one to two word descriptor"

### Team ID
- hardcoded definitions
- `ASAP_team_id`

### Subject ID
- unique for ASAP
- could exist across several Teams / Datasets
- `ASAP_subject_id`

### Sample ID
- unique for each sample
- multiple could derive from same `ASAP_subject_id`.  
    - multiple brain regions from a single team
    - multiple teams from same biobank
    - "other" repeated samples??
- `ASAP_sample_id`
- Unique ASAP_subject_id + "sample repeat number"

## Study ID: Postmortem-derived Brain Sequencing Collection (PMBDS) 
- All ASAP_dataset_id, and ASAP_subject_id here will start with "ASAP_PMBDS_"


###  Issues

- storing "master" IDs for lookup: choosing json to make an easy `dict` mapper, but could make .csv tables 


In [36]:
# conda create -n lw10 python=3.10 notebook ipykernel pip pandas ijson - y && conda activate lw10

In [37]:
import pandas as pd
from pathlib import Path


from asap_ids import (read_meta_table, get_dtypes_dict, STUDY_PREFIX, DATASET_ID, 
                      load_id_mapper, write_id_mapper, generate_asap_sample_ids,
                      generate_asap_subject_ids, process_meta_files)


                       

%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Load CDE for properly reading the team tables.

In [38]:
CDE_path = Path.cwd() / "ASAP_CDE.csv" 
CDE = pd.read_csv(CDE_path )
# Initialize the data types dictionary
dtypes_dict = get_dtypes_dict(CDE)


## `ASAP_team_id`

On meta-data ingest, add this to:
- STUDY, PROTOCOL

In [39]:
team_names = ["lee", "hafler", "hardy", "jakobsson", "sherzer","sulzer", "voet","wood"]
[x.upper() for x in team_names]



['LEE', 'HAFLER', 'HARDY', 'JAKOBSSON', 'SHERZER', 'SULZER', 'VOET', 'WOOD']

In [40]:
team_codes = ["LEE", "HAF", "HAR", "JAK", "SHE", "SUL", "VOE", "WOO"]




In [41]:
ASAP_team_id = ["TEAM_" + team_name.upper() for team_name in team_names]
ASAP_team_id 

['TEAM_LEE',
 'TEAM_HAFLER',
 'TEAM_HARDY',
 'TEAM_JAKOBSSON',
 'TEAM_SHERZER',
 'TEAM_SULZER',
 'TEAM_VOET',
 'TEAM_WOOD']

## `ASAP_dataset_id`

This compares with the GP2 "study code".

This is done by hand for now. On meta-data ingest, add this (?) to:
- STUDY, PROTOCOL, SAMPLE



Currently we have:
- Team Lee 
- Team Hardy
- Team Hafler



In [42]:
ASAP_dataset_id = DATASET_ID
ASAP_dataset_id


'ASAP_PMBDS'

## `ASAP_subject_id`


### Subject ID
- unique for ASAP
- could exist across several Teams / Datasets
- `ASAP_subject_id`


On meta-data ingest, add this to:
- SUBJECT

"ASAP_XXXXXXX"

Team Lee:  

Team Hardy:

Team Hafler:



We need to define a function that creates the _master_archive_ (if it doesn't exist), and assigns  



## `ASAP_sample_id`

- unique for each sample
- multiple could derive from same `ASAP_subject_id`
- `ASAP_sample_id`
- Unique ASAP_subject_id + "sample repeat number"


On meta-data ingest, add this to:
- SAMPLE

In [43]:
MASTER_SUBJECT_IDs = pd.DataFrame()



In [44]:

data_path = Path.cwd() / "clean/team-Lee"
# make sure cleaned files are correct


SUBJECT = read_meta_table(f"{data_path}/SUBJECT.csv", dtypes_dict)
CLINPATH = read_meta_table(f"{data_path}/CLINPATH.csv", dtypes_dict)
STUDY = read_meta_table(f"{data_path}/STUDY.csv", dtypes_dict)
PROTOCOL = read_meta_table(f"{data_path}/PROTOCOL.csv", dtypes_dict)
SAMPLE = read_meta_table(f"{data_path}/SAMPLE.csv", dtypes_dict)


Examples of how to generate the subj_id_mapper  and samp_id_mapper `dict`s

In [45]:



## test with team Lee
subject_mapper_path = Path.cwd() / "ASAP_subj_map2.json"
sample_mapper_path = Path.cwd() / "ASAP_samp_map2.json"

subj_id_mapper = load_id_mapper(subject_mapper_path)
samp_id_mapper = load_id_mapper(sample_mapper_path)

ud_subj_id_mapper, ud_subject_df, n = generate_asap_subject_ids(subj_id_mapper, SUBJECT)
ud_samp_id_mapper, sample_df = generate_asap_sample_ids(ud_subj_id_mapper, SAMPLE, n, samp_id_mapper)




id_mapper not found at /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_map2.json
id_mapper not found at /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_map2.json


In [46]:
ud_subj_id_mapper, ud_samp_id_mapper



({'HC_1225': 'ASAP_PMBDS_000001',
  'HC_0602': 'ASAP_PMBDS_000002',
  'PD_0009': 'ASAP_PMBDS_000003',
  'PD_1921': 'ASAP_PMBDS_000004',
  'PD_2058': 'ASAP_PMBDS_000005',
  'PD_1441': 'ASAP_PMBDS_000006',
  'PD_1344': 'ASAP_PMBDS_000007',
  'HC_1939': 'ASAP_PMBDS_000008',
  'HC_1308': 'ASAP_PMBDS_000009',
  'HC_1862': 'ASAP_PMBDS_000010',
  'HC_1864': 'ASAP_PMBDS_000011',
  'HC_2057': 'ASAP_PMBDS_000012',
  'HC_2061': 'ASAP_PMBDS_000013',
  'HC_2062': 'ASAP_PMBDS_000014',
  'HC_2067': 'ASAP_PMBDS_000015',
  'PD_0348': 'ASAP_PMBDS_000016',
  'PD_0413': 'ASAP_PMBDS_000017',
  'PD_1312': 'ASAP_PMBDS_000018',
  'PD_1317': 'ASAP_PMBDS_000019',
  'PD_1504': 'ASAP_PMBDS_000020',
  'PD_1858': 'ASAP_PMBDS_000021',
  'PD_1902': 'ASAP_PMBDS_000022',
  'PD_1973': 'ASAP_PMBDS_000023',
  'PD_2005': 'ASAP_PMBDS_000024',
  'PD_2038': 'ASAP_PMBDS_000025'},
 {'MFG_HC_1225': 'ASAP_PMBDS_000001_000001_s1',
  'HIP_HC_1225': 'ASAP_PMBDS_000001_000001_s2',
  'SN_HC_1225': 'ASAP_PMBDS_000001_000001_s3',
  'MFG

Use the `process_meta_files` function to generate the mappers and update the meta tables.

In [47]:
subject_mapper_path = Path.cwd() / "ASAP_subj_map3.json"
sample_mapper_path = Path.cwd() / "ASAP_samp_map3.json"



export_root = Path.cwd() / "ASAP_tables" 

table_root = Path.cwd() / "clean/team-Lee"
## add team Lee
process_meta_files(table_root, 
                       CDE_path, 
                       subject_mapper_path, 
                       sample_mapper_path, 
                       export_path=export_root)



id_mapper not found at /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_map3.json
id_mapper not found at /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_map3.json
exporting to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_tables/Lee


1

Using the same mapper_paths we can continue to generate ASAP IDs

In [48]:

## add team Hafler
table_root = Path.cwd() / "clean/team-Hafler"
process_meta_files(table_root, 
                       CDE_path, 
                       subject_mapper_path, 
                       sample_mapper_path, 
                       export_path=export_root)

## add team Hardy
table_root = Path.cwd() / "clean/team-Hardy"
process_meta_files(table_root, 
                       CDE_path, 
                       subject_mapper_path, 
                       sample_mapper_path, 
                       export_path=export_root)


id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_map3.json
id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_map3.json
exporting to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_tables/Hafler
id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_map3.json
id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_map3.json
exporting to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_tables/Hardy


1