ASAP CRN Unique ID generation - wave 1

# ASAP CRN Unique ID generation - wave 1


Postmortem-derived Brain Sequencing Collection


25 OCT 2023
Andy Henrie

## keys to assign centrally 

### `ASAP_dataset_id`
- **ASAP_PBMSC** to identify that it is part of the Postmortem-derived Brain Sequencing Collection
- `ASAP_dataset_id`
    - also need to generate a "team_dataset_id" (Add to CDE/DataDictionary). TeamCODE+"one to two word descriptor"
- add to **STUDY**, **PROTOCOL**, and **SAMPLE**

### `ASAP_team_id`
- hardcoded definitions
- add to **STUDY**

### `ASAP_subject_id`
- unique for ASAP 
- could exist across several Teams / Datasets
- add to **SUBJECT** and **CLINPATH**

### `ASAP_sample_id`
- unique for each sample
- format:  `ASAP_DATASET_ID_XXXXXX_sXXX``
- multiple could derive from same `ASAP_subject_id`.  
    - multiple brain regions from a single team
    - multiple teams from same biobank
    - "other" repeated samples??
- Unique ASAP_subject_id + "sample repeat number"
- add to **SAMPLE**

## Study ID: Postmortem-derived Brain Sequencing Collection (PMBDS) 
- All ASAP_dataset_id, and ASAP_subject_id here will start with "ASAP_PMBDS_"


###  Issues

- storing "master" IDs for lookup: choosing json to make an easy `dict` mapper, but could make .csv tables 


In [1]:
# conda create -n lw10 python=3.10 notebook ipykernel pip pandas - y && conda activate lw10

In [2]:
import pandas as pd
from pathlib import Path


from asap_ids import (read_meta_table, get_dtypes_dict, DATASET_ID, 
                      load_id_mapper, write_id_mapper, generate_asap_sample_ids,
                      generate_asap_subject_ids, process_meta_files,
                      get_id,get_sampr)


                       

%load_ext autoreload
%autoreload 2


Load CDE for properly reading the team tables.

In [3]:
CDE_path = Path.cwd() / "ASAP_CDE_v2.csv" 
CDE = pd.read_csv(CDE_path )
# Initialize the data types dictionary
dtypes_dict = get_dtypes_dict(CDE)


## `ASAP_team_id`

Just requires enforcing teh ENUM and replacing hyphen ("-") with underscore ("_")
On meta-data ingest, add this to:
- STUDY

In [4]:
team_names = ["lee", "hafler", "hardy", "jakobsson", "scherzer","sulzer", "voet","wood"]
[x.upper() for x in team_names]



['LEE', 'HAFLER', 'HARDY', 'JAKOBSSON', 'SCHERZER', 'SULZER', 'VOET', 'WOOD']

In [5]:
team_codes = ["LEE", "HAF", "HAR", "JAK", "SCH", "SUL", "VOE", "WOO"]
team_codes = [x.upper()[:3] for x in team_names]

team_codes

['LEE', 'HAF', 'HAR', 'JAK', 'SCH', 'SUL', 'VOE', 'WOO']

In [6]:
ASAP_team_id = ["TEAM_" + team_name.upper() for team_name in team_names]
ASAP_team_id 

['TEAM_LEE',
 'TEAM_HAFLER',
 'TEAM_HARDY',
 'TEAM_JAKOBSSON',
 'TEAM_SCHERZER',
 'TEAM_SULZER',
 'TEAM_VOET',
 'TEAM_WOOD']

## `ASAP_dataset_id`

This compares with the GP2 "study code".

This is done by hand for now. On meta-data ingest, add this to:
- STUDY, PROTOCOL, SAMPLE



Currently we have:
- Team Lee 
- Team Hardy
- Team Hafler
- Team Jakobsson


- (Team Test)



In [33]:
ASAP_dataset_id = DATASET_ID
ASAP_dataset_id


'ASAP_PMBDS'

## `ASAP_subject_id`


### Subject ID
- unique for ASAP
- could exist across several Teams / Datasets
- `ASAP_subject_id`


On meta-data ingest, add this to:
- SUBJECT

"ASAP_XXXXXXX"

Team Lee:  

Team Hardy:

Team Hafler:



We need to define a function that creates the _master_archive_ (if it doesn't exist), and assigns  



## `ASAP_sample_id`

- unique for each sample
- multiple could derive from same `ASAP_subject_id`
- `ASAP_sample_id`
- Unique ASAP_subject_id + "sample repeat number"


On meta-data ingest, add this to:
- SAMPLE

## Example usage of `asap_id.py` functions
Examples of how to generate the subj_id_mapper  and samp_id_mapper `dict`s

In [34]:


## test with team Lee
subject_mapper_path = Path.cwd() / "ASAP_subj_test1.json"
source_mapper_path = Path.cwd() / "ASAP_source_test1.json"
gp2_mapper_path = Path.cwd() / "ASAP_gp2_test1.json"
sample_mapper_path = Path.cwd() / "ASAP_samp_test1.json"

asapid_mapper = load_id_mapper(sample_mapper_path) # subject id
sourceid_mapper = load_id_mapper(source_mapper_path) # sample id from source
gp2id_mapper = load_id_mapper(subject_mapper_path) # gp2 id
sampleid_mapper = load_id_mapper(sample_mapper_path) # sample id 


# ud_subj_id_mapper, ud_subject_df, n = generate_asap_subject_ids(subj_id_mapper, SUBJECT)
# ud_samp_id_mapper, sample_df = generate_asap_sample_ids(ud_subj_id_mapper, SAMPLE, n, samp_id_mapper)

id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_test1.json
id_mapper not found at /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_source_test1.json
id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_test1.json
id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_test1.json


In [35]:

# data_path = Path.cwd() / "clean/team-Test/v2_20231112"
data_path = Path.cwd() / "clean/team-Lee/v2_20231130"
# data_path = Path.cwd() / "clean/team-Sulzer"

# make sure cleaned files are correct


SUBJECT = read_meta_table(f"{data_path}/SUBJECT.csv", dtypes_dict)
SAMPLE = read_meta_table(f"{data_path}/SAMPLE.csv", dtypes_dict)



`generate_asap_subject_ids` example

In [36]:
subject_df = SUBJECT.copy()
output = generate_asap_subject_ids(asapid_mapper,
                                    gp2id_mapper,
                                    sourceid_mapper, 
                                    subject_df)
asapid_mapper, gp2id_mapper,sourceid_mapper= output

added 25 new asap_subject_ids
added 0 new gp2_ids
added 25 new source_ids


In [37]:
subject_df.head()

Unnamed: 0,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,...,hx_dementia_mci,hx_melanoma,education_level,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text
0,HC_1225,12-25,,,Banner Sun Health Research Institute,Human,Male,80,White,Not Reported,...,No,,,Unknown,,33,Normal,0,No PD nor other neurological disorder,
1,HC_0602,06-02,,,Banner Sun Health Research Institute,Human,Male,84,White,Not Reported,...,Yes,,,Unknown,,34,Normal,0,Other neurological disorder,Mild Cognitive Impairment
2,PD_0009,00-09,,,Banner Sun Health Research Institute,Human,Male,64,White,Not Reported,...,Yes,,,Unknown,,33,MCI,0,Idiopathic PD,
3,PD_1921,19-21,,,Banner Sun Health Research Institute,Human,Male,82,White,Not Reported,...,Yes,,,Unknown,,33,MCI,0,Idiopathic PD,
4,PD_2058,20-58,,,Banner Sun Health Research Institute,Human,Male,87,White,Not Reported,...,Yes,,,Unknown,,34,Normal,0,Idiopathic PD,


`generate_asap_sample_ids` example

In [39]:
sample_df = SAMPLE.copy()

ud_sampleid_mapper = generate_asap_sample_ids(asapid_mapper, sample_df, sampleid_mapper)
ud2_sampleid_mapper = generate_asap_sample_ids(asapid_mapper, sample_df, ud_sampleid_mapper)
ud3_sampleid_mapper = generate_asap_sample_ids(asapid_mapper, sample_df, ud2_sampleid_mapper)


found 75 sample_id's that have already been mapped!! BEWARE a sample_id naming collision!! If you are just reprocessing tables, it shoud be okay.
Nothing to see here... move along... move along .... 
No new sample_ids to map
found 75 sample_id's that have already been mapped!! BEWARE a sample_id naming collision!! If you are just reprocessing tables, it shoud be okay.
Nothing to see here... move along... move along .... 
No new sample_ids to map
found 75 sample_id's that have already been mapped!! BEWARE a sample_id naming collision!! If you are just reprocessing tables, it shoud be okay.
Nothing to see here... move along... move along .... 
No new sample_ids to map


In [19]:
!python asap_ids.py --tables "clean/team-Lee/v2_20231130"\
                    --cde "."\
                    --map "."\
                    --suf "testcl"\
                    --outdir "./ASAP_tables"

id_mapper not found at ASAP_subj_testcl.json
id_mapper not found at ASAP_samp_testcl.json
id_mapper not found at ASAP_gp2_testcl.json
id_mapper not found at ASAP_source_testcl.json
added 25 new asap_subject_ids
added 0 new gp2_ids
added 25 new source_ids
exporting to ASAP_tables/TEAM-LEE
overwriting updated id_mapper to ASAP_subj_testcl.json,ASAP_samp_testcl.json, etc.


`process_meta_files` example

In [53]:
export_root = Path.cwd() / "ASAP_tables" 
table_root = data_path

process_meta_files(table_root, 
                       CDE_path, 
                       subject_mapper_path, 
                       sample_mapper_path, 
                       export_path=export_root)


id_mapper not found at /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_test1.json
id_mapper not found at /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_test1.json
id_mapper not found at ASAP_gp2_map.json
id_mapper not found at ASAP_source_map.json
added 41 new asap_subject_ids
added 8 new gp2_ids
added 41 new source_ids
exporting to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_tables/TEAM-TEST
overwriting updated id_mapper to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_test1.json,/Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_test1.json, etc.


1

In [29]:

subject_df = SUBJECT.copy()
asapid_mapper, gp2id_mapper,sourceid_mapper = generate_asap_subject_ids(asapid_mapper,
                                                                gp2id_mapper,
                                                                sourceid_mapper, 
                                                                subject_df)



added 0 new asap_subject_ids
added 0 new gp2_ids
added 0 new source_ids


In [28]:
subject_df.head()

Unnamed: 0,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,...,hx_dementia_mci,hx_melanoma,education_level,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text
0,HC_1225,12-25,,,Banner Sun Health Research Institute,Human,Male,80,White,Not Reported,...,No,,,Unknown,,33,Normal,0,No PD nor other neurological disorder,
1,HC_0602,06-02,,,Banner Sun Health Research Institute,Human,Male,84,White,Not Reported,...,Yes,,,Unknown,,34,Normal,0,Other neurological disorder,Mild Cognitive Impairment
2,PD_0009,00-09,,,Banner Sun Health Research Institute,Human,Male,64,White,Not Reported,...,Yes,,,Unknown,,33,MCI,0,Idiopathic PD,
3,PD_1921,19-21,,,Banner Sun Health Research Institute,Human,Male,82,White,Not Reported,...,Yes,,,Unknown,,33,MCI,0,Idiopathic PD,
4,PD_2058,20-58,,,Banner Sun Health Research Institute,Human,Male,87,White,Not Reported,...,Yes,,,Unknown,,34,Normal,0,Idiopathic PD,


In [33]:

ASAP_subject_id = subject_df['subject_id'].map(asapid_mapper)
subject_df.insert(0, 'ASAP_subject_id', ASAP_subject_id)

subject_df.head()

Unnamed: 0,ASAP_subject_id,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,...,hx_dementia_mci,hx_melanoma,education_level,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text
0,ASAP_PMBDS_000001,HC_1225,12-25,,,Banner Sun Health Research Institute,Human,Male,80,White,...,No,,,Unknown,,33,Normal,0,No PD nor other neurological disorder,
1,ASAP_PMBDS_000002,HC_0602,06-02,,,Banner Sun Health Research Institute,Human,Male,84,White,...,Yes,,,Unknown,,34,Normal,0,Other neurological disorder,Mild Cognitive Impairment
2,ASAP_PMBDS_000003,PD_0009,00-09,,,Banner Sun Health Research Institute,Human,Male,64,White,...,Yes,,,Unknown,,33,MCI,0,Idiopathic PD,
3,ASAP_PMBDS_000004,PD_1921,19-21,,,Banner Sun Health Research Institute,Human,Male,82,White,...,Yes,,,Unknown,,33,MCI,0,Idiopathic PD,
4,ASAP_PMBDS_000005,PD_2058,20-58,,,Banner Sun Health Research Institute,Human,Male,87,White,...,Yes,,,Unknown,,34,Normal,0,Idiopathic PD,


In [88]:
ud_sampleid_mapper = sampleid_mapper.copy() #sampleid_mapper.copy()
# 
uniq_subj = sample_df.subject_id.unique()
# check for subject_id collisions in the sampleid_mapper
if intersec := set(uniq_subj) & set(ud_sampleid_mapper.values()): 
    print(f"found {len(intersec)} subject_id collisions in the sampleid_mapper")
intersec

set()

First check to see if we've already got ASAP_ids for these samples

In [89]:
uniq_samp = sample_df.sample_id.unique()
if intersec := set(uniq_samp) & set(ud_sampleid_mapper.keys()): 
    print(f"found {len(intersec)} subject_id collisions in the sampleid_mapper")
intersec

set()

In [76]:
already_mapped = sample_df[sample_df['sample_id'].apply(lambda x: x in intersec)]

In [80]:
to_map = sample_df[~sample_df['sample_id'].apply(lambda x: x in intersec)]

uniq_subj = to_map.subject_id.unique()


In [85]:
bool(to_map.shape[0]+1)

True

In [94]:
def generate_asap_sample_ids2(asapid_mapper:dict, 
                             sample_df:pd.DataFrame, 
                             sampleid_mapper:dict) -> tuple[dict, pd.DataFrame]:
    """
    generate new unique_ids for new sample_ids in sample_df table, 
    update the id_mapper with the new ids from the data table

    return the updated id_mapper and updated sample_df
    """

    ud_sampleid_mapper = sampleid_mapper.copy()
    

    uniq_samp = sample_df.sample_id.unique()
    if samp_intersec := set(uniq_samp) & set(ud_sampleid_mapper.keys()): 
        print(f"found {len(samp_intersec)} sample_id's that have already been mapped!! BEWARE a sample_id naming collision!! If you are just reprocessing tables, it shoud be okay.")


    to_map = sample_df[~sample_df['sample_id'].apply(lambda x: x in samp_intersec)].copy()

    if not bool(to_map.shape[0]): 
        print("Nothing to see here... move along... move along .... \nNo new sample_ids to map")
        return ud_sampleid_mapper

    uniq_subj = to_map.subject_id.unique()
    # check for subject_id collisions in the sampleid_mapper
    if subj_intersec := set(uniq_subj) & set(ud_sampleid_mapper.values()): 
        print(f"found {len(subj_intersec)} subject_id collisions in the sampleid_mapper")
        
    df_chunks = []
    for subj_id in uniq_subj:

        df_subset = to_map[to_map.subject_id==subj_id].copy()
        asap_id = asapid_mapper[subj_id]

        dups = df_subset[df_subset.duplicated(keep=False, subset=['sample_id'])].sort_values('sample_id').reset_index(drop = True).copy()
        nodups = df_subset[~df_subset.duplicated(keep=False, subset=['sample_id'])].sort_values('sample_id').reset_index(drop = True).copy()
   
        asap_id = asapid_mapper[subj_id]
        if bool(ud_sampleid_mapper):
            # see if there are any samples already with this asap_id
            sns = [get_sampr(v) for v in ud_sampleid_mapper.values() if get_id(v)==asap_id]
            if len(sns) > 0: 
                rep_n = max(sns) + 1            
            else: 
                rep_n = 1   # start incrimenting from 1        
        else: # empty dicitonary. starting from scratch
            rep_n = 1


        if nodups.shape[0]>0:
            # ASSIGN IDS
            asap_nodups = [f'{asap_id}_s{rep_n+i:03}' for i in range(nodups.shape[0])]
            # nodups['ASAP_sample_id'] = asap_nodups
            nodups.loc[:, 'ASAP_sample_id'] = asap_nodups
            rep_n = rep_n + nodups.shape[0]
            samples_nodups = nodups['sample_id'].unique()

            nodup_mapper = dict(zip(nodups['sample_id'],asap_nodups))

            df_chunks.append(nodups)
        else:
            samples_nodups = []

        if dups.shape[0]>0:
            for dup_id in dups['sample_id'].unique():
                # first peel of any sample_ids that were already named in nodups, 

                if dup_id in samples_nodups:
                    asap_dup = nodup_mapper[dup_id]                    
                else:
                    # then assign ids to the rest.
                    asap_dup = f'{asap_id}_s{rep_n:03}'
                    dups.loc[dups.sample_id==dup_id, 'ASAP_sample_id'] = asap_dup
                    rep_n += 1
            df_chunks.append(dups)


    df_wids = pd.concat(df_chunks)
    id_mapper = dict(zip(df_wids['sample_id'],
                        df_wids['ASAP_sample_id']))

    ud_sampleid_mapper.update(id_mapper)


    # print(ud_sampleid_mapper)
    return ud_sampleid_mapper



In [32]:

sample_df = SAMPLE.copy()

sample_df.head()

Unnamed: 0,sample_id,subject_id,source_sample_id,replicate,replicate_count,repeated_sample,batch,tissue,brain_region,hemisphere,...,sex_ontology_term_id,self_reported_ethnicity_ontology_term_id,disease_ontology_term_id,tissue_ontology_term_id,cell_type_ontology_term_id,assay_ontology_term_id,suspension_type,DV200,pm_PH,donor_id
0,OF001,T-362,,Rep1,1,0,cing1,Brain,Cingulate Cortex,,...,PATO:0000383 (female),Unknown,,UBERON:0003027,,,Nucleus,,,
1,OF009,T-308,,Rep1,1,0,cing1,Brain,Cingulate Cortex,,...,PATO:0000384 (male),Unknown,,UBERON:0003027,,,Nucleus,,,
2,OF016,T3860,,Rep1,1,0,cing1,Brain,Cingulate Cortex,,...,PATO:0000384 (male),Unknown,,UBERON:0003027,,,Nucleus,,,
3,OF005,T290,,Rep1,1,0,cing1,Brain,Cingulate Cortex,,...,PATO:0000384 (male),Unknown,,UBERON:0003027,,,Nucleus,,,
4,OF013,T-233,,Rep1,1,0,cing1,Brain,Cingulate Cortex,,...,PATO:0000384 (male),Unknown,,UBERON:0003027,,,Nucleus,,,


In [31]:


ud_sampleid_mapper = generate_asap_sample_ids(asapid_mapper, sample_df, sampleid_mapper)



KeyError: 'T-362'

In [None]:

ud_sampleid_mapper2 = generate_asap_sample_ids2(asapid_mapper, sample_df, ud_sampleid_mapper)
ud_sampleid_mapper3 = generate_asap_sample_ids2(asapid_mapper, sample_df, ud_sampleid_mapper2)

ud_sampleid_mapper3

In [46]:
ud_sample_df = sample_df.copy()
ASAP_sample_id = ud_sample_df['sample_id'].map(ud_sampleid_mapper)
ud_sample_df.insert(0, 'ASAP_sample_id', ASAP_sample_id)

In [48]:
ud_sample_df.head(10)

Unnamed: 0,ASAP_sample_id,sample_id,subject_id,source_sample_id,replicate,replicate_count,repeated_sample,batch,tissue,brain_region,...,sex_ontology_term_id,self_reported_ethnicity_ontology_term_id,disease_ontology_term_id,tissue_ontology_term_id,cell_type_ontology_term_id,assay_ontology_term_id,suspension_type,DV200,pm_PH,donor_id
0,ASAP_PMBDS_000001_s002,MFG_HC_1225,HC_1225,12-25,rep1,1,0,BATCH_4,Brain,Middle_Frontal_Gyrus,...,PATO:0000384 (male),Unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
1,ASAP_PMBDS_000001_s001,HIP_HC_1225,HC_1225,12-25,rep1,1,0,BATCH_9,Brain,Hippocampus,...,PATO:0000384 (male),Unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
2,ASAP_PMBDS_000001_s003,SN_HC_1225,HC_1225,12-25,rep1,1,0,BATCH_7,Brain,Substantia_Nigra,...,PATO:0000384 (male),Unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
3,ASAP_PMBDS_000002_s002,MFG_HC_0602,HC_0602,06-02,rep1,1,0,BATCH_4,Brain,Middle_Frontal_Gyrus,...,PATO:0000384 (male),Unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
4,ASAP_PMBDS_000002_s001,HIP_HC_0602,HC_0602,06-02,rep1,1,0,BATCH_9,Brain,Hippocampus,...,PATO:0000384 (male),Unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
5,ASAP_PMBDS_000002_s003,SN_HC_0602,HC_0602,06-02,rep1,1,0,BATCH_5,Brain,Substantia_Nigra,...,PATO:0000384 (male),Unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
6,ASAP_PMBDS_000003_s002,MFG_PD_0009,PD_0009,00-09,rep1,1,0,BATCH_4,Brain,Middle_Frontal_Gyrus,...,PATO:0000384 (male),Unknown,MONDO:0005180,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
7,ASAP_PMBDS_000003_s001,HIP_PD_0009,PD_0009,00-09,rep1,1,0,BATCH_9,Brain,Hippocampus,...,PATO:0000384 (male),Unknown,MONDO:0005180,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
8,ASAP_PMBDS_000003_s003,SN_PD_0009,PD_0009,00-09,rep1,1,0,BATCH_6,Brain,Substantia_Nigra,...,PATO:0000384 (male),Unknown,MONDO:0005180,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
9,ASAP_PMBDS_000004_s002,MFG_PD_1921,PD_1921,19-21,rep1,1,0,BATCH_4,Brain,Middle_Frontal_Gyrus,...,PATO:0000384 (male),Unknown,MONDO:0005180,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,


In [100]:
from asap_ids import read_CDE, get_dtypes_dict, STUDY_PREFIX, read_meta_table, DATASET_ID, load_id_mapper, write_id_mapper, generate_asap_sample_ids2, generate_asap_subject_ids2

def process_meta_files_(table_path, 
                        CDE_path, 
                        subject_mapper_path = "ASAP_subj_map.json",
                        sample_mapper_path = "ASAP_samp_map.json",
                        gp2_mapper_path = "ASAP_gp2_map.json",
                        source_mapper_path = "ASAP_source_map.json",
                        export_path = None):
    """
    read in the meta data table, generate new ids, update the id_mapper, write the updated id_mapper to file
    """

    try:
        asapid_mapper = load_id_mapper(subject_mapper_path)
    except FileNotFoundError:
        asapid_mapper = {}
        print(f"{subject_mapper_path} not found... starting from scratch")

    try:
        sampleid_mapper = load_id_mapper(sample_mapper_path)
    except FileNotFoundError:
        sampleid_mapper = {}
        print(f"{sample_mapper_path} not found... starting from scratch")

    try:
        gp2id_mapper = load_id_mapper(gp2_mapper_path)
    except FileNotFoundError:
        gp2id_mapper = {}
        print(f"{gp2_mapper_path} not found... starting from scratch")

    try:
        sourceid_mapper = load_id_mapper(source_mapper_path)
    except FileNotFoundError:
        sourceid_mapper = {}
        print(f"{source_mapper_path} not found... starting from scratch")

    
    CDE, dtypes_dict = read_CDE(CDE_path)
    if CDE is None:
        return 0
    
    # add ASAP_team_id to the STUDY and PROTOCOL tables
    study_path = table_path / "STUDY.csv"
    if study_path.exists():
        study_df = read_meta_table(study_path, dtypes_dict)
        team_id = study_df['ASAP_team_name'].str.upper().replace('-', '_')
        study_df['ASAP_team_id'] = team_id
        # add ASAP_dataset_id = DATASET_ID to the STUDY tables
        study_df['ASAP_dataset_id'] = DATASET_ID
    else:
        study_df = None
        print(f"{study_path} not found... aborting")
        return 0

    protocol_path = table_path / "PROTOCOL.csv"
    if protocol_path.exists():
        protocol_df = read_meta_table(protocol_path, dtypes_dict)
        protocol_df['ASAP_dataset_id'] = DATASET_ID
    else:
        protocol_df = None
        print(f"{protocol_path} not found... aborting")
        return 0
    
    # add ASAP_subject_id to the SUBJECT tables
    subject_path = table_path / "SUBJECT.csv"
    if subject_path.exists():
        subject_df = read_meta_table(subject_path, dtypes_dict)
        output = generate_asap_subject_ids2(asapid_mapper,
                                            gp2id_mapper,
                                            sourceid_mapper, 
                                            subject_df)
        asapid_mapper, gp2id_mapper,sourceid_mapper = output

        ASAP_subject_id = subject_df['subject_id'].map(asapid_mapper)
        subject_df.insert(0, 'ASAP_subject_id', ASAP_subject_id)

        # # add ASAP_dataset_id = DATASET_ID to the SUBJECT tables
        # subject_df['ASAP_dataset_id'] = DATASET_ID
    else:
        subject_df = None
        print(f"{subject_path} not found... aborting")
        return 0
    
    # add ASAP_sample_id and ASAP_dataset_id to the SAMPLE tables
    sample_path = table_path / "SAMPLE.csv"
    if sample_path.exists():
        sample_df = read_meta_table(sample_path, dtypes_dict)
        sampleid_mapper = generate_asap_sample_ids2(asapid_mapper, sample_df, sampleid_mapper)
        sample_df['ASAP_dataset_id'] = DATASET_ID

        ASAP_sample_id = sample_df['sample_id'].map(sampleid_mapper)
        sample_df.insert(0, 'ASAP_sample_id', ASAP_sample_id)

    else:
        sample_df = None
        print(f"{sample_path} not found... aborting")
        return 0

    # add ASAP_sample_id to the CLINPATH tables
    clinpath_path = table_path / "CLINPATH.csv"
    if clinpath_path.exists():
        clinpath_df = read_meta_table(clinpath_path, dtypes_dict)
        clinpath_df['ASAP_subject_id'] = clinpath_df['subject_id'].map(asapid_mapper)

    # add ASAP_sample_id to the DATA tables
    data_path = table_path / "DATA.csv"
    if data_path.exists():
        data_df = read_meta_table(data_path, dtypes_dict)
        data_df['ASAP_sample_id'] = data_df['sample_id'].map(sampleid_mapper)


    # export updated tables
    if export_path is not None:

        #HACK: do we want to specify the full export path, or separate by team ID?
        asap_tables_path = export_path / study_df.ASAP_team_id[0]
        print(f"exporting to {asap_tables_path}")
        if  not asap_tables_path.exists():
            asap_tables_path.mkdir()

        if study_path.exists():
            study_df.to_csv(asap_tables_path / study_path.name)
        if protocol_path.exists():
            protocol_df.to_csv(asap_tables_path / protocol_path.name)
        if subject_path.exists():
            subject_df.to_csv(asap_tables_path / subject_path.name)
        if sample_path.exists():
            sample_df.to_csv(asap_tables_path / sample_path.name)
        if clinpath_path.exists():
            clinpath_df.to_csv(asap_tables_path / clinpath_path.name)
        if data_path.exists():
            data_df.to_csv(asap_tables_path / data_path.name)
    else:
        print("no ASAP_tables with ASAP_ID's exported")

    # write the updated id_mapper to file
    print(f"overwriting updated id_mapper to {subject_mapper_path},{sample_mapper_path}, etc.")
    write_id_mapper(asapid_mapper, subject_mapper_path)
    write_id_mapper(sourceid_mapper, source_mapper_path)
    write_id_mapper(gp2id_mapper, gp2_mapper_path)
    write_id_mapper(sampleid_mapper, sample_mapper_path)

    return 1


ImportError: cannot import name 'generate_asap_sample_ids2' from 'asap_ids' (/Users/ergonyc/Projects/ASAP/meta-clean/asap_ids.py)

In [103]:


export_root = Path.cwd() / "ASAP_tables" 
table_root = data_path

process_meta_files_(table_root, 
                       CDE_path, 
                       subject_mapper_path, 
                       sample_mapper_path, 
                       export_path=export_root)


id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_test1.json
id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_test1.json
id_mapper loaded from ASAP_gp2_map.json
id_mapper loaded from ASAP_source_map.json
added 0 new asap_subject_ids
added 0 new gp2_ids
added 0 new source_ids
found 75 sample_id's that have already been mapped!! BEWARE a sample_id naming collision!! If you are just reprocessing tables, it shoud be okay.
Nothing to see here... move along... move along .... 
No new sample_ids to map
exporting to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_tables/TEAM-LEE
overwriting updated id_mapper to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_test1.json,/Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_test1.json, etc.


1

## `asap_ids.py` CLI examples


Get some help:
```bash

python asap_ids.py --help
```

Returns:

```bash 
usage: asap_ids.py [-h] [--tables TABLES] [--cde CDE] [--map MAP] [--suf SUF] [--outdir OUTDIR]

A command-line tool to update tables from ASAP_CDEv1 to ASAP_CDEv2.

options:
  -h, --help       show this help message and exit
  --tables TABLES  Path to the directory containing meta TABLES. Defaults to the current working directory.
  --cde CDE        Path to the directory containing CSD.csv. Defaults to the current working directory.
  --map MAP        Path to the directory containing path to mapper.json files. Defaults to the current working directory.
  --suf SUF        suffix to mapper.json. Defaults to 'map' i.e. ASAP_{samp,subj}_map.json
  --outdir OUTDIR  Path to the directory containing CSD.csv. Defaults to the current working directory.
  ```

Generate IDs and update .json mapper. 

```bash

python asap_ids.py --tables "clean/team-Test/v2_20231109"\
                    --cde "."\
                    --map "."\
                    --suf "test1"\
                    --outdir "./ASAP_tables"


## TESTING 

First,kick things off with team Lee.   This seems to work. 

In [3]:
!python asap_ids.py --tables "clean/team-Lee/v2"\
                    --cde "."\
                    --map "."\
                    --suf "test"\
                    --outdir "./ASAP_tables"

id_mapper not found at ASAP_subj_test.json
id_mapper not found at ASAP_samp_test.json
id_mapper not found at ASAP_gp2_test.json
id_mapper not found at ASAP_source_test.json
try local file
added 25 new asap_subject_ids
added 0 new gp2_ids
added 25 new source_ids
exporting to ASAP_tables/TEAM-LEE
overwriting updated id_mapper to ASAP_subj_test.json,ASAP_samp_test.json, etc.


## probe with team Scherzer

Loaded Lee, Hafler, Hardy, Jakobsson into the "test" id mappers

Now lets simulate the if `__name__ == __main__` and process Team Hafler


In [8]:
from asap_ids import read_CDE, read_meta_table, generate_asap_subject_ids, generate_asap_sample_ids

# simulates asap_ids.main
DATASET_ID = "ASAP_PMBDS"
STUDY_PREFIX = f"{DATASET_ID}_"
ASAP_CDE = "ASAP_CDE_v2.csv"

# args
tables = "clean/team-Scherzer"
cde = "."
map = "."
suf = "test"
outdir = "./ASAP_tables/"

CDE_path = Path(cde) / ASAP_CDE

subject_mapper_path = Path(map) / f"ASAP_subj_{suf}.json"
sample_mapper_path = Path(map) / f"ASAP_samp_{suf}.json"
gp2_mapper_path = Path(map) / f"ASAP_gp2_{suf}.json"
source_mapper_path = Path(map) / f"ASAP_source_{suf}.json"

table_root = Path(tables) 
export_root= Path(outdir)

## Simulate process_meta_files

```python

    process_meta_files(table_root, 
                        CDE_path, 
                        subject_mapper_path=subject_mapper_path, 
                        sample_mapper_path=sample_mapper_path, 
                        gp2_mapper_path=gp2_mapper_path,
                        source_mapper_path = source_mapper_path,
                        export_path = export_root)
    
```

The first part should all work as is.

In [9]:
table_path = table_root
export_path = export_root
try:
    asapid_mapper = load_id_mapper(subject_mapper_path)
except FileNotFoundError:
    asapid_mapper = {}
    print(f"{subject_mapper_path} not found... starting from scratch")

try:
    sampleid_mapper = load_id_mapper(sample_mapper_path)
except FileNotFoundError:
    sampleid_mapper = {}
    print(f"{sample_mapper_path} not found... starting from scratch")

try:
    gp2id_mapper = load_id_mapper(gp2_mapper_path)
except FileNotFoundError:
    gp2id_mapper = {}
    print(f"{gp2_mapper_path} not found... starting from scratch")

try:
    sourceid_mapper = load_id_mapper(source_mapper_path)
except FileNotFoundError:
    sourceid_mapper = {}
    print(f"{source_mapper_path} not found... starting from scratch")

CDE, dtypes_dict = read_CDE(CDE_path)

# add ASAP_team_id to the STUDY and PROTOCOL tables
study_path = table_path / "STUDY.csv"
if study_path.exists():
    study_df = read_meta_table(study_path, dtypes_dict)
    team_id = study_df['ASAP_team_name'].str.upper().replace('-', '_')
    study_df['ASAP_team_id'] = team_id
    # add ASAP_dataset_id = DATASET_ID to the STUDY tables
    study_df['ASAP_dataset_id'] = DATASET_ID
else:
    study_df = None
    print(f"{study_path} not found... aborting")

protocol_path = table_path / "PROTOCOL.csv"
if protocol_path.exists():
    protocol_df = read_meta_table(protocol_path, dtypes_dict)
    protocol_df['ASAP_dataset_id'] = DATASET_ID
else:
    protocol_df = None
    print(f"{protocol_path} not found... aborting")

asapid_mapper

id_mapper loaded from ASAP_subj_test.json
id_mapper loaded from ASAP_samp_test.json
id_mapper loaded from ASAP_gp2_test.json
id_mapper loaded from ASAP_source_test.json


{'HC_1225': 'ASAP_PMBDS_000001',
 'HC_0602': 'ASAP_PMBDS_000002',
 'PD_0009': 'ASAP_PMBDS_000003',
 'PD_1921': 'ASAP_PMBDS_000004',
 'PD_2058': 'ASAP_PMBDS_000005',
 'PD_1441': 'ASAP_PMBDS_000006',
 'PD_1344': 'ASAP_PMBDS_000007',
 'HC_1939': 'ASAP_PMBDS_000008',
 'HC_1308': 'ASAP_PMBDS_000009',
 'HC_1862': 'ASAP_PMBDS_000010',
 'HC_1864': 'ASAP_PMBDS_000011',
 'HC_2057': 'ASAP_PMBDS_000012',
 'HC_2061': 'ASAP_PMBDS_000013',
 'HC_2062': 'ASAP_PMBDS_000014',
 'HC_2067': 'ASAP_PMBDS_000015',
 'PD_0348': 'ASAP_PMBDS_000016',
 'PD_0413': 'ASAP_PMBDS_000017',
 'PD_1312': 'ASAP_PMBDS_000018',
 'PD_1317': 'ASAP_PMBDS_000019',
 'PD_1504': 'ASAP_PMBDS_000020',
 'PD_1858': 'ASAP_PMBDS_000021',
 'PD_1902': 'ASAP_PMBDS_000022',
 'PD_1973': 'ASAP_PMBDS_000023',
 'PD_2005': 'ASAP_PMBDS_000024',
 'PD_2038': 'ASAP_PMBDS_000025',
 'HC01': 'ASAP_PMBDS_000026',
 'HC02': 'ASAP_PMBDS_000027',
 'HC03': 'ASAP_PMBDS_000028',
 'HC04': 'ASAP_PMBDS_000029',
 'HC05': 'ASAP_PMBDS_000030',
 'HC06': 'ASAP_PMBDS_0000

simulate 
```python 

# add ASAP_subject_id to the SUBJECT tables
subject_path = table_path / "SUBJECT.csv"
if subject_path.exists():
    subject_df = read_meta_table(subject_path, dtypes_dict)
    output = generate_asap_subject_ids(asapid_mapper,
                                        gp2id_mapper,
                                        sourceid_mapper, 
                                        subject_df)
    asapid_mapper, gp2id_mapper,sourceid_mapper, subject_df, n = output
    # # add ASAP_dataset_id = DATASET_ID to the SUBJECT tables
    # subject_df['ASAP_dataset_id'] = DATASET_ID
else:
    subject_df = None
    print(f"{subject_path} not found... aborting")

```

In [10]:
subject_path = table_path / "SUBJECT.csv"
subject_df = read_meta_table(subject_path, dtypes_dict)


In [12]:
output = generate_asap_subject_ids(asapid_mapper,
                                    gp2id_mapper,
                                    sourceid_mapper, 
                                    subject_df)
asapid_mapper, gp2id_mapper,sourceid_mapper, subject_df, n = output

added 94 new asap_subject_ids
added 0 new gp2_ids
added 85 new source_ids


In [15]:
subject_df

Unnamed: 0,ASAP_subject_id,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,...,hx_dementia_mci,hx_melanoma,education_level,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text
0,ASAP_PMBDS_000003,BN0009,00-09,,,BSHRI,Human,Male,64,White,...,Yes,,,Former smoker,30,33,Dementia,Unknown,Idiopathic PD,
1,ASAP_PMBDS_000128,BN0329,03-29,,,BSHRI,Human,Male,79,White,...,Yes,,14,Unknown,Unknown,23,Dementia,Unknown,Idiopathic PD,
2,ASAP_PMBDS_000129,BN0339,03-39,,,BSHRI,Human,Male,86,White,...,No,,12,Former smoker,50,33,Normal,Unknown,Healthy Control,
3,ASAP_PMBDS_000130,BN0341,03-41,,,BSHRI,Human,Male,89,White,...,No,,16,Never,0,34,Normal,Unknown,Healthy Control,
4,ASAP_PMBDS_000131,BN0347,03-47,,,BSHRI,Human,Female,95,White,...,No,,12,Former smoker,Unknown,33,MCI,Unknown,Healthy Control,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,ASAP_PMBDS_000216,BN2003,20-03,,,BSHRI,Human,Male,72,White,...,Yes,,16,Former smoker,Unknown,33,Dementia,Unknown,Idiopathic PD,
90,ASAP_PMBDS_000217,BN2015,20-15,,,BSHRI,Human,Male,90,White,...,No,,20,Never,Unknown,34,Normal,Unknown,Prodromal motor PD,
91,ASAP_PMBDS_000218,BN9944,99-44,,,BSHRI,Human,Male,69,White,...,No,,15,Former smoker,Unknown,34,Normal,Unknown,Healthy Control,
92,ASAP_PMBDS_000219,BN9947,99-47,,,BSHRI,Human,Male,84,White,...,No,,20,Former smoker,Unknown,33,Normal,Unknown,Prodromal motor PD,


In [16]:
# add ASAP_sample_id and ASAP_dataset_id to the SAMPLE tables
sample_path = table_path / "SAMPLE.csv"

sample_df = read_meta_table(sample_path, dtypes_dict)


In [17]:

sampleid_mapper, sample_df = generate_asap_sample_ids(asapid_mapper, sample_df, sampleid_mapper)
sample_df['ASAP_dataset_id'] = DATASET_ID



In [19]:
sampleid_mapper

{'HIP_HC_1225': 'ASAP_PMBDS_000001_s001',
 'MFG_HC_1225': 'ASAP_PMBDS_000001_s002',
 'SN_HC_1225': 'ASAP_PMBDS_000001_s003',
 'HIP_HC_0602': 'ASAP_PMBDS_000002_s001',
 'MFG_HC_0602': 'ASAP_PMBDS_000002_s002',
 'SN_HC_0602': 'ASAP_PMBDS_000002_s003',
 'HIP_PD_0009': 'ASAP_PMBDS_000003_s001',
 'MFG_PD_0009': 'ASAP_PMBDS_000003_s002',
 'SN_PD_0009': 'ASAP_PMBDS_000003_s003',
 'HIP_PD_1921': 'ASAP_PMBDS_000004_s001',
 'MFG_PD_1921': 'ASAP_PMBDS_000004_s002',
 'SN_PD_1921': 'ASAP_PMBDS_000004_s003',
 'HIP_PD_2058': 'ASAP_PMBDS_000005_s001',
 'MFG_PD_2058': 'ASAP_PMBDS_000005_s002',
 'SN_PD_2058': 'ASAP_PMBDS_000005_s003',
 'HIP_PD_1441': 'ASAP_PMBDS_000006_s001',
 'MFG_PD_1441': 'ASAP_PMBDS_000006_s002',
 'SN_PD_1441': 'ASAP_PMBDS_000006_s003',
 'HIP_PD_1344': 'ASAP_PMBDS_000007_s001',
 'MFG_PD_1344': 'ASAP_PMBDS_000007_s002',
 'SN_PD_1344': 'ASAP_PMBDS_000007_s003',
 'HIP_HC_1939': 'ASAP_PMBDS_000008_s001',
 'MFG_HC_1939': 'ASAP_PMBDS_000008_s002',
 'SN_HC_1939': 'ASAP_PMBDS_000008_s003',


Now step through `generate_asap_subject_ids`

```python

output = generate_asap_subject_ids(asapid_mapper,
                                        gp2id_mapper,
                                        sourceid_mapper, 
                                        subject_df)
asapid_mapper, gp2id_mapper,sourceid_mapper, subject_df, n = output

def generate_asap_subject_ids(asapid_mapper:dict,
                             gp2id_mapper:dict,
                             sourceid_mapper:dict, 
                             subject_df:pd.DataFrame) -> tuple[dict,dict,dict,pd.DataFrame,int]:
    """
    generate new unique_ids for new subject_ids in subject_df table, 
    update the id_mapper with the new ids from the data table

    return t
    """

```

In [117]:
n = max([int(v.split("_")[2]) for v in asapid_mapper.values() if v]) + 1
n

127

In [118]:

nstart = n

# ids_df = subject_df[['subject_id','source_subject_id', 'AMPPD_id', 'GP2_id']].copy()
ids_df = subject_df.copy()

# might want to use 'source_subject_id' instead of 'subject_id' since we want to find matches across teams
# shouldn't actually matter but logically cleaner
uniq_subj = ids_df['subject_id'].unique()
dupids_mapper = dict(zip(uniq_subj,
                    [num + nstart for num in range(len(uniq_subj))] ))

n_asap_id_add = 0
n_gp2_id_add = 0
n_source_id_add = 0


start with the first subj_id

In [119]:

df_dup_chunks = []
id_source = []
subj_id = uniq_subj[0]
samp_n = dupids_mapper[subj_id]

df_dups_subset = ids_df[ids_df.subject_id==subj_id].copy()

df_dups_subset

Unnamed: 0,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,...,hx_dementia_mci,hx_melanoma,education_level,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text
0,BN0009,00-09,,,BSHRI,Human,Male,64,White,Unknown,...,Yes,,,Former smoker,30,33,Dementia,Unknown,Idiopathic PD,


check gp2 IDs

In [120]:

gp2_id = None
add_gp2_id = False
# force skipping of null GP2_ids
if df_dups_subset['GP2_id'].nunique() > 1:
    print(f"subj_id: {subj_id} has multiple gp2_ids: {df_dups_subset['GP2_id'].to_list()}... something is wrong")
    #TODO: log this
elif not df_dups_subset['GP2_id'].dropna().empty: # we have a valide GP2_id
    gp2_id = df_dups_subset['GP2_id'].values[0] # values because index was not reset

if gp2_id in set(gp2id_mapper.keys()):
    asap_subj_id_gp2 = gp2id_mapper[gp2_id]
else:
    add_gp2_id = True
    asap_subj_id_gp2 = None

gp2_id, asap_subj_id_gp2

(None, None)

check source IDs

In [121]:
# check if source_subject_id is known
source_id = None
add_source_id = False
if df_dups_subset['source_subject_id'].nunique() > 1:
    print(f"subj_id: {subj_id} has multiple source ids: {df_dups_subset['source_subject_id'].to_list()}... something is wrong")
    #TODO: log this
elif df_dups_subset['source_subject_id'].isnull().any():
    print(f"subj_id: {subj_id} has no source_id... something is wrong")
    #TODO: log this
else: # we have a valide source_id
    #TODO: check for `source_subject_id` naming collisions with other teams
    #      e.g. check the `biobank_name`
    source_id = df_dups_subset['source_subject_id'].values[0]

if source_id in set(sourceid_mapper.keys()):
    asap_subj_id_source = sourceid_mapper[source_id]
else:
    add_source_id = True
    asap_subj_id_source = None

add_source_id, asap_subj_id_source

(False, 'ASAP_PMBDS_000003')

In [122]:
# check if subj_id is known
add_subj_id = False
# check if subj_id (subject_id) is known
if subj_id in set(asapid_mapper.keys()): # duplicate!!
    # TODO: log this
    # TODO: check for `subject_id` naming collisions with other teams
    asap_subj_id = asapid_mapper[subj_id]
else:
    add_subj_id = True
    asap_subj_id = None

add_subj_id, asap_subj_id

(True, None)

In [123]:
testset = set((asap_subj_id, asap_subj_id_gp2, asap_subj_id_source))
if None in testset:
    testset.remove(None)

In [124]:
# check that asap_subj_id is not disparate between the maps
if len(testset) > 1:
    print(f"collission between our ids: {(asap_subj_id, asap_subj_id_gp2, asap_subj_id_source)=}")
    print(f"this is BAAAAD. could be a naming collision with another team on `subject_id` ")

if len(testset) == 0:  # generate a new asap_subj_id
    # print(samp_n)
    asap_subject_id = f"{STUDY_PREFIX}{samp_n:06}"
    # df_dups_subset.insert(0, 'ASAP_subject_id', asap_subject_id, inplace=True)
else: # testset should have the asap_subj_id
    asap_subject_id = testset.pop() # but where did it come from?
    # print(f"found {subj_id }:{asap_subject_id} in the maps")

asap_subject_id

'ASAP_PMBDS_000003'

In [110]:
src = []
if add_subj_id:
    asapid_mapper[subj_id] = asap_subject_id
    n_asap_id_add += 1
    src.append('asap')

src, asapid_mapper

(['asap'],
 {'HC_1225': 'ASAP_PMBDS_000001',
  'HC_0602': 'ASAP_PMBDS_000002',
  'PD_0009': 'ASAP_PMBDS_000003',
  'PD_1921': 'ASAP_PMBDS_000004',
  'PD_2058': 'ASAP_PMBDS_000005',
  'PD_1441': 'ASAP_PMBDS_000006',
  'PD_1344': 'ASAP_PMBDS_000007',
  'HC_1939': 'ASAP_PMBDS_000008',
  'HC_1308': 'ASAP_PMBDS_000009',
  'HC_1862': 'ASAP_PMBDS_000010',
  'HC_1864': 'ASAP_PMBDS_000011',
  'HC_2057': 'ASAP_PMBDS_000012',
  'HC_2061': 'ASAP_PMBDS_000013',
  'HC_2062': 'ASAP_PMBDS_000014',
  'HC_2067': 'ASAP_PMBDS_000015',
  'PD_0348': 'ASAP_PMBDS_000016',
  'PD_0413': 'ASAP_PMBDS_000017',
  'PD_1312': 'ASAP_PMBDS_000018',
  'PD_1317': 'ASAP_PMBDS_000019',
  'PD_1504': 'ASAP_PMBDS_000020',
  'PD_1858': 'ASAP_PMBDS_000021',
  'PD_1902': 'ASAP_PMBDS_000022',
  'PD_1973': 'ASAP_PMBDS_000023',
  'PD_2005': 'ASAP_PMBDS_000024',
  'PD_2038': 'ASAP_PMBDS_000025',
  'HC01': 'ASAP_PMBDS_000026',
  'HC02': 'ASAP_PMBDS_000027',
  'HC03': 'ASAP_PMBDS_000028',
  'HC04': 'ASAP_PMBDS_000029',
  'HC05': 'ASAP

In [111]:

if add_gp2_id and gp2_id is not None:
    gp2id_mapper[gp2_id] = asap_subject_id
    n_gp2_id_add += 1
    src.append('gp2')

src, gp2id_mapper

(['asap'],
 {'MDGAP-QSBB_000096_s1': 'ASAP_PMBDS_000038',
  'MDGAP-QSBB_000046_s1': 'ASAP_PMBDS_000040',
  'MDGAP-QSBB_000070_s1': 'ASAP_PMBDS_000041',
  'MDGAP-QSBB_000592_s1': 'ASAP_PMBDS_000042',
  'MDGAP-QSBB_000547_s1': 'ASAP_PMBDS_000043',
  'MDGAP-QSBB_000482_s1': 'ASAP_PMBDS_000044',
  'MDGAP-QSBB_000239_s1': 'ASAP_PMBDS_000045',
  'MDGAP-QSBB_000486_s1': 'ASAP_PMBDS_000046',
  'MDGAP-QSBB_000081_s1': 'ASAP_PMBDS_000047',
  'MDGAP-QSBB_000272_s1': 'ASAP_PMBDS_000048',
  'MDGAP-QSBB_000362_s1': 'ASAP_PMBDS_000049',
  'MDGAP-QSBB_000496_s1': 'ASAP_PMBDS_000050',
  'MDGAP-QSBB_000363_s1': 'ASAP_PMBDS_000051',
  'MDGAP-QSBB_000608_s1': 'ASAP_PMBDS_000052',
  'MDGAP-QSBB_000597_s1': 'ASAP_PMBDS_000053',
  'MDGAP-QSBB_000468_s1': 'ASAP_PMBDS_000054',
  'MDGAP-QSBB_000406_s1': 'ASAP_PMBDS_000055',
  'MDGAP-QSBB_000514_s1': 'ASAP_PMBDS_000056',
  'MDGAP-QSBB_000412_s1': 'ASAP_PMBDS_000057',
  'MDGAP-QSBB_000307_s1': 'ASAP_PMBDS_000058',
  'MDGAP-QSBB_000593_s1': 'ASAP_PMBDS_000059',
  

In [112]:

if add_source_id and source_id is not None:   
    sourceid_mapper[source_id] = asap_subject_id
    n_source_id_add += 1
    src.append('source')

src, sourceid_mapper

(['asap'],
 {'12-25': 'ASAP_PMBDS_000001',
  '06-02': 'ASAP_PMBDS_000002',
  '00-09': 'ASAP_PMBDS_000003',
  '19-21': 'ASAP_PMBDS_000004',
  '20-58': 'ASAP_PMBDS_000005',
  '14-41': 'ASAP_PMBDS_000006',
  '13-44': 'ASAP_PMBDS_000007',
  '19-39': 'ASAP_PMBDS_000008',
  '13-08': 'ASAP_PMBDS_000009',
  '18-62': 'ASAP_PMBDS_000010',
  '18-64': 'ASAP_PMBDS_000011',
  '20-57': 'ASAP_PMBDS_000012',
  '20-61': 'ASAP_PMBDS_000013',
  '20-62': 'ASAP_PMBDS_000014',
  '20-67': 'ASAP_PMBDS_000015',
  '03-48': 'ASAP_PMBDS_000016',
  '04-13': 'ASAP_PMBDS_000017',
  '13-12': 'ASAP_PMBDS_000018',
  '13-17': 'ASAP_PMBDS_000019',
  '15-04': 'ASAP_PMBDS_000020',
  '18-58': 'ASAP_PMBDS_000021',
  '19-02': 'ASAP_PMBDS_000022',
  '19-73': 'ASAP_PMBDS_000023',
  '20-05': 'ASAP_PMBDS_000024',
  '20-38': 'ASAP_PMBDS_000025',
  'HSDG07': 'ASAP_PMBDS_000026',
  'HSDG13': 'ASAP_PMBDS_000027',
  'HSDG101': 'ASAP_PMBDS_000028',
  'HSDG10': 'ASAP_PMBDS_000029',
  'HSDG30': 'ASAP_PMBDS_000030',
  'HSDG99': 'ASAP_PMBDS

strange... everything looks okay... where does it find NP16-161, etc...

In [58]:
df_dup_chunks = []
id_source = []
for subj_id, samp_n in dupids_mapper.items():
    df_dups_subset = ids_df[ids_df.subject_id==subj_id].copy()

    # check if gp2_id is known
    # NOTE:  the gp2_id _might_ not be the GP2ID, but instead the GP2sampleID
    #        we might want to check for a trailing _s\d+ and remove it
    #        need to check w/ GP2 team about this.  The RepNo might be sample timepoint... 
    #        and hence be a "subject" in our context
    #    # df['GP2ID'] = df['GP2sampleID'].apply(lambda x: ("_").join(x.split("_")[:-1]))
    #    # df['SampleRepNo'] = df['GP2sampleID'].apply(lambda x: x.split("_")[-1])#.replace("s",""))

    gp2_id = None
    add_gp2_id = False
    # force skipping of null GP2_ids
    if df_dups_subset['GP2_id'].nunique() > 1:
        print(f"subj_id: {subj_id} has multiple gp2_ids: {df_dups_subset['GP2_id'].to_list()}... something is wrong")
        #TODO: log this
    elif not df_dups_subset['GP2_id'].dropna().empty: # we have a valide GP2_id
        gp2_id = df_dups_subset['GP2_id'].values[0] # values because index was not reset

    if gp2_id in set(gp2id_mapper.keys()):
        asap_subj_id_gp2 = gp2id_mapper[gp2_id]
    else:
        add_gp2_id = True
        asap_subj_id_gp2 = None

    # check if source_id is known
    source_id = None
    add_source_id = False
    if df_dups_subset['source_subject_id'].nunique() > 1:
        print(f"subj_id: {subj_id} has multiple source ids: {df_dups_subset['source_subject_id'].to_list()}... something is wrong")
        #TODO: log this
    elif df_dups_subset['source_subject_id'].isnull().any():
        print(f"subj_id: {subj_id} has no source_id... something is wrong")
        print(samp_n)
        break
        #TODO: log this
    else: # we have a valide source_id
        #TODO: check for `source_subject_id` naming collisions with other teams
        #      e.g. check the `biobank_name`
        source_id = df_dups_subset['source_subject_id'].values[0]

subj_id: NP16-161 has no source_id... something is wrong
104


In [59]:
df_dups_subset

Unnamed: 0,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,...,hx_dementia_mci,hx_melanoma,education_level,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text
2,NP16-161,,,,,Human,Male,72,,,...,,,,,,,,,No PD nor other neurological disorder,


In [125]:

df_dup_chunks = []
id_source = []
for subj_id, samp_n in dupids_mapper.items():
    df_dups_subset = ids_df[ids_df.subject_id==subj_id].copy()

    # check if gp2_id is known
    # NOTE:  the gp2_id _might_ not be the GP2ID, but instead the GP2sampleID
    #        we might want to check for a trailing _s\d+ and remove it
    #        need to check w/ GP2 team about this.  The RepNo might be sample timepoint... 
    #        and hence be a "subject" in our context
    #    # df['GP2ID'] = df['GP2sampleID'].apply(lambda x: ("_").join(x.split("_")[:-1]))
    #    # df['SampleRepNo'] = df['GP2sampleID'].apply(lambda x: x.split("_")[-1])#.replace("s",""))

    gp2_id = None
    add_gp2_id = False
    # force skipping of null GP2_ids
    if df_dups_subset['GP2_id'].nunique() > 1:
        print(f"subj_id: {subj_id} has multiple gp2_ids: {df_dups_subset['GP2_id'].to_list()}... something is wrong")
        #TODO: log this
    elif not df_dups_subset['GP2_id'].dropna().empty: # we have a valide GP2_id
        gp2_id = df_dups_subset['GP2_id'].values[0] # values because index was not reset

    if gp2_id in set(gp2id_mapper.keys()):
        asap_subj_id_gp2 = gp2id_mapper[gp2_id]
    else:
        add_gp2_id = True
        asap_subj_id_gp2 = None

    # check if source_id is known
    source_id = None
    add_source_id = False
    if df_dups_subset['source_subject_id'].nunique() > 1:
        print(f"subj_id: {subj_id} has multiple source ids: {df_dups_subset['source_subject_id'].to_list()}... something is wrong")
        #TODO: log this
    elif df_dups_subset['source_subject_id'].isnull().any():
        print(f"subj_id: {subj_id} has no source_id... something is wrong")
        print(samp_n)
        #TODO: log this
    else: # we have a valide source_id
        #TODO: check for `source_subject_id` naming collisions with other teams
        #      e.g. check the `biobank_name`
        source_id = df_dups_subset['source_subject_id'].values[0]

    if source_id in set(sourceid_mapper.keys()):
        asap_subj_id_source = sourceid_mapper[source_id]
    else:
        add_source_id = True
        asap_subj_id_source = None

    # TODO: add AMPPD_id test/mapper 

    # check if subj_id is known
    add_subj_id = False
    # check if subj_id (subject_id) is known
    if subj_id in set(asapid_mapper.keys()): # duplicate!!
        # TODO: log this
        # TODO: check for `subject_id` naming collisions with other teams
        asap_subj_id = asapid_mapper[subj_id]
    else:
        add_subj_id = True
        asap_subj_id = None

    # TODO:  improve the logic here so gp2 is the default if it exists.?
    #        we need to check the team_id to make sure it's not a naming collision on subject_id
    #        we need to check the biobank_name to make sure it's not a naming collision on source_subject_id

    testset = set((asap_subj_id, asap_subj_id_gp2, asap_subj_id_source))
    if None in testset:
        testset.remove(None)

    # check that asap_subj_id is not disparate between the maps
    if len(testset) > 1:
        print(f"collission between our ids: {(asap_subj_id, asap_subj_id_gp2, asap_subj_id_source)=}")
        print(f"this is BAAAAD. could be a naming collision with another team on `subject_id` ")

    if len(testset) == 0:  # generate a new asap_subj_id
        # print(samp_n)
        asap_subject_id = f"{STUDY_PREFIX}{samp_n:06}"
        # df_dups_subset.insert(0, 'ASAP_subject_id', asap_subject_id, inplace=True)
    else: # testset should have the asap_subj_id
        asap_subject_id = testset.pop() # but where did it come from?
        print(f"found {subj_id }:{asap_subject_id} in the maps, src{src}")
    
    src = []
    if add_subj_id:
        asapid_mapper[subj_id] = asap_subject_id
        n_asap_id_add += 1
        src.append('asap')

    if add_gp2_id and gp2_id is not None:
        gp2id_mapper[gp2_id] = asap_subject_id
        n_gp2_id_add += 1
        src.append('gp2')

    if add_source_id and source_id is not None:   
        sourceid_mapper[source_id] = asap_subject_id
        n_source_id_add += 1
        src.append('source')

    
    df_dup_chunks.append(df_dups_subset)
    id_source.append(src)


found BN0009:ASAP_PMBDS_000003 in the maps, src['asap']
found BN0348:ASAP_PMBDS_000016 in the maps, src['asap', 'source']
found BN0602:ASAP_PMBDS_000002 in the maps, src['asap', 'source']
found BN1308:ASAP_PMBDS_000009 in the maps, src['asap', 'source']
found BN1317:ASAP_PMBDS_000019 in the maps, src['asap']
found BN1504:ASAP_PMBDS_000020 in the maps, src['asap', 'source']
found BN1862:ASAP_PMBDS_000010 in the maps, src['asap', 'source']
found BN1902:ASAP_PMBDS_000022 in the maps, src['asap', 'source']
found BN1939:ASAP_PMBDS_000008 in the maps, src['asap', 'source']


In [126]:
asapid_mapper

{'HC_1225': 'ASAP_PMBDS_000001',
 'HC_0602': 'ASAP_PMBDS_000002',
 'PD_0009': 'ASAP_PMBDS_000003',
 'PD_1921': 'ASAP_PMBDS_000004',
 'PD_2058': 'ASAP_PMBDS_000005',
 'PD_1441': 'ASAP_PMBDS_000006',
 'PD_1344': 'ASAP_PMBDS_000007',
 'HC_1939': 'ASAP_PMBDS_000008',
 'HC_1308': 'ASAP_PMBDS_000009',
 'HC_1862': 'ASAP_PMBDS_000010',
 'HC_1864': 'ASAP_PMBDS_000011',
 'HC_2057': 'ASAP_PMBDS_000012',
 'HC_2061': 'ASAP_PMBDS_000013',
 'HC_2062': 'ASAP_PMBDS_000014',
 'HC_2067': 'ASAP_PMBDS_000015',
 'PD_0348': 'ASAP_PMBDS_000016',
 'PD_0413': 'ASAP_PMBDS_000017',
 'PD_1312': 'ASAP_PMBDS_000018',
 'PD_1317': 'ASAP_PMBDS_000019',
 'PD_1504': 'ASAP_PMBDS_000020',
 'PD_1858': 'ASAP_PMBDS_000021',
 'PD_1902': 'ASAP_PMBDS_000022',
 'PD_1973': 'ASAP_PMBDS_000023',
 'PD_2005': 'ASAP_PMBDS_000024',
 'PD_2038': 'ASAP_PMBDS_000025',
 'HC01': 'ASAP_PMBDS_000026',
 'HC02': 'ASAP_PMBDS_000027',
 'HC03': 'ASAP_PMBDS_000028',
 'HC04': 'ASAP_PMBDS_000029',
 'HC05': 'ASAP_PMBDS_000030',
 'HC06': 'ASAP_PMBDS_0000

In [76]:


df_dups_wids = pd.concat(df_dup_chunks)

df_dups_wids

Unnamed: 0,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,...,hx_dementia_mci,hx_melanoma,education_level,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text
0,NP16-140,BB16.0002,,,,Human,Male,71,,,...,,,,,,,,,,
1,NP16-160,BB16.0012,,,,Human,Male,88,,,...,,,,,,,,,,
2,NP16-161,,,,,Human,Male,72,,,...,,,,,,,,,No PD nor other neurological disorder,
3,NP16-162,BB16.0014,,,,Human,Male,74,,,...,,,,,,,,,,
4,NP16-164,,,,,Human,Male,75,,,...,,,,,,,,,No PD nor other neurological disorder,
5,NP16-21,,,,,Human,Male,83,,,...,,,,,,,,,No PD nor other neurological disorder,
6,NP16-25,BB16.0068,,,,Human,Male,85,,,...,,,,,,,,,,
7,NP16-269,BB16.0037,,,,Human,Male,73,,,...,,,,,,,,,,
8,NP17-191,BB17.0045,,,,Human,Male,73,,,...,,,,,,,,,,
9,NP17-256,BB17.0066,,,,Human,Female,72,,,...,,,,,,,,,No PD nor other neurological disorder,


In [64]:
asapid_mapper

{'HC_1225': 'ASAP_PMBDS_000001',
 'HC_0602': 'ASAP_PMBDS_000002',
 'PD_0009': 'ASAP_PMBDS_000003',
 'PD_1921': 'ASAP_PMBDS_000004',
 'PD_2058': 'ASAP_PMBDS_000005',
 'PD_1441': 'ASAP_PMBDS_000006',
 'PD_1344': 'ASAP_PMBDS_000007',
 'HC_1939': 'ASAP_PMBDS_000008',
 'HC_1308': 'ASAP_PMBDS_000009',
 'HC_1862': 'ASAP_PMBDS_000010',
 'HC_1864': 'ASAP_PMBDS_000011',
 'HC_2057': 'ASAP_PMBDS_000012',
 'HC_2061': 'ASAP_PMBDS_000013',
 'HC_2062': 'ASAP_PMBDS_000014',
 'HC_2067': 'ASAP_PMBDS_000015',
 'PD_0348': 'ASAP_PMBDS_000016',
 'PD_0413': 'ASAP_PMBDS_000017',
 'PD_1312': 'ASAP_PMBDS_000018',
 'PD_1317': 'ASAP_PMBDS_000019',
 'PD_1504': 'ASAP_PMBDS_000020',
 'PD_1858': 'ASAP_PMBDS_000021',
 'PD_1902': 'ASAP_PMBDS_000022',
 'PD_1973': 'ASAP_PMBDS_000023',
 'PD_2005': 'ASAP_PMBDS_000024',
 'PD_2038': 'ASAP_PMBDS_000025',
 'HC01': 'ASAP_PMBDS_000026',
 'HC02': 'ASAP_PMBDS_000027',
 'HC03': 'ASAP_PMBDS_000028',
 'HC04': 'ASAP_PMBDS_000029',
 'HC05': 'ASAP_PMBDS_000030',
 'HC06': 'ASAP_PMBDS_0000

In [78]:
df_dups_wids['subject_id'].map(asapid_mapper)

0     ASAP_PMBDS_000102
1     ASAP_PMBDS_000103
2     ASAP_PMBDS_000104
3     ASAP_PMBDS_000105
4     ASAP_PMBDS_000106
5     ASAP_PMBDS_000107
6     ASAP_PMBDS_000108
7     ASAP_PMBDS_000109
8     ASAP_PMBDS_000110
9     ASAP_PMBDS_000111
10    ASAP_PMBDS_000112
11    ASAP_PMBDS_000113
12    ASAP_PMBDS_000114
13    ASAP_PMBDS_000113
14    ASAP_PMBDS_000116
15    ASAP_PMBDS_000117
16    ASAP_PMBDS_000118
17    ASAP_PMBDS_000119
18    ASAP_PMBDS_000120
19    ASAP_PMBDS_000121
20    ASAP_PMBDS_000122
21    ASAP_PMBDS_000123
22    ASAP_PMBDS_000124
23    ASAP_PMBDS_000125
24    ASAP_PMBDS_000126
Name: subject_id, dtype: object

In [79]:

ASAP_subject_id = df_dups_wids['subject_id'].map(asapid_mapper)
df_dups_wids.insert(0, 'ASAP_subject_id', ASAP_subject_id)

df_dups_wids

Unnamed: 0,ASAP_subject_id,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,...,hx_dementia_mci,hx_melanoma,education_level,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text
0,ASAP_PMBDS_000102,NP16-140,BB16.0002,,,,Human,Male,71,,...,,,,,,,,,,
1,ASAP_PMBDS_000103,NP16-160,BB16.0012,,,,Human,Male,88,,...,,,,,,,,,,
2,ASAP_PMBDS_000104,NP16-161,,,,,Human,Male,72,,...,,,,,,,,,No PD nor other neurological disorder,
3,ASAP_PMBDS_000105,NP16-162,BB16.0014,,,,Human,Male,74,,...,,,,,,,,,,
4,ASAP_PMBDS_000106,NP16-164,,,,,Human,Male,75,,...,,,,,,,,,No PD nor other neurological disorder,
5,ASAP_PMBDS_000107,NP16-21,,,,,Human,Male,83,,...,,,,,,,,,No PD nor other neurological disorder,
6,ASAP_PMBDS_000108,NP16-25,BB16.0068,,,,Human,Male,85,,...,,,,,,,,,,
7,ASAP_PMBDS_000109,NP16-269,BB16.0037,,,,Human,Male,73,,...,,,,,,,,,,
8,ASAP_PMBDS_000110,NP17-191,BB17.0045,,,,Human,Male,73,,...,,,,,,,,,,
9,ASAP_PMBDS_000111,NP17-256,BB17.0066,,,,Human,Female,72,,...,,,,,,,,,No PD nor other neurological disorder,


In [80]:

print(f"added {n_asap_id_add} new asap_subject_ids")
print(f"added {n_gp2_id_add} new gp2_ids")
print(f"added {n_source_id_add} new source_ids")

# print(id_source)

    # return asapid_mapper, gp2id_mapper, sourceid_mapper, df_dups_wids, nstart

added 25 new asap_subject_ids
added 0 new gp2_ids
added 18 new source_ids


In [45]:

# add ASAP_subject_id to the SUBJECT tables
subject_path = table_path / "SUBJECT.csv"
if subject_path.exists():
    subject_df = read_meta_table(subject_path, dtypes_dict)
    output = generate_asap_subject_ids(asapid_mapper,
                                        gp2id_mapper,
                                        sourceid_mapper, 
                                        subject_df)
    asapid_mapper, gp2id_mapper,sourceid_mapper, subject_df, n = output
    # # add ASAP_dataset_id = DATASET_ID to the SUBJECT tables
    # subject_df['ASAP_dataset_id'] = DATASET_ID
else:
    subject_df = None
    print(f"{subject_path} not found... aborting")


added 64 new asap_subject_ids
added 62 new gp2_ids
added 64 new source_ids


In [81]:
asapid_mapper

{'HC_1225': 'ASAP_PMBDS_000001',
 'HC_0602': 'ASAP_PMBDS_000002',
 'PD_0009': 'ASAP_PMBDS_000003',
 'PD_1921': 'ASAP_PMBDS_000004',
 'PD_2058': 'ASAP_PMBDS_000005',
 'PD_1441': 'ASAP_PMBDS_000006',
 'PD_1344': 'ASAP_PMBDS_000007',
 'HC_1939': 'ASAP_PMBDS_000008',
 'HC_1308': 'ASAP_PMBDS_000009',
 'HC_1862': 'ASAP_PMBDS_000010',
 'HC_1864': 'ASAP_PMBDS_000011',
 'HC_2057': 'ASAP_PMBDS_000012',
 'HC_2061': 'ASAP_PMBDS_000013',
 'HC_2062': 'ASAP_PMBDS_000014',
 'HC_2067': 'ASAP_PMBDS_000015',
 'PD_0348': 'ASAP_PMBDS_000016',
 'PD_0413': 'ASAP_PMBDS_000017',
 'PD_1312': 'ASAP_PMBDS_000018',
 'PD_1317': 'ASAP_PMBDS_000019',
 'PD_1504': 'ASAP_PMBDS_000020',
 'PD_1858': 'ASAP_PMBDS_000021',
 'PD_1902': 'ASAP_PMBDS_000022',
 'PD_1973': 'ASAP_PMBDS_000023',
 'PD_2005': 'ASAP_PMBDS_000024',
 'PD_2038': 'ASAP_PMBDS_000025',
 'HC01': 'ASAP_PMBDS_000026',
 'HC02': 'ASAP_PMBDS_000027',
 'HC03': 'ASAP_PMBDS_000028',
 'HC04': 'ASAP_PMBDS_000029',
 'HC05': 'ASAP_PMBDS_000030',
 'HC06': 'ASAP_PMBDS_0000

Subject_ids look sensible. .. Now we need to look at the sample IDs

In [82]:
sampleid_mapper

{'HIP_HC_1225': 'ASAP_PMBDS_000001_s001',
 'MFG_HC_1225': 'ASAP_PMBDS_000001_s002',
 'SN_HC_1225': 'ASAP_PMBDS_000001_s003',
 'HIP_HC_0602': 'ASAP_PMBDS_000002_s001',
 'MFG_HC_0602': 'ASAP_PMBDS_000002_s002',
 'SN_HC_0602': 'ASAP_PMBDS_000002_s003',
 'HIP_PD_0009': 'ASAP_PMBDS_000003_s001',
 'MFG_PD_0009': 'ASAP_PMBDS_000003_s002',
 'SN_PD_0009': 'ASAP_PMBDS_000003_s003',
 'HIP_PD_1921': 'ASAP_PMBDS_000004_s001',
 'MFG_PD_1921': 'ASAP_PMBDS_000004_s002',
 'SN_PD_1921': 'ASAP_PMBDS_000004_s003',
 'HIP_PD_2058': 'ASAP_PMBDS_000005_s001',
 'MFG_PD_2058': 'ASAP_PMBDS_000005_s002',
 'SN_PD_2058': 'ASAP_PMBDS_000005_s003',
 'HIP_PD_1441': 'ASAP_PMBDS_000006_s001',
 'MFG_PD_1441': 'ASAP_PMBDS_000006_s002',
 'SN_PD_1441': 'ASAP_PMBDS_000006_s003',
 'HIP_PD_1344': 'ASAP_PMBDS_000007_s001',
 'MFG_PD_1344': 'ASAP_PMBDS_000007_s002',
 'SN_PD_1344': 'ASAP_PMBDS_000007_s003',
 'HIP_HC_1939': 'ASAP_PMBDS_000008_s001',
 'MFG_HC_1939': 'ASAP_PMBDS_000008_s002',
 'SN_HC_1939': 'ASAP_PMBDS_000008_s003',


simulate 
```python 

   # add ASAP_sample_id and ASAP_dataset_id to the SAMPLE tables
    sample_path = table_path / "SAMPLE.csv"
    if sample_path.exists():
        sample_df = read_meta_table(sample_path, dtypes_dict)
        sampleid_mapper, sample_df = generate_asap_sample_ids(asapid_mapper, sample_df, sampleid_mapper)
        sample_df['ASAP_dataset_id'] = DATASET_ID
    else:
        sample_df = None
        print(f"{sample_path} not found... aborting")
        return 0


```

In [83]:
sample_path = table_path / "SAMPLE.csv"
sample_df = read_meta_table(sample_path, dtypes_dict)

Now step through `generate_asap_sample_ids`

```python

sampleid_mapper, sample_df = generate_asap_sample_ids(asapid_mapper, sample_df, sampleid_mapper)


def generate_asap_sample_ids(asapid_mapper:dict, 
                             sample_df:pd.DataFrame, 
                             sampleid_mapper:dict) -> tuple[dict, pd.DataFrame]:
    """
    generate new unique_ids for new sample_ids in sample_df table, 
    update the id_mapper with the new ids from the data table

    return the updated id_mapper and updated sample_df
    """

```

In [84]:

ud_sampleid_mapper = sampleid_mapper.copy()
# 
uniq_subj = sample_df.subject_id.unique()
# check for subject_id collisions in the sampleid_mapper
if intersec := set(uniq_subj) & set(ud_sampleid_mapper.values()): 
    print(f"found {len(intersec)} subject_id collisions in the sampleid_mapper")
        
uniq_subj

array(['NP16-162', 'NP16-25', 'P73', 'P74', 'NP16-161', 'NP16-164',
       'NP16-21', 'PT231', 'NP16-140', 'NP16-160', 'NP16-269', 'NP17-94',
       'NP17-191', 'NP18-117', 'NP18-287', 'NP18-304', 'NP19-16',
       'NP19-23', 'NP19-108', 'NP19-255', 'NP17-256', 'NP18-159',
       'NP19-36', 'NP19-37', 'NP19-45'], dtype=object)

don't forget to define our helper functions fro reading out the asap_subj_id parts

In [85]:
def get_sampr(v):
    return int(v.split("_")[3].replace("s","")) 

def get_id(v):
    return v[:17] 


grab the first subj_id and see what our dup/nodup looks like

In [86]:
subj_id = uniq_subj[0]
df_subset = sample_df[sample_df.subject_id==subj_id].copy()
asap_id = asapid_mapper[subj_id]

dups = df_subset[df_subset.duplicated(keep=False, subset=['sample_id'])].sort_values('sample_id').reset_index(drop = True).copy()
nodups = df_subset[~df_subset.duplicated(keep=False, subset=['sample_id'])].sort_values('sample_id').reset_index(drop = True).copy()

asap_id = asapid_mapper[subj_id]

asap_id

'ASAP_PMBDS_000105'

In [87]:
nodups

Unnamed: 0,sample_id,subject_id,source_sample_id,replicate,replicate_count,repeated_sample,batch,tissue,brain_region,hemisphere,...,sex_ontology_term_id,self_reported_ethnicity_ontology_term_id,disease_ontology_term_id,tissue_ontology_term_id,cell_type_ontology_term_id,assay_ontology_term_id,suspension_type,DV200,pm_PH,donor_id
0,ASAP1_PD_NP16-162_SN,NP16-162,NP16-162 SN,1,1,0,1,Brain,Substantia nigra,,...,PATO:0000384 (male),,,UBERON:0002038,,EFO:0008443,Nucleus,,,
1,ASAP64_PD_NP16-162_PUT,NP16-162,NP16-162 PUT,1,1,0,8,Brain,Putamen,,...,PATO:0000384 (male),,,UBERON:0001874,,EFO:0008443,Nucleus,,,
2,ASAP9_PD_NP16-162_PFCTX,NP16-162,NP16-162 PFC,1,1,0,2,Brain,Prefrontal cortex,,...,PATO:0000384 (male),,,UBERON:0000451,,EFO:0008443,Nucleus,,,


Now we need to figure out where to start incrimenting our sample IDs 

In [88]:
sns = [get_sampr(v) for v in ud_sampleid_mapper.values() if get_id(v)==asap_id]
if len(sns) > 0: 
    rep_n = max(sns) + 1            
else: 
    rep_n = 1 

rep_n

1

In [90]:
ud_sampleid_mapper

{'HIP_HC_1225': 'ASAP_PMBDS_000001_s001',
 'MFG_HC_1225': 'ASAP_PMBDS_000001_s002',
 'SN_HC_1225': 'ASAP_PMBDS_000001_s003',
 'HIP_HC_0602': 'ASAP_PMBDS_000002_s001',
 'MFG_HC_0602': 'ASAP_PMBDS_000002_s002',
 'SN_HC_0602': 'ASAP_PMBDS_000002_s003',
 'HIP_PD_0009': 'ASAP_PMBDS_000003_s001',
 'MFG_PD_0009': 'ASAP_PMBDS_000003_s002',
 'SN_PD_0009': 'ASAP_PMBDS_000003_s003',
 'HIP_PD_1921': 'ASAP_PMBDS_000004_s001',
 'MFG_PD_1921': 'ASAP_PMBDS_000004_s002',
 'SN_PD_1921': 'ASAP_PMBDS_000004_s003',
 'HIP_PD_2058': 'ASAP_PMBDS_000005_s001',
 'MFG_PD_2058': 'ASAP_PMBDS_000005_s002',
 'SN_PD_2058': 'ASAP_PMBDS_000005_s003',
 'HIP_PD_1441': 'ASAP_PMBDS_000006_s001',
 'MFG_PD_1441': 'ASAP_PMBDS_000006_s002',
 'SN_PD_1441': 'ASAP_PMBDS_000006_s003',
 'HIP_PD_1344': 'ASAP_PMBDS_000007_s001',
 'MFG_PD_1344': 'ASAP_PMBDS_000007_s002',
 'SN_PD_1344': 'ASAP_PMBDS_000007_s003',
 'HIP_HC_1939': 'ASAP_PMBDS_000008_s001',
 'MFG_HC_1939': 'ASAP_PMBDS_000008_s002',
 'SN_HC_1939': 'ASAP_PMBDS_000008_s003',


In [91]:
dups

Unnamed: 0,sample_id,subject_id,source_sample_id,replicate,replicate_count,repeated_sample,batch,tissue,brain_region,hemisphere,...,sex_ontology_term_id,self_reported_ethnicity_ontology_term_id,disease_ontology_term_id,tissue_ontology_term_id,cell_type_ontology_term_id,assay_ontology_term_id,suspension_type,DV200,pm_PH,donor_id


In [92]:
df_chunks = []
for subj_id in uniq_subj:

    df_subset = sample_df[sample_df.subject_id==subj_id].copy()
    asap_id = asapid_mapper[subj_id]

    dups = df_subset[df_subset.duplicated(keep=False, subset=['sample_id'])].sort_values('sample_id').reset_index(drop = True).copy()
    nodups = df_subset[~df_subset.duplicated(keep=False, subset=['sample_id'])].sort_values('sample_id').reset_index(drop = True).copy()

    asap_id = asapid_mapper[subj_id]

    ## Here this should always be true. so n != 1
    if bool(ud_sampleid_mapper):
        # see if there are any samples already with this asap_id
        sns = [get_sampr(v) for v in ud_sampleid_mapper.values() if get_id(v)==asap_id]
        if len(sns) > 0: 
            rep_n = max(sns) + 1            
        else: 
            rep_n = 1   # start incrimenting from 1        
    else: # empty dicitonary. starting from scratch
        rep_n = 1


    if nodups.shape[0]>0:
        # ASSIGN IDS
        asap_nodups = [f'{asap_id}_s{rep_n+i:03}' for i in range(nodups.shape[0])]
        # nodups['ASAP_sample_id'] = asap_nodups
        nodups.loc[:, 'ASAP_sample_id'] = asap_nodups
        rep_n = rep_n + nodups.shape[0]
        samples_nodups = nodups['sample_id'].unique()

        nodup_mapper = dict(zip(nodups['sample_id'],asap_nodups))


        df_chunks.append(nodups)
    else:
        samples_nodups = []

    if dups.shape[0]>0:
        for dup_id in dups['sample_id'].unique():
            # first peel of any sample_ids that were already named in nodups, 

            if dup_id in samples_nodups:
                asap_dup = nodup_mapper[dup_id]                    
            else:
                # then assign ids to the rest.
                asap_dup = f'{asap_id}_s{rep_n:03}'
                dups.loc[dups.sample_id==dup_id, 'ASAP_sample_id'] = asap_dup
                rep_n += 1
        df_chunks.append(dups)


df_wids = pd.concat(df_chunks)
id_mapper = dict(zip(df_wids['sample_id'],
                    df_wids['ASAP_sample_id']))

ud_sampleid_mapper.update(id_mapper)


out_df = sample_df.copy()
ASAP_sample_id = out_df['sample_id'].map(ud_sampleid_mapper)
out_df.insert(0, 'ASAP_sample_id', ASAP_sample_id)


In [93]:
out_df

Unnamed: 0,ASAP_sample_id,sample_id,subject_id,source_sample_id,replicate,replicate_count,repeated_sample,batch,tissue,brain_region,...,sex_ontology_term_id,self_reported_ethnicity_ontology_term_id,disease_ontology_term_id,tissue_ontology_term_id,cell_type_ontology_term_id,assay_ontology_term_id,suspension_type,DV200,pm_PH,donor_id
0,ASAP_PMBDS_000105_s001,ASAP1_PD_NP16-162_SN,NP16-162,NP16-162 SN,1,1,0,1,Brain,Substantia nigra,...,PATO:0000384 (male),,,UBERON:0002038,,EFO:0008443,Nucleus,,,
1,ASAP_PMBDS_000108_s003,ASAP2_PD_NP16-25_SN,NP16-25,NP16-25 SN,1,1,0,1,Brain,Substantia nigra,...,PATO:0000384 (male),,,UBERON:0002038,,EFO:0008443,Nucleus,,,
2,ASAP_PMBDS_000124_s002,ASAP3_PD_P73_SN,P73,P73 SN,1,1,0,1,Brain,Substantia nigra,...,PATO:0000383 (female),,,UBERON:0002038,,EFO:0008443,Nucleus,,,
3,ASAP_PMBDS_000125_s002,ASAP4_PD_P74_SN,P74,P74 SN,1,1,0,1,Brain,Substantia nigra,...,PATO:0000384 (male),,,UBERON:0002038,,EFO:0008443,Nucleus,,,
4,ASAP_PMBDS_000104_s002,ASAP5_Ctl_NP16-161_SN,NP16-161,NP16-161 SN,1,1,0,1,Brain,Substantia nigra,...,PATO:0000384 (male),,,UBERON:0002038,,EFO:0008443,Nucleus,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66,ASAP_PMBDS_000117_s004,ASAP73_PD_NP19-108_PUT,NP19-108,NP19-108 PUT,1,1,0,9,Brain,Putamen,...,PATO:0000383 (female),,,UBERON:0001874,,EFO:0008443,Nucleus,,,
67,ASAP_PMBDS_000120_s003,ASAP74_PD_NP19-255_PUT,NP19-255,NP19-255 PUT,1,1,0,8,Brain,Putamen,...,PATO:0000384 (male),,,UBERON:0001874,,EFO:0008443,Nucleus,,,
68,ASAP_PMBDS_000114_s003,ASAP76_Ctl_NP18-159_PUT,NP18-159,NP18-159 PUT,1,1,0,9,Brain,Putamen,...,PATO:0000383 (female),,,UBERON:0001874,,EFO:0008443,Nucleus,,,
69,ASAP_PMBDS_000121_s003,ASAP77_Ctl_NP19-36_PUT,NP19-36,NP19-36 PUT,1,1,0,9,Brain,Putamen,...,PATO:0000384 (male),,,UBERON:0001874,,EFO:0008443,Nucleus,,,


In [8]:
tmp_dir = Path.cwd() / "tmp"
if not tmp_dir.exists():
    tmp_dir.mkdir(parents=True, exist_ok=True)


In [10]:
!gcloud auth activate-service-account --key-file=/Users/ergonyc/Projects/ASAP/wf-credentials.json
!gsutil  cp -r "gs://asap-workflow-dev/metadata/v2" "./tmp/"

Activated service account credentials for: [admin-workflow-dev@dnastack-asap-parkinsons.iam.gserviceaccount.com]
Copying gs://asap-workflow-dev/metadata/v2/hafler/CLINPATH.csv...
Copying gs://asap-workflow-dev/metadata/v2/hafler/DATA.csv...                   
Copying gs://asap-workflow-dev/metadata/v2/hafler/PROTOCOL.csv...               
Copying gs://asap-workflow-dev/metadata/v2/hafler/SAMPLE.csv...                 
- [4 files][ 11.3 KiB/ 11.3 KiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://asap-workflow-dev/metadata/v2/hafler/STUDY.csv...
Copying gs://asap-workflow-dev/metadata/v2/hafler/SUBJECT.csv...                
Copying gs://asap-workflow-dev/metadata/v2/hafler/v2_20231128/CLINPATH.csv...   
Copying gs://asap-wo

Make a pandas DataFrame from the mapper

In [84]:
subject_level_df = pd.DataFrame.from_dict(asapid_mapper, orient='index', columns = ['ASAP_subject_id']).rename_axis('subject_id').reset_index().astype(str)

subject_level_df

Unnamed: 0,subject_id,ASAP_subject_id
0,HC_1225,ASAP_PMBDS_000001
1,HC_0602,ASAP_PMBDS_000002
2,PD_0009,ASAP_PMBDS_000003
3,PD_1921,ASAP_PMBDS_000004
4,PD_2058,ASAP_PMBDS_000005
...,...,...
62,dup_2nafam,ASAP_PMBDS_000050
63,dup_2posaj,ASAP_PMBDS_000051
64,dup_2lutit,ASAP_PMBDS_000052
65,dup_2rusiz,ASAP_PMBDS_000053


In [86]:
gp2id_mapper

{'MDGAP-QSBB_000538_s1': 'ASAP_PMBDS_000042',
 'MDGAP-QSBB_000030_s1': 'ASAP_PMBDS_000043',
 'MDGAP-QSBB_000583_s1': 'ASAP_PMBDS_000044',
 'MDGAP-QSBB_000246_s1': 'ASAP_PMBDS_000045',
 'MDGAP-QSBB_000422_s1': 'ASAP_PMBDS_000046',
 'MDGAP-QSBB_000046_s1': 'ASAP_PMBDS_000047',
 'MDGAP-QSBB_000593_s1': 'ASAP_PMBDS_000048',
 'MDGAP-QSBB_000101_s1': 'ASAP_PMBDS_000049',
 'MDGAP-DEMENTIA_000249_s1': 'ASAP_PMBDS_000050',
 'MDGAP-QSBB_000427_s1': 'ASAP_PMBDS_000051',
 'MDGAP-QSBB_000426_s1': 'ASAP_PMBDS_000052',
 'MDGAP-QSBB_000234_s1': 'ASAP_PMBDS_000053',
 'MDGAP-QSBB_000325_s1': 'ASAP_PMBDS_000054'}

In [87]:
sourceid_mapper

{'P67/16': 'ASAP_PMBDS_000042',
 'P56/08': 'ASAP_PMBDS_000043',
 'P8/18': 'ASAP_PMBDS_000044',
 'P13/17': 'ASAP_PMBDS_000045',
 'P46/10': 'ASAP_PMBDS_000046',
 'P95/10': 'ASAP_PMBDS_000047',
 'P82/12': 'ASAP_PMBDS_000048',
 'P30/09': 'ASAP_PMBDS_000049',
 'P17/16': 'ASAP_PMBDS_000050',
 'P47/11': 'ASAP_PMBDS_000051',
 'P47/05': 'ASAP_PMBDS_000052',
 'P11/14': 'ASAP_PMBDS_000053',
 'P26/11': 'ASAP_PMBDS_000054'}

In [77]:
master_df = pd.DataFrame.from_dict(sampleid_mapper, orient='index', columns = ['ASAP_sample_id']).rename_axis('sample_id').reset_index().astype(str)

master_df


                                #              #orient='index',
                                #             columns = ['sample_id','ASAP_sample_id']).astype(str)
                                # #     .rename_axis('master_subject_id').reset_index()\


Unnamed: 0,sample_id,ASAP_sample_id
0,HIP_HC_1225,ASAP_PMBDS_000001_s001
1,MFG_HC_1225,ASAP_PMBDS_000001_s002
2,SN_HC_1225,ASAP_PMBDS_000001_s003
3,HIP_HC_0602,ASAP_PMBDS_000002_s001
4,MFG_HC_0602,ASAP_PMBDS_000002_s002
...,...,...
81,SN_PD_2038,ASAP_PMBDS_000025_s003
82,dup_HIP_PD_2038,ASAP_PMBDS_000025_s004
83,dup_HIP_PD_1344,ASAP_PMBDS_000033_s001
84,dup_HIP_PD_1921,ASAP_PMBDS_000031_s001


## END HERE
Random testing code below...   Mostly CRUD

In [90]:

# new entries
newids_df = ids_df[ids_df['ASAP_subject_id'].isnull()].reset_index(drop = True).copy()

# entries with existing ids
haveids_df = ids_df[~ids_df['ASAP_subject_id'].isnull()].reset_index(drop = True).copy()

In [91]:
data_duplicated = pd.merge(subject_df, master_df, on=['subject_id'], how='inner')

assert data_duplicated.empty, "There are duplicated subject_ids in the data table"

In [92]:
newids_df.empty

False

In [93]:
# newids_df is not Empty # i.e. we have new ids to add

df_dups = ids_df[ids_df.duplicated(keep=False, subset=['subject_id'])].sort_values('subject_id').reset_index(drop = True).copy()
df_nodups = ids_df[~ids_df.duplicated(keep=False, subset=['subject_id'])].sort_values('subject_id').reset_index(drop = True).copy()


# TODO: CODE THE UNPACKER OF THE ASAP_sample_id to deal with duplicates after creating a duplicate
# # unpack the old entries
# df_wids = df_subset[~df_subset['GP2sampleID'].isnull()].reset_index(drop = True).copy()
# df_wids['GP2ID'] = df_wids['GP2sampleID'].apply(lambda x: ("_").join(x.split("_")[:-1]))
# df_wids['SampleRepNo'] = df_wids['GP2sampleID'].apply(lambda x: x.split("_")[-1])#.replace("s",""))

df_dups.shape, df_nodups.shape

((26, 27), (13, 27))

In [None]:


dups_df = ids_df[ids_df.duplicated(keep=False, subset=['subject_id'])].sort_values('subject_id').reset_index(drop = True).copy()
if dups_df.shape[0]>0:
    dupids_mapper = dict(zip(dups_df.subject_id.unique(),
                        [num+n for num in range(len(dups_df.subject_id.unique()))]))
    
    dup_df_chunks = []
    for clin_id, id in dupids_mapper.items():
        dups_subset = df_dups[df_dups.clinical_id==clin_id].copy()
        dups_subset['GP2ID'] = [f'{study_code}_{id:06}' for i in range(dups_subset.shape[0])]
        dups_subset['SampleRepNo'] = ['s'+str(i+1) for i in range(dups_subset.shape[0])]
        dups_subset['GP2sampleID'] = dups_subset['GP2ID'] + '_' + dups_subset['SampleRepNo']
        dup_df_chunks.append(df_dups_subset)
    df_dups_wids = pd.concat(df_dup_chunks)

df_nodups = dfproc[~dfproc.duplicated(keep=False, subset=['clinical_id'])].sort_values('clinical_id').reset_index(drop = True).copy()

if df_dups.shape[0]>0:
    n =  len(list(dupids_mapper.values())) + n
else:
    n = n

uids = [str(id) for id in df_nodups['sample_id'].unique()]
mapid = {}
for uid in uids:
    mapid[uid]= n
    n += 1
df_nodups_wids = df_nodups.copy()
df_nodups_wids['uid_idx'] = df_nodups_wids['sample_id'].map(mapid)
df_nodups_wids['GP2ID'] = [f'{study_code}_{i:06}' for i in df_nodups_wids.uid_idx]
df_nodups_wids['uid_idx_cumcount'] = df_nodups_wids.groupby('GP2ID').cumcount() + 1
df_nodups_wids['GP2sampleID'] = df_nodups_wids.GP2ID + '_s' + df_nodups_wids.uid_idx_cumcount.astype('str')
df_nodups_wids['SampleRepNo'] = 's' + df_nodups_wids.uid_idx_cumcount.astype('str')
df_nodups_wids.drop(['uid_idx','uid_idx_cumcount'], axis = 1, inplace = True)

if df_dups.shape[0]>0:
    df_newids = pd.concat([df_dups_wids, df_nodups_wids])
else:
    df_newids = df_nodups_wids


In [None]:

# make new ids for new entries
n=int(max(study_tracker_df['master_GP2sampleID'].to_list()).split("_")[1])+1
df_newids = generategp2ids.getgp2idsv2(df_newids, n, study)
df_subset = pd.concat([df_newids, df_wids], axis = 0)
study_subsets.append(df_subset)
log_new.append(df_newids[['study','clinical_id','sample_id','GP2sampleID']])



In [None]:
    
# newids_df is  Empty # i.e. we ONLY new ids to add
# TO CONSIDER THE CASE IN WHICH WE ONLY HAD DUPLICATE IDS MAPPED ON THE MASTER FILE

# just unpack the entries
df_subset['GP2ID'] = df_subset['GP2sampleID'].apply(lambda x: ("_").join(x.split("_")[:-1]))
df_subset['SampleRepNo'] = df_subset['GP2sampleID'].apply(lambda x: x.split("_")[-1])#.replace("s",""))
study_subsets.append(df_subset)



In [None]:

                # Brand new data - NO STUDY TRACKER FOR THIS COHORT
                else:
                    study = study
                    new_clinicaldups = False # Duplicates from master key json are treated differently to brand new data
                    n = 1
                    df_newids = generategp2ids.getgp2idsv2(df_subset, n, study)
                    study_subsets.append(df_newids)












df_dups = dfproc[dfproc.duplicated(keep=False, subset=['clinical_id'])].sort_values('clinical_id').reset_index(drop = True).copy()
if df_dups.shape[0]>0:
    dupids_mapper = dict(zip(df_dups.clinical_id.unique(),
                        [num+n for num in range(len(df_dups.clinical_id.unique()))]))
    
    df_dup_chunks = []
    for clin_id, gp2id in dupids_mapper.items():
        df_dups_subset = df_dups[df_dups.clinical_id==clin_id].copy()
        df_dups_subset['GP2ID'] = [f'{study_code}_{gp2id:06}' for i in range(df_dups_subset.shape[0])]
        df_dups_subset['SampleRepNo'] = ['s'+str(i+1) for i in range(df_dups_subset.shape[0])]
        df_dups_subset['GP2sampleID'] = df_dups_subset['GP2ID'] + '_' + df_dups_subset['SampleRepNo']
        df_dup_chunks.append(df_dups_subset)
    df_dups_wids = pd.concat(df_dup_chunks)

df_nodups = dfproc[~dfproc.duplicated(keep=False, subset=['clinical_id'])].sort_values('clinical_id').reset_index(drop = True).copy()

if df_dups.shape[0]>0:
    n =  len(list(dupids_mapper.values())) + n
else:
    n = n

uids = [str(id) for id in df_nodups['sample_id'].unique()]
mapid = {}
for uid in uids:
    mapid[uid]= n
    n += 1
df_nodups_wids = df_nodups.copy()
df_nodups_wids['uid_idx'] = df_nodups_wids['sample_id'].map(mapid)
df_nodups_wids['GP2ID'] = [f'{study_code}_{i:06}' for i in df_nodups_wids.uid_idx]
df_nodups_wids['uid_idx_cumcount'] = df_nodups_wids.groupby('GP2ID').cumcount() + 1
df_nodups_wids['GP2sampleID'] = df_nodups_wids.GP2ID + '_s' + df_nodups_wids.uid_idx_cumcount.astype('str')
df_nodups_wids['SampleRepNo'] = 's' + df_nodups_wids.uid_idx_cumcount.astype('str')
df_nodups_wids.drop(['uid_idx','uid_idx_cumcount'], axis = 1, inplace = True)

if df_dups.shape[0]>0:
    df_newids = pd.concat([df_dups_wids, df_nodups_wids])
else:
    df_newids = df_nodups_wids








# might want to use 'source_subject_id' instead of 'subject_id' since we want to find matches across teams
# shouldn't actually matter but logically cleaner
uids = [str(id) for id in df_nodups_wids['subject_id'].unique()]
mapid = {}
for uid in uids:
    mapid[uid]= n
    n += 1

    df_nodups_wids['uid_idx'] = df_nodups_wids['subject_id'].map(mapid)
    # make a new column with the ASAP_subject_id
    # and insert it at the beginning of the dataframe
    ASAP_subject_id = [f'{STUDY_PREFIX}{i:06}' for i in df_nodups_wids.uid_idx]
    df_nodups_wids.insert(0, 'ASAP_subject_id', ASAP_subject_id)

    df_nodups_wids['uid_idx_cumcount'] = df_nodups_wids.groupby('ASAP_subject_id').cumcount() + 1
    asap_id_mapper = dict(zip(df_nodups_wids['subject_id'], df_nodups_wids['ASAP_subject_id']))

    # update the subj_id_mapper
    subj_id_mapper.update(asap_id_mapper)

In [69]:
# get the last incrimented id from master_df
if newids_df.empty:# TO CONSIDER THE CASE IN WHICH WE ONLY HAD DUPLICATE IDS MAPPED ON THE MASTER FILE
    # just unpack the entries





else:   # Get new  IDs
    if master_df.empty:
        n = 1
    else:
        n = int(max(master_df['ASAP_subject_id'].to_list()).split("_")[2])+1
    nstart = n
    



In [None]:
# GET GP2 IDs METADATA for new CLINICAL-SAMPLE ID pairs
df_newids = df_subset[df_subset['GP2sampleID'].isnull()].reset_index(drop = True).copy()
if not df_newids.empty: # Get new GP2 IDs
    # unpack the entries
    df_wids = df_subset[~df_subset['GP2sampleID'].isnull()].reset_index(drop = True).copy()
    df_wids['GP2ID'] = df_wids['GP2sampleID'].apply(lambda x: ("_").join(x.split("_")[:-1]))
    df_wids['SampleRepNo'] = df_wids['GP2sampleID'].apply(lambda x: x.split("_")[-1])#.replace("s",""))

    # make new ids for new entries
    n=int(max(study_tracker_df['master_GP2sampleID'].to_list()).split("_")[1])+1
    df_newids = generategp2ids.getgp2idsv2(df_newids, n, study)
    df_subset = pd.concat([df_newids, df_wids], axis = 0)
    study_subsets.append(df_subset)
    log_new.append(df_newids[['study','clinical_id','sample_id','GP2sampleID']])
    
else: # TO CONSIDER THE CASE IN WHICH WE ONLY HAD DUPLICATE IDS MAPPED ON THE MASTER FILE
    # just unpack the entries
    df_subset['GP2ID'] = df_subset['GP2sampleID'].apply(lambda x: ("_").join(x.split("_")[:-1]))
    df_subset['SampleRepNo'] = df_subset['GP2sampleID'].apply(lambda x: x.split("_")[-1])#.replace("s",""))
    study_subsets.append(df_subset)

In [None]:


df_nodups_wids = subject_df.copy()
# might want to use 'source_subject_id' instead of 'subject_id' since we want to find matches across teams
# shouldn't actually matter but logically cleaner
uids = [str(id) for id in df_nodups_wids['subject_id'].unique()]
mapid = {}
for uid in uids:
    mapid[uid]= n
    n += 1

df_nodups_wids['uid_idx'] = df_nodups_wids['subject_id'].map(mapid)
# make a new column with the ASAP_subject_id
# and insert it at the beginning of the dataframe
ASAP_subject_id = [f'{STUDY_PREFIX}{i:06}' for i in df_nodups_wids.uid_idx]
df_nodups_wids.insert(0, 'ASAP_subject_id', ASAP_subject_id)

df_nodups_wids['uid_idx_cumcount'] = df_nodups_wids.groupby('ASAP_subject_id').cumcount() + 1
asap_id_mapper = dict(zip(df_nodups_wids['subject_id'], df_nodups_wids['ASAP_subject_id']))

# update the subj_id_mapper
subj_id_mapper.update(asap_id_mapper)

In [None]:
if not df_newids.empty: # Get new GP2 IDs
    df_wids = df_subset[~df_subset['ASAP_subject_id'].isnull()].reset_index(drop = True).copy()
    df_wids['asap_id'] = df_wids['ASAP_subject_id'].apply(lambda x: ("_").join(x.split("_")[:-1]))
    df_wids['samp_rep_no'] = df_wids['GP2samasap_idpleID'].apply(lambda x: x.split("_")[-1])#.replace("s",""))

    n=int(max(study_tracker_df['master_GP2sampleID'].to_list()).split("_")[1])+1
    df_newids = generategp2ids.getgp2idsv2(df_newids, n, study)
    df_subset = pd.concat([df_newids, df_wids], axis = 0)
    study_subsets.append(df_subset)
    log_new.append(df_newids[['study','clinical_id','sample_id','GP2sampleID']])
    
else: # TO CONSIDER THE CASE IN WHICH WE ONLY HAD DUPLICATE IDS MAPPED ON THE MASTER FILE

    
    df_subset['GP2ID'] = df_subset['GP2sampleID'].apply(lambda x: ("_").join(x.split("_")[:-1]))
    df_subset['SampleRepNo'] = df_subset['GP2sampleID'].apply(lambda x: x.split("_")[-1])#.replace("s",""))
    study_subsets.append(df_subset)

In [None]:

# check for duplicates in the subject_df
df_dups = master_df[master_df.duplicated(keep=False, subset=['subject_id'])].sort_values('subject_id').reset_index(drop = True).copy()



# if there are duplicates, then generate new ids for the duplicates
if not df_dups.empty:
    # generate new ids for the duplicates
    new_ids = generate_new_ids(df_dups, 'subject_id')
    # update the master_df with the new ids
    master_df = pd.merge(master_df, new_ids, on='subject_id', how='left')
    # update the subj_id_mapper with the new ids from the data table
    subj_id_mapper.update(new_ids.set_index('subject_id')['ASAP_sample_id'].to_dict())

# extract the new ids from the subject_df as an updated subj_id_mapper
subj_id_mapper = master_df[['subject_id','ASAP_sample_id']].set_index('subject_id')['ASAP_sample_id'].to_dict()

# update the subj_id_mapper with the new ids from the data table
master_df = pd.merge(master_df, subject_df,
                    left_on=['subject_id'], right_on=['sample_id'], how='inner')

# return the updated id_mapper, updated subject_df, and the starting index for the new ids
return subj_id_mapper, master_df, master_df.shape[0]


# generate new ids for all the subjects in the data table
updated_subj_id_mapper, master_df, n = generate_asap_subject_ids(subj_id_mapper, subject_df)


df_dups = subject_df[subject_df.duplicated(keep=False, subset=['subject_id'])].sort_values('subject_id').reset_index(drop = True).copy()

study_tracker_df = pd.DataFrame()
# study_tracker_df = pd.DataFrame.from_dict(subj_id_mapper,
#                                                         orient='index',
#                                                         columns = ['ASAP_sample_id','subject_id'])\
#                                                     .rename_axis('master_sample_id').reset_index()\
#                                                     .astype(str)
# Check if any sample ID exists in df_subset.
sample_id_unique = pd.merge(study_tracker_df, subject_df,
                            left_on=['subject_id'], right_on=['sample_id'], how='inner')

In [22]:
# extract the max value of the mapper's third (last) section ([2] or [-1]) to get our n
if bool(subj_id_mapper):
    n = max([int(v.split("_")[2]) for v in subj_id_mapper.values() if v]) + 1
else:
    n = 1
nstart = n

df_nodups_wids = subject_df.copy()
# might want to use 'source_subject_id' instead of 'subject_id' since we want to find matches across teams
# shouldn't actually matter but logically cleaner
uids = [str(id) for id in df_nodups_wids['subject_id'].unique()]
mapid = {}
for uid in uids:
    mapid[uid]= n
    n += 1

df_nodups_wids['uid_idx'] = df_nodups_wids['subject_id'].map(mapid)
# make a new column with the ASAP_subject_id
# and insert it at the beginning of the dataframe
ASAP_subject_id = [f'{STUDY_PREFIX}{i:06}' for i in df_nodups_wids.uid_idx]
df_nodups_wids.insert(0, 'ASAP_subject_id', ASAP_subject_id)

df_nodups_wids['uid_idx_cumcount'] = df_nodups_wids.groupby('ASAP_subject_id').cumcount() + 1
asap_id_mapper = dict(zip(df_nodups_wids['subject_id'], df_nodups_wids['ASAP_subject_id']))

# update the subj_id_mapper
subj_id_mapper.update(asap_id_mapper)

In [20]:
if df_dups.shape[0]>0:
    dupids_mapper = dict(zip(df_dups.subject_id.unique(),
                        [num+n for num in range(len(df_dups.subject_id.unique()))]))
    
    df_dup_chunks = []
    for subj_id, samp_n in dupids_mapper.items():
        df_dups_subset = df_dups[df_dups.subject_id==subj_id].copy()
        
        asap_id = dupids_mapper[subj_id]
        df_dups_subset['asap_sample'] = [f'{asap_id}_{samp_n:06}' for i in range(df_dups_subset.shape[0])]
        df_dups_subset['samp_rep_no'] = ['s'+str(i+1) for i in range(df_dups_subset.shape[0])]
        # make a new column with the asap_sample_id
        # and insert it at the beginning of the dataframe
        df_dups_subset['ASAP_sample_id'] = df_dups_subset['asap_sample'] + '_' + df_dups_subset['samp_rep_no']

        df_dup_chunks.append(df_dups_subset)
    df_dups_wids = pd.concat(df_dup_chunks)

df_nodups = subject_df[~subject_df.duplicated(keep=False, subset=['subject_id'])].sort_values('subject_id').reset_index(drop = True).copy()


In [21]:
df_dups_wids

Unnamed: 0,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,...,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text,asap_sample,samp_rep_no,ASAP_sample_id
0,bovon,P95/10,,MDGAP-QSBB_000046_s1,QSBB_UK,Human,Male,81,,,...,,,,,0,Idiopathic PD,clinpath info: PD | NA,2_000002,s1,2_000002_s1
1,bovon,dup_1P95/10,,MDGAP-QSBB_000046_s1,QSBB_UK,Human,Male,81,,,...,,,,,0,Idiopathic PD,clinpath info: PD | NA,2_000002,s2,2_000002_s2
2,himuv,P82/12,,MDGAP-QSBB_000593_s1,QSBB_UK,Human,Male,69,,,...,,,,,0,Idiopathic PD,clinpath info: PDD | Atypical Parkinson's and ...,3_000003,s1,3_000003_s1
3,himuv,dup_1P82/12,,MDGAP-QSBB_000593_s1,QSBB_UK,Human,Male,69,,,...,,,,,0,Idiopathic PD,clinpath info: PDD | Atypical Parkinson's and ...,3_000003,s2,3_000003_s2
4,jopin,dup_1P67/16,,MDGAP-QSBB_000538_s1,QSBB_UK,Human,Female,68,,,...,,,,,0,Idiopathic PD,clinpath info: PDD | PD with dementia,4_000004,s1,4_000004_s1
5,jopin,P67/16,,MDGAP-QSBB_000538_s1,QSBB_UK,Human,Female,68,,,...,,,,,0,Idiopathic PD,clinpath info: PDD | PD with dementia,4_000004,s2,4_000004_s2
6,juzot,dup_1P13/17,,MDGAP-QSBB_000246_s1,QSBB_UK,Human,Male,83,,,...,,,,,0,Idiopathic PD,clinpath info: PDD | PD (with dementia),5_000005,s1,5_000005_s1
7,juzot,P13/17,,MDGAP-QSBB_000246_s1,QSBB_UK,Human,Male,83,,,...,,,,,0,Idiopathic PD,clinpath info: PDD | PD (with dementia),5_000005,s2,5_000005_s2
8,litan,dup_1P26/11,,MDGAP-QSBB_000325_s1,QSBB_UK,Human,Female,73,,,...,,,,,0,No PD nor other neurological disorder,clinpath info: Pathological ageing | Control,6_000006,s1,6_000006_s1
9,litan,P26/11,,MDGAP-QSBB_000325_s1,QSBB_UK,Human,Female,73,,,...,,,,,0,No PD nor other neurological disorder,clinpath info: Pathological ageing | Control,6_000006,s2,6_000006_s2


In [None]:


# extract the max value of the mapper's third (last) section ([2] or [-1]) to get our n
if bool(subj_id_mapper):
    n = max([int(v.split("_")[2]) for v in subj_id_mapper.values() if v]) + 1
else:
    n = 1
nstart = n
df_nodups_wids = subject_df.copy()



In [13]:
df_nodups_wids

Unnamed: 0,subject_id,source_subject_id,AMPPD_id,GP2_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,...,hx_dementia_mci,hx_melanoma,education_level,smoking_status,smoking_years,APOE_e4_status,cognitive_status,time_from_baseline,primary_diagnosis,primary_diagnosis_text
0,jopin,dup_1P67/16,,MDGAP-QSBB_000538_s1,QSBB_UK,Human,Female,68,,,...,Yes,,,,,,,0,Idiopathic PD,clinpath info: PDD | PD with dementia
1,mivog,dup_1P56/08,,MDGAP-QSBB_000030_s1,QSBB_UK,Human,Male,65,,,...,Yes,,,,,,,0,Idiopathic PD,clinpath info: PDD | PD
2,rijof,dup_1P8/18,,MDGAP-QSBB_000583_s1,QSBB_UK,Human,Male,84,,,...,No,,,,,,,0,Idiopathic PD,clinpath info: PD | PD
3,juzot,dup_1P13/17,,MDGAP-QSBB_000246_s1,QSBB_UK,Human,Male,83,,,...,Yes,,,,,,,0,Idiopathic PD,clinpath info: PDD | PD (with dementia)
4,nobin,dup_1P46/10,,MDGAP-QSBB_000422_s1,QSBB_UK,Human,Female,83,,,...,Yes,,,,,,,0,Idiopathic PD,clinpath info: PDD | PD (with dementia)
5,bovon,dup_1P95/10,,MDGAP-QSBB_000046_s1,QSBB_UK,Human,Male,81,,,...,No,,,,,,,0,Idiopathic PD,clinpath info: PD | NA
6,himuv,dup_1P82/12,,MDGAP-QSBB_000593_s1,QSBB_UK,Human,Male,69,,,...,Yes,,,,,,,0,Idiopathic PD,clinpath info: PDD | Atypical Parkinson's and ...
7,zidip,dup_1P30/09,,MDGAP-QSBB_000101_s1,QSBB_UK,Human,Female,77,,,...,No,,,,,,,0,Idiopathic PD,clinpath info: PD | PD
8,nafam,dup_1P17/16,,MDGAP-DEMENTIA_000249_s1,QSBB_UK,Human,Female,84,,,...,No,,,,,,,0,No PD nor other neurological disorder,clinpath info: Control | Control
9,posaj,dup_1P47/11,,MDGAP-QSBB_000427_s1,QSBB_UK,Human,Female,79,,,...,No,,,,,,,0,No PD nor other neurological disorder,clinpath info: Control | Control


In [14]:

# might want to use 'source_subject_id' instead of 'subject_id' since we want to find matches across teams
# shouldn't actually matter but logically cleaner
uids = [str(id) for id in df_nodups_wids['subject_id'].unique()]
mapid = {}
for uid in uids:
    mapid[uid]= n
n += 1


In [15]:
mapid

{'jopin': 1,
 'mivog': 1,
 'rijof': 1,
 'juzot': 1,
 'nobin': 1,
 'bovon': 1,
 'himuv': 1,
 'zidip': 1,
 'nafam': 1,
 'posaj': 1,
 'lutit': 1,
 'rusiz': 1,
 'litan': 1,
 'dup_2jopin': 1,
 'dup_2mivog': 1,
 'dup_2rijof': 1,
 'dup_2juzot': 1,
 'dup_2nobin': 1,
 'dup_2bovon': 1,
 'dup_2himuv': 1,
 'dup_2zidip': 1,
 'dup_2nafam': 1,
 'dup_2posaj': 1,
 'dup_2lutit': 1,
 'dup_2rusiz': 1,
 'dup_2litan': 1}

In [None]:

df_nodups_wids['uid_idx'] = df_nodups_wids['subject_id'].map(mapid)
# make a new column with the ASAP_subject_id
# and insert it at the beginning of the dataframe
ASAP_subject_id = [f'{STUDY_PREFIX}{i:06}' for i in df_nodups_wids.uid_idx]
df_nodups_wids.insert(0, 'ASAP_subject_id', ASAP_subject_id)

df_nodups_wids['uid_idx_cumcount'] = df_nodups_wids.groupby('ASAP_subject_id').cumcount() + 1
asap_id_mapper = dict(zip(df_nodups_wids['subject_id'], df_nodups_wids['ASAP_subject_id']))

# update the subj_id_mapper
subj_id_mapper.update(asap_id_mapper)

# return subj_id_mapper, df_nodups_wids, nstart


In [38]:
subj_id_mapper

{'HC_1225': 'ASAP_PMBDS_000001',
 'HC_0602': 'ASAP_PMBDS_000002',
 'PD_0009': 'ASAP_PMBDS_000003',
 'PD_1921': 'ASAP_PMBDS_000004',
 'PD_2058': 'ASAP_PMBDS_000005',
 'PD_1441': 'ASAP_PMBDS_000006',
 'PD_1344': 'ASAP_PMBDS_000007',
 'HC_1939': 'ASAP_PMBDS_000008',
 'HC_1308': 'ASAP_PMBDS_000009',
 'HC_1862': 'ASAP_PMBDS_000010',
 'HC_1864': 'ASAP_PMBDS_000011',
 'HC_2057': 'ASAP_PMBDS_000012',
 'HC_2061': 'ASAP_PMBDS_000013',
 'HC_2062': 'ASAP_PMBDS_000014',
 'HC_2067': 'ASAP_PMBDS_000015',
 'PD_0348': 'ASAP_PMBDS_000016',
 'PD_0413': 'ASAP_PMBDS_000017',
 'PD_1312': 'ASAP_PMBDS_000018',
 'PD_1317': 'ASAP_PMBDS_000019',
 'PD_1504': 'ASAP_PMBDS_000020',
 'PD_1858': 'ASAP_PMBDS_000021',
 'PD_1902': 'ASAP_PMBDS_000022',
 'PD_1973': 'ASAP_PMBDS_000023',
 'PD_2005': 'ASAP_PMBDS_000024',
 'PD_2038': 'ASAP_PMBDS_000025'}

In [39]:
# since the current SAMPLE tables can have multipl sample_ids lets drop duplciates, with the caveat of replciates
df_nodups = sample_df.drop_duplicates(subset=['sample_id'])

# 
uniq_subj = df_nodups.subject_id.unique()

dupids_mapper = dict(zip(uniq_subj,
                    [num+nstart for num in range(len(uniq_subj))] ))
dupids_mapper

{'HC_1225': 1,
 'HC_0602': 2,
 'PD_0009': 3,
 'PD_1921': 4,
 'PD_2058': 5,
 'PD_1441': 6,
 'PD_1344': 7,
 'HC_1939': 8,
 'HC_1308': 9,
 'HC_1862': 10,
 'HC_1864': 11,
 'HC_2057': 12,
 'HC_2061': 13,
 'HC_2062': 14,
 'HC_2067': 15,
 'PD_0348': 16,
 'PD_0413': 17,
 'PD_1312': 18,
 'PD_1317': 19,
 'PD_1504': 20,
 'PD_1858': 21,
 'PD_1902': 22,
 'PD_1973': 23,
 'PD_2005': 24,
 'PD_2038': 25}

In [47]:

dupids_mapper = dict(zip(uniq_subj,
                    [num+nstart for num in range(len(uniq_subj))] ))

df_dup_chunks = []
for subj_id, samp_n in dupids_mapper.items():
    df_dups_subset = df_nodups[df_nodups.subject_id==subj_id].copy()
    asap_id = subj_id_mapper[subj_id]
    df_dups_subset['asap_sample'] = [f'{asap_id}_{i:06}' for i in range(df_dups_subset.shape[0])]
    df_dups_subset['samp_rep_no'] = ['s'+str(i+1) for i in range(df_dups_subset.shape[0])]
    # make a new column with the asap_sample_id
    # and insert it at the beginning of the dataframe
    df_dups_subset['ASAP_sample_id'] = df_dups_subset['asap_sample'] + '_' + df_dups_subset['samp_rep_no']

    df_dup_chunks.append(df_dups_subset)



In [48]:
df_dups_subset.shape[0]

3

In [49]:
df_dups_wids = pd.concat(df_dup_chunks)

id_mapper = dict(zip(df_dups_wids.sample_id,
                    df_dups_wids.ASAP_sample_id))
out_df = sample_df.copy()
ASAP_sample_id = out_df['sample_id'].map(id_mapper)
out_df.insert(0, 'ASAP_sample_id', ASAP_sample_id)

samp_id_mapper.update(id_mapper)

samp_id_mapper


{'MFG_HC_1225': 'ASAP_PMBDS_000001_s1',
 'HIP_HC_1225': 'ASAP_PMBDS_000001_s2',
 'SN_HC_1225': 'ASAP_PMBDS_000001_s3',
 'MFG_HC_0602': 'ASAP_PMBDS_000002_s1',
 'HIP_HC_0602': 'ASAP_PMBDS_000002_s2',
 'SN_HC_0602': 'ASAP_PMBDS_000002_s3',
 'MFG_PD_0009': 'ASAP_PMBDS_000003_s1',
 'HIP_PD_0009': 'ASAP_PMBDS_000003_s2',
 'SN_PD_0009': 'ASAP_PMBDS_000003_s3',
 'MFG_PD_1921': 'ASAP_PMBDS_000004_s1',
 'HIP_PD_1921': 'ASAP_PMBDS_000004_s2',
 'SN_PD_1921': 'ASAP_PMBDS_000004_s3',
 'MFG_PD_2058': 'ASAP_PMBDS_000005_s1',
 'HIP_PD_2058': 'ASAP_PMBDS_000005_s2',
 'SN_PD_2058': 'ASAP_PMBDS_000005_s3',
 'MFG_PD_1441': 'ASAP_PMBDS_000006_s1',
 'HIP_PD_1441': 'ASAP_PMBDS_000006_s2',
 'SN_PD_1441': 'ASAP_PMBDS_000006_s3',
 'MFG_PD_1344': 'ASAP_PMBDS_000007_s1',
 'HIP_PD_1344': 'ASAP_PMBDS_000007_s2',
 'SN_PD_1344': 'ASAP_PMBDS_000007_s3',
 'MFG_HC_1939': 'ASAP_PMBDS_000008_s1',
 'HIP_HC_1939': 'ASAP_PMBDS_000008_s2',
 'SN_HC_1939': 'ASAP_PMBDS_000008_s3',
 'MFG_HC_1308': 'ASAP_PMBDS_000009_s1',
 'HIP_HC

In [31]:
SAMPLE.columns

Index(['sample_id', 'subject_id', 'source_sample_id', 'replicate',
       'replicate_count', 'repeated_sample', 'batch', 'tissue', 'brain_region',
       'hemisphere', 'region_level_1', 'region_level_2', 'region_level_3',
       'RIN', 'source_RIN', 'molecular_source', 'input_cell_count', 'assay',
       'sequencing_end', 'sequencing_length', 'sequencing_instrument',
       'organism_ontology_term_id', 'development_stage_ontology_term_id',
       'sex_ontology_term_id', 'self_reported_ethnicity_ontology_term_id',
       'disease_ontology_term_id', 'tissue_ontology_term_id',
       'cell_type_ontology_term_id', 'assay_ontology_term_id',
       'suspension_type', 'DV200', 'pm_PH', 'donor_id'],
      dtype='object')

In [19]:
sample_df = SAMPLE.copy()
if True:

# def generate_asap_sample_ids(subj_id_mapper:dict, 
#                              sample_df:pd.DataFrame, 
#                              nstart:int, 
#                              samp_id_mapper:dict) -> tuple[dict, pd.DataFrame]:
    """
    generate new unique_ids for new sample_ids in sample_df table, 
    update the id_mapper with the new ids from the data table


    return the updated id_mapper and updated sample_df
    """
    # could pass subj_id_mapper as a parameter instead of n.  e.g.
    # if bool(subj_id_mapper):
    #     n = max([int(v.split("_")[2]) for v in subj_id_mapper.values() if v]) + 1
    # else:
    #     n = 1
    
    # since the current SAMPLE tables can have multipl sample_ids lets drop duplciates, with the caveat of replciates
    df_nodups = sample_df.drop_duplicates(subset=['sample_id'])
    
    # 
    uniq_subj = df_nodups.subject_id.unique()

    dupids_mapper = dict(zip(uniq_subj,
                        [num+nstart for num in range(len(uniq_subj))] ))

    df_dup_chunks = []
    for subj_id, samp_n in dupids_mapper.items():
        df_dups_subset = df_nodups[df_nodups.subject_id==subj_id].copy()
        asap_id = subj_id_mapper[subj_id]
        df_dups_subset['asap_sample'] = [f'{asap_id}_{samp_n:06}' for i in range(df_dups_subset.shape[0])]
        df_dups_subset['samp_rep_no'] = ['s'+str(i+1) for i in range(df_dups_subset.shape[0])]
        # make a new column with the asap_sample_id
        # and insert it at the beginning of the dataframe
        df_dups_subset['ASAP_sample_id'] = df_dups_subset['asap_sample'] + '_' + df_dups_subset['samp_rep_no']

        df_dup_chunks.append(df_dups_subset)
    df_dups_wids = pd.concat(df_dup_chunks)

    id_mapper = dict(zip(df_dups_wids.sample_id,
                        df_dups_wids.ASAP_sample_id))
    out_df = sample_df.copy()
    ASAP_sample_id = out_df['sample_id'].map(id_mapper)
    out_df.insert(0, 'ASAP_sample_id', ASAP_sample_id)

    samp_id_mapper.update(id_mapper)

    # return samp_id_mapper, out_df



In [20]:
subj_id_mapper, samp_id_mapper



({'HC_1225': 'ASAP_PMBDS_000001',
  'HC_0602': 'ASAP_PMBDS_000002',
  'PD_0009': 'ASAP_PMBDS_000003',
  'PD_1921': 'ASAP_PMBDS_000004',
  'PD_2058': 'ASAP_PMBDS_000005',
  'PD_1441': 'ASAP_PMBDS_000006',
  'PD_1344': 'ASAP_PMBDS_000007',
  'HC_1939': 'ASAP_PMBDS_000008',
  'HC_1308': 'ASAP_PMBDS_000009',
  'HC_1862': 'ASAP_PMBDS_000010',
  'HC_1864': 'ASAP_PMBDS_000011',
  'HC_2057': 'ASAP_PMBDS_000012',
  'HC_2061': 'ASAP_PMBDS_000013',
  'HC_2062': 'ASAP_PMBDS_000014',
  'HC_2067': 'ASAP_PMBDS_000015',
  'PD_0348': 'ASAP_PMBDS_000016',
  'PD_0413': 'ASAP_PMBDS_000017',
  'PD_1312': 'ASAP_PMBDS_000018',
  'PD_1317': 'ASAP_PMBDS_000019',
  'PD_1504': 'ASAP_PMBDS_000020',
  'PD_1858': 'ASAP_PMBDS_000021',
  'PD_1902': 'ASAP_PMBDS_000022',
  'PD_1973': 'ASAP_PMBDS_000023',
  'PD_2005': 'ASAP_PMBDS_000024',
  'PD_2038': 'ASAP_PMBDS_000025'},
 {'MFG_HC_1225': 'ASAP_PMBDS_000001_000001_s1',
  'HIP_HC_1225': 'ASAP_PMBDS_000001_000001_s2',
  'SN_HC_1225': 'ASAP_PMBDS_000001_000001_s3',
  'MFG

Use the `process_meta_files` function to generate the mappers and update the meta tables.

In [27]:
subject_mapper_path = Path.cwd() / "ASAP_subj_test2.json"
sample_mapper_path = Path.cwd() / "ASAP_samp_test2.json"



export_root = Path.cwd() / "ASAP_tables" 

table_root = Path.cwd() / "clean/team-Lee/v2_20231109/"
## add team Lee
process_meta_files(table_root, 
                       CDE_path, 
                       subject_mapper_path, 
                       sample_mapper_path, 
                       export_path=export_root)



id_mapper not found at /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_test2.json
id_mapper not found at /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_test2.json
exporting to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_tables/TEAM-LEE


1

Using the same mapper_paths we can continue to generate ASAP IDs

In [28]:

## add team Hafler
table_root = Path.cwd() / "clean/team-Hafler"
table_root = Path.cwd() / "clean/team-Hafler/v2_20231109/"

process_meta_files(table_root, 
                       CDE_path, 
                       subject_mapper_path, 
                       sample_mapper_path, 
                       export_path=export_root)

## add team Hardy
table_root = Path.cwd() / "clean/team-Hardy/v2_20231109/"
process_meta_files(table_root, 
                       CDE_path, 
                       subject_mapper_path, 
                       sample_mapper_path, 
                       export_path=export_root)


id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_test2.json
id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_test2.json
exporting to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_tables/TEAM-HAFLER
id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_subj_test2.json
id_mapper loaded from /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_samp_test2.json
exporting to /Users/ergonyc/Projects/ASAP/meta-clean/ASAP_tables/TEAM-HARDY


1

In [30]:
Path(".")/ "clean/team-Hardy/v2_20231109/"

PosixPath('clean/team-Hardy/v2_20231109')