ASAP CRN Metadata validation - wave 1

# ASAP CRN Metadata validation - wave 1

28 October 2023
Andy Henrie



## STEPS

### imports
- pandas
- pathlib

### Load CDE for validation
- check all columns

### Team Lee
- load .tsv, csv tables
- fix format
- load additional metadata

- add batch columns
- add missing columns



In [31]:
import pandas as pd
from pathlib import Path


# local helpers
from utils.qcutils import validate_table, force_enum_string, reorder_table_to_CDE
from utils.io import ReportCollector, get_dtypes_dict, read_meta_table


## Load CDE

In [32]:
CDE_path = Path.cwd() / "ASAP_CDE.csv" 
CDE = pd.read_csv(CDE_path )

CDE.head()



Unnamed: 0,Table,Field,Description,DataType,Required,Validation,Unnamed: 6,ClinPath field,team_Hafler type,ClinPath description,Unnamed: 10
0,STUDY,project_name,Project Name: A Title of the overall project...,String,Required,,,,,,bard
1,STUDY,project_dataset,Dataset Name: A unique name is required for ...,String,Required,,,,,,
2,STUDY,project_description,Project Description: Brief description of th...,String,Required,,,,,,
3,STUDY,ASAP_team_name,ASAP Team Name: Name of the ASAP CRN Team. i...,Enum,Required,"[""TEAM-LEE"",""TEAM-HAFLER"",""TEAM-HARDY"", ""TEAM-...",,,,,
4,STUDY,ASAP_lab_name,Lab Name. : Lab name that is submitting data...,String,Required,,,,,,




## Clean Team Lee tables

### Load Tables from csv

The metadata path below has copies of the raw meta-tables


In [57]:

# Initialize the data types dictionary
dtypes_dict = get_dtypes_dict(CDE)
    
## convert 
data_path = Path.home() / ("Projects/ASAP/team-lee")
# NOTE:  ogmetadata is the original metadata folder
metadata_path = data_path / "metadata/ogmetadata"

SUBJECT = pd.read_csv(f"{metadata_path}/SUBJECT.tsv", delimiter="\t", dtype=dtypes_dict)
SAMPLE = pd.read_csv(f"{metadata_path}/SAMPLE.tsv",delimiter="\t", dtype=dtypes_dict)

CLINPATH = pd.read_csv(f"{metadata_path}/CLINPATH.csv",delimiter=",", dtype=dtypes_dict)
PROTOCOL = pd.read_csv(f"{metadata_path}/PROTOCOL.tsv",delimiter="\t", dtype=dtypes_dict)

STUDY = pd.read_csv(f"{metadata_path}/STUDY.tsv",delimiter="\t")


### STUDY

In [58]:
# STUDY = pd.read_csv(metadata_path / "STUDY.tsv",delimiter="\t")
STUDY.to_csv(data_path / "STUDY_.csv")
STUDY = pd.read_csv(data_path / "STUDY_.csv")


STUDY.head()


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Team-Lee-Bras-Lab-Info,Field,Description,Data type,Validation,Note,Required/Optional
0,Is senescence a component of human PD and does...,project_name,Project Name/Title,String,,Unique and clear title.,Required,,,
1,Human snRNA-seq PD Senesence Jose Bras Team Lee,project_dataset,Dataset name,String,,A Dataset name is required for each submission...,Required,,,
2,Characterize the neuropathological progression...,project_description,Brief description of the goals and objectives ...,String,,,Required,,,
3,TEAM-LEE,ASAP_team_name,"ASAP Team e.g. ""Scherzer""",Enum,"[""TEAM-LEE"",""TEAM-HAFLER"",""TEAM-HARDY"",....]",,Required,,,
4,Bras,ASAP_lab_name,"ASAP Lab under the above team e.g. ""Dong""",String,,,Required,,,


In [59]:

# fix STUDY formatting
tmp = pd.DataFrame()
tmp = STUDY[["Unnamed: 1","Unnamed: 0"]].transpose().reset_index().drop(columns=["index"])
tmp.columns = tmp.iloc[0]
STUDY = tmp.drop([0])
# STUDY[["Unnamed: 1"]].transpose().reset_index().drop(columns=["index"]), tmp
STUDY.head()

Unnamed: 0,project_name,project_dataset,project_description,ASAP_team_name,ASAP_lab_name,PI_full_name,PI_email,submitter_id,submitter_name,submittor_email,...,other_funding_source,publication_DOI,publication_PMID,number_of_brain_samples,brain_regions,types_of_samples,PI_ORCHID,PI_google_scholar_id,DUA_version,metadata_version_date
1,Is senescence a component of human PD and does...,Human snRNA-seq PD Senesence Jose Bras Team Lee,Characterize the neuropathological progression...,TEAM-LEE,Bras,"Jose, Bras",jose.bras@vai.org,"Lee, L, Marshall ; Kimberly, E, Paquette ; Kai...",Kaitlyn E Westra,kaitlyn.westra@vai.org,...,,,,75,hippocampus; middle frontal gyrus; substantia ...,human PD and control postmortem brains,,,unsure,


In [60]:

# Need to rename submitter_id to contributor_names
STUDY = STUDY.rename(columns={"submitter_id": "contributor_names"})
STUDY = reorder_table_to_CDE(STUDY, "STUDY", CDE)



In [61]:

study_report = ReportCollector(destination="print")
validate_table(STUDY, "STUDY", CDE, study_report)
print(study_report.get_log())

All required fields are present in *STUDY* table.
No empty entries (Nan) found in _Required_ fields.
No empty entries (Nan) found in _Optional_ fields.
## Enum fields have valid values in STUDY. 🥳



In [62]:
STUDY

Unnamed: 0,project_name,project_dataset,project_description,ASAP_team_name,ASAP_lab_name,PI_full_name,PI_email,contributor_names,submitter_name,submittor_email,...,other_funding_source,publication_DOI,publication_PMID,number_of_brain_samples,brain_regions,types_of_samples,PI_ORCHID,PI_google_scholar_id,DUA_version,metadata_version_date
1,Is senescence a component of human PD and does...,Human snRNA-seq PD Senesence Jose Bras Team Lee,Characterize the neuropathological progression...,TEAM-LEE,Bras,"Jose, Bras",jose.bras@vai.org,"Lee, L, Marshall ; Kimberly, E, Paquette ; Kai...",Kaitlyn E Westra,kaitlyn.westra@vai.org,...,,,,75,hippocampus; middle frontal gyrus; substantia ...,human PD and control postmortem brains,,,unsure,


### SAMPLE

`batch` must be collected additional metadata from covar.csv .. i.e. batch

In [63]:
SAMPLE = pd.read_csv(f"{metadata_path}/SAMPLE.tsv",delimiter="\t", dtype=dtypes_dict)

metadata_path = Path.home() / ("Projects/ASAP/team-lee/metadata")
HIP_covar = pd.read_csv(f"{metadata_path}/HIP/covar.csv")
HIP_cases = pd.read_csv(f"{metadata_path}/HIP/PD_ASAP_Sample_batch_information_banner_cases.csv").dropna(axis=0,how='all')
HIP_control = pd.read_csv(f"{metadata_path}/HIP/PD_ASAP_Sample_batch_information_banner_controls.csv")

MFG_covar = pd.read_csv(f"{metadata_path}/MFG/covar.csv") # includes 'PMI' ?
MFG_cases = pd.read_csv(f"{metadata_path}/MFG/PD_ASAP_Sample_batch_information_banner_cases.csv").dropna(axis=0,how='all')
MFG_control = pd.read_csv(f"{metadata_path}/MFG/PD_ASAP_Sample_batch_information_banner_controls.csv")


SN_covar = pd.read_csv(f"{metadata_path}/SN/covar.csv")
SN_cases = pd.read_csv(f"{metadata_path}/SN/PD_ASAP_Sample_batch_information_banner_cases.csv").dropna(axis=0,how='all')
SN_control = pd.read_csv(f"{metadata_path}/SN/PD_ASAP_Sample_batch_information_banner_controls.csv")

In [64]:
# Hippocampus samples
# HIP_cases["GROUPcv"]="PD"
# HIP_control["GROUPcv"]="HC"

HIP_meta = pd.concat([HIP_cases, HIP_control], axis=0, ignore_index=True)
HIP_meta["GROUPcv"]= HIP_meta["PD"].apply(lambda x: "PD" if (x=="yes") else "HC")


In [65]:


HIP_meta['MERGE_ID'] = "HIP_" + HIP_meta['GROUPcv'] +"_" + HIP_meta['CaseID'].str.replace('-','')
HIP_covar['MERGE_ID'] = HIP_covar['COUNT_ID']
# the fastqs follow COUNT_ID insteald of SEQ_ID naming convention
HIP_covar['SEQ_ID'] = HIP_covar['COUNT_ID']



In [66]:
# there's a bug in the meta table... skip for now
HIP_TABLE = pd.merge(HIP_covar, HIP_meta, on='MERGE_ID', how='outer')

# HIP_TABLE = HIP_covar
HIP_TABLE['subdir']="HIP"


In [67]:
test = HIP_TABLE[["MERGE_ID","SEQ_ID","GROUPcv","subdir",'PD']]

In [68]:
### medial frontal gyrus samples
MFG_meta = pd.concat([MFG_cases, MFG_control], axis=0, ignore_index=True)
MFG_meta["GROUPcv"]= MFG_meta["PD"].apply(lambda x: "PD" if (x=="yes") else "HC")

# make a MERGE_ID column because the formatting is inconsistent
MFG_meta['MERGE_ID'] = "MFG_" + MFG_meta['GROUPcv'] +"_" + MFG_meta['CaseID'].str.replace('-','')
MFG_covar['MERGE_ID'] = MFG_covar['SAMPLE']
# the fastqs are in SEQ_ID 

# there's a bug in the meta table... skip for now
MFG_TABLE = pd.merge(MFG_covar, MFG_meta, on='MERGE_ID', how='inner')
MFG_TABLE['subdir']="MFG"



# Substantia Nigra
SN_meta = pd.concat([SN_cases, SN_control], axis=0, ignore_index=True)
SN_meta["GROUPcv"] = SN_meta["PD"].apply(lambda x: "PD" if (x=="yes") else "HC")

SN_meta['MERGE_ID'] = "SN_" + MFG_meta['GROUPcv'] +"_" + MFG_meta['CaseID'].str.replace('-','')
SN_covar['MERGE_ID'] = SN_covar['SAMPLE']

# there's a bug in the meta table... skip for now
SN_TABLE = pd.merge(SN_covar, SN_meta, on='MERGE_ID', how='outer')
SN_TABLE['subdir']="SN"


### concatenate SN, MSG, and HIP tables into one 'all_samples' table
all_samples = pd.concat([HIP_TABLE, MFG_TABLE, SN_TABLE], axis=0, ignore_index=True)


In [69]:

SAMPLE_ALL = SAMPLE.merge(all_samples, left_on='sample_id', right_on='MERGE_ID', how='left')
SAMPLE_ALL.to_csv("alternate_metadata.csv")

In [70]:
SAMPLE_og = SAMPLE.copy()
SAMPLE['batch'] = SAMPLE_ALL['BATCH']

In [71]:

# SAMPLE = force_enum_string(SAMPLE, "SAMPLE", CDE)



In [72]:
sample_report = ReportCollector(destination="print")
validate_table(SAMPLE, "SAMPLE", CDE, sample_report)
print(sample_report.get_log())

🚨⚠️❗ **Missing Required Fields in SAMPLE: file_MD5**
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- source_RIN: 75/75 empty rows

	- RIN: 75/75 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- pm_PH: 75/75 empty rows
## Enum fields have valid values in SAMPLE. 🥳



In [73]:
# make the colunn order of SAMPLE match the CDE.Field
# SAMPLE = SAMPLE[CDE.Field.tolist()]
SAMPLE.head()

Unnamed: 0,sample_id,source_sample_id,subject_id,replicate,replicate_count,repeated_sample,tissue,brain_region,source_RIN,RIN,...,self_reported_ethnicity_ontology_term_id,disease_ontology_term_id,tissue_ontology_term_id,cell_type_ontology_term_id,assay_ontology_term_id,suspension_type,DV2000,pm_PH,donor_id,batch
0,MFG_HC_1225,12-25,12-25,rep1,1,0,brain,Middle_Frontal_Gyrus,,,...,unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,,BATCH_4
1,MFG_HC_0602,06-02,06-02,rep1,1,0,brain,Middle_Frontal_Gyrus,,,...,unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,,BATCH_4
2,MFG_PD_0009,00-09,00-09,rep1,1,0,brain,Middle_Frontal_Gyrus,,,...,unknown,MONDO:0005180,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,,BATCH_4
3,MFG_PD_1921,19-21,19-21,rep1,1,0,brain,Middle_Frontal_Gyrus,,,...,unknown,MONDO:0005180,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,,BATCH_4
4,MFG_PD_2058,20-58,20-58,rep1,1,0,brain,Middle_Frontal_Gyrus,,,...,unknown,MONDO:0005180,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,,BATCH_4


In [74]:
# fix file_name and file_MD5 which need to be exploded (do this last for simplicity. i.e. to keep one sample per row rather than one file per row)

# Step 1: Split the values in the columns based on commas
SAMPLE['file_name'] = SAMPLE['file_name'].str.split(',')
SAMPLE['file_MD5(R1,R2)'] = SAMPLE['file_MD5(R1,R2)'].str.split(',')

# Step 2: Explode both 'file_name' and 'file_MD5(R1,R2)' columns together
SAMPLE = SAMPLE.explode(['file_name', 'file_MD5(R1,R2)'])

# Step 3: Rename the "file_MD5(R1,R2)" column to "file_MD5"
SAMPLE = SAMPLE.rename(columns={"file_MD5(R1,R2)": "file_MD5"})

SAMPLE.columns


Index(['sample_id', 'source_sample_id', 'subject_id', 'replicate',
       'replicate_count', 'repeated_sample', 'tissue', 'brain_region',
       'source_RIN', 'RIN', 'molecular_source', 'input_cell_count', 'assay',
       'sequencing_end', 'sequencing_length', 'sequencing_instrument',
       'file_type', 'file_name', 'file_description', 'file_MD5', 'technology',
       'omic', 'adjustment', 'content', 'time', 'header', 'annotation',
       'preprocessing_references', 'configuration_file',
       'organism_ontology_term_id', 'development_stage_ontology_term_id',
       'sex_ontology_term_id', 'self_reported_ethnicity_ontology_term_id',
       'disease_ontology_term_id', 'tissue_ontology_term_id',
       'cell_type_ontology_term_id', 'assay_ontology_term_id',
       'suspension_type', 'DV2000', 'pm_PH', 'donor_id', 'batch'],
      dtype='object')

In [75]:

# Extract the fields with DataType as "Enum" or "String" for the "sample" table from CDE.csv
SAMPLE = reorder_table_to_CDE(SAMPLE, "SAMPLE", CDE)

SAMPLE.columns


Index(['sample_id', 'source_sample_id', 'subject_id', 'replicate',
       'replicate_count', 'repeated_sample', 'batch', 'tissue', 'brain_region',
       'source_RIN', 'RIN', 'molecular_source', 'input_cell_count', 'assay',
       'sequencing_end', 'sequencing_length', 'sequencing_instrument',
       'file_type', 'file_name', 'file_description', 'file_MD5', 'technology',
       'omic', 'adjustment', 'content', 'time', 'header', 'annotation',
       'preprocessing_references', 'configuration_file',
       'organism_ontology_term_id', 'development_stage_ontology_term_id',
       'sex_ontology_term_id', 'self_reported_ethnicity_ontology_term_id',
       'disease_ontology_term_id', 'tissue_ontology_term_id',
       'cell_type_ontology_term_id', 'assay_ontology_term_id',
       'suspension_type', 'DV200', 'pm_PH', 'donor_id'],
      dtype='object')

In [76]:
SAMPLE.head()

Unnamed: 0,sample_id,source_sample_id,subject_id,replicate,replicate_count,repeated_sample,batch,tissue,brain_region,source_RIN,...,sex_ontology_term_id,self_reported_ethnicity_ontology_term_id,disease_ontology_term_id,tissue_ontology_term_id,cell_type_ontology_term_id,assay_ontology_term_id,suspension_type,DV200,pm_PH,donor_id
0,MFG_HC_1225,12-25,12-25,rep1,1,0,BATCH_4,brain,Middle_Frontal_Gyrus,,...,PATO:0000384 (male),unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
0,MFG_HC_1225,12-25,12-25,rep1,1,0,BATCH_4,brain,Middle_Frontal_Gyrus,,...,PATO:0000384 (male),unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
1,MFG_HC_0602,06-02,06-02,rep1,1,0,BATCH_4,brain,Middle_Frontal_Gyrus,,...,PATO:0000384 (male),unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
1,MFG_HC_0602,06-02,06-02,rep1,1,0,BATCH_4,brain,Middle_Frontal_Gyrus,,...,PATO:0000384 (male),unknown,PATO:0000461,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,
2,MFG_PD_0009,00-09,00-09,rep1,1,0,BATCH_4,brain,Middle_Frontal_Gyrus,,...,PATO:0000384 (male),unknown,MONDO:0005180,UBERON:0002702,NA(multiple),EFO:0030004,nucleus,,,


### SUBJECT

In [101]:
SUBJECT = pd.read_csv(f"{metadata_path}/SUBJECT.tsv", delimiter="\t", dtype=dtypes_dict)

# SUBJECT = reorder_table_to_CDE(SUBJECT, "SUBJECT", CDE)     

subject_report = ReportCollector(destination="print")
validate_table(SUBJECT, "SUBJECT", CDE, subject_report)
print(subject_report.get_log())

All required fields are present in *SUBJECT* table.
No empty entries (Nan) found in _Required_ fields.
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- primary_diagnosis_text: 23/25 empty rows
## Enum fields have valid values in SUBJECT. 🥳



In [102]:

SUBJECT.head()

Unnamed: 0,subject_id,source_subject_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,duration_pmi,primary_diagnosis,primary_diagnosis_text
0,HC_1225,12-25,Banner Sun Health Research Institute,Human,Male,80,White,Not Reported,3.5,No PD nor other neurological disorder,
1,HC_0602,06-02,Banner Sun Health Research Institute,Human,Male,84,White,Not Reported,2.66,Other neurological disorder,Mild Cognitive Impairment
2,PD_0009,00-09,Banner Sun Health Research Institute,Human,Male,64,White,Not Reported,4.0,Idiopathic PD,
3,PD_1921,19-21,Banner Sun Health Research Institute,Human,Male,82,White,Not Reported,3.93,Idiopathic PD,
4,PD_2058,20-58,Banner Sun Health Research Institute,Human,Male,87,White,Not Reported,3.17,Idiopathic PD,


In [103]:
metadata_path

PosixPath('/Users/ergonyc/Projects/ASAP/team-lee/metadata')

### CLINPATH



In [104]:
CLINPATH = pd.read_csv(f"{metadata_path}/ogmetadata/CLINPATH.csv",delimiter=",", dtype=dtypes_dict)

clinpath_report = ReportCollector(destination="print")
validate_table(CLINPATH, "CLINPATH", CDE, clinpath_report)

print(clinpath_report.get_log())


All required fields are present in *CLINPATH* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- age_at_onset: 75/75 empty rows

	- age_at_diagnosis: 75/75 empty rows

	- first_motor_symptom: 75/75 empty rows

	- path_year_death: 75/75 empty rows

	- brain_weight: 75/75 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- smoking_years: 75/75 empty rows
## Enums
🚨⚠️❗ **Invalid entries**
	- region_level_2:Hippocampus
	>	 change to: Superior frontal gyrus, Middle frontal gyrus, Inferior frontal gyrus, Superior temporal gyrus, Middle temporal gyrus, Inferior temporal gyrus, Fusiform gyrus, Transentorhinal region, Entorinal region, Subiculum, CA1-CA4, Amygdala, Periamygdala cortex, Anterior cingulate gyrus, Posterior cingulate gyrus, Superior parietal lobule, Inferior parietal lobule, Parastriate cortex, Peristriate cortex, Striate cortex, Insular cortex, Caudate nucleus, Putamen, Globus pallidus, Thalamus, Subthalamic nucleus, Substantia nigra, Pontine tegmentum, Pon

In [105]:

# change "Hippocampus" to "CA1-CA4"
CLINPATH['region_level_2'] = CLINPATH['region_level_2'].replace('Hippocampus', 'CA1-CA4')

# skip hx_melanoma and education level for now as there is not a "Unknown" or "Not Reported" option in the CDE

# leave cognitive status as is, since there is no "Unknown" or "Not Reported" option in the CDE


0         l. Olfactory Bulb-Only
1     lla. Brainstem Predominant
2        llb. Limbic Predominant
3                lV. Neocortical
4                lV. Neocortical
                 ...            
70         lll. Brainstem/Limbic
71               lV. Neocortical
72               lV. Neocortical
73         lll. Brainstem/Limbic
74         lll. Brainstem/Limbic
Name: path_mckeith, Length: 75, dtype: object

In [106]:

# potential "path_braak_asyn" coding . Note that "I" are actually "l"
braak_map = {'l. Olfactory Bulb-Only':"1/2", 'la. Brainstem Predominant':"3",
       'llb. Limbic Predominant':"3/4", 'lV. Neocortical':"5",
       'lll. Brainstem/Limbic':"3/4", '0. No Lewy bodies':"0"}
# set to NaN for now since this is actualy path_mckeith coding

CLINPATH['path_braak_asyn'] = ""



In [97]:

mckeith_map = {'l. Olfactory Bulb-Only':"Olfactory bulb only", 'lla. Brainstem Predominant':"Brainstem",
       'llb. Limbic Predominant':"Limbic (transitional)", 'lV. Neocortical':"Neocortical",
       'lll. Brainstem/Limbic':"Amygdala Predominant", '0. No Lewy bodies':"Absent"}


CLINPATH['path_mckeith'] = CLINPATH['path_mckeith'].replace(mckeith_map)

# leave path_nia_ri like this for now. not sure how to map "criteria not met" and "Not AD"

# leave amyloid_angiopathy_severity_scale like this for now. not sure how to map 'Cerebral amyloid angiopathy, temporal and occipital lobe','Cerebral amyloid angiopathy, frontal lobe']


In [109]:
CLINPATH[['path_ad_level', 'path_mckeith', 'path_nia_ri']]

Unnamed: 0,path_ad_level,path_mckeith,path_nia_ri
0,No evidence of Alzheimer's disease neuropathol...,Olfactory bulb only,Criteria not met
1,No evidence of Alzheimer's disease neuropathol...,Brainstem,Criteria not met
2,No evidence of Alzheimer's disease neuropathol...,Limbic (transitional),Not AD
3,No evidence of Alzheimer's disease neuropathol...,Neocortical,Criteria not met
4,"Microscopic changes of Alzheimer's disease, in...",Neocortical,Low
...,...,...,...
70,"Microscopic changes of Alzheimer's disease, in...",Amygdala Predominant,Criteria not met
71,No evidence of Alzheimer's disease neuropathol...,Neocortical,Criteria not met
72,No evidence of Alzheimer's disease neuropathol...,Neocortical,Criteria not met
73,No evidence of Alzheimer's disease neuropathol...,Amygdala Predominant,Criteria not met


In [110]:

clinpath_report = ReportCollector(destination="print")
validate_table(CLINPATH, "CLINPATH", CDE, clinpath_report)

print(clinpath_report.get_log())


All required fields are present in *CLINPATH* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- age_at_onset: 75/75 empty rows

	- age_at_diagnosis: 75/75 empty rows

	- first_motor_symptom: 75/75 empty rows

	- path_year_death: 75/75 empty rows

	- brain_weight: 75/75 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- smoking_years: 75/75 empty rows
## Enums
🚨⚠️❗ **Invalid entries**
	- path_nia_ri:Criteria not met, Not AD
	>	 change to: Low, Intermediate, High, None
	- TDP43:Na
	>	 change to: None in medial temporal lobe, Present in amygdala, only, Present in hippocampus, only, Present in amygdala and hippocampus, only, Present in medial temporal lobe and middle frontal gyrus (not FTLD pattern), Unknown
	- amyloid_angiopathy_severity_scale:Cerebral amyloid angiopathy, temporal and occipital lobe, Cerebral amyloid angiopathy, frontal lobe
	>	 change to: None, Mild, Moderate, Severe, Not assessed, Unknown
	- path_ad_level:Microscopic changes of Alzheimer's disea

In [111]:
CLINPATH.head()

Unnamed: 0,sample_id,source_sample_id,time_from_baseline,GP2_id,hemisphere,region_level_1,region_level_2,region_level_3,AMPPD_id,family_history,...,sn_neuronal_loss,path_infarcs,path_nia_ri,path_nia_aa_a,path_nia_aa_b,path_nia_aa_c,TDP43,arteriolosclerosis_severity_scale,amyloid_angiopathy_severity_scale,path_ad_level
0,MFG_HC_1225,12-25,0,,Unknown,Frontal lobe,Middle frontal gyrus,unknown,,Not Reported,...,,,Criteria not met,,,,,,,No evidence of Alzheimer's disease neuropathol...
1,MFG_HC_0602,06-02,0,,Unknown,Frontal lobe,Middle frontal gyrus,unknown,,Not Reported,...,,,Criteria not met,,,,,,,No evidence of Alzheimer's disease neuropathol...
2,MFG_PD_0009,00-09,0,,Unknown,Frontal lobe,Middle frontal gyrus,unknown,,Not Reported,...,,,Not AD,,,,,,,No evidence of Alzheimer's disease neuropathol...
3,MFG_PD_1921,19-21,0,,Unknown,Frontal lobe,Middle frontal gyrus,unknown,,Not Reported,...,,,Criteria not met,,,,None in medial temporal lobe,,,No evidence of Alzheimer's disease neuropathol...
4,MFG_PD_2058,20-58,0,,Unknown,Frontal lobe,Middle frontal gyrus,unknown,,Not Reported,...,,,Low,,,,None in medial temporal lobe,,,"Microscopic changes of Alzheimer's disease, in..."


In [112]:

SAMPLE_ALL_CP = SAMPLE_ALL.merge(CLINPATH, on='sample_id', how='outer')


In [113]:
SAMPLE_ALL_CP.to_csv("./clean/team-Lee/auxiluary_metadata.csv")

In [115]:
data_path

PosixPath('/Users/ergonyc/Projects/ASAP/team-lee')

In [114]:
# fix the column order
STUDY = reorder_table_to_CDE(STUDY, "STUDY", CDE)
SAMPLE = reorder_table_to_CDE(SAMPLE, "SAMPLE", CDE)
PROTOCOL = reorder_table_to_CDE(PROTOCOL, "PROTOCOL", CDE)
SUBJECT = reorder_table_to_CDE(SUBJECT, "SUBJECT", CDE)     
CLINPATH = reorder_table_to_CDE(CLINPATH, "CLINPATH", CDE)

# write the clean metadata
STUDY.to_csv(data_path / "metadata/STUDY.csv")
PROTOCOL.to_csv(data_path / "metadata/PROTOCOL.csv")
CLINPATH.to_csv(data_path / "metadata/CLINPATH.csv")
SAMPLE.to_csv(data_path / "metadata/SAMPLE.csv")
SUBJECT.to_csv(data_path / "metadata/SUBJECT.csv")

# also writh them to clean...
# 
#  

export_root = Path.cwd() / "clean/team-Lee"
if not export_root.exists():
    export_root.mkdir(parents=True, exist_ok=True)


STUDY.to_csv( export_root / "STUDY.csv")
PROTOCOL.to_csv(export_root / "PROTOCOL.csv")
SAMPLE.to_csv(export_root / "SAMPLE.csv")
SUBJECT.to_csv(export_root / "SUBJECT.csv")
CLINPATH.to_csv(export_root / "CLINPATH.csv")


In [116]:
# make sure cleaned files are correct

SUBJECT = read_meta_table(f"{export_root}/SUBJECT.csv", dtypes_dict)
CLINPATH = read_meta_table(f"{export_root}/CLINPATH.csv", dtypes_dict)
STUDY = read_meta_table(f"{export_root}/STUDY.csv", dtypes_dict)
PROTOCOL = read_meta_table(f"{export_root}/PROTOCOL.csv", dtypes_dict)
SAMPLE = read_meta_table(f"{export_root}/SAMPLE.csv", dtypes_dict)


# SUBJECT = pd.read_csv(f"{export_root}/SUBJECT.csv",header=0,index_col=0, dtype=dtypes_dict)
# CLINPATH = pd.read_csv(f"{export_root}/CLINPATH.csv",header=0,index_col=0, dtype=dtypes_dict)
# STUDY = pd.read_csv(f"{export_root}/STUDY.csv",header=0,index_col=0, dtype=dtypes_dict)
# PROTOCOL = pd.read_csv(f"{export_root}/PROTOCOL.csv",header=0,index_col=0, dtype=dtypes_dict)
# SAMPLE = pd.read_csv(f"{export_root}/SAMPLE.csv",header=0,index_col=0, dtype=dtypes_dict)


In [117]:
table, table_name = SUBJECT, "SUBJECT"

report = ReportCollector(destination="print")
validate_table(table, table_name, CDE, report)
print(report.get_log())

All required fields are present in *SUBJECT* table.
No empty entries (Nan) found in _Required_ fields.
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- primary_diagnosis_text: 23/25 empty rows
## Enum fields have valid values in SUBJECT. 🥳



In [118]:
table, table_name = SAMPLE, "SAMPLE"

report = ReportCollector(destination="print")
validate_table(table, table_name, CDE, report)
print(report.get_log())

All required fields are present in *SAMPLE* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- source_RIN: 150/150 empty rows

	- RIN: 150/150 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- DV200: 150/150 empty rows

	- pm_PH: 150/150 empty rows
## Enum fields have valid values in SAMPLE. 🥳



In [119]:
table, table_name = CLINPATH, "CLINPATH"

report = ReportCollector(destination="print")
validate_table(table, table_name, CDE, report)
print(report.get_log())

All required fields are present in *CLINPATH* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- age_at_onset: 75/75 empty rows

	- age_at_diagnosis: 75/75 empty rows

	- first_motor_symptom: 75/75 empty rows

	- path_year_death: 75/75 empty rows

	- brain_weight: 75/75 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- smoking_years: 75/75 empty rows
## Enums
🚨⚠️❗ **Invalid entries**
	- path_nia_ri:Criteria not met, Not AD
	>	 change to: Low, Intermediate, High, None
	- TDP43:Na
	>	 change to: None in medial temporal lobe, Present in amygdala, only, Present in hippocampus, only, Present in amygdala and hippocampus, only, Present in medial temporal lobe and middle frontal gyrus (not FTLD pattern), Unknown
	- amyloid_angiopathy_severity_scale:Cerebral amyloid angiopathy, temporal and occipital lobe, Cerebral amyloid angiopathy, frontal lobe
	>	 change to: None, Mild, Moderate, Severe, Not assessed, Unknown
	- path_ad_level:Microscopic changes of Alzheimer's disea