ASAP CRN Metadata validation - wave 1

# ASAP CRN Metadata validation - wave 1

28 October 2023
Andy Henrie



## STEPS

### imports
- pandas
- pathlib

### Load CDE for validation
- check all columns


### Team Hardy
- load .csv files with tables


In [1]:
import pandas as pd
from pathlib import Path


# local helpers
from utils.qcutils import validate_table, force_enum_string, reorder_table_to_CDE
from utils.io import ReportCollector, get_dtypes_dict, read_meta_table




Streamlit NOT successfully


## Load CDE

In [2]:
CDE_path = Path.cwd() / "ASAP_CDE.csv" 
CDE = pd.read_csv(CDE_path )

CDE.head()



Unnamed: 0,Table,Field,Description,DataType,Required,Validation,Unnamed: 6,ClinPath field,team_Hafler type,ClinPath description,Unnamed: 10
0,STUDY,project_name,Project Name: A Title of the overall project...,String,Required,,,,,,bard
1,STUDY,project_dataset,Dataset Name: A unique name is required for ...,String,Required,,,,,,
2,STUDY,project_description,Project Description: Brief description of th...,String,Required,,,,,,
3,STUDY,ASAP_team_name,ASAP Team Name: Name of the ASAP CRN Team. i...,Enum,Required,"[""TEAM-LEE"",""TEAM-HAFLER"",""TEAM-HARDY"", ""TEAM-...",,,,,
4,STUDY,ASAP_lab_name,Lab Name. : Lab name that is submitting data...,String,Required,,,,,,


## Clean Team Hardy tables

### Load Tables from csv

The metadata path below has copies of the raw meta-tables

In [12]:
# # AS UPLOADED FROM Team Hardy.  This is the raw meta-data
# Samples with proper batch: "Projects/ASAP/team-hardy/metadata/23102023_SAMPLE.csv"
# All other tables transferred directly to the raw bucket "Projects/ASAP/team-hardy/hardy-metadata-20232009"

# Initialize the data types dictionary
dtypes_dict = get_dtypes_dict(CDE)
    


## convert 
data_path = Path.home() / ("Projects/ASAP/team-hardy")
metadata_path = data_path / "metadata"

# the unmolested Hardy meta-tables don't have index columns 
SUBJECT = pd.read_csv(f"{metadata_path}/SUBJECT.csv", dtype=dtypes_dict)
CLINPATH = pd.read_csv(f"{metadata_path}/CLINPATH.csv", dtype=dtypes_dict)
STUDY = pd.read_csv(f"{metadata_path}/STUDY.csv", dtype=dtypes_dict)
PROTOCOL = pd.read_csv(f"{metadata_path}/PROTOCOL.csv", dtype=dtypes_dict)
SAMPLE = pd.read_csv(f"{metadata_path}/SAMPLE.csv", dtype=dtypes_dict)



In [14]:
SUBJECT.head()

# SUBJECT = pd.read_csv(f"{metadata_path}/SUBJECT.csv", dtype=dtypes_dict)
# CLINPATH = pd.read_csv(f"{metadata_path}/CLINPATH.csv", dtype=dtypes_dict)
# STUDY = pd.read_csv(f"{metadata_path}/STUDY.csv", dtype=dtypes_dict)
# PROTOCOL = pd.read_csv(f"{metadata_path}/PROTOCOL.csv", dtype=dtypes_dict)
# SAMPLE = pd.read_csv(f"{metadata_path}/SAMPLE.csv", dtype=dtypes_dict)





Unnamed: 0,subject_id,source_subject_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,duration_pmi,primary_diagnosis,primary_diagnosis_text
0,babom,P2/14,QSBB_UK,Human,Female,78,,,46.0,Idiopathic PD,clinpath info: PDD | PD (with dementia)
1,borah,P4/11,QSBB_UK,Human,Male,63,,,37.0,Idiopathic PD,clinpath info: PD | PD
2,bovon,P95/10,QSBB_UK,Human,Male,81,,,59.5,Idiopathic PD,clinpath info: PD | NA
3,davof,P80/11,QSBB_UK,Human,Male,80,,,100.0,Idiopathic PD,clinpath info: PD | PD
4,dudug,P82/10,QSBB_UK,Human,Female,87,,,84.0,No PD nor other neurological disorder,clinpath info: Control | Control


### SUBJECT

In [15]:

SUBJECT['sex'] = SUBJECT['sex'].replace({'F':"Female", 'M':"Male"})
SUBJECT['race'] = SUBJECT['race'].replace({'W':"White", 'B':"Black or African American"})

SUBJECT['primary_diagnosis'] = SUBJECT['primary_diagnosis'].replace({'Normal control':"Healthy Control", "Idiopathic Parkinson's disease":"Idiopathic PD"})


In [16]:
SUBJECT = reorder_table_to_CDE(SUBJECT, "SUBJECT", CDE)     
subject_report = ReportCollector(destination="print")

validate_table(SUBJECT, "SUBJECT", CDE, subject_report)
print(subject_report.get_log())


All required fields are present in *SUBJECT* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- ethnicity: 64/64 empty rows

	- duration_pmi: 1/64 empty rows
No empty entries (Nan) found in _Optional_ fields.
## Enum fields have valid values in SUBJECT. 🥳



In [17]:
SUBJECT.head()

Unnamed: 0,subject_id,source_subject_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,duration_pmi,primary_diagnosis,primary_diagnosis_text
0,babom,P2/14,QSBB_UK,Human,Female,78,,,46.0,Idiopathic PD,clinpath info: PDD | PD (with dementia)
1,borah,P4/11,QSBB_UK,Human,Male,63,,,37.0,Idiopathic PD,clinpath info: PD | PD
2,bovon,P95/10,QSBB_UK,Human,Male,81,,,59.5,Idiopathic PD,clinpath info: PD | NA
3,davof,P80/11,QSBB_UK,Human,Male,80,,,100.0,Idiopathic PD,clinpath info: PD | PD
4,dudug,P82/10,QSBB_UK,Human,Female,87,,,84.0,No PD nor other neurological disorder,clinpath info: Control | Control


### SAMPLE

`source_subject_id` are in the CLINPATH table. 

In [27]:
SAMPLE = pd.read_csv(f"{metadata_path}/SAMPLE.csv", dtype=dtypes_dict)
SAMPLE_OG = SAMPLE.copy()
# SAMPLE: source_subject_id -> source_sample_id

SAMPLE = reorder_table_to_CDE(SAMPLE, "SAMPLE", CDE)

#SAMPLE[['sample_id','source_sample_id','subject_id', 'batch']]

In [28]:
# force the right sex_ontology_term_id
SAMPLE["organism_ontology_term_id"] = "NCBITaxon:9606"

# fix batch to be BATCH_1, BATCH_2, etc
SAMPLE['batch'] = "BATCH_" + SAMPLE['batch']

In [29]:
# SAMPLE = reorder_table_to_CDE(SAMPLE, "SAMPLE", CDE)     

sample_report = ReportCollector(destination="print")
validate_table(SAMPLE, "SAMPLE", CDE, sample_report)
print(sample_report.get_log())

All required fields are present in *SAMPLE* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- source_RIN: 3616/3616 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- pm_PH: 3616/3616 empty rows
## Enums
🚨⚠️❗ **Invalid entries**
	- sequencing_length:190
	>	 change to: 25, 50, 100, 150



In [30]:
SAMPLE.source_sample_id.unique()

array([''], dtype=object)

### CLINPATH

For some reason there's some duplicate 'sample_id' which also have NULL entries for some of the Enum fields.

In [31]:
CLINPATH_og = CLINPATH.copy()

CLINPATH.drop_duplicates(subset=['sample_id'], inplace=True)

# # CLINPATH.rename(columns={"subject_id":"SUBJECT_ID"}, inplace=True)
# CLINPATH['source_sample_id']


In [32]:

clinpath_report = ReportCollector(destination="print")
validate_table(CLINPATH, "CLINPATH", CDE, clinpath_report)

print(clinpath_report.get_log())


🚨⚠️❗ **Missing Required Fields in CLINPATH: source_sample_id**
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- age_at_onset: 128/128 empty rows

	- first_motor_symptom: 128/128 empty rows

	- path_year_death: 128/128 empty rows

	- brain_weight: 128/128 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- smoking_years: 128/128 empty rows
## Enums
🚨⚠️❗ **Invalid entries**
	- path_autopsy_dx_main:Control brain, Pathological ageing, Control brain / Path ageing, Argyrophilic grain disease, Control brain, Cerebrovascular disease (small vessel), Cerebrovascular disease (small vessel), Control brain, Alzheimer`s disease (intermediate level AD pathological change), Control brain / Path ageing, CAA
	>	 change to: Lewy body disease nos, Parkinson's disease, Parkinson's disease with dementia, Dementia with Lewy bodies, Multiple system atrophy (SND>OPCA), Multiple system atrophy (OPCA<SND), Multiple system atrophy (SND=OPCA), Progressive supranuclear palsy, Corticobasal degenera

In [33]:

clinpath_report2 = ReportCollector(destination="print")
validate_table(CLINPATH_og, "CLINPATH", CDE, clinpath_report2)

print(clinpath_report2.get_log())


🚨⚠️❗ **Missing Required Fields in CLINPATH: source_sample_id**
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- age_at_onset: 138/138 empty rows

	- age_at_diagnosis: 10/138 empty rows

	- first_motor_symptom: 138/138 empty rows

	- path_year_death: 138/138 empty rows

	- brain_weight: 138/138 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- smoking_years: 138/138 empty rows
## Enums
🚨⚠️❗ **Invalid entries**
	- path_autopsy_dx_main:Control brain, Pathological ageing, Control brain / Path ageing, Argyrophilic grain disease, Control brain, Cerebrovascular disease (small vessel), Cerebrovascular disease (small vessel), Control brain, Alzheimer`s disease (intermediate level AD pathological change), Control brain / Path ageing, CAA
	>	 change to: Lewy body disease nos, Parkinson's disease, Parkinson's disease with dementia, Dementia with Lewy bodies, Multiple system atrophy (SND>OPCA), Multiple system atrophy (OPCA<SND), Multiple system atrophy (SND=OPCA), Progressive s

In [34]:
# replace 'path_braak_asyn' with with string of the numeric.
# ???:  convert nan to ""?? else ""
# CLINPATH['path_braak_asyn'] = CLINPATH['path_braak_asyn'].apply(lambda val: str(int(float(val))))
CLINPATH['path_braak_asyn'].apply(lambda val: str(int(float(val)))).unique()

array(['6', '0', '5'], dtype=object)

In [35]:
CLINPATH['path_braak_nft'].unique()

array(['2', '1', '3', '0', '4', '6'], dtype=object)

In [36]:


# replace 'path_braak_nft' with with string of the numeric. converte nan to ""
CLINPATH['path_braak_nft'] = CLINPATH['path_braak_nft'].replace({"0":"0", 
                                                                "1":"I", 
                                                                "2": "II", 
                                                                "3":"III", 
                                                                "4":"IV", 
                                                                "5":"V", 
                                                                "6":"VI"})


In [37]:

# code family_history as "Not Reported" (currently empty)
CLINPATH['family_history'] = "Not Reported"

# check APOE_e4_status ? currently empty
# `path_autopsy_dx_main`  actually seems good parser might be wrong

# code "at least 4" as "4/5" 
CLINPATH['path_thal'] = CLINPATH['path_thal'].replace({'At least 4':"4/5"})


CLINPATH['path_mckeith'] = CLINPATH['path_mckeith'].replace({'Diffuse neocortical': "Diffuse, neocortical (brainstem, limbic and neocortical involvement)", 
                                                        'Limbic transitional': "Limbic (transitional)" ,
                                                        'Diffuse Neocortical':"Diffuse, neocortical (brainstem, limbic and neocortical involvement)"})

# replace 'path_braak_nft' with with string of the numeric. converte nan to ""
CLINPATH['path_nia_aa_a'] = CLINPATH['path_nia_aa_a'].replace({"0":"A0", 
                                                                    "1":"A1", 
                                                                    "2": "A2", 
                                                                    "3":"A3"})


In [38]:
CLINPATH['path_nia_aa_b'].unique()

array(['1', '2', '0', '3'], dtype=object)

In [39]:

# replace 'path_braak_nft' with with string of the numeric. converte nan to ""
CLINPATH['path_nia_aa_b'] = CLINPATH['path_nia_aa_b'].replace({"0":"B0", 
                                                                "1":"B1", 
                                                                "2": "B2", 
                                                                "3":"B3"})



In [40]:

# replace 'path_braak_nft' with with string of the numeric. converte nan to ""
CLINPATH['path_nia_aa_c'] = CLINPATH['path_nia_aa_c'].replace({"0":"C0", 
                                                                "1":"C1", 
                                                                "2": "C2", 
                                                                "3":"C3"})


In [41]:
CLINPATH['path_ad_level'] = CLINPATH['path_ad_level'].replace({"No evidence": "No evidence of Alzheimer\'s disease neuropathological change"})



In [42]:
CLINPATH[['source_sample_id','path_autopsy_dx_main','path_ad_level', "cause_death"]].drop_duplicates().head(50)


KeyError: "['source_sample_id'] not in index"

In [43]:
CLINPATH['path_autopsy_dx_main'].unique()



array(["Parkinson's disease with dementia", "Parkinson's disease",
       'Control brain', 'Pathological ageing',
       'Control brain / Path ageing', 'Argyrophilic grain disease',
       'Control brain, Cerebrovascular disease (small vessel)',
       'Cerebrovascular disease (small vessel)',
       'Control brain, Alzheimer`s disease (intermediate level AD pathological change)',
       'Control brain / Path ageing, CAA'], dtype=object)

In [44]:
path_autopsy_map = { "Parkinson's disease with dementia": "Parkinson's disease with dementia", 
       "Parkinson's disease": "Parkinson's disease",
       'Control brain':"Control, no misfolded protein or significant vascular pathology", 
       'Pathological ageing': 'Control, no misfolded protein or significant vascular pathology',
       'Control brain / Path ageing': 'Control, no misfolded protein or significant vascular pathology',
       'Argyrophilic grain disease': "Control, Argyrophilic grain disease",
       'Control brain, Cerebrovascular disease (small vessel)':"Control, Cerebrovascular disease (atherosclerosis)",
       'Cerebrovascular disease (small vessel)':"Control, Cerebrovascular disease (atherosclerosis)",
       "Control brain, Alzheimer`s disease (intermediate level AD pathological change)":"Alzheimer's disease (intermediate level neuropathological change)",
       'Control brain / Path ageing, CAA':"Control, Cerebrovascular disease (cerebral amyloid angiopathy)"}


In [45]:
CLINPATH['path_autopsy_dx_main'] = CLINPATH['path_autopsy_dx_main'].replace(path_autopsy_map)

In [46]:
CLINPATH['path_autopsy_dx_main'].unique()


array(["Parkinson's disease with dementia", "Parkinson's disease",
       'Control, no misfolded protein or significant vascular pathology',
       'Control, Argyrophilic grain disease',
       'Control, Cerebrovascular disease (atherosclerosis)',
       "Alzheimer's disease (intermediate level neuropathological change)",
       'Control, Cerebrovascular disease (cerebral amyloid angiopathy)'],
      dtype=object)

In [47]:
clinpath_report = ReportCollector(destination="print")
validate_table(CLINPATH, "CLINPATH", CDE, clinpath_report)

print(clinpath_report.get_log())

🚨⚠️❗ **Missing Required Fields in CLINPATH: source_sample_id**
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- age_at_onset: 128/128 empty rows

	- first_motor_symptom: 128/128 empty rows

	- path_year_death: 128/128 empty rows

	- brain_weight: 128/128 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- smoking_years: 128/128 empty rows
## Enum fields have valid values in CLINPATH. 🥳



### STUDY

In [71]:
# fix STUDY formatting
tmp = pd.DataFrame()
tmp = STUDY[["name","value"]].transpose().reset_index().drop(columns=["index"])
tmp.columns = tmp.iloc[0]
STUDY = tmp.drop([0]).reset_index(drop=True)
# # STUDY[["Unnamed: 1"]].transpose().reset_index().drop(columns=["index"]), tmp
# STUDY.head()

In [72]:
study_report = ReportCollector(destination="print")
validate_table(STUDY, "STUDY", CDE, study_report)
print(study_report.get_log())

🚨⚠️❗ **Missing Required Fields in STUDY: submittor_email**
No empty entries (Nan) found in _Required_ fields.
No empty entries (Nan) found in _Optional_ fields.
## Enum fields have valid values in STUDY. 🥳



In [73]:
STUDY

Unnamed: 0,project_name,project_dataset,project_description,ASAP_team_name,ASAP_lab_name,PI_full_name,PI_email,contributor_names,submitter_name,submitter_email,...,other_funding_source,publication_DOI,publication_PMID,number_of_brain_samples,brain_regions,types_of_samples,PI_ORCHID,PI_google_scholar_id,DUA_version,metadata_version_date
0,Understanding mechanisms of Parkinson's diseas...,Hardy snRNA-seq,Genetic analysis has identified many risk gene...,TEAM-HARDY,Ryten Lab,Mina Ryten,mina.ryten@ucl.ac.uk,"Aine Fairbrother-Browne, Jonathan Brenton, Mel...",Aine Fairbrother-Browne,aine.fairbrother-browne.18@ucl.ac.uk,...,,,,128,"Inferior Parietal Lobule (IPL), Anterior Cingu...",Late stage (Braak 5-6) PD and control post-mor...,0000-0001-9520-6957,https://scholar.google.co.uk/citations?user=lt...,,"Version 1, 09/2023"


In [39]:
STUDY.head()

Unnamed: 0,project_name,project_dataset,project_description,ASAP_team_name,ASAP_lab_name,PI_full_name,PI_email,contributor_names,submitter_name,submittor_email,...,other_funding_source,publication_DOI,publication_PMID,number_of_brain_samples,brain_regions,types_of_samples,PI_ORCHID,PI_google_scholar_id,DUA_version,metadata_version_date


### PROTOCOL

In [76]:
# fix STUDY formatting
tmp = pd.DataFrame()
tmp = PROTOCOL[["name","value"]].transpose().reset_index().drop(columns=["index"])
tmp.columns = tmp.iloc[0]
PROTOCOL = tmp.drop([0]).reset_index(drop=True)
PROTOCOL

Unnamed: 0,sample_collection_summary,cell_extraction_summary,lib_prep_summary,data_processing_summary,github_url,protocols_io_DOI,other_reference
0,"This dataset contains cortical regions only, p...",From protocols.io: This protocol is used to is...,'Nuclei were extracted from homogenised post-m...,Cell ranger was used to convert raw sequencing...,Raw to fastq to mapped: https://github.com/RHR...,Nuclear extraction protocol: 10.17504/protocol...,


In [77]:
protocol_report = ReportCollector(destination="print")
validate_table(PROTOCOL, "PROTOCOL", CDE, protocol_report)
print(protocol_report.get_log())

All required fields are present in *PROTOCOL* table.
No empty entries (Nan) found in _Required_ fields.
No empty entries (Nan) found in _Optional_ fields.
## Enum fields have valid values in PROTOCOL. 🥳



All required fields are present in *PROTOCOL* table.
No empty entries (Nan) found in _Required_ fields.
No empty entries (Nan) found in _Optional_ fields.
## Enum fields have valid values in PROTOCOL. 🥳



### export clean tables

In [78]:
data_path = data_path / "team-hardy"
data_path = Path.home() / ("Projects/ASAP/team-hardy")

In [79]:
data_path.name.split("-")[1]

'hardy'

In [80]:

# # write the clean metadata
# STUDY.to_csv(data_path / "metadata/STUDY.csv")
# PROTOCOL.to_csv(data_path / "metadata/PROTOCOL.csv")
# CLINPATH.to_csv(data_path / "metadata/CLINPATH.csv")
# SAMPLE.to_csv(data_path / "metadata/SAMPLE.csv")
# SUBJECT.to_csv(data_path / "metadata/SUBJECT.csv")

# also writh them to clean...
# 
#  

export_root = Path.cwd() / "clean/team-Hardy"
if not export_root.exists():
    export_root.mkdir(parents=True, exist_ok=True)


STUDY.to_csv( export_root / "STUDY.csv")
PROTOCOL.to_csv(export_root / "PROTOCOL.csv")
SAMPLE.to_csv(export_root / "SAMPLE.csv")
SUBJECT.to_csv(export_root / "SUBJECT.csv")
CLINPATH.to_csv(export_root / "CLINPATH.csv")


In [93]:
# make sure cleaned files are correct

SUBJECT = read_meta_table(f"{export_root}/SUBJECT.csv", dtypes_dict)
CLINPATH = read_meta_table(f"{export_root}/CLINPATH.csv", dtypes_dict)
STUDY = read_meta_table(f"{export_root}/STUDY.csv", dtypes_dict)
PROTOCOL = read_meta_table(f"{export_root}/PROTOCOL.csv", dtypes_dict)
SAMPLE = read_meta_table(f"{export_root}/SAMPLE.csv", dtypes_dict)


# SUBJECT = pd.read_csv(f"{export_root}/SUBJECT.csv",header=0,index_col=0, dtype=dtypes_dict)
# CLINPATH = pd.read_csv(f"{export_root}/CLINPATH.csv",header=0,index_col=0, dtype=dtypes_dict)
# STUDY = pd.read_csv(f"{export_root}/STUDY.csv",header=0,index_col=0, dtype=dtypes_dict)
# PROTOCOL = pd.read_csv(f"{export_root}/PROTOCOL.csv",header=0,index_col=0, dtype=dtypes_dict)
# SAMPLE = pd.read_csv(f"{export_root}/SAMPLE.csv",header=0,index_col=0, dtype=dtypes_dict)


In [94]:
SUBJ = pd.read_csv(f"{export_root}/SUBJECT.csv", dtype=dtypes_dict)
SUBJ.columns[0]

'Unnamed: 0'

In [95]:
SUBJECT.head()

Unnamed: 0,subject_id,source_subject_id,biobank_name,organism,sex,age_at_collection,race,ethnicity,duration_pmi,primary_diagnosis,primary_diagnosis_text
0,babom,P2/14,QSBB_UK,Human,Female,78,,,46.0,Idiopathic PD,clinpath info: PDD | PD (with dementia)
1,borah,P4/11,QSBB_UK,Human,Male,63,,,37.0,Idiopathic PD,clinpath info: PD | PD
2,bovon,P95/10,QSBB_UK,Human,Male,81,,,59.5,Idiopathic PD,clinpath info: PD | NA
3,davof,P80/11,QSBB_UK,Human,Male,80,,,100.0,Idiopathic PD,clinpath info: PD | PD
4,dudug,P82/10,QSBB_UK,Human,Female,87,,,84.0,No PD nor other neurological disorder,clinpath info: Control | Control


In [82]:
table, table_name = SUBJECT, "SUBJECT"

report = ReportCollector(destination="print")
validate_table(table, table_name, CDE, report)
print(report.get_log())

All required fields are present in *SUBJECT* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- ethnicity: 64/64 empty rows

	- duration_pmi: 1/64 empty rows
No empty entries (Nan) found in _Optional_ fields.
## Enum fields have valid values in SUBJECT. 🥳



In [83]:
table, table_name = SAMPLE, "SAMPLE"

report = ReportCollector(destination="print")
validate_table(table, table_name, CDE, report)
print(report.get_log())

All required fields are present in *SAMPLE* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- source_RIN: 3616/3616 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- DV200: 3616/3616 empty rows

	- pm_PH: 3616/3616 empty rows
## Enums
🚨⚠️❗ **Invalid entries**
	- sequencing_length:190
	>	 change to: 25, 50, 100, 150



In [84]:
table, table_name = CLINPATH, "CLINPATH"

report = ReportCollector(destination="print")
validate_table(table, table_name, CDE, report)
print(report.get_log())

🚨⚠️❗ **Missing Required Fields in CLINPATH: source_sample_id**
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- age_at_onset: 128/128 empty rows

	- first_motor_symptom: 128/128 empty rows

	- path_year_death: 128/128 empty rows

	- brain_weight: 128/128 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- smoking_years: 128/128 empty rows
## Enum fields have valid values in CLINPATH. 🥳



In [85]:
CLINPATH['path_braak_asyn'].unique()

array(['6', '0', '5'], dtype=object)

In [86]:
CLINPATH.head()

Unnamed: 0,sample_id,source_subject_id,time_from_baseline,GP2_id,hemisphere,region_level_1,region_level_2,region_level_3,AMPPD_id,family_history,...,path_nia_ri,path_nia_aa_a,path_nia_aa_b,path_nia_aa_c,TDP43,arteriolosclerosis_severity_scale,amyloid_angiopathy_severity_scale,path_ad_level,dig_slide_avail,quant_path_avail
0,libat_IPL,P27/11,0,MDGAP-QSBB_000088_s1,Left,Parietal lobe,Inferior parietal lobule,Grey matter,,Not Reported,...,,A1,B1,C0,,,,Low level Alzheimer's disease neuropathologica...,Yes,Yes
1,libat_ACG,P27/11,0,MDGAP-QSBB_000088_s1,Left,Cingulate gyrus,Anterior cingulate gyrus,Grey matter,,Not Reported,...,,A1,B1,C0,,,,Low level Alzheimer's disease neuropathologica...,Yes,Yes
2,rijof_IPL,P8/18,0,MDGAP-QSBB_000583_s1,Right,Parietal lobe,Inferior parietal lobule,Grey matter,,Not Reported,...,,A1,B1,C0,,,,Low level Alzheimer's disease neuropathologica...,Yes,Yes
3,rijof_ACG,P8/18,0,MDGAP-QSBB_000583_s1,Right,Cingulate gyrus,Anterior cingulate gyrus,Grey matter,,Not Reported,...,,A1,B1,C0,,,,Low level Alzheimer's disease neuropathologica...,Yes,Yes
4,gotar_IPL,P41/09,0,MDGAP-QSBB_000406_s1,Right,Parietal lobe,Inferior parietal lobule,Grey matter,,Not Reported,...,,A1,B1,C0,,,,Low level Alzheimer's disease neuropathologica...,Yes,Yes


In [50]:
SAMPLE.head()

Unnamed: 0,sample_id,source_sample_id,subject_id,replicate,replicate_count,repeated_sample,batch,tissue,brain_region,source_RIN,...,sex_ontology_term_id,self_reported_ethnicity_ontology_term_id,disease_ontology_term_id,tissue_ontology_term_id,cell_type_ontology_term_id,assay_ontology_term_id,suspension_type,DV200,pm_PH,donor_id
0,babom_ACG,,babom,,1,0,BATCH_2,brain,ACG,,...,PATO:0000383 (female),,MONDO:0005180,UBERON:0009835,,EFO:0008913,nucleus,,,
1,babom_ACG,,babom,,1,0,BATCH_2,brain,ACG,,...,PATO:0000383 (female),,MONDO:0005180,UBERON:0009835,,EFO:0008913,nucleus,,,
2,babom_ACG,,babom,,1,0,BATCH_2,brain,ACG,,...,PATO:0000383 (female),,MONDO:0005180,UBERON:0009835,,EFO:0008913,nucleus,,,
3,babom_ACG,,babom,,1,0,BATCH_2,brain,ACG,,...,PATO:0000383 (female),,MONDO:0005180,UBERON:0009835,,EFO:0008913,nucleus,,,
4,babom_ACG,,babom,,1,0,BATCH_2,brain,ACG,,...,PATO:0000383 (female),,MONDO:0005180,UBERON:0009835,,EFO:0008913,nucleus,,,
