ASAP CRN Metadata validation - wave 1

# ASAP CRN Metadata validation - wave 1

28 October 2023
Andy Henrie



## STEPS

### imports
- pandas
- pathlib

### Load CDE for validation
- check all columns


### Team Hafler
- load excel file with tables
- add batch info
- add missing columns


In [1]:
import pandas as pd
from pathlib import Path


# local helpers
from utils.qcutils import validate_table, force_enum_string, reorder_table_to_CDE
from utils.io import ReportCollector, get_dtypes_dict, read_meta_table




Streamlit NOT successfully


## Load CDE

In [2]:
CDE_path = Path.cwd() / "ASAP_CDE.csv" 
CDE = pd.read_csv(CDE_path )

CDE.head()



Unnamed: 0,Table,Field,Description,DataType,Required,Validation,Unnamed: 6,ClinPath field,team_Hafler type,ClinPath description,Unnamed: 10
0,STUDY,project_name,Project Name: A Title of the overall project...,String,Required,,,,,,bard
1,STUDY,project_dataset,Dataset Name: A unique name is required for ...,String,Required,,,,,,
2,STUDY,project_description,Project Description: Brief description of th...,String,Required,,,,,,
3,STUDY,ASAP_team_name,ASAP Team Name: Name of the ASAP CRN Team. i...,Enum,Required,"[""TEAM-LEE"",""TEAM-HAFLER"",""TEAM-HARDY"", ""TEAM-...",,,,,
4,STUDY,ASAP_lab_name,Lab Name. : Lab name that is submitting data...,String,Required,,,,,,


## Clean Team Hafler tables

### Load Tables from excel

In [3]:
## convert to seurat Object
data_path = Path.home() / ("Projects/ASAP")
metadata_path = data_path / "team-hafler/metadata"

sheets = ["SAMPLE","SUBJECT","CLINPATH","STUDY","PROTOCOL"]
excel_path = data_path / "ASAP_CDE_ALL_Team_Hafler_v1.xlsx"
STUDY = pd.read_excel(excel_path,sheet_name="STUDY",header=1).drop(columns="Field")
CLINPATH = pd.read_excel(excel_path,sheet_name="CLINPATH",header=1).drop(columns="Field")
SUBJECT = pd.read_excel(excel_path,sheet_name="SUBJECT",header=1).drop(columns="Field")
SAMPLE = pd.read_excel(excel_path,sheet_name="SAMPLE",header=1).drop(columns="Field")
PROTOCOL = pd.read_excel(excel_path,sheet_name="PROTOCOL",header=1).drop(columns="Field")


# fix the column order
STUDY = reorder_table_to_CDE(STUDY, "STUDY", CDE)
SAMPLE = reorder_table_to_CDE(SAMPLE, "SAMPLE", CDE)
PROTOCOL = reorder_table_to_CDE(PROTOCOL, "PROTOCOL", CDE)
SUBJECT = reorder_table_to_CDE(SUBJECT, "SUBJECT", CDE)     
CLINPATH = reorder_table_to_CDE(CLINPATH, "CLINPATH", CDE)



### SUBJECT

In [4]:
# SUBJECT = force_enum_string(SUBJECT, "SUBJECT", CDE)

SUBJECT['sex'] = SUBJECT['sex'].replace({'F':"Female", 'M':"Male"})
SUBJECT['race'] = SUBJECT['race'].replace({'W':"White", 'B':"Black or African American"})

SUBJECT['primary_diagnosis'] = SUBJECT['primary_diagnosis'].replace({'normal control':"Healthy Control", "idiopathic Parkinson's disease":"Idiopathic PD"})


In [5]:
subject_report = ReportCollector(destination="print")

validate_table(SUBJECT, "SUBJECT", CDE, subject_report)

0

In [6]:
print(subject_report.get_log())

All required fields are present in *SUBJECT* table.
No empty entries (Nan) found in _Required_ fields.
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- primary_diagnosis_text: 12/12 empty rows
## Enum fields have valid values in SUBJECT. 🥳



### SAMPLE

In [7]:

def add_hafler_batch(sample_df):

    # First batch: HSDG07HC HSDG10HC HSDG148PD HSDG199PD
    # batch[batch.sample_id in ['hSDG07HC', 'hSDG10HC', 'hSDG148PD', 'hSDG199PD']]=1
    Batch_1 = ['hSDG07', 'hSDG10', 'hSDG148', 'hSDG199'] 
    # Second batch: hsDG101HC hsDG13HC hsDG151PD hsDG197PD hsDG30HC hsDG99HC
    Batch_2 = ['hSDG101', 'hSDG13', 'hSDG151', 'hSDG197', 'hSDG30', 'hSDG99']
    # Third batch: hsDG142PD hsDG208PD
    Batch_3 = ['hSDG142', 'hSDG208'] 


    batch_col = []
    for row in sample_df.sample_id:
        if row in Batch_1:
            batch_col.append("Batch_1")
        elif row in Batch_2:
            batch_col.append("Batch_2")
        elif row in Batch_3:
            batch_col.append("Batch_3")
        else:
            print("ERROR >>>>>>>> not no batch info")
            batch_col.append("")


    sample_df['batch'] = batch_col
    return sample_df

SAMPLE = add_hafler_batch(SAMPLE)


In [8]:
# # fix replicate & replicate_count
SAMPLE['replicate'] = "Rep1"
SAMPLE['replicate_count'] = 1


# force the right sex_ontology_term_id
SAMPLE["organism_ontology_term_id"] = "NCBITaxon:9606"

# set time == 0 for all samples
SAMPLE['time'] = 0

SAMPLE['file_type'] = SAMPLE['file_type'].replace({"Fastq":"fastq"})


In [9]:

# need to join with subject to get "sex" and convert to ontology term
SAMPLE_SUBJECT = SAMPLE.merge(SUBJECT, on='subject_id',  how='left')
SAMPLE_og = SAMPLE.copy()
SAMPLE['sex_ontology_term_id'] = SAMPLE_SUBJECT['sex'].replace({"Male":"PATO:0000384 (male)", "Female":"PATO:0000383 (female)" })

# ignore development_stage_ontology_term_id, self_reported_ethnicity_ontology_term_id, assay_ontology_term_id, etc for now. (Check wiht Le)

In [10]:
# fix assay
SAMPLE['assay'] = SAMPLE['assay'].replace({'v3.1 - Single Index, 10x Genomics ':"v3.1 - Single Index"})
# fix assay
SAMPLE['sequencing_length'] = SAMPLE['sequencing_length'].replace({'150bp x2':"150"})


In [11]:
sample_report = ReportCollector(destination="print")

validate_table(SAMPLE, "SAMPLE", CDE, sample_report)

0

In [12]:
print(sample_report.get_log())

All required fields are present in *SAMPLE* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- source_RIN: 36/36 empty rows

	- RIN: 36/36 empty rows

	- suspension_type: 36/36 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- pm_PH: 36/36 empty rows
## Enum fields have valid values in SAMPLE. 🥳



### CLINPATH

In [13]:
# redact "Prefrontal Cortex" from region_level_2 for now
CLINPATH['region_level_2'] = CLINPATH['region_level_2'].replace({'Prefrontal Cortex':""})
CLINPATH['region_level_2'] = CLINPATH['region_level_2'].replace({'Prefrontal cortex':""})

# leave te APOE_e4_status as is for now . multiple are coded as "2,3" 
# but remove commas
CLINPATH["APOE_e4_status"] = CLINPATH["APOE_e4_status"].str.replace(",","")

# need to fix the path_autopsy_dx_main

In [14]:
clinpath_report = ReportCollector(destination="print")
validate_table(CLINPATH, "CLINPATH", CDE, clinpath_report)


0

In [15]:
print(clinpath_report.get_log())

All required fields are present in *CLINPATH* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- age_at_onset: 12/12 empty rows

	- age_at_diagnosis: 12/12 empty rows

	- first_motor_symptom: 12/12 empty rows

	- path_year_death: 12/12 empty rows

	- brain_weight: 12/12 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- smoking_years: 12/12 empty rows
## Enum fields have valid values in CLINPATH. 🥳



### STUDY

In [16]:
study_report = ReportCollector(destination="print")
validate_table(STUDY, "STUDY", CDE, study_report)


1

In [17]:
print(study_report.get_log())

All required fields are present in *STUDY* table.
No empty entries (Nan) found in _Required_ fields.
No empty entries (Nan) found in _Optional_ fields.
## Enum fields have valid values in STUDY. 🥳



### PROTOCOL

In [18]:
protocol_report = ReportCollector(destination="print")
validate_table(PROTOCOL, "PROTOCOL", CDE, protocol_report)


1

In [19]:

print(protocol_report.get_log())

All required fields are present in *PROTOCOL* table.
No empty entries (Nan) found in _Required_ fields.
No empty entries (Nan) found in _Optional_ fields.
## Enum fields have valid values in PROTOCOL. 🥳



### export clean tables

In [20]:
data_path = data_path / "team-hafler"
data_path = Path.home() / ("Projects/ASAP/team-hafler")

In [21]:

# # write the clean metadata
# STUDY.to_csv(data_path / "metadata/STUDY.csv")
# PROTOCOL.to_csv(data_path / "metadata/PROTOCOL.csv")
# CLINPATH.to_csv(data_path / "metadata/CLINPATH.csv")
# SAMPLE.to_csv(data_path / "metadata/SAMPLE.csv")
# SUBJECT.to_csv(data_path / "metadata/SUBJECT.csv")

# also writh them to clean...
# 
#  

export_root = Path.cwd() / "clean/team-Hafler"
if not export_root.exists():
    export_root.mkdir(parents=True, exist_ok=True)


STUDY.to_csv( export_root / "STUDY.csv")
PROTOCOL.to_csv(export_root / "PROTOCOL.csv")
SAMPLE.to_csv(export_root / "SAMPLE.csv")
SUBJECT.to_csv(export_root / "SUBJECT.csv")
CLINPATH.to_csv(export_root / "CLINPATH.csv")


In [23]:

# Initialize the data types dictionary
dtypes_dict = get_dtypes_dict(CDE)
    


# make sure cleaned files are correct

SUBJECT = read_meta_table(f"{export_root}/SUBJECT.csv", dtypes_dict)
CLINPATH = read_meta_table(f"{export_root}/CLINPATH.csv", dtypes_dict)
STUDY = read_meta_table(f"{export_root}/STUDY.csv", dtypes_dict)
PROTOCOL = read_meta_table(f"{export_root}/PROTOCOL.csv", dtypes_dict)
SAMPLE = read_meta_table(f"{export_root}/SAMPLE.csv", dtypes_dict)


In [25]:
table, table_name = SUBJECT, "SUBJECT"

report = ReportCollector(destination="print")
validate_table(table, table_name, CDE, report)
print(report.get_log())

All required fields are present in *SUBJECT* table.
No empty entries (Nan) found in _Required_ fields.
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- primary_diagnosis_text: 12/12 empty rows
## Enum fields have valid values in SUBJECT. 🥳



In [26]:
table, table_name = SAMPLE, "SAMPLE"

report = ReportCollector(destination="print")
validate_table(table, table_name, CDE, report)
print(report.get_log())

All required fields are present in *SAMPLE* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- source_RIN: 36/36 empty rows

	- RIN: 36/36 empty rows

	- suspension_type: 36/36 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- DV200: 36/36 empty rows

	- pm_PH: 36/36 empty rows
## Enum fields have valid values in SAMPLE. 🥳



In [27]:
table, table_name = CLINPATH, "CLINPATH"

report = ReportCollector(destination="print")
validate_table(table, table_name, CDE, report)
print(report.get_log())

All required fields are present in *CLINPATH* table.
🚨⚠️❗ **Required Fields with Empty (nan) values:**

	- age_at_onset: 12/12 empty rows

	- age_at_diagnosis: 12/12 empty rows

	- first_motor_symptom: 12/12 empty rows

	- path_year_death: 12/12 empty rows

	- brain_weight: 12/12 empty rows
🚨⚠️❗ **Optional Fields with Empty (nan) values:**

	- smoking_years: 12/12 empty rows
## Enum fields have valid values in CLINPATH. 🥳

