## Finding Potentially Useful Tables
First I want to go through the AACT Database Schema table definitions in `aact_tables.xlsx` to identify which tables might contain useful features. After identifying a table, I'll take a look at it's features and select any as potential candidates for features to train the ML models on.

In [1]:
import numpy as np
import pandas as pd
import pathlib

raw_dir = pathlib.Path('../data/raw')

### `calculated_values`
An AACT-provided table that contains info that's been calculated from the information received from ClinicalTrials.gov.  For example, number_of_facilities and actual_duration are provided in this table.

In [2]:
calc_vals = pd.read_csv(raw_dir/'calculated_values.txt', sep='|')

In [3]:
calc_vals.head(10)

Unnamed: 0,id,nct_id,number_of_facilities,number_of_nsae_subjects,number_of_sae_subjects,registered_in_calendar_year,nlm_download_date,actual_duration,were_results_reported,months_to_report_results,has_us_facility,has_single_facility,minimum_age_num,maximum_age_num,minimum_age_unit,maximum_age_unit,number_of_primary_outcomes_to_measure,number_of_secondary_outcomes_to_measure,number_of_other_outcomes_to_measure
0,170308915,NCT06272461,1,,,2024,,,f,,f,t,18.0,90.0,year,year,1.0,5.0,
1,170308916,NCT02497274,0,,,2015,,37.0,f,,,,18.0,55.0,year,year,1.0,1.0,2.0
2,170308917,NCT05412550,1,,,2022,,,f,,t,t,40.0,89.0,year,year,1.0,7.0,1.0
3,170308918,NCT05292352,1,,,2022,,,f,,t,t,6.0,9.0,year,year,2.0,12.0,
4,170308919,NCT05866458,12,,,2023,,,f,,f,f,50.0,,year,,1.0,4.0,
5,170532907,NCT03632941,1,17.0,4.0,2018,,55.0,t,11.0,t,t,18.0,,year,,1.0,1.0,1.0
6,170532908,NCT04219826,22,,,2020,,38.0,f,,t,f,18.0,85.0,year,year,1.0,7.0,
7,170532909,NCT04527887,1,,,2020,,38.0,f,,t,t,18.0,,year,,1.0,7.0,
8,170532910,NCT06652464,1,,,2024,,3.0,f,,f,t,18.0,80.0,year,year,1.0,1.0,
9,170532911,NCT05578898,1,,,2022,,23.0,f,,t,t,18.0,,year,,5.0,3.0,


In [None]:
# check if only one entry per study
calc_vals.shape[0] == calc_vals['nct_id'].unique().size

True

In [6]:
calc_vals['maximum_age_unit'].value_counts()

maximum_age_unit
year      287149
month       4946
day         2213
week        1793
hour         610
minute       135
Name: count, dtype: int64

In [9]:
calc_vals[calc_vals['maximum_age_unit'] == 'month'].head()

Unnamed: 0,id,nct_id,number_of_facilities,number_of_nsae_subjects,number_of_sae_subjects,registered_in_calendar_year,nlm_download_date,actual_duration,were_results_reported,months_to_report_results,has_us_facility,has_single_facility,minimum_age_num,maximum_age_num,minimum_age_unit,maximum_age_unit,number_of_primary_outcomes_to_measure,number_of_secondary_outcomes_to_measure,number_of_other_outcomes_to_measure
62,170308944,NCT05994742,5,,,2023,,,f,,f,f,6.0,59.0,month,month,1.0,5.0,8.0
226,170309044,NCT00369759,38,,,2006,,22.0,f,,t,f,1.0,12.0,day,month,1.0,4.0,
370,170309097,NCT02173951,1,,,2014,,,f,,f,t,6.0,36.0,month,month,1.0,1.0,
555,170533243,NCT05973812,3,,,2023,,19.0,f,,f,f,3.0,3.0,month,month,3.0,2.0,
599,170533287,NCT03615495,12,,,2018,,61.0,f,,t,f,,12.0,,month,1.0,,


- **nsae/sae:** (non) serious adverse event
- Working with age units of minutes and hours seems messy. So I might just include year and month. I might have to calculate age*12 if the age unit is year so that an age_unit column isn't needed.
- Also will need to decide what to do when minimum_age_num is na but maximum_age_num isn't. Might make more sense to impute with 0 instead of the mean.

In [21]:
calc_vals_cols = [
    'nct_id', 'number_of_facilities', 'has_us_facility', 'number_of_nsae_subjects', 'number_of_sae_subjects',
    'minimum_age_num', 'maximum_age_num', 'minimum_age_unit', 'maximum_age_unit',
    'number_of_primary_outcomes_to_measure', 'number_of_secondary_outcomes_to_measure'
    ]

## `central_contacts`
Contact info for people (primary & backup) who can answer questions concerning enrollment at any location of the study.

In [13]:
contacts = pd.read_csv(raw_dir/'central_contacts.txt', sep='|')

In [14]:
contacts.head(10)

Unnamed: 0,id,nct_id,contact_type,name,phone,email,phone_extension,role
0,56422417,NCT05460416,primary,Julie Collée,+32498973386,julie.collee@uliege.be,,CONTACT
1,56422418,NCT05460416,backup,Marie Timmermans,,marie.timmermans@chuliege.be;,,CONTACT
2,56422419,NCT06791369,primary,"Heinz Jungbluth, MD PhD MRCP MRCPCH",+44 20 71883998,heinz.jungbluth@gstt.nhs.uk,,CONTACT
3,56422420,NCT06791369,backup,"Arti M Mistry, PhD MSci",,arti.mistry@gstt.nhs.uk,,CONTACT
4,56422421,NCT05642156,primary,Alexander H Kirsch,+43316385,alexander.kirsch@medunigraz.at,16023.0,CONTACT
5,56422422,NCT06793631,primary,"Nicola White, PhD",+44 (0) 2076799057,n.g.white@ucl.ac.uk,,CONTACT
6,56422423,NCT06793631,backup,"Alessandro Bosco, PhD",,alessandro.bosco@ucl.ac.uk,,CONTACT
7,56422424,NCT06796816,primary,"Cristian Rapicetta, MD",0522296858,Cristian.rapicetta@ausl.re.it,,CONTACT
8,56422425,NCT06489301,primary,"Manager, Clinical Research Operations",937-245-7500,pturesearch@wrightstatephysicians.org,,CONTACT
9,56422426,NCT06489301,backup,Regulatory Specialist,937-245-7500,pturesearch@wrightstatephysicians.org,,CONTACT


In [22]:
# more than one entry for each study
contacts.shape[0] == contacts['nct_id'].unique().size

False

In [20]:
contacts['role'].value_counts()
contacts['contact_type'].value_counts()

contact_type
primary    143948
backup      67393
Name: count, dtype: int64

In [18]:
# wondering if not having up to date contact information can effect termination risk
print('contact info | num missing')
for col in contacts.columns:
    print(f"{col}: {contacts[col].isna().sum()}")

contact info | num missing
id: 0
nct_id: 0
contact_type: 0
name: 1
phone: 12965
email: 4816
phone_extension: 185659
role: 0


- might be able to use 4 features for each nct_id: primary_email_missing, backup_email_missing, primary_phone_missing, backup_phone_missing
- will drop the study with the missing name

In [23]:
contacts_cols = [
    'id', 'nct_id', 'contact_type', 'phone', 'email'
]

In [36]:
conditions = pd.read_csv(raw_dir/'conditions.txt', sep='|')
browse_conditions = pd.read_csv(raw_dir/'browse_conditions.txt', sep='|')

In [26]:
conditions.head(10)

Unnamed: 0,id,nct_id,name,downcase_name
0,258969120,NCT02413840,COPD,copd
1,258969121,NCT02413840,Anxiety,anxiety
2,258969122,NCT02413840,Depression,depression
3,258969123,NCT04661215,Gastroparesis,gastroparesis
4,258969124,NCT04661215,Idiopathic Gastric Motility Disorder,idiopathic gastric motility disorder
5,258969125,NCT04661215,Diabetic Gastroparesis,diabetic gastroparesis
6,258969126,NCT06181136,Mucopolysaccharidosis Type IIIA,mucopolysaccharidosis type iiia
7,258969127,NCT04585750,Advanced Solid Tumor,advanced solid tumor
8,258969128,NCT04585750,Advanced Malignant Neoplasm,advanced malignant neoplasm
9,258969129,NCT04585750,Metastatic Cancer,metastatic cancer


In [38]:
browse_conditions.head(10)

Unnamed: 0,id,nct_id,mesh_term,downcase_mesh_term,mesh_type
0,1014094542,NCT02521727,Intestinal Neoplasms,intestinal neoplasms,mesh-ancestor
1,1014094543,NCT02521727,Gastrointestinal Neoplasms,gastrointestinal neoplasms,mesh-ancestor
2,1014094544,NCT02521727,Digestive System Neoplasms,digestive system neoplasms,mesh-ancestor
3,1014094545,NCT02521727,Neoplasms by Site,neoplasms by site,mesh-ancestor
4,1014094546,NCT02521727,Neoplasms,neoplasms,mesh-ancestor
5,1014094547,NCT02521727,Digestive System Diseases,digestive system diseases,mesh-ancestor
6,1014094548,NCT02521727,Gastrointestinal Diseases,gastrointestinal diseases,mesh-ancestor
7,1014094549,NCT02521727,Colonic Diseases,colonic diseases,mesh-ancestor
8,1014094550,NCT02521727,Intestinal Diseases,intestinal diseases,mesh-ancestor
9,1014094551,NCT02521727,Rectal Diseases,rectal diseases,mesh-ancestor


In [35]:
# I think way too many categories, will need to find a way to group them or something
conditions['downcase_name'].unique().size, conditions['downcase_name'].shape[0]

(122341, 990427)

In [None]:
# mesh terms gives way less individual conditions
browse_conditions['downcase_mesh_term'].unique().size, browse_conditions['downcase_mesh_term'].shape[0]

(5996, 4086571)

In [41]:
browse_conditions['downcase_mesh_term'].unique()

array(['intestinal neoplasms', 'gastrointestinal neoplasms',
       'digestive system neoplasms', ...,
       'phosphoglycerate kinase 1 deficiency',
       'glycogen storage disease type ix',
       'lactate dehydrogenase deficiency'], shape=(5996,), dtype=object)

In [33]:
len([cond for cond in conditions['downcase_name'] if 'anxiety' in cond])

6077