# Preprocess clinical trial data from clinicaltrials.gov
Raw data were exported as .csv, and some fields contain lists (e.g. the trial sites). Here we create a long-form dataset to simplify further analysis. Long-form means that if e.g. a trial has sites in more than one country, there will be a row for each country separately.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
%load_ext blackcellmagic

In [4]:
import pandas as pd

In [5]:
raw_trial_data = pd.read_csv("data/source/clinicaltrials//clinicaltrials_allMS_trials.csv")

In [6]:
raw_trial_data.head()

Unnamed: 0,NCT Number,Study Title,Study URL,Study Status,Conditions,Interventions,Sponsor,Collaborators,Sex,Age,Phases,Enrollment,Funder Type,Study Type,Study Design,Start Date,Primary Completion Date,Completion Date,Locations
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,OTHER: No Interventions,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",,1500.0,OTHER,OBSERVATIONAL,Observational Model: |Time Perspective: p,2020-06-29,2021-01-30,2021-07-30,"Advanced Neurosciences Institute, Franklin, Te..."
1,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,DRUG: Natalizumab,"University Hospital, Toulouse",,ALL,ADULT,PHASE4,300.0,OTHER,INTERVENTIONAL,Allocation: NA|Intervention Model: SINGLE_GROU...,2009-06,2011-02,2011-03,"service de neurologie, hôpital Purpan, Toulous..."
2,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",BEHAVIORAL: HIIT|BEHAVIORAL: MCT,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",,30.0,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2022-10-01,2023-04-01,2023-04-01,"Klinik Valens, Valens rehabilitation clinic, V..."
3,NCT05090033,Characterizing the Use of Ofatumumab in a Real...,https://beta.clinicaltrials.gov/study/NCT05090033,RECRUITING,Relapsing Multiple Sclerosis,OTHER: ofatumumab,Novartis Pharmaceuticals,,ALL,"ADULT, OLDER_ADULT",,3500.0,INDUSTRY,OBSERVATIONAL,Observational Model: |Time Perspective: p,2022-12-08,2025-06-30,2025-06-30,"Novartis Investigative Site, Concord, New Sout..."
4,NCT00883337,A Study Comparing the Effectiveness and Safety...,https://beta.clinicaltrials.gov/study/NCT00883337,COMPLETED,Multiple Sclerosis,DRUG: Interferon β-1a|DRUG: Teriflunomide,Sanofi,,ALL,"ADULT, OLDER_ADULT",PHASE3,324.0,INDUSTRY,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2009-04,2011-09,2015-05,"Investigational Site Number 056003, Bruxelles,..."


## Create a base dataframe with trial ID, title, URL, and status

In [7]:
raw_trial_data = raw_trial_data.rename(columns={"NCT Number": "nct_number"})

Only one row per trial?

In [8]:
len(raw_trial_data) == len(raw_trial_data["nct_number"].drop_duplicates())

True

In [10]:
trial_data = (
    raw_trial_data[["nct_number", "Study Title", "Study URL", "Study Status"]]
    .rename(
        columns={
            "Study Title": "study_title",
            "Study URL": "study_url",
            "Study Status": "study_status",
        }
    )
    .copy()
)

In [11]:
trial_data.head()

Unnamed: 0,nct_number,study_title,study_url,study_status
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN
1,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED
2,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING
3,NCT05090033,Characterizing the Use of Ofatumumab in a Real...,https://beta.clinicaltrials.gov/study/NCT05090033,RECRUITING
4,NCT00883337,A Study Comparing the Effectiveness and Safety...,https://beta.clinicaltrials.gov/study/NCT00883337,COMPLETED


## Explode the listed conditions, export, filter manually, import filtered list, and filter data
The search function on clinicaltrials.gov might yield false positives when searching for common abbreviations related to MS. We thus list all mentioned conditions, then filter them manually. The filtered list is then imported again, and only trials where at least one of these conditions is mentioned are retained in the data set.

### Explode and export
If a trial includes multiple conditions, they are provided as a pipe-separated list. We split this list, and write a row for each list item. Then we remove duplicates and export the conditions for manual filtering.

In [13]:
conditions = raw_trial_data[["nct_number", "Conditions"]].copy()

In [14]:
conditions["condition"] = conditions["Conditions"].str.split("|")
conditions = conditions.explode("condition")

In [15]:
conditions.head()

Unnamed: 0,nct_number,Conditions,condition
0,NCT04447937,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,Multiple Sclerosis
0,NCT04447937,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,Hypogammaglobulinemia
0,NCT04447937,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,Immunodeficiency
0,NCT04447937,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,"Infection, Bacterial"
1,NCT00942214,Multiple Sclerosis,Multiple Sclerosis


In [16]:
conditions_for_export = (
    conditions[["condition"]].drop_duplicates().sort_values("condition")
)

In [None]:
# safety switch
polse

In [None]:
with pd.ExcelWriter("data/manual/clinicaltrials/all_conditions.xlsx") as writer:
    conditions_for_export.to_excel(writer, index=False)

### Import and filter
Import the filtered list of conditions, then retain only those trials with at least one of those conditions.

In [17]:
ms_conditions = pd.read_excel("data/manual/clinicaltrials/conditions_list_filtered.xlsx")

In [18]:
ms_conditions.head()

Unnamed: 0,condition,tag
0,Active Secondary Progressive Multiple Sclerosis,SPMS
1,Acute Disseminated Encephalomyelitis,MS
2,Acute Exacerbation of Remitting Relapsing Mult...,RRMS
3,Advanced Multiple Sclerosis,MS
4,Advancing Multiple Sclerosis,MS


For filtering, we use an inner join, since we only consider MS trials.

In [20]:
ms_trials = pd.merge(
    left=conditions, right=ms_conditions, on="condition", how="inner"
)

### Drop the single-condition column, just keep the condition categories
This is to reduce the size of the dataset; we don't need resolution beyond the categories.

In [21]:
ms_trials = (
    ms_trials[["nct_number", "Conditions", "tag"]]
    .drop_duplicates()
    .rename(columns={"Conditions": "conditions", "tag": "condition_category"})
    .copy()
)

In [22]:
ms_trials

Unnamed: 0,nct_number,conditions,condition_category
0,NCT04447937,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS
1,NCT00942214,Multiple Sclerosis,MS
2,NCT00883337,Multiple Sclerosis,MS
3,NCT01018537,Multiple Sclerosis,MS
4,NCT04132037,Multiple Sclerosis,MS
...,...,...,...
2995,NCT03636789,Radiologically Isolated Syndrome (RIS)|Multipl...,RIS
2996,NCT05437276,Gait Impairment Due to Mild/Moderate Multiple ...,MS
2997,NCT01865357,Clinically Isolated Demyelinating Syndromes|Mu...,CIS
2998,NCT04369898,Multiple Sclerosis),MS


### Merge with base dataframe

In [23]:
ms_trials = pd.merge(left=trial_data, right=ms_trials, on="nct_number", how="inner")

In [24]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS
1,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS
2,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS
3,NCT05090033,Characterizing the Use of Ofatumumab in a Real...,https://beta.clinicaltrials.gov/study/NCT05090033,RECRUITING,Relapsing Multiple Sclerosis,RMS
4,NCT00883337,A Study Comparing the Effectiveness and Safety...,https://beta.clinicaltrials.gov/study/NCT00883337,COMPLETED,Multiple Sclerosis,MS


In [25]:
len(ms_trials["nct_number"].drop_duplicates())

2711

We have identified 2711 MS trials.

## Interventions
Convert the intervention lists to long-form. If a trial includes more than one intervention, they are provided in a pipe-separated list. We will create one row per intervention.

In [26]:
interventions = raw_trial_data[["nct_number", "Interventions"]].copy()

In [27]:
interventions["intervention"] = interventions["Interventions"].str.split("|")
interventions = interventions.explode("intervention")

In [28]:
interventions.head()

Unnamed: 0,nct_number,Interventions,intervention
0,NCT04447937,OTHER: No Interventions,OTHER: No Interventions
1,NCT00942214,DRUG: Natalizumab,DRUG: Natalizumab
2,NCT05562414,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL: HIIT
2,NCT05562414,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL: MCT
3,NCT05090033,OTHER: ofatumumab,OTHER: ofatumumab


Are there trials where intervention is not specified?

In [29]:
len(interventions[interventions["Interventions"].isna()]) > 0

True

Drop trials where intervention is not specified. We will keep them in our final dataset, though (see below, where we used a left join to add the intervention data to the base dataframe).

In [30]:
interventions = interventions[~interventions["Interventions"].isna()].copy()

### Keep the intervention type only
Again, intervention type level information is sufficient for our analysis.

In [31]:
interventions["intervention_type"] = (
    interventions["intervention"].str.split(":").str[0]
)

In [32]:
interventions.head()

Unnamed: 0,nct_number,Interventions,intervention,intervention_type
0,NCT04447937,OTHER: No Interventions,OTHER: No Interventions,OTHER
1,NCT00942214,DRUG: Natalizumab,DRUG: Natalizumab,DRUG
2,NCT05562414,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL: HIIT,BEHAVIORAL
2,NCT05562414,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL: MCT,BEHAVIORAL
3,NCT05090033,OTHER: ofatumumab,OTHER: ofatumumab,OTHER


In [33]:
interventions = (
    interventions[["nct_number", "Interventions", "intervention_type"]]
    .rename(columns={"Interventions": "interventions"})
    .drop_duplicates()
    .copy()
)

In [34]:
interventions.head()

Unnamed: 0,nct_number,interventions,intervention_type
0,NCT04447937,OTHER: No Interventions,OTHER
1,NCT00942214,DRUG: Natalizumab,DRUG
2,NCT05562414,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL
3,NCT05090033,OTHER: ofatumumab,OTHER
4,NCT00883337,DRUG: Interferon β-1a|DRUG: Teriflunomide,DRUG


### Merge with base dataframe
We use a left join here, i.e. we keep trials where intervention is not specified in the overall dataset.

In [35]:
ms_trials = pd.merge(left=ms_trials, right=interventions, on="nct_number", how="left")

In [36]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category,interventions,intervention_type
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER
1,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS,DRUG: Natalizumab,DRUG
2,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL
3,NCT05090033,Characterizing the Use of Ofatumumab in a Real...,https://beta.clinicaltrials.gov/study/NCT05090033,RECRUITING,Relapsing Multiple Sclerosis,RMS,OTHER: ofatumumab,OTHER
4,NCT00883337,A Study Comparing the Effectiveness and Safety...,https://beta.clinicaltrials.gov/study/NCT00883337,COMPLETED,Multiple Sclerosis,MS,DRUG: Interferon β-1a|DRUG: Teriflunomide,DRUG


## Sponsors and collaborators

### Sponsors

Does the 'Sponsor' column contain a list?

In [37]:
raw_trial_data[raw_trial_data["Sponsor"].str.contains(r"|", regex=False)][["Sponsor"]].drop_duplicates()

Unnamed: 0,Sponsor


-> each trial has only one sponsor listed (and no trial has no sponsor listed, otherwise the above would fail).

### Collaborators

The 'Collaborators' column contains a list; however, this is beyond the resolution we need, so we keep this column as-is for now.

In [39]:
sponsors_collaborators = (
    raw_trial_data[["nct_number", "Sponsor", "Collaborators"]]
    .rename(columns={"Sponsor": "sponsor", "Collaborators": "collaborators"})
    .drop_duplicates()
    .copy()
)

In [40]:
sponsors_collaborators

Unnamed: 0,nct_number,sponsor,collaborators
0,NCT04447937,Advanced Neurosciences Institute,Novel Pharmaceutics Institute
1,NCT00942214,"University Hospital, Toulouse",
2,NCT05562414,Klinik Valens,
3,NCT05090033,Novartis Pharmaceuticals,
4,NCT00883337,Sanofi,
...,...,...,...
2959,NCT02451696,NYU Langone Health,
2960,NCT01191996,Innate Immunotherapeutics,Primorus Clinical Trials|National Multiple Scl...
2961,NCT03826095,Hacettepe University,KARMUTLU|MATUNCER|EÇKÜTÜKÇÜ
2962,NCT05341895,Firat University,


### Merge with base dataframe

In [41]:
ms_trials = pd.merge(
    left=ms_trials, right=sponsors_collaborators, on="nct_number", how="left"
)

In [42]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category,interventions,intervention_type,sponsor,collaborators
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute
1,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS,DRUG: Natalizumab,DRUG,"University Hospital, Toulouse",
2,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,
3,NCT05090033,Characterizing the Use of Ofatumumab in a Real...,https://beta.clinicaltrials.gov/study/NCT05090033,RECRUITING,Relapsing Multiple Sclerosis,RMS,OTHER: ofatumumab,OTHER,Novartis Pharmaceuticals,
4,NCT00883337,A Study Comparing the Effectiveness and Safety...,https://beta.clinicaltrials.gov/study/NCT00883337,COMPLETED,Multiple Sclerosis,MS,DRUG: Interferon β-1a|DRUG: Teriflunomide,DRUG,Sanofi,


## Age and sex

### Sex

In [43]:
raw_trial_data[["Sex"]].drop_duplicates()

Unnamed: 0,Sex
0,ALL
10,FEMALE
31,MALE
50,


Leave as-is.

### Age

In [44]:
raw_trial_data[["Age"]].drop_duplicates()

Unnamed: 0,Age
0,"ADULT, OLDER_ADULT"
1,ADULT
10,"CHILD, ADULT, OLDER_ADULT"
16,"CHILD, ADULT"
69,CHILD
1605,OLDER_ADULT


Split to long-form.

In [45]:
raw_trial_data[raw_trial_data["Age"].isna()]

Unnamed: 0,nct_number,Study Title,Study URL,Study Status,Conditions,Interventions,Sponsor,Collaborators,Sex,Age,Phases,Enrollment,Funder Type,Study Type,Study Design,Start Date,Primary Completion Date,Completion Date,Locations


In [50]:
age_sex = raw_trial_data[["nct_number", "Sex", "Age"]].drop_duplicates().copy()

In [51]:
age_sex["age_category"] = age_sex["Age"].str.split(",")
age_sex = age_sex.explode("age_category")

In [52]:
age_sex = age_sex.rename(columns={"Sex": "sex", "Age": "age"})

In [53]:
age_sex

Unnamed: 0,nct_number,sex,age,age_category
0,NCT04447937,ALL,"ADULT, OLDER_ADULT",ADULT
0,NCT04447937,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT
1,NCT00942214,ALL,ADULT,ADULT
2,NCT05562414,ALL,"ADULT, OLDER_ADULT",ADULT
2,NCT05562414,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT
...,...,...,...,...
2962,NCT05341895,ALL,"ADULT, OLDER_ADULT",ADULT
2962,NCT05341895,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT
2963,NCT03458169,ALL,"CHILD, ADULT, OLDER_ADULT",CHILD
2963,NCT03458169,ALL,"CHILD, ADULT, OLDER_ADULT",ADULT


### Merge with base dataframe

In [54]:
ms_trials = pd.merge(left=ms_trials, right=age_sex, on="nct_number", how="left")

In [55]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category,interventions,intervention_type,sponsor,collaborators,sex,age,age_category
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",ADULT
1,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT
2,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS,DRUG: Natalizumab,DRUG,"University Hospital, Toulouse",,ALL,ADULT,ADULT
3,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",ADULT
4,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT


## Phases

In [56]:
raw_trial_data[["Phases"]].drop_duplicates()

Unnamed: 0,Phases
0,
1,PHASE4
4,PHASE3
6,PHASE2
12,PHASE1
14,PHASE2|PHASE3
40,PHASE1|PHASE2
46,EARLY_PHASE1


Split the phase list; trials with more than one phase will be counted for each phase separately.

In [57]:
phases = raw_trial_data[["nct_number", "Phases"]].drop_duplicates().copy()
phases["phase"] = phases["Phases"].str.split("|")
phases = phases.explode("phase")
phases = phases.rename(columns={"Phases": "phases"})

In [58]:
phases

Unnamed: 0,nct_number,phases,phase
0,NCT04447937,,
1,NCT00942214,PHASE4,PHASE4
2,NCT05562414,,
3,NCT05090033,,
4,NCT00883337,PHASE3,PHASE3
...,...,...,...
2960,NCT01191996,PHASE1|PHASE2,PHASE1
2960,NCT01191996,PHASE1|PHASE2,PHASE2
2961,NCT03826095,,
2962,NCT05341895,,


### Merge with base dataframe

In [59]:
ms_trials = pd.merge(left=ms_trials, right=phases, on="nct_number", how="left")

In [60]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category,interventions,intervention_type,sponsor,collaborators,sex,age,age_category,phases,phase
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",ADULT,,
1,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT,,
2,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS,DRUG: Natalizumab,DRUG,"University Hospital, Toulouse",,ALL,ADULT,ADULT,PHASE4,PHASE4
3,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",ADULT,,
4,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT,,


## Enrollment

In [61]:
len(raw_trial_data[raw_trial_data["Enrollment"].isna()].drop_duplicates())

13

A few trials don't specify enrollment; what about the rest?

In [62]:
enrollment_notnan = raw_trial_data[~raw_trial_data["Enrollment"].isna()].copy()
enrollment_notnan["enrollment_int"] = enrollment_notnan["Enrollment"].astype(int)

The rest can be converted to integer if required; we leave this for now, as enrollment is trial-level anyway (no country-level or site-level resolution) and we thus won't analyze it further.

### Merge with base dataframe

In [63]:
enrollment = (
    raw_trial_data[["nct_number", "Enrollment"]]
    .rename(columns={"Enrollment": "enrollment"})
    .drop_duplicates()
    .copy()
)

In [64]:
ms_trials = pd.merge(left=ms_trials, right=enrollment, on="nct_number", how="left")

In [65]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category,interventions,intervention_type,sponsor,collaborators,sex,age,age_category,phases,phase,enrollment
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",ADULT,,,1500.0
1,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT,,,1500.0
2,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS,DRUG: Natalizumab,DRUG,"University Hospital, Toulouse",,ALL,ADULT,ADULT,PHASE4,PHASE4,300.0
3,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",ADULT,,,30.0
4,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT,,,30.0


## Funder type and study type

### Funder type

In [66]:
raw_trial_data[["Funder Type"]].drop_duplicates()

Unnamed: 0,Funder Type
0,OTHER
3,INDUSTRY
26,NIH
29,OTHER_GOV
176,NETWORK
372,INDIV
424,FED


In [67]:
raw_trial_data[raw_trial_data["Funder Type"].isna()]

Unnamed: 0,nct_number,Study Title,Study URL,Study Status,Conditions,Interventions,Sponsor,Collaborators,Sex,Age,Phases,Enrollment,Funder Type,Study Type,Study Design,Start Date,Primary Completion Date,Completion Date,Locations


This is fine, no processing required.

### Study type

In [68]:
raw_trial_data[["Study Type"]].drop_duplicates()

Unnamed: 0,Study Type
0,OBSERVATIONAL
1,INTERVENTIONAL
1268,EXPANDED_ACCESS


In [70]:
raw_trial_data[raw_trial_data["Study Type"].isna()]

Unnamed: 0,nct_number,Study Title,Study URL,Study Status,Conditions,Interventions,Sponsor,Collaborators,Sex,Age,Phases,Enrollment,Funder Type,Study Type,Study Design,Start Date,Primary Completion Date,Completion Date,Locations


This is fine, no processing required.

### Merge with base dataframe

In [71]:
funder_study_type = (
    raw_trial_data[["nct_number", "Funder Type", "Study Type"]]
    .rename(columns={"Funder Type": "funder_type", "Study Type": "study_type"})
    .drop_duplicates()
    .copy()
)

In [72]:
ms_trials = pd.merge(left=ms_trials, right=funder_study_type, on="nct_number", how="left")

In [73]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category,interventions,intervention_type,sponsor,collaborators,sex,age,age_category,phases,phase,enrollment,funder_type,study_type
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",ADULT,,,1500.0,OTHER,OBSERVATIONAL
1,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT,,,1500.0,OTHER,OBSERVATIONAL
2,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS,DRUG: Natalizumab,DRUG,"University Hospital, Toulouse",,ALL,ADULT,ADULT,PHASE4,PHASE4,300.0,OTHER,INTERVENTIONAL
3,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",ADULT,,,30.0,OTHER,INTERVENTIONAL
4,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT,,,30.0,OTHER,INTERVENTIONAL


## Study design

In [74]:
design = (
    raw_trial_data[["nct_number", "Study Design"]]
    .rename(columns={"Study Design": "study_design"})
    .drop_duplicates()
    .copy()
)

In [75]:
design

Unnamed: 0,nct_number,study_design
0,NCT04447937,Observational Model: |Time Perspective: p
1,NCT00942214,Allocation: NA|Intervention Model: SINGLE_GROU...
2,NCT05562414,Allocation: RANDOMIZED|Intervention Model: PAR...
3,NCT05090033,Observational Model: |Time Perspective: p
4,NCT00883337,Allocation: RANDOMIZED|Intervention Model: PAR...
...,...,...
2959,NCT02451696,Allocation: NON_RANDOMIZED|Intervention Model:...
2960,NCT01191996,Allocation: NA|Intervention Model: SINGLE_GROU...
2961,NCT03826095,Observational Model: |Time Perspective: p
2962,NCT05341895,Allocation: NON_RANDOMIZED|Intervention Model:...


### Extract the primary purpose

In [76]:
primary_purpose = design.copy()
primary_purpose["primary_purpose"] = primary_purpose["study_design"].str.split(
    "|"
)
primary_purpose = primary_purpose.explode("primary_purpose")

Drop NaNs.

In [77]:
primary_purpose = primary_purpose[
    ~primary_purpose["primary_purpose"].isna()
].copy()

Keep on 'Primary Purpose' and drop the rest.

In [78]:
primary_purpose = primary_purpose[
    primary_purpose["primary_purpose"].str.startswith("Primary Purpose")
].copy()

In [79]:
primary_purpose

Unnamed: 0,nct_number,study_design,primary_purpose
1,NCT00942214,Allocation: NA|Intervention Model: SINGLE_GROU...,Primary Purpose: DIAGNOSTIC
2,NCT05562414,Allocation: RANDOMIZED|Intervention Model: PAR...,Primary Purpose: TREATMENT
4,NCT00883337,Allocation: RANDOMIZED|Intervention Model: PAR...,Primary Purpose: TREATMENT
6,NCT03249714,Allocation: RANDOMIZED|Intervention Model: PAR...,Primary Purpose: TREATMENT
9,NCT05809414,Allocation: RANDOMIZED|Intervention Model: CRO...,Primary Purpose: TREATMENT
...,...,...,...
2955,NCT01845896,Allocation: RANDOMIZED|Intervention Model: PAR...,Primary Purpose: TREATMENT
2959,NCT02451696,Allocation: NON_RANDOMIZED|Intervention Model:...,Primary Purpose: TREATMENT
2960,NCT01191996,Allocation: NA|Intervention Model: SINGLE_GROU...,Primary Purpose: TREATMENT
2962,NCT05341895,Allocation: NON_RANDOMIZED|Intervention Model:...,Primary Purpose: OTHER


Clean up.

In [80]:
primary_purpose["primary_purpose"] = primary_purpose[
    "primary_purpose"
].str.replace("Primary Purpose: ", "")

In [81]:
primary_purpose.head()

Unnamed: 0,nct_number,study_design,primary_purpose
1,NCT00942214,Allocation: NA|Intervention Model: SINGLE_GROU...,DIAGNOSTIC
2,NCT05562414,Allocation: RANDOMIZED|Intervention Model: PAR...,TREATMENT
4,NCT00883337,Allocation: RANDOMIZED|Intervention Model: PAR...,TREATMENT
6,NCT03249714,Allocation: RANDOMIZED|Intervention Model: PAR...,TREATMENT
9,NCT05809414,Allocation: RANDOMIZED|Intervention Model: CRO...,TREATMENT


Sanity check: only one primary purpose per trial?

In [82]:
primary_purpose.groupby(
    ["nct_number", "study_design"]
).count().reset_index().sort_values("primary_purpose", ascending=False).head()

Unnamed: 0,nct_number,study_design,primary_purpose
0,NCT00000146,Allocation: RANDOMIZED|Intervention Model: |Ma...,1
1411,NCT03737851,Allocation: RANDOMIZED|Intervention Model: PAR...,1
1425,NCT03768648,Allocation: NON_RANDOMIZED|Intervention Model:...,1
1424,NCT03759522,Allocation: NON_RANDOMIZED|Intervention Model:...,1
1423,NCT03759249,Allocation: RANDOMIZED|Intervention Model: PAR...,1


Yes; now join back to base dataframe.

### Merge with base dataframe
Note that we have to use two left joins separately, because we dropped NaNs along the way when exploding list fields.

In [83]:
ms_trials = pd.merge(left=ms_trials, right=design, on="nct_number", how="left")

In [84]:
ms_trials = pd.merge(
    left=ms_trials,
    right=primary_purpose[["nct_number", "primary_purpose"]],
    on="nct_number",
    how="left",
)

In [85]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category,interventions,intervention_type,sponsor,collaborators,sex,age,age_category,phases,phase,enrollment,funder_type,study_type,study_design,primary_purpose
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",ADULT,,,1500.0,OTHER,OBSERVATIONAL,Observational Model: |Time Perspective: p,
1,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT,,,1500.0,OTHER,OBSERVATIONAL,Observational Model: |Time Perspective: p,
2,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS,DRUG: Natalizumab,DRUG,"University Hospital, Toulouse",,ALL,ADULT,ADULT,PHASE4,PHASE4,300.0,OTHER,INTERVENTIONAL,Allocation: NA|Intervention Model: SINGLE_GROU...,DIAGNOSTIC
3,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",ADULT,,,30.0,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,TREATMENT
4,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,ALL,"ADULT, OLDER_ADULT",OLDER_ADULT,,,30.0,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,TREATMENT


## Dates

In [86]:
dates = raw_trial_data[
    ["nct_number", "Start Date", "Primary Completion Date", "Completion Date"]
].copy()

In [87]:
dates = dates.rename(
    columns={colname: colname.lower().replace(" ", "_") for colname in dates.columns}
)

In [88]:
dates

Unnamed: 0,nct_number,start_date,primary_completion_date,completion_date
0,NCT04447937,2020-06-29,2021-01-30,2021-07-30
1,NCT00942214,2009-06,2011-02,2011-03
2,NCT05562414,2022-10-01,2023-04-01,2023-04-01
3,NCT05090033,2022-12-08,2025-06-30,2025-06-30
4,NCT00883337,2009-04,2011-09,2015-05
...,...,...,...,...
2959,NCT02451696,2014-01,2017-12-08,2017-12-28
2960,NCT01191996,2010-08,2012-06,2012-11
2961,NCT03826095,2019-02-04,2019-07-26,2019-08-02
2962,NCT05341895,2021-11-22,2022-04-22,2022-05-06


Dates have different resolution; extract the year.

In [89]:
dates["start_year"] = dates["start_date"].str[:4]
dates["primary_completion_year"] = dates["primary_completion_date"].str[:4]
dates["completion_year"] = dates["completion_date"].str[:4]

### Merge with base dataframe

In [90]:
ms_trials = pd.merge(left=ms_trials, right=dates, on="nct_number", how="left")

In [91]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category,interventions,intervention_type,sponsor,collaborators,...,funder_type,study_type,study_design,primary_purpose,start_date,primary_completion_date,completion_date,start_year,primary_completion_year,completion_year
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,...,OTHER,OBSERVATIONAL,Observational Model: |Time Perspective: p,,2020-06-29,2021-01-30,2021-07-30,2020,2021,2021
1,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,...,OTHER,OBSERVATIONAL,Observational Model: |Time Perspective: p,,2020-06-29,2021-01-30,2021-07-30,2020,2021,2021
2,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS,DRUG: Natalizumab,DRUG,"University Hospital, Toulouse",,...,OTHER,INTERVENTIONAL,Allocation: NA|Intervention Model: SINGLE_GROU...,DIAGNOSTIC,2009-06,2011-02,2011-03,2009,2011,2011
3,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,...,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,TREATMENT,2022-10-01,2023-04-01,2023-04-01,2022,2023,2023
4,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,...,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,TREATMENT,2022-10-01,2023-04-01,2023-04-01,2022,2023,2023


## Locations
Site locations are provided as pipe-separated list of addresses. We split the list, write each address on a separate row, then extract the country from the address.

In [92]:
locations = (
    raw_trial_data[["nct_number", "Locations"]]
    .rename(columns={"Locations": "locations"})
    .copy()
)

In [93]:
locations["location"] = locations["locations"].str.split("|")
locations = locations.explode("location")

In [94]:
locations

Unnamed: 0,nct_number,locations,location
0,NCT04447937,"Advanced Neurosciences Institute, Franklin, Te...","Advanced Neurosciences Institute, Franklin, Te..."
1,NCT00942214,"service de neurologie, hôpital Purpan, Toulous...","service de neurologie, hôpital Purpan, Toulous..."
2,NCT05562414,"Klinik Valens, Valens rehabilitation clinic, V...","Klinik Valens, Valens rehabilitation clinic, V..."
3,NCT05090033,"Novartis Investigative Site, Concord, New Sout...","Novartis Investigative Site, Concord, New Sout..."
3,NCT05090033,"Novartis Investigative Site, Concord, New Sout...","Novartis Investigative Site, Southport, Queens..."
...,...,...,...
2960,NCT01191996,"Primorus Clinical Trials, 40 Stewart Street, C...","Primorus Clinical Trials, 40 Stewart Street, C..."
2961,NCT03826095,"Hacettepe University, Ankara, Turkey","Hacettepe University, Ankara, Turkey"
2962,NCT05341895,"Furkan Bilek, Elazığ, 23100, Turkey|Fırat univ...","Furkan Bilek, Elazığ, 23100, Turkey"
2962,NCT05341895,"Furkan Bilek, Elazığ, 23100, Turkey|Fırat univ...","Fırat university, Elazığ, 23100, Turkey"


### Get the country from the site address
The address format is standardized as a comma separated list with the country name as last item. In some cases, the country name itself contains a comma (as in e.g. 'Iran, Islamic Republic of'). Thus, if an address ends with 'epublic of' (drop the 'r' so we don't have to adjust capitalization), we take the two last entries and re-order them.

In [96]:
def country_from_location(raw_location):
    raw_location = str(raw_location)
    if raw_location.endswith("epublic of"):
        country = reversed(raw_location.split(",")[-2:])
        country = "".join(country).strip()
    else:
        country = raw_location.split(",")[-1].strip()
    return country

In [97]:
locations["country"] = locations.apply(
    lambda row: country_from_location(raw_location=row["location"]), axis=1
)

In [98]:
locations

Unnamed: 0,nct_number,locations,location,country
0,NCT04447937,"Advanced Neurosciences Institute, Franklin, Te...","Advanced Neurosciences Institute, Franklin, Te...",United States
1,NCT00942214,"service de neurologie, hôpital Purpan, Toulous...","service de neurologie, hôpital Purpan, Toulous...",France
2,NCT05562414,"Klinik Valens, Valens rehabilitation clinic, V...","Klinik Valens, Valens rehabilitation clinic, V...",Switzerland
3,NCT05090033,"Novartis Investigative Site, Concord, New Sout...","Novartis Investigative Site, Concord, New Sout...",Australia
3,NCT05090033,"Novartis Investigative Site, Concord, New Sout...","Novartis Investigative Site, Southport, Queens...",Australia
...,...,...,...,...
2960,NCT01191996,"Primorus Clinical Trials, 40 Stewart Street, C...","Primorus Clinical Trials, 40 Stewart Street, C...",New Zealand
2961,NCT03826095,"Hacettepe University, Ankara, Turkey","Hacettepe University, Ankara, Turkey",Turkey
2962,NCT05341895,"Furkan Bilek, Elazığ, 23100, Turkey|Fırat univ...","Furkan Bilek, Elazığ, 23100, Turkey",Turkey
2962,NCT05341895,"Furkan Bilek, Elazığ, 23100, Turkey|Fırat univ...","Fırat university, Elazığ, 23100, Turkey",Turkey


### Check 'Many Locations'
Some locations are not given explicitly; can these 'many location' data be mixed with specific addresses?

In [99]:
locations[locations["location"].astype(str).str.startswith("Many Locations")]

Unnamed: 0,nct_number,locations,location,country
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Austria",Austria
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Belgium",Belgium
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Finland",Finland
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Germany",Germany
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Israel",Israel
...,...,...,...,...
2956,NCT00461396,"Many Locations, Alabama, United States|Many Lo...","Many Locations, North Carolina, United States",United States
2956,NCT00461396,"Many Locations, Alabama, United States|Many Lo...","Many Locations, Ohio, United States",United States
2956,NCT00461396,"Many Locations, Alabama, United States|Many Lo...","Many Locations, Pennsylvania, United States",United States
2956,NCT00461396,"Many Locations, Alabama, United States|Many Lo...","Many Locations, Tennessee, United States",United States


In [100]:
locations[
    (~locations["location"].astype(str).str.startswith("Many Locations"))
    & (locations["location"].astype(str).str.contains("Many Locations"))
]

Unnamed: 0,nct_number,locations,location,country


-> if a trial indicates 'Many locations', the site information always starts with 'Many locations' (i.e. never an address, then 'many locations').

### Flag entries with 'Many Locations'

In [101]:
locations["many_locations_flag"] = (
    locations["location"].astype(str).str.lower().str.startswith("many locations")
)

In [102]:
locations[locations["many_locations_flag"]].head()

Unnamed: 0,nct_number,locations,location,country,many_locations_flag
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Austria",Austria,True
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Belgium",Belgium,True
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Finland",Finland,True
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Germany",Germany,True
69,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...","Many Locations, Israel",Israel,True


In [103]:
len(locations[locations["many_locations_flag"]])

189

### Add country ISO codes and continents
We manually created a mapping of country names and their corresponding 3-character ISO code.

In [104]:
country_mapping = pd.read_excel("data/manual/clinicaltrials//countries.xlsx")

In [105]:
locations = pd.merge(
    left=locations, right=country_mapping, on="country", how="left"
)

In [106]:
locations

Unnamed: 0,nct_number,locations,location,country,many_locations_flag,country_norm,country_ISO,country_continent
0,NCT04447937,"Advanced Neurosciences Institute, Franklin, Te...","Advanced Neurosciences Institute, Franklin, Te...",United States,False,United States,USA,North America
1,NCT00942214,"service de neurologie, hôpital Purpan, Toulous...","service de neurologie, hôpital Purpan, Toulous...",France,False,France,FRA,Europe
2,NCT05562414,"Klinik Valens, Valens rehabilitation clinic, V...","Klinik Valens, Valens rehabilitation clinic, V...",Switzerland,False,Switzerland,CHE,Europe
3,NCT05090033,"Novartis Investigative Site, Concord, New Sout...","Novartis Investigative Site, Concord, New Sout...",Australia,False,Australia,AUS,Oceania
4,NCT05090033,"Novartis Investigative Site, Concord, New Sout...","Novartis Investigative Site, Southport, Queens...",Australia,False,Australia,AUS,Oceania
...,...,...,...,...,...,...,...,...
33794,NCT01191996,"Primorus Clinical Trials, 40 Stewart Street, C...","Primorus Clinical Trials, 40 Stewart Street, C...",New Zealand,False,New Zealand,NZL,Oceania
33795,NCT03826095,"Hacettepe University, Ankara, Turkey","Hacettepe University, Ankara, Turkey",Turkey,False,Turkey,TUR,Asia
33796,NCT05341895,"Furkan Bilek, Elazığ, 23100, Turkey|Fırat univ...","Furkan Bilek, Elazığ, 23100, Turkey",Turkey,False,Turkey,TUR,Asia
33797,NCT05341895,"Furkan Bilek, Elazığ, 23100, Turkey|Fırat univ...","Fırat university, Elazığ, 23100, Turkey",Turkey,False,Turkey,TUR,Asia


Is the country missing for any entry with a valid location?

In [107]:
locations[(locations["country_ISO"].isna()) & (~locations["location"].isna())]

Unnamed: 0,nct_number,locations,location,country,many_locations_flag,country_norm,country_ISO,country_continent


All locations properly matched.

### Compute the number of sites per country
Instead of listing all sites for the same country, we just keep the number of sites.

In [108]:
sites_per_trial_per_country = (
    locations.drop(columns=["location"])
    .groupby(
        [
            "nct_number",
            "locations",
            "many_locations_flag",
            "country_norm",
            "country_ISO",
            "country_continent",
        ],
        as_index=False,
    )
    .size()
    .rename(columns={"size": "n_sites"})
)

In [109]:
sites_per_trial_per_country

Unnamed: 0,nct_number,locations,many_locations_flag,country_norm,country_ISO,country_continent,n_sites
0,NCT00000146,"University of Arkansas, Little Rock, Arkansas,...",False,United States,USA,North America,15
1,NCT00000147,"University of Arkansas, Little Rock, Arkansas,...",False,United States,USA,North America,15
2,NCT00001156,"National Institutes of Health Clinical Center,...",False,United States,USA,North America,1
3,NCT00001248,"National Institutes of Health Clinical Center,...",False,United States,USA,North America,1
4,NCT00001465,"National Institutes of Health Clinical Center,...",False,United States,USA,North America,1
...,...,...,...,...,...,...,...
6449,NCT05840653,"SAMSUN, Samsun, 55000, Turkey",False,Turkey,TUR,Asia,1
6450,NCT05844826,"University of Calgary, Calgary, Alberta, T2N 1...",False,Canada,CAN,North America,1
6451,NCT05849467,"National Institutes of Health Clinical Center,...",False,United States,USA,North America,1
6452,NCT05853835,"Triumpharma, Amman, 11941, Jordan",False,Jordan,JOR,Asia,1


**Note**: If 'many locations', then the site count is 1!

In [110]:
sites_per_trial_per_country[sites_per_trial_per_country["nct_number"] == "NCT00963833"]

Unnamed: 0,nct_number,locations,many_locations_flag,country_norm,country_ISO,country_continent,n_sites
1464,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...",True,Austria,AUT,Europe,1
1465,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...",True,Belgium,BEL,Europe,1
1466,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...",True,Finland,FIN,Europe,1
1467,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...",True,Germany,DEU,Europe,1
1468,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...",True,Israel,ISR,Asia,1
1469,NCT00963833,"Many Locations, Austria|Many Locations, Belgiu...",True,United Kingdom,GBR,Europe,1


### Merge with base dataframe

In [111]:
ms_trials = pd.merge(
    left=ms_trials, right=sites_per_trial_per_country, on="nct_number", how="left"
)

In [112]:
ms_trials.head()

Unnamed: 0,nct_number,study_title,study_url,study_status,conditions,condition_category,interventions,intervention_type,sponsor,collaborators,...,completion_date,start_year,primary_completion_year,completion_year,locations,many_locations_flag,country_norm,country_ISO,country_continent,n_sites
0,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,...,2021-07-30,2020,2021,2021,"Advanced Neurosciences Institute, Franklin, Te...",False,United States,USA,North America,1.0
1,NCT04447937,Immunodeficiency in MS,https://beta.clinicaltrials.gov/study/NCT04447937,UNKNOWN,Multiple Sclerosis|Hypogammaglobulinemia|Immun...,MS,OTHER: No Interventions,OTHER,Advanced Neurosciences Institute,Novel Pharmaceutics Institute,...,2021-07-30,2020,2021,2021,"Advanced Neurosciences Institute, Franklin, Te...",False,United States,USA,North America,1.0
2,NCT00942214,Biomarkers and Response to Natalizumab for Mul...,https://beta.clinicaltrials.gov/study/NCT00942214,COMPLETED,Multiple Sclerosis,MS,DRUG: Natalizumab,DRUG,"University Hospital, Toulouse",,...,2011-03,2009,2011,2011,"service de neurologie, hôpital Purpan, Toulous...",False,France,FRA,Europe,1.0
3,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,...,2023-04-01,2022,2023,2023,"Klinik Valens, Valens rehabilitation clinic, V...",False,Switzerland,CHE,Europe,1.0
4,NCT05562414,Transient and Immediate Motor Effects of Exerc...,https://beta.clinicaltrials.gov/study/NCT05562414,RECRUITING,"Multiple Sclerosis, Chronic Progressive|High-I...",PMS,BEHAVIORAL: HIIT|BEHAVIORAL: MCT,BEHAVIORAL,Klinik Valens,,...,2023-04-01,2022,2023,2023,"Klinik Valens, Valens rehabilitation clinic, V...",False,Switzerland,CHE,Europe,1.0


## Export

In [113]:
len(ms_trials), len(ms_trials.drop_duplicates()), len(ms_trials["nct_number"].drop_duplicates())

(11705, 11705, 2711)

In [None]:
# safety switch
polse

In [None]:
with pd.ExcelWriter("data/intermediate/ms_trials_long.xlsx") as writer:
    ms_trials.to_excel(writer, index=False)