# This notebooks takes a look at positive and negative label predictions from [CTO](https://chufangao.github.io/CTOD/) [1].

We note that since these are just labels aggregated from many weakly supervised labelling functions, they may not perfectly align with the actual outcome.

[1] Gao, C., Pradeepkumar, J., Das, T., Thati, S., & Sun, J. (2024). Automatically Labeling Clinical Trial Outcomes: A Large-Scale Benchmark for Drug Development. arXiv preprint arXiv:2406.10292.

In [1]:
# # ================ First, let us get started by cloning everyting in ================
!git clone https://github.com/chufangao/CTOD.git
!git clone https://github.com/futianfan/clinical-trial-outcome-prediction.git
!wget https://huggingface.co/datasets/chufangao/CTO/resolve/main/CTTI.zip
CTTI_PATH = './CTTI.zip'

fatal: destination path 'CTOD' already exists and is not an empty directory.
fatal: destination path 'clinical-trial-outcome-prediction' already exists and is not an empty directory.
--2025-03-12 20:18:29--  https://huggingface.co/datasets/chufangao/CTO/resolve/main/CTTI.zip
Resolving huggingface.co (huggingface.co)... 108.156.201.58, 108.156.201.111, 108.156.201.48, ...
Connecting to huggingface.co (huggingface.co)|108.156.201.58|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/da/7f/da7f4412b646e319927d3efe09843fa4011826528566ace91f650c2b87e52687/fae177751917082e5d439755a26093d56fbd4002c1e89562ee7728ff80f6d06c?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27CTTI.zip%3B+filename%3D%22CTTI.zip%22%3B&response-content-type=application%2Fzip&Expires=1741814309&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0MTgxNDMwOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNv

In [2]:
# # if you want to use the latest version of clinical trials instead, uncomment and run this cell
# !pip install selenium
# !python ./CTOD/download_ctti.py
# CTTI_PATH = './downloads/CTTI_new.zip'

In [3]:
# ================ loading phase 3 CTO label ================
import glob
import os
import pandas as pd
import numpy as np
import zipfile

with zipfile.ZipFile(CTTI_PATH, 'r') as zip_ref:
    names = zip_ref.namelist()
    all_studies = pd.read_csv(zip_ref.open([name for name in names if name.split("/")[-1]=='studies.txt'][0]), sep='|')
    outcome_analyses = pd.read_csv(zip_ref.open([name for name in names if name.split("/")[-1]=='outcome_analyses.txt'][0]), sep='|')
    outcomes = pd.read_csv(zip_ref.open([name for name in names if name.split("/")[-1]=='outcomes.txt'][0]), sep='|')

CTO_phase3_preds = pd.read_csv("https://huggingface.co/datasets/chufangao/CTO/raw/main/phase3_CTO_rf.csv")
all_studies['completion_date'] = pd.to_datetime(all_studies['completion_date'])

# merge completion dates for later filterting
CTO_phase3_preds = pd.merge(CTO_phase3_preds,all_studies[['nct_id','completion_date']], on='nct_id', how='left')

  all_studies = pd.read_csv(zip_ref.open([name for name in names if name.split("/")[-1]=='studies.txt'][0]), sep='|')
  outcome_analyses = pd.read_csv(zip_ref.open([name for name in names if name.split("/")[-1]=='outcome_analyses.txt'][0]), sep='|')


In [4]:
# obtain subset of trials that have was completed in 2023-2024
recent_subset = CTO_phase3_preds[(CTO_phase3_preds['completion_date'] >= pd.to_datetime('2023-01-01')) & (CTO_phase3_preds['completion_date'] <= pd.to_datetime('2024-01-01'))]
recent_subset.drop_duplicates('nct_id', inplace=True)

# show highest probability predictions
recent_subset.sort_values(['pred_proba', 'completion_date'], ascending=[False, False]).head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recent_subset.drop_duplicates('nct_id', inplace=True)


Unnamed: 0,nct_id,hint_train,hint_train2,hint_train3,status,status2,gpt,gpt2,linkage,linkage2,...,sites,serious_ae,patient_drop,num_patients,death_ae,amendments,all_ae,pred,pred_proba,completion_date
5972,NCT00085202,-1.0,-1.0,-1.0,-1,-1,1.0,1.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,-1.0,0.0,1,1.0,2023-12-31
12958,NCT05259917,-1.0,-1.0,-1.0,-1,-1,1.0,1.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,-1.0,0.0,1,1.0,2023-12-31
23880,NCT04109066,-1.0,-1.0,-1.0,-1,-1,0.0,0.0,-1.0,-1.0,...,1.0,0.0,0.0,1.0,0.0,-1.0,0.0,1,1.0,2023-12-27
1643,NCT01566695,-1.0,-1.0,-1.0,-1,-1,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,-1.0,0.0,1,1.0,2023-12-21
29450,NCT03164928,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,1.0,1.0,...,1.0,0.0,1.0,1.0,1.0,-1.0,0.0,1,1.0,2023-12-20


Let us investigate these predicted successful trials one by one.

* **NCT00085202**: [Relevant literature](https://pubmed.ncbi.nlm.nih.gov/33405951/) indicates that this phase III risk-adapted trial successfully established a new risk stratification for future medulloblastoma trials.
* **NCT05259917**: [Relevant literature](https://pubmed.ncbi.nlm.nih.gov/38819658/) shows that oral sebetralstat (the proposed drug intervention) provided faster times to the beginning of symptom relief, reduction in attack severity, and complete Angioedema attack resolution than placebo.
* **NCT04109066**: This [Nature Paper](https://www.nature.com/articles/s41591-024-03414-8#Abs1) mentions that the primary endpoint of pCR was significantly higher in the nivolumab arm (the proposed drug) compared with placebo for early estrogen receptor-positive breast cancer patients.
* **NCT01566695**: This [paper](https://pubmed.ncbi.nlm.nih.gov/33764805/) indicates that CC-486 (the proposed drug) significantly improved RBC-TI rate and induced durable bilineage improvements in patients with Lower-Risk Myelodysplastic Syndromes.
* **NCT03164928**: No news or publications. However, given the not significant p-values of the [study outcome](https://clinicaltrials.gov/study/NCT03164928?tab=results), it is likely that this may be an incorrect prediction.

In [6]:
# let us look at lowest probabilities
recent_subset.sort_values(['pred_proba', 'completion_date'], ascending=[True, False]).head()

Unnamed: 0,nct_id,hint_train,hint_train2,hint_train3,status,status2,gpt,gpt2,linkage,linkage2,...,sites,serious_ae,patient_drop,num_patients,death_ae,amendments,all_ae,pred,pred_proba,completion_date
7740,NCT03338764,-1.0,-1.0,-1.0,0,0,-1.0,-1.0,0.0,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0,0.0,2023-12-31
17273,NCT04001010,-1.0,-1.0,-1.0,0,0,-1.0,-1.0,0.0,0.0,...,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0,0.0,2023-12-31
18641,NCT04088591,-1.0,-1.0,-1.0,0,0,-1.0,-1.0,1.0,1.0,...,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0,0.0,2023-12-31
19531,NCT03866980,-1.0,-1.0,-1.0,0,0,0.0,0.0,0.0,0.0,...,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0,0.0,2023-12-31
20471,NCT04244877,-1.0,-1.0,-1.0,0,0,1.0,1.0,0.0,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0,0.0,2023-12-31


Let us investigate these predicted successful trials one by one.

* **NCT03338764**: This [study](https://clinicaltrials.gov/study/NCT03338764) appears to be withdrawn due to the lack of enrolled participants.
* **NCT04001010**: This [study](https://clinicaltrials.gov/study/NCT04001010) is suspended, postoned for later.
* **NCT04088591**: This [study](https://clinicaltrials.gov/study/NCT03338764) is also withdrawn as recent studies suggest an increase in mortality due to high-does Vitamin C, which would directly affect this study's investigation of high-dose intravenous Vitamin C as an adjunctive treatment for sepsis.
* **NCT03866980**: This terminated [study](https://clinicaltrials.gov/study/NCT03866980) is cancelled due to "corporate strategy adjustment".
* **NCT04244877**: Similar to the first NCT03338764, this [study](https://clinicaltrials.gov/study/NCT04244877) appears to be withdrawn due to the lack of enrolled participants

In summary, while the predictions themselves may not be perfect, they are a generally good set of predictions.