# Predicting Clinical Trial Terminations

**Author: Clement Chan**

---
Notes on this notebook: (Add cool description here!)

- ctg.csv notebook was aquired from clinicaltrials.gov on Feb 16th 2024

## Table of Contents
---
1. [Data Wrangling](#wrangle)
2. [Exploratory Data Analysis (EDA)](#EDA)
3.


### Data Dictionary (Rough outline)

| Column | Description                                  |Data Type|
|-------|--------------------------------------------|-------|
| NCT Number | Unique ID                            | object |
| Study Title | Title of the Clinical Trial           | object |
| Study URL | URL link to the study on clinicaltrials.gov  | object |
| Acronym | An abbreviation used to identify the clinical study | object|
| Study Status | Categorical column displaying the current position of the study | object |
| Brief Summary | Short description of the clinical study (Includes study hypothesis) | object |
| Study Results | Not sure what the result is (but is in yes or no format) | object|
| Conditions | Primary Disease or Condition being studied     | object |
| Interventions | The method of the trial???                  | object |
| Primary Outcome Measures | Description of specific primary outcome | object |
| Secondary Outcome Measures | Description of specific secondary outcome | object |
| Other Outcome Measures | Any other measures used to evaluate the interventions | object |
| Sponsor | The corporation or agency that initiates the study | object |
| Collaborators | Other organizations that provide support | object |




**Importing Libraries**

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

<a id = 'wrangle'><a/>
## Data Wrangling

---

In [10]:
# Reading the clinical trials dataset
df = pd.read_csv('ctg.csv')

# First 5 rows of dataset
df.head()

Unnamed: 0,NCT Number,Study Title,Study URL,Acronym,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,...,Study Design,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents
0,NCT03630471,Effectiveness of a Problem-solving Interventio...,https://clinicaltrials.gov/study/NCT03630471,PRIDE,COMPLETED,We will conduct a two-arm individually randomi...,NO,"Mental Health Issue (E.G., Depression, Psychos...",BEHAVIORAL: PRIDE 'Step 1' problem-solving int...,"Mental health symptoms, The Strengths and Diff...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,SANPRIDE_002,2018-08-20,2019-01-20,2019-02-28,2018-08-14,,2019-05-21,"Sangath, New Delhi, Delhi, 110016, India","Statistical Analysis Plan, https://storage.goo..."
1,NCT05992571,Oral Ketone Monoester Supplementation and Rest...,https://clinicaltrials.gov/study/NCT05992571,,RECRUITING,People who report subjective memory complaints...,NO,Cerebrovascular Function|Cognition,OTHER: Placebo|DIETARY_SUPPLEMENT: β-OHB,"Brain network connectivity, Functional connect...",...,Allocation: RANDOMIZED|Intervention Model: CRO...,rs-KME,2023-10-25,2024-08,2024-08,2023-08-15,,2023-12-01,"McMaster University, Hamilton, Ontario, L8S 4K...",
2,NCT01854671,Investigating the Effect of a Prenatal Family ...,https://clinicaltrials.gov/study/NCT01854671,,COMPLETED,The purpose of this study is to measure the di...,NO,Focus: Contraceptive Counseling|Focus: Postpar...,OTHER: family planning counseling let by commu...,"Self-reported contraceptive use, 6 months post...",...,Allocation: NON_RANDOMIZED|Intervention Model:...,SFPRF13-10,2013-08,2014-12,2014-12,2013-05-15,,2015-08-17,Palestinian Ministry of Health Maternal Child ...,
3,NCT03869671,Pre-exposure Prophylaxis (PrEP) for People Who...,https://clinicaltrials.gov/study/NCT03869671,,WITHDRAWN,People who inject drugs (PWID) experience high...,NO,Intravenous Drug Abuse,BEHAVIORAL: PrEP uptake/adherence intervention...,"PrEP uptake by self-report, measured using 1 i...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,H-34960|K01DA043412-01A1,2021-03,2022-03,2022-03,2019-03-11,,2021-03-10,,
4,NCT02945371,Tailored Inhibitory Control Training to Revers...,https://clinicaltrials.gov/study/NCT02945371,REV,COMPLETED,Insufficient inhibitory control is one pathway...,NO,Smoking|Alcohol Drinking|Prescription Drug Abu...,BEHAVIORAL: Person-centered inhibitory control...,"Inhibitory control performance, Task 1, Perfor...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,EPCS20613,2014-09,2016-04,2016-05,2016-10-26,,2016-10-26,"University of Oregon, Social and Affective Neu...",


In [11]:
# Let's look at the shape of the dataset
f'There are {df.shape[0]} rows and {df.shape[1]} columns.'

'There are 483238 rows and 30 columns.'

In [12]:
# Checking for duplicated data in rows
f'There are {df.duplicated().sum()} duplicated rows.'

'There are 0 duplicated rows.'

In [13]:
# Reading the dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 483238 entries, 0 to 483237
Data columns (total 30 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   NCT Number                  483238 non-null  object 
 1   Study Title                 483238 non-null  object 
 2   Study URL                   483238 non-null  object 
 3   Acronym                     133174 non-null  object 
 4   Study Status                483238 non-null  object 
 5   Brief Summary               482350 non-null  object 
 6   Study Results               483238 non-null  object 
 7   Conditions                  482317 non-null  object 
 8   Interventions               435308 non-null  object 
 9   Primary Outcome Measures    465764 non-null  object 
 10  Secondary Outcome Measures  350835 non-null  object 
 11  Other Outcome Measures      38422 non-null   object 
 12  Sponsor                     483238 non-null  object 
 13  Collaborators 

**Important Notes to Consider:**
There are 483,238 total rows, but there are large amounts of missing data in most of the columns. Let's look into each column that has missing data chronologically by index.

Starting with `Study Acronym`

In [25]:
# Finding the number of NaN (missing values)

f"There are {df['Acronym'].isna().sum()} missing values in the `Study Acronym` column."

'There are 350064 missing values in the `Study Acronym` column.'

In [27]:
# Let's see what we can do with these Acronyms...
df['Acronym'].value_counts()

Acronym
IMPACT       129
COVID-19     122
SMART        111
RCT           88
STAR          78
            ... 
boron_gel      1
CUMACA-M       1
PR11           1
NutriCim       1
AFOCUFF        1
Name: count, Length: 104724, dtype: int64

It seems that most clinical trials did not name their study, but this information is still very useful, since we can group by specific `Acronyms` and predict why their studies in their  individual groups fail more accurately. However, since most of the data is missing, this is a potential column to drop.

Next, let's look at the `Brief Summary` column.

In [31]:
# Finding the number of NaN (missing values)
f"There are {df['Brief Summary'].isna().sum()} missing values in the `Brief Summary` column."

'There are 888 missing values in the `Brief Summary` column.'

In [33]:
# diving into the values of `Brief Summary`
df['Brief Summary'].value_counts()

Brief Summary
To evaluate the Sun Protection Factor efficacy on human skin.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

There are only **888** missing values in this column, and it will be usefull to use NLP keyword matching to match specific words together which will increase the predictivity of our model.

Next, let's dive into the `Conditions` column.

In [35]:
# Finding the number of NaN (missing values)
f"There are {df['Conditions'].isna().sum()} missing values in the `Conditions` column."

'There are 921 missing values in the `Conditions` column.'

In [36]:
# Looking into the values of `Conditions`
df['Conditions'].value_counts()

Conditions
Healthy                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           7996
Breast Cancer                                                         

There are **921** missing values in the `Conditions` column. This column will also be very useful in grouping certain conditions or diseases that are being studied primarily.

Next, we will look at the `Interventions` column.

In [37]:
# Finding the number of NaN (missing values)
f"There are {df['Interventions'].isna().sum()} missing values in the `Interventions` column."

'There are 47930 missing values in the `Interventions` column.'

In [39]:
# Looking into the values of `Conditions`
df['Interventions'].value_counts()

Interventions
OTHER: No intervention                                                                          1083
OTHER: no intervention                                                                           410
OTHER: Questionnaire                                                                             318
OTHER: Exercise                                                                                  273
OTHER: No Intervention                                                                           249
                                                                                                ... 
DEVICE: Patient specific instrumentation (MRI)|DEVICE: Patient specific instrumentation (CT)       1
DRUG: Bone Morphogenetic Protein 2|PROCEDURE: Autologous bone graft                                1
OTHER: HaRTS-TRENDS|OTHER: Standard Care (SC)                                                      1
DIAGNOSTIC_TEST: Stroke simulation and machine learning                      

There are **47,930** missing values in the `Intervention` column. Don't know what this is yet... will figure out this later.

Next is `Primary Outcome Measures` column:

In [41]:
# Finding the number of NaN (missing values)
f"There are {df['Primary Outcome Measures'].isna().sum()} missing values in the `Primary Outcome Measures` column."

'There are 17474 missing values in the `Primary Outcome Measures` column.'

In [44]:
# Looking into the column
df['Primary Outcome Measures'].value_counts()

Primary Outcome Measures
Bioequivalence, within 30 days                                                                                                                                                                                                                   118
Bioequivalence                                                                                                                                                                                                                                    68
Overall survival                                                                                                                                                                                                                                  49
Minimal Erythema Dose (MED), Up to 15 minutes|Minimal Persistent Pigment Darkening Dose (MPPD), Up to 15 minutes                                                                                                                                

There are **17,474** missing values in the `Primary Outcome Measures` column. Don't know what this is yet... will figure out this later.

Next is `Secondary Outcome Measures` column:

In [43]:
# Finding the number of NaN (missing values)
f"There are {df['Secondary Outcome Measures'].isna().sum()} missing values in the `Secondary Outcome Measures` column."

'There are 132403 missing values in the `Secondary Outcome Measures` column.'

In [45]:
# Looking into the column
df['Primary Outcome Measures'].value_counts()

Primary Outcome Measures
Bioequivalence, within 30 days                                                                                                                                                                                                                   118
Bioequivalence                                                                                                                                                                                                                                    68
Overall survival                                                                                                                                                                                                                                  49
Minimal Erythema Dose (MED), Up to 15 minutes|Minimal Persistent Pigment Darkening Dose (MPPD), Up to 15 minutes                                                                                                                                

There are **132,403** missing values in the `Secondary Outcome Measures` column. Don't know what this is yet... will figure out this later.

Next is `Other Outcome Measures` column:

In [48]:
# Finding the number of NaN (missing values)
f"There are {df['Other Outcome Measures'].isna().sum()} missing values in the `Other Outcome Measures` column."

'There are 444816 missing values in the `Other Outcome Measures` column.'

In [49]:
# Looking into the column
df['Other Outcome Measures'].value_counts()

Other Outcome Measures
The dose of CSC vaccine, up to 3 months                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

There are **444,816** missing values in the `Other Outcome Measures` column. Don't know what this is yet... will figure out this later. Most of the data in this column is missing. So this column is a potential DROP.

Next is `Collaborators` column:

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 483238 entries, 0 to 483237
Data columns (total 30 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   NCT Number                  483238 non-null  object 
 1   Study Title                 483238 non-null  object 
 2   Study URL                   483238 non-null  object 
 3   Acronym                     133174 non-null  object 
 4   Study Status                483238 non-null  object 
 5   Brief Summary               482350 non-null  object 
 6   Study Results               483238 non-null  object 
 7   Conditions                  482317 non-null  object 
 8   Interventions               435308 non-null  object 
 9   Primary Outcome Measures    465764 non-null  object 
 10  Secondary Outcome Measures  350835 non-null  object 
 11  Other Outcome Measures      38422 non-null   object 
 12  Sponsor                     483238 non-null  object 
 13  Collaborators 