# Predicting Clinical Trial Terminations
### Notebook 2: Preprocessing & EDA

**Author: Clement Chan**

---
Notes on the notebook:
- The previous notebook consists of the raw dataset and the steps involved in cleaning. We also performed preliminary EDA and highlighted the distribution of important columns and analyzed what potential features we should include in the model.
- In this notebook, we will complete a second round of more in-depth cleaning and exploratory data analysis. Our goal is to create a baseline model for statistical analysis, prepare for feature engineering and modelling.

### Data Dictionary for this notebook that is based on clinicaltrials.gov:

---

| Column | Description                                  |Data Type|
|-------|--------------------------------------------|-------|
| Study Title | Title of the Clinical Trial           | object |
| Study Status | Categorical column displaying the current position of the study | object (could classify into dummies) |
| Brief Summary | Short description of the clinical study (Includes study hypothesis) | object |
| Study Results | Whether the results are posted (yes or no) | object (turn to binary?)|
| Conditions | Primary Disease or Condition being studied     | object |
| Interventions | The methods used in the trial to reach the outcome                 | object |
| Primary Outcome Measures | Description of specific primary outcome | object |
| Secondary Outcome Measures | Description of specific secondary outcome | object |
| Sponsor | The corporation or agency that initiates the study | object |
| Collaborators | Other organizations that provide support | object |
| Sex | All: No limit on eligibility based on sex, Male: Only male participants, Female: Only female participants | object (Could change into binary or dummies) |
| Age | Relative age of the participants? Probably need to clarify/change up the name of the values | object (could classify this into dummies)|
| Phases | Clinical trial phase of the study (drug)... need to classify this futher. | object (dummies?)|
| Enrollment | Total estimated number of participants in a study or total actual enrollment? | float -> int |
| Funder Type | Funding type in category? | object (could classify into dummies) |
| Study Type | Interventional, Observational, or Expanded Access | object (definitely dummies) |
| Study Design | Study design based on study type | object |
| Start Date | Estimated/Actual date the first participant was enrolled | object -> date |
| Primary Completion Date | Date the final participant was examined for the first outcome | object -> date |
| Completion Date | Date the final participant was examined | object -> date |
| First Posted | Date the clinical trial was first posted to the public      | object -> date  |
| Last Update Posted | Date the clinical trial was last posted to the public   | object -> date |
| Locations | Where the clinical study will be held | object |


**Importing Libraries**

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

<a id = 'toc'><a/>
## Table of Contents

---
1. [Data Preprocessing](#prep)


**Load the dataset**

In [9]:
# make sure to drop the unnamed index
df = pd.read_csv('updated_ctg.csv').drop(['Unnamed: 0'],axis=1)

# First 5 rows of dataset
df.head()

Unnamed: 0,Study Title,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,Secondary Outcome Measures,Sponsor,Collaborators,...,Enrollment,Funder Type,Study Type,Study Design,Start Date,Primary Completion Date,Completion Date,First Posted,Last Update Posted,Locations
0,Effectiveness of a Problem-solving Interventio...,COMPLETED,We will conduct a two-arm individually randomi...,NO,"Mental Health Issue (E.G., Depression, Psychos...",behavioral: pride 'step 1' problem-solving int...,"Mental health symptoms, The Strengths and Diff...","Mental health symptoms, The adolescent-reporte...",Sangath,Harvard Medical School (HMS and HSDM)|London S...,...,250,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2018-08-20,2019-01-20,2019-02-28,2018-08-14,2019-05-21,"Sangath, New Delhi, Delhi, 110016, India"
1,Oral Ketone Monoester Supplementation and Rest...,RECRUITING,People who report subjective memory complaints...,NO,Cerebrovascular Function|Cognition,other: placebo|dietary_supplement: β-ohb,"Brain network connectivity, Functional connect...","Working memory, Computer battery to assess wor...",McMaster University,"Alzheimer's Society of Brant, Haldimand Norfol...",...,30,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: CRO...,2023-10-25,2024-08,2024-08,2023-08-15,2023-12-01,"McMaster University, Hamilton, Ontario, L8S 4K..."
2,Investigating the Effect of a Prenatal Family ...,COMPLETED,The purpose of this study is to measure the di...,NO,Focus: Contraceptive Counseling|Focus: Postpar...,other: family planning counseling let by commu...,"Self-reported contraceptive use, 6 months post...","Intent to use contraception in the future, 6 m...",Planned Parenthood League of Massachusetts,Society for Family Planning Research Fund,...,176,OTHER,INTERVENTIONAL,Allocation: NON_RANDOMIZED|Intervention Model:...,2013-08,2014-12,2014-12,2013-05-15,2015-08-17,Palestinian Ministry of Health Maternal Child ...
3,Pre-exposure Prophylaxis (PrEP) for People Who...,WITHDRAWN,People who inject drugs (PWID) experience high...,NO,Intravenous Drug Abuse,behavioral: prep uptake/adherence intervention...,"PrEP uptake by self-report, measured using 1 i...",Participant satisfaction with intervention con...,Boston University,National Institute on Drug Abuse (NIDA),...,0,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2021-03,2022-03,2022-03,2019-03-11,2021-03-10,unknown
4,Tailored Inhibitory Control Training to Revers...,COMPLETED,Insufficient inhibitory control is one pathway...,NO,Smoking|Alcohol Drinking|Prescription Drug Abu...,behavioral: person-centered inhibitory control...,"Inhibitory control performance, Task 1, Perfor...",Far transfer to a task related to inhibitory c...,University of Oregon,none,...,103,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2014-09,2016-04,2016-05,2016-10-26,2016-10-26,"University of Oregon, Social and Affective Neu..."


Let's find the total number of rows and columns in the data.

In [12]:
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns.')

The dataset has 482350 rows and 23 columns.


Check the information of the dataset

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482350 entries, 0 to 482349
Data columns (total 23 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   Study Title                 482350 non-null  object
 1   Study Status                482350 non-null  object
 2   Brief Summary               482350 non-null  object
 3   Study Results               482350 non-null  object
 4   Conditions                  482350 non-null  object
 5   Interventions               482350 non-null  object
 6   Primary Outcome Measures    482350 non-null  object
 7   Secondary Outcome Measures  482350 non-null  object
 8   Sponsor                     482350 non-null  object
 9   Collaborators               482350 non-null  object
 10  Sex                         482350 non-null  object
 11  Age                         482350 non-null  object
 12  Phases                      482350 non-null  object
 13  Enrollment                  4