## Goals of Notebook

* Download XML files from clincial trials url
* Parse XML files and grab relevant information
* Create CSV File from parsed information


### Downloading the Data

The Clinical Trial data can be found in XML format at https://clinicaltrials.gov/ct2/resources/download

In [1]:
from bs4 import BeautifulSoup
import requests
import os
import lxml
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
# Instantiate soup object
soup = BeautifulSoup("XML.parser")

### Column List

Columns list is used to create an empty DataFrame to be later populated with parsed data <br>
* List should contain targeted tags within the downloaded clinical trial xml data

In [9]:
columns = [
        'agency',
        'brief_title',
        'brief_summary',
        'city',
        'clinical_study',
        'collaborator',
        'completion_date',
        'condition',
        'condition_browse',
        'country',
        'criteria',
        'detailed_description',
        'eligibility',
        'enrollment',
        'facility',
        'gender',
        'has_expanded_access',
        'intervention',
        'intervention_name',
        'intervention_type',
        'keyword',
        'last_update_posted',
        'last_update_submitted',
        'last_update_submitted_qc',
        'lead_sponsor',
        'location',
        'mesh_term',
        'official_title',
        'overall_status',
        'phase',
        'primary_completion_date',
        'primary_purpose',
        'role',
        'source',
        'sponsors',
        'start_date',
        'state',
        'textblock',
        'time_frame',
        'url'
]

In [10]:
def soup_dict(soup):
    """Returns dictionary containing targeted tags as keys and scraped text as values"""
    
    list_of_tags = [tag.name for tag in soup.find_all()]
    
    data = {
     'agency': soup.agency.text if (type(soup.agency))!=None and 'agency' in list_of_tags else 'N/A',
     'brief_title': soup.brief_title.text if (type(soup.brief_title))!=None and 'brief_title' in list_of_tags else 'N/A',
     'brief_summary': soup.brief_summary.text if (type(soup.brief_summary))!=None and 'brief_summary' in list_of_tags else 'N/A',
     'city': soup.city.text if (type(soup.city))!=None and 'city' in list_of_tags else 'N/A',
     'clinical_study': soup.clinical_study.text if (type(soup.clinical_study))!=None and 'clinical_study' in list_of_tags else 'N/A',
     'collaborator': soup.collaborator.text if (type(soup.collaborator))!=None and 'collaborator' in list_of_tags else 'N/A',
     'completion_date': soup.completion_date.text if (type(soup.completion_date))!=None and 'completion_date' in list_of_tags else 'N/A',
     'condition': soup.condition.text if (type(soup.condition))!=None and 'condition' in list_of_tags else 'N/A',
     'condition_browse': soup.condition_browse.text if (type(soup.condition_browse))!=None and 'condition_browse' in list_of_tags else 'N/A',
     'country': soup.country.text if (type(soup.country))!=None and 'country' in list_of_tags else 'N/A',
     'criteria': soup.criteria.text if (type(soup.criteria))!=None and 'criteria' in list_of_tags else 'N/A',
     'detailed_description': soup.detailed_description.text if (type(soup.detailed_description))!=None and 'detailed_description' in list_of_tags else 'N/A',
     'eligibility': soup.eligibility.text if (type(soup.eligibility))!=None and 'eligibility' in list_of_tags else 'N/A', 
     'enrollment': soup.enrollment.text if (type(soup.enrollment))!=None and 'enrollment' in list_of_tags else 'N/A',
     'facility': soup.facility.text if (type(soup.facility))!=None and 'facility' in list_of_tags else 'N/A',
     'gender': soup.gender.text if (type(soup.gender))!=None and 'gender' in list_of_tags else 'N/A',
     'has_expanded_access': soup.has_expanded_access.text if (type(soup.has_expanded_access))!=None and 'has_expanded_access' in list_of_tags else 'N/A',
     'intervention': soup.intervention.text if (type(soup.intervention))!=None and 'intervention' in list_of_tags else 'N/A',
     'intervention_name': soup.intervention_name.text if (type(soup.intervention_name))!=None and 'intervention_name' in list_of_tags else 'N/A',
     'intervention_type': soup.intervention_type.text if (type(soup.intervention_type))!=None and 'intervention_name' in list_of_tags else 'N/A',
     'keyword': soup.keyword.text if (type(soup.keyword))!=None and 'keyword' in list_of_tags else 'N/A',
     'last_update_posted': soup.last_update_posted.text if (type(soup.last_update_posted))!=None and 'last_update_posted' in list_of_tags else 'N/A',
     'last_update_submitted': soup.last_update_submitted.text if (type(soup.last_update_submitted))!=None and 'last_update_submitted' in list_of_tags else 'N/A',
     'last_update_submitted_qc': soup.last_update_submitted_qc.text if (type(soup.last_update_submitted_qc))!=None and 'last_update_submitted_qc' in list_of_tags else 'N/A',
     'lead_sponsor': soup.lead_sponsor.text if (type(soup.lead_sponsor))!=None and 'lead_sponsor' in list_of_tags else 'N/A',
     'location': soup.location.text if (type(soup.location))!=None and 'location' in list_of_tags else 'N/A',
     'mesh_term': soup.mesh_term.text if (type(soup.mesh_term))!=None and 'mesh_term' in list_of_tags else 'N/A',
     'official_title': soup.official_title.text if (type(soup.official_title))!=None and 'official_title' in list_of_tags else 'N/A',
     'overall_status': soup.overall_status.text if (type(soup.overall_status))!=None and 'overall_status' in list_of_tags else 'N/A',
     'phase': soup.phase.text if (type(soup.phase))!=None and 'phase' in list_of_tags else 'N/A',
     'primary_completion_date': soup.primary_completion_date.text if (type(soup.primary_completion_date))!=None and 'primary_completion_date' in list_of_tags else 'N/A',
     'primary_purpose': soup.primary_purpose.text if (type(soup.primary_purpose))!=None and 'primary_purpose' in list_of_tags else 'N/A',
     'role': soup.role.text if (type(soup.role))!=None and 'role' in list_of_tags  else 'N/A',
     'source': soup.source.text if (type(soup.source))!=None and 'source' in list_of_tags  else 'N/A',  
     'sponsors': soup.sponsors.text if (type(soup.sponsors))!=None and 'sponsors' in list_of_tags else 'N/A',
     'start_date': soup.start_date.text if (type(soup.start_date))!=None and 'start_date' in list_of_tags else 'N/A',  
     'state': soup.state.text if (type(soup.state))!=None and 'state' in list_of_tags else 'N/A',
     'textblock': soup.textblock.text if (type(soup.textblock))!=None and 'textblock' in list_of_tags else 'N/A',
     'time_frame': soup.time_frame.text if (type(soup.time_frame))!=None and 'time_frame' in list_of_tags else 'N/A',
     'url': soup.url.text if (type(soup.url))!=None and 'url' in list_of_tags else 'N/A',
    }
    
    return data

## Function for generating a pandas dataframe

This function will create a dataframe using the xml files that are located in a single directory

In [27]:
def generate_dataframe(filefolder, columns):
    """Returns DataFrame from Clinical Trial XML Directory"""
    
    # Create empty DataFrame from columns list
    df = pd.DataFrame(columns=columns)
    
    # Head directory containing sub directories
    folder = os.listdir(filefolder)

    
    for file in folder:
        print(file)
        
        for xml in os.listdir(filefolder + '/' + file):

            path = os.path.join(filefolder, file, xml)

            if  path[-3:] == 'xml': 
                with open(path, 'rb') as f2:
                    data = f2.read()
                    soup = BeautifulSoup(data, 'xml')

                    data_dict = soup_dict(soup)
                    temp_df = pd.DataFrame.from_dict([data_dict])
                    df = pd.concat([df, temp_df])
            
                
    return df

In [28]:
# Run generate dataframe function and instantiate df
df = generate_dataframe('filefolder', columns)

NCT0197xxxx


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




NCT0383xxxx
NCT0396xxxx
NCT0005xxxx


In [29]:
# Check shape of DataFrame
df.shape

(2984, 43)

In [30]:
# Check and verify DataFrame
df.head()

Unnamed: 0,address,agency,brief_summary,brief_title,city,clinical_study,collaborator,completion_date,condition,condition_browse,...,primary_purpose,role,source,sponsors,start_date,state,textblock,time_frame,time_perspective,url
0,,University of Utah,\n\n The purpose of this prospective tria...,Cancer Symptom Monitoring Telephone System Wit...,Nashville,\n\n\nClinicalTrials.gov processed this data o...,\nNational Cancer Institute (NCI)\nNIH\n,April 2012,Cancer,,...,Supportive Care,Principal Investigator,University of Utah,\n\nUniversity of Utah\nOther\n\n\nNational Ca...,September 2007,Tennessee,\n The purpose of this prospective trial ...,Daily patient reports of symptom levels were m...,,https://clinicaltrials.gov/show/NCT01973946
0,,Neurogen Brain and Spine Institute,\n\n The purpose of this study was to stu...,Stem Cell Therapy in Autism Spectrum Disorders,Mumbai,\n\n\nClinicalTrials.gov processed this data o...,,May 2016,Autism Spectrum Disorders,\n\nDisease\nAutistic Disorder\nAutism Spectru...,...,Treatment,,Neurogen Brain and Spine Institute,\n\nNeurogen Brain and Spine Institute\nOther\n\n,August 2009,Maharashtra,\n The purpose of this study was to study...,Six months,,https://clinicaltrials.gov/show/NCT01974973
0,,Nektar Therapeutics,\n\n The purpose of this research study i...,A Study in Cancer Patients to Evaluate the Eff...,Los Angeles,\n\n\nClinicalTrials.gov processed this data o...,,September 2016,Advanced Cancer,,...,Treatment,Study Director,Nektar Therapeutics,\n\nNektar Therapeutics\nIndustry\n\n,February 2014,California,\n The purpose of this research study is ...,Day -1 through Day 42,,https://clinicaltrials.gov/show/NCT01976143
0,,University of Oxford,\n\n Pleural effusion is an extremely com...,Using Ultrasound to Predict the Results of Dra...,Oxford,\n\n\nClinicalTrials.gov processed this data o...,,"May 16, 2017",Pleural Effusion,\n\nPleural Effusion\n,...,,Study Chair,University of Oxford,\n\nUniversity of Oxford\nOther\n\n,"August 1, 2014",,\n Pleural effusion is an extremely commo...,Up to 3 months,Prospective,https://clinicaltrials.gov/show/NCT01973985
0,,M.D. Anderson Cancer Center,\n\n The goal of this clinical research s...,Widespread vs. Selective Screening for Hepatit...,Houston,\n\n\nClinicalTrials.gov processed this data o...,\nNational Cancer Institute (NCI)\nNIH\n,December 2019,Cancer,\n\nInfection\nHepatitis\nHepatitis A\nHepatit...,...,,Study Chair,M.D. Anderson Cancer Center,\n\nM.D. Anderson Cancer Center\nOther\n\n\nNa...,June 2013,Texas,\n The goal of this clinical research stu...,"Baseline blood tests, approximately 30 minutes...",Prospective,https://clinicaltrials.gov/show/NCT01970254


In [32]:
# Write DataFrame to CSV file
df.to_csv('clinical_trial_csv')