# Accesing the clinicaltrails.gov data via REST API in python
This notebook plays around with accessing the clinicaltrial data via the REST API.
While lmited in size and somehwat cumbersome this is probably the best way to access the newest data for particular clinical trials.
Several avenues are explored here:
1. pytrials package : Recommended
2. direct REST API queries and manual parsing of the rsulting XML - cumbersome and has formatting issues - abandonded for now.
## Resuources
All API urls neede ror requests:
https://classic.clinicaltrials.gov/api/gui/ref/api_urls
https://classic.clinicaltrials.gov/api/gui


In [1]:
# Imports
import requests
import pandas as pd
import plotly.express as px
from tqdm import tqdm
from datetime import datetime as dt

import requests
import xml.etree.ElementTree as ET
import duckdb
import pytrials


# Using pytrails repository to access the data

https://pypi.org/project/pytrials/ 

## Get the list of available fields
First let's see what are the available fields that we can search by: 
For that purpose we parse the XML file listed here: 

https://classic.clinicaltrials.gov/api/info/study_fields_list

In [2]:
print(f"{dt.now().strftime('%Y-%m-%d %H:%M:%S')} - Start")
all_study_fields_url ="https://classic.clinicaltrials.gov/api/info/study_fields_list"
response = requests.get(all_study_fields_url)
all_study_fields_xml  = response.text

# Parse the XML data
root = ET.fromstring(all_study_fields_xml)

# Find all Field elements within FieldList
field_elements = root.findall(".//FieldList/Field")

# Extract the Field Name attribute and store in a list
all_study_fields = [str(field.get("Name")) for field in field_elements]

# Print the list of field names
print(all_study_fields)
print(f"{len(all_study_fields)} fields in total")
print(f"{dt.now().strftime('%Y-%m-%d %H:%M:%S')} - End")

2023-10-18 16:41:49 - Start
['Acronym', 'AgreementOtherDetails', 'AgreementPISponsorEmployee', 'AgreementRestrictionType', 'AgreementRestrictiveAgreement', 'ArmGroupDescription', 'ArmGroupInterventionName', 'ArmGroupLabel', 'ArmGroupType', 'AvailIPDComment', 'AvailIPDId', 'AvailIPDType', 'AvailIPDURL', 'BaselineCategoryTitle', 'BaselineClassDenomCountGroupId', 'BaselineClassDenomCountValue', 'BaselineClassDenomUnits', 'BaselineClassTitle', 'BaselineDenomCountGroupId', 'BaselineDenomCountValue', 'BaselineDenomUnits', 'BaselineGroupDescription', 'BaselineGroupId', 'BaselineGroupTitle', 'BaselineMeasureCalculatePct', 'BaselineMeasureDenomCountGroupId', 'BaselineMeasureDenomCountValue', 'BaselineMeasureDenomUnits', 'BaselineMeasureDenomUnitsSelected', 'BaselineMeasureDescription', 'BaselineMeasureDispersionType', 'BaselineMeasureParamType', 'BaselineMeasurePopulationDescription', 'BaselineMeasureTitle', 'BaselineMeasureUnitOfMeasure', 'BaselineMeasurementComment', 'BaselineMeasurementGroup

## Pytrials usage

In [3]:
print(f"{dt.now().strftime('%Y-%m-%d %H:%M:%S')} - Start")
from pytrials.client import ClinicalTrials
search_term = """Coronavirus+COVID"""
ct = ClinicalTrials()
# Get 50 full studies related to Coronavirus and COVID in json format.
ct.get_full_studies(search_expr=search_term, max_studies=100)

# Get the NCTId, Condition and Brief title fields from 500 studies related to Coronavirus and Covid, in csv format.
corona_fields = ct.get_study_fields(
    search_expr="Coronavirus+COVID",
    fields=["BriefSummary", "BriefTitle", "Condition", "LeadSponsorName" ],
    #fields=all_study_fields[0:20], # The API limits the number of fields to 20
    max_studies=100, # API has a limit 100 records
    fmt="csv",
)

# Get the count of studies related to Coronavirus and COVID.
# ClinicalTrials limits API queries to 1000 records
# Count of studies may be useful to build loops when you want to retrieve more than 1000 records

ct.get_study_count(search_expr="Coronavirus+COVID")

# Read the csv data in Pandas
corona_df = pd.DataFrame.from_records(corona_fields[1:], columns=corona_fields[0])
print(corona_df)
print(f"{dt.now().strftime('%Y-%m-%d %H:%M:%S')} - End")

2023-10-18 16:41:55 - Start
   Rank                                       BriefSummary  \
0     1  The objectives of this study are to characteri...   
1     2  Assessment of the clinical effects of infusion...   
2     3  This is a prospective follow-up non-interventi...   
3     4  Using Laser light to detect COVID 19 virus par...   
4     5  Comparison of the effects of CYT107 vs Placebo...   
..  ...                                                ...   
95   96  Study objectives: To evaluate the immunogenici...   
96   97  Background: Direct exposure to public health e...   
97   98  The purpose of the program. To determine the c...   
98   99  The coronavirus (COVID-19) pandemic continues ...   
99  100  200 participants should be included in the stu...   

                                           BriefTitle  \
0   Collection of Coronavirus COVID-19 Outbreak Sa...   
1   Treatment of Coronavirus COVID-19 Pneumonia (P...   
2   Extraordinary Measures for Egyptian Children W...   

## Manual REST API query 
Example, semi-working REST API query

In [5]:


# Sample URL
#url = "https://api/v2/studies"
url = "https://ClinicalTrials.gov/api/query/study_fields?expr=heart+attack&fields=NCTId,Condition,BriefTitle"
response = requests.get(url)
text_data = response.text
print("This is hwo the XML data looks like:")
print(text_data)

#json_data = json.loads(text_data) # <-- this does not work, because the data comes in as XML string not JSON string

This is hwo the XML data looks like:
<StudyFieldsResponse>
  <APIVrs>1.01.05</APIVrs>
  <DataVrs>2023:10:17 23:49:37.236</DataVrs>
  <Expression>heart attack</Expression>
  <NStudiesAvail>469845</NStudiesAvail>
  <NStudiesFound>10139</NStudiesFound>
  <MinRank>1</MinRank>
  <MaxRank>20</MaxRank>
  <NStudiesReturned>20</NStudiesReturned>
  <FieldList>
    <Field>NCTId</Field>
    <Field>Condition</Field>
    <Field>BriefTitle</Field>
  </FieldList>
  <StudyFieldsList>
    <StudyFields Rank="1">
      <FieldValues Field="NCTId">
        <FieldValue>NCT05654389</FieldValue>
      </FieldValues>
      <FieldValues Field="Condition">
        <FieldValue>Telemedicine</FieldValue>
      </FieldValues>
      <FieldValues Field="BriefTitle">
        <FieldValue>Effectiveness of Teleconsultation in Referring a Patient With Early Myocardial Infarction From Peripheral Hospital to Cardiac Centre in Hail, Saudi Arabia</FieldValue>
      </FieldValues>
    </StudyFields>
    <StudyFields Rank="2">
  

### Parsing the XML manually. Not recommended
This is a bit of a pain, but it works to some extent. A little bit more of exercise and we could get it to work
Still this will face same limitations as the pytrials approach as the API limits on columns and fields is the same.

In [8]:
# Full version
if response.status_code == 200:
    try:
        # Parse the XML content
        root = ET.fromstring(response.text)

        # Initialize a dictionary to store data for each field
        field_data = {}

        # Traverse the XML tree and extract data
        for element in root.iter():
            field_name = element.tag
            field_value = element.text

            # Initialize the field_data dictionary if the field doesn't exist
            if field_name not in field_data:
                field_data[field_name] = []

            # Append the field value to the corresponding field
            field_data[field_name].append(field_value)

        # Determine the maximum length of field values
        max_len = max(len(v) for v in field_data.values())

        # Pad field values with None to ensure they have the same length
        for k, v in field_data.items():
            while len(v) < max_len:
                v.append(None)

        # Create a Pandas DataFrame from the padded field_data
        df = pd.DataFrame(field_data)

        # Create a DuckDB database and insert data from the DataFrame
        conn = duckdb.connect(database='clinical_trials.db')
        df.to_sql('clinical_trials', conn, if_exists='replace', index=False)

        # Close the connection
        conn.close()

        print("Data inserted into the DuckDB database.")

    except ET.ParseError as e:
        print('Failed to parse XML:', e)
else:
    print(f'Failed to retrieve data. Status code: {response.status_code}')


  df.to_sql('clinical_trials', conn, if_exists='replace', index=False)


Data inserted into the DuckDB database.


In [None]:
df

Unnamed: 0,StudyFieldsResponse,APIVrs,DataVrs,Expression,NStudiesAvail,NStudiesFound,MinRank,MaxRank,NStudiesReturned,FieldList,Field,StudyFieldsList,StudyFields,FieldValues,FieldValue
0,\n,1.01.05,2023:10:17 00:10:37.720,heart attack,469667,10132,1,20,20,\n,NCTId,\n,\n,\n,NCT05654389
1,,,,,,,,,,,Condition,,\n,\n,Telemedicine
2,,,,,,,,,,,BriefTitle,,\n,\n,Effectiveness of Teleconsultation in Referring...
3,,,,,,,,,,,,,\n,\n,NCT01874691
4,,,,,,,,,,,,,\n,\n,Acute Myocardial Infarction
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,,,,,,,,,,,,,,\n,Myocardial Infarction
57,,,,,,,,,,,,,,\n,MiSaver® Stem Cell Treatment for Heart Attack ...
58,,,,,,,,,,,,,,\n,NCT01150825
59,,,,,,,,,,,,,,\n,Myocardial Infarction
