# Finding trials registered on ClinicalTrials.gov that do not have reported results

Reporting of clinical trial results became mandatory for many trials in 2008. However [this paper](http://www.bmj.com/content/352/bmj.i637) and [this investigation](https://www.statnews.com/2015/12/13/clinical-trials-investigation/) both find that substantial numbers of clinical trials have not reported results, even for those trials where the FDAAA has made reporting mandatory.

This notebook examines how many trials on ClinicalTrials.gov have had their results publicly reported. We have a broader definition of a trial that should report its results than the FDAAA. We count a trial as eligible for our analysis if:

- it has overall status of 'Completed'
- it has a study type of 'Interventional' 
- its completion date was after 1 Jan 2006, but is more than 24 months ago
- it is phase 2 or later (or its phase is N/A, ie it's a trial of a device or a behavioural intervention)
- it has no results disposition date (i.e. no application to delay results has been filed).

We then classify it as overdue if it has no summary results attached on ClinicalTrials.gov, and no results on PubMed that are linked by NCT ID (see below). 

This is substantially broader than FDAAA, which covers only US-based trials of FDA-approved drugs. However, we think all trials should report their results, not just US-based trials, or FDA-approved drugs. In addition, FDAAA requires results to be reported within 12 months of copmletion, and we allow 24 months.

ClinicalTrials.gov supplies notes on [how to find studies with results](https://clinicaltrials.gov/ct2/help/how-find/find-study-results) and [results in general](https://clinicaltrials.gov/ct2/about-site/results).

In [2]:
import csv
from datetime import datetime
from dateutil.relativedelta import relativedelta
import glob
from pprint import pprint
import requests
import StringIO
import time
import zipfile

import numpy as np
import pandas as pd
from pyquery import PyQuery as pq
from slugify import slugify
from xml.etree import ElementTree

## Obtain the raw XML data

First, we get the raw XML trial summaries from ClinicalTrials.gov - this is supplied as a [single very large zip file](https://clinicaltrials.gov/search?studyxml=true), containing more than 200,000  XML files. ClinicalTrials.gov supplies [field definitions](https://prsinfo.clinicaltrials.gov/definitions.html).

In [3]:
# Set this to True if you want to download the latest raw data. 
# TODO: make sure this code works, rather than doing it manually!
REDOWNLOAD_XML = False
if REDOWNLOAD_XML:
    r = requests.get('https://clinicaltrials.gov/search?studyxml=true', stream=True)
    if not r.ok:
        print 'Problem downloading'
    zip_ref = zipfile.ZipFile(StringIO.StringIO(r.content))
    zip_ref.extractall('./data/search_result')
    zip_ref.close()
print 'done'

done


## Create summary results file

Extract the fields of interest from the XML summaries, and save them to a CSV file, which we'll use as our source data for the rest of this exercise. Note that this section is skipped by default, for the purposes of development. 

In [4]:
fname = 'trials.csv'
REGENERATE_SUMMARY = False
if REGENERATE_SUMMARY:
    files = glob.glob('./data/search_result/*.xml')
    fieldnames = ['nct_id', 'title', 'overall_status', 
                  'study_type', 'completion_date',
                  'lead_sponsor', 'lead_sponsor_class',
                  'collaborator', 'collaborator_class', 
                  'phase', 'locations', 'has_drug_intervention', 'drugs', 
                  'disposition_date', 'results_date', 'results_pmids', 
                  'enrollment']
    trials = csv.DictWriter(open(fname, 'wb'), fieldnames=fieldnames)
    trials.writeheader()
    for i, f in enumerate(files):
        if i % 50000 == 0:
            print i, f
#         print i, f
        text = open(f, 'r').read()
        d = pq(text, parser='xml')
        data = {}
        for f in fieldnames:
            data[f] = None
        
        data['nct_id'] = d('nct_id').text()
        data['title'] = d('brief_title').text().strip()
        data['overall_status'] = d('overall_status').text().strip()
        data['phase'] = d('phase').text().replace("Phase ", "")
        
        data['lead_sponsor'] = d('lead_sponsor agency').text()
        data['lead_sponsor_class'] = d('lead_sponsor agency_class').text()
        data['collaborator'] = d('collaborator')('agency').text()
        data['collaborator_class'] = d('collaborator')('agency_class').text()
        
        data['study_type'] = d('study_type').text()
        data['completion_date'] = d('primary_completion_date').text()
        data['results_date'] = d('firstreceived_results_date').text()
        data['results_pmids'] = d('results_reference PMID').text()  # Not used, see below.
        data['enrollment'] = d('enrollment').text()
        
        # Not currently used, but might be useful in future. 
        data['has_drug_intervention'] = False
        data['drugs'] = ''
        for it in d('intervention'):
            e = pq(it)
            if e('intervention_type').text() == 'Drug':
                data['has_drug_intervention'] = True
                data['drugs'] += e('intervention_name').text() + '; '
                
        data['disposition_date'] = d('firstreceived_results_disposition_date').text()
        data['locations'] =  d('location_countries country').text()

        for k in data:
            if data[k] and isinstance(data[k], basestring):
                data[k] = data[k].encode('utf8')
        trials.writerow(data)
        
print 'done'

done


## Load data for analysis

Normalise date fields, and load into Pandas.  

In [5]:
dtype = {'has_drug_intervention': bool, 
         'phase': str } 
datefields = ['completion_date', 'results_date', 'disposition_date']
df = pd.read_csv(fname,
                 parse_dates=datefields, 
                 infer_datetime_format=True,
                 dtype=dtype)
print len(df), 'trials found'

227798 trials found


In [6]:
df.tail()

Unnamed: 0,nct_id,title,overall_status,study_type,completion_date,lead_sponsor,lead_sponsor_class,collaborator,collaborator_class,phase,locations,has_drug_intervention,drugs,disposition_date,results_date,results_pmids,enrollment
227793,NCT02934282,HBOC-201 Expanded Access Protocol for Life-thr...,Temporarily not available,Expanded Access,NaT,University of Miami,Other,,,,,True,HBOC-201;,NaT,NaT,,
227794,NCT02934295,Study of Rubella Immunity. Response to Vaccina...,"Active, not recruiting",Interventional,2014-05-01,Hopital Foch,Other,,,,France,False,,NaT,NaT,,192.0
227795,NCT02934308,Comparison of the Skin Conductance Values and ...,Recruiting,Interventional,2017-09-01,Hopital Foch,Other,,,,France,False,,NaT,NaT,,60.0
227796,NCT02934321,Assessment of Satiety Following Oral Administr...,Not yet recruiting,Interventional,2017-08-01,University of Florida,Other,,,,United States,False,,NaT,NaT,,25.0
227797,NCT02934334,Wellness Monitoring for Major Depressive Disorder,Enrolling by invitation,Observational,2018-12-01,Sidney Kennedy,Other,,,,Canada,False,,NaT,NaT,,100.0


In [7]:
def normalise_phase(x):
    # Set N/A (trials without phases, e.g. device trials) to 5 (i.e. later than 
    # phase 2, which is our cutoff for inclusion). And set phase 1/2 trials to 1.
    if pd.isnull(x):
        x = 5
    return int(str(x).split('/')[0])
assert normalise_phase(None) == 5
assert normalise_phase('3') == 3
assert normalise_phase('1/2') == 1
df['phase_normalised'] = df['phase'].apply(normalise_phase)
df.phase_normalised.value_counts()

5    102483
2     40150
1     33776
3     26721
4     22759
0      1909
Name: phase_normalised, dtype: int64

### Calculate whether trials are completed

The criteria for counting a trial as completed are defined above.  

In [8]:
startdate = datetime.strptime('01 January 2006', '%d %B %Y')
cutoff = datetime.now() - relativedelta(years=2)
# cutoff = datetime.strptime('28 September 2014', '%d %B %Y') 
print 'Cutoff date', cutoff

df['is_completed'] = (df.overall_status == 'Completed') & \
    (df.completion_date >= startdate) & \
    (df.completion_date <= cutoff) & \
    (df.phase_normalised >= 2) & \
    (df.disposition_date.isnull() & \
    (df.study_type.str.startswith('Interventional')))
df['is_overdue'] = (df.is_completed & \
                    df.results_date.isnull())
                    # & df.results_pmids.isnull())
df_completed = df[df.is_completed] 
df_overdue = df[df.is_completed & df.results_date.isnull()]
print len(df), 'total trials found'
print len(df_completed), 'are completed and due results, by our definition'
print len(df[df.is_completed & ~df.results_date.isnull()]), \
    'trials due results have submitted results on CT.gov'
print len(df_overdue), \
    'trials due results have not submitted results on CT.gov'
# print int(df_completed.enrollment.sum()), 'total patients enrolled in completed trials'
# print int(df_overdue.enrollment.sum()), 'total patients enrolled in overdue trials'

Cutoff date 2014-10-17 11:47:58.705821
227798 total trials found
48966 are completed and due results, by our definition
14304 trials due results have submitted results on CT.gov
34662 trials due results have not submitted results on CT.gov


## Check for results on PubMed

If trials have reported their results on PubMed, and if it's possible to find them on PubMed using either the NCT ID or the PMID, then we count those trials as having submitted results. 

So, for all trials that we regard as completed and due results, search PubMed using these two IDs, and look for that has been anything published between the completion date and now, and that don't have the words "study protocol" in the title. (There are a small number of protocols that word things slightly differently, but we'll live with being slightly too generous.)

An example of a trial with results on PubMed: NCT02460380.  At the time of writing, about 14,000 of the 48,000 completed trials have results on PubMed. 

Note that we know from the BMJ studies that there are trials that do have results on PubMed, but that aren't linked using the NCT ID. The BMJ authors found these using a manual search. Some examples: `NCT00002762: 19487378`, `NCT00002879: 18470909`, `NCT00003134: 19066728`, `NCT00003596: 18430910`. We regard these as invalid, because you can only find results via an exhaustive manual search. We only count results as published for our purposes if they are either (i) submitted on ClinicalTrials.gov or (ii) retrievable on PubMed using the NCT ID. 

Note also that there are some trials that have results PMIDs directly in ClinicalTrials.gov, in the `results_reference` field of the XML. After discussion with Jess here, and Annice at ClinicalTrials.gov, I decided that these results are too often meaningless to be useful - lots of the time they aren't truly results, but are studies from years ago.

In [20]:
def get_pubmed_title(pmid):
    '''
    Retrieve the title of a PubMed article, from its PMID.
    '''
    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?'
    url += 'db=pubmed&rettype=abstract&id=%s' % pmid
    try:
        resp = requests.get(url)
    except ValueError, requests.ConnectionError:
        print 'Error!', url
        time.sleep(10)
        return get_pubmed_title(pmid)
    tree = ElementTree.fromstring(resp.content)
    title = tree.find('.//Article/ArticleTitle')
    if title is not None:
        title = title.text.encode('utf8')
    return title

def get_pubmed_linked_articles(nct_id, completion_date):
    '''
    Given an NCT ID, search PubMed for related results articles.
    '''
    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
    url += 'esearch.fcgi?db=pubmed&retmode=json&term='
    url += '%s[si] ' % nct_id
    url += 'AND ("%s"[pdat] : ' % completion_date.strftime('%Y/%m/%d')
    url += '"3000"[pdat]) '
    try:
        resp = requests.get(url)
        data = resp.json()
    except ValueError, requests.ConnectionError:
        print 'Error!', url
        time.sleep(10)
        return get_pubmed_linked_articles(nct_id, completion_date)
    esearchresult = data['esearchresult']
    ids = []
    if 'idlist' in esearchresult:
        ids = esearchresult['idlist']
#         if len(ids) > 0:
#             print row['nct_id'], ids
    return ids

import sqlite3
conn = sqlite3.connect('trials.db')
cur = conn.cursor()  
c = "CREATE TABLE IF NOT EXISTS trials(nct_id TEXT PRIMARY KEY, pubmed_results BOOLEAN)"
cur.execute(c)
conn.commit()

# Only bother looking for PubMed results for overdue trials
# (but set the value on the original dataframe).
df['pubmed_results'] = False
count = 0
for i, row in df_overdue.iterrows():
    if count % 1000 == 0:
        print count, row.nct_id
    count += 1
    pubmed_results = False
    # Check for results stored locally.
    c = "SELECT * FROM trials WHERE nct_id='%s'" % row.nct_id
    cur.execute(c)
    data = cur.fetchone()
    if data: # and data[1]: # change this to re-grab all from scratch
#         print row.nct_id, 'results exist locally!'
        pubmed_results = bool(data[1])
    else:
        # No local results, scrape PubMed.
        results = get_pubmed_linked_articles(row.nct_id, row.completion_date)
        if results:
            pubmed_results = True
            for r in results:
                title = get_pubmed_title(r)
                if title and 'study protocol' in title.lower():
                    pubmed_results = False
                else:
                    # There's at least one PubMed article that isn't a
                    # protocol, so we can break the loop.
                    break
    c = "INSERT OR REPLACE INTO trials VALUES('%s', %s)" % (row.nct_id, int(pubmed_results))
    cur.execute(c)
    conn.commit()
    df.set_value(i, 'pubmed_results', pubmed_results) 
    
cur.close()
conn.close()
print df[df.is_completed & df.results_date.isnull()].pubmed_results.value_counts()
print 'done'

0 NCT00000176
1000 NCT00094432
2000 NCT00156026
3000 NCT00220402
4000 NCT00280475
5000 NCT00345917
6000 NCT00407095
7000 NCT00471965
8000 NCT00535756
9000 NCT00604357
10000 NCT00661310
11000 NCT00719342
12000 NCT00779493
13000 NCT00834951
14000 NCT00900718
15000 NCT00956215
16000 NCT01009112
17000 NCT01065246
18000 NCT01121692
19000 NCT01179074
20000 NCT01237795
21000 NCT01297725
22000 NCT01357265
23000 NCT01421043
24000 NCT01482273
25000 NCT01544868
26000 NCT01609491
27000 NCT01681719
28000 NCT01754077
29000 NCT01826409
30000 NCT01912430
31000 NCT02010320
32000 NCT02128464
33000 NCT02301598
34000 NCT02599181
False    25742
True      8920
Name: pubmed_results, dtype: int64
done


### Calculate final overdue count

Now we have looked for PubMed results, we can calculate the final overdue count.

In [21]:
df['is_overdue'] = (df.is_completed & df.results_date.isnull() & ~df.pubmed_results)
df_overdue = df[df.is_overdue]
print len(df_overdue), 'trials have not published results'
percent_submitted = (1 - (len(df_overdue) / float(len(df_completed)))) * 100
print '%s%% of completed trials have published results' % \
    '{:,.2f}'.format(percent_submitted)
print int(df_overdue.enrollment.sum()), 'total patients are enrolled in overdue trials'

25742 trials have not published results
47.43% of completed trials have published results
9212256 total patients are enrolled in overdue trials


## Write to CSV

Output final results to a CSV file, which we will use in the interactive version. We reshape the data so it has a row for each sponsor, and columns by year - two columns for each year, one for the number of completed trials with overdue results, and one for the total completed trials. 

In [25]:
# We're only interested in the completed trials, and we will divide the data up by year.
df_completed['year_completed'] = df_completed['completion_date'].dt.year.dropna().astype(int)
df_completed['year_completed'] = df_completed.year_completed.astype(int)

# Drop all sponsors with fewer than 40 completed trials.
NUM_TRIALS = 40
df_final = df_completed[
    df_completed.groupby('lead_sponsor').nct_id.transform(len) > NUM_TRIALS]

# Now reshape the data: a row for each sponsor, columns by year:
# lead_sponsor,2008_overdue,2008_total,2009_overdue,2009_total...
df_temp = df_final.set_index(['lead_sponsor', 'year_completed']) 
gb = df_temp.groupby(level=[0, 1]).is_overdue
df2 = gb.agg({'overdue': 'sum', 'total': 'count'}) \
          .unstack().swaplevel(0, 1, 1).sort_index(1)
df2.columns = df2.columns.to_series().apply(lambda x: '{}_{}'.format(*x))
df3 = df2.reset_index()
df3['lead_sponsor_slug'] = df3.lead_sponsor.apply(slugify)
df3.to_csv('../data/completed.csv', index=None)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


### TODO: Compare our results with PLOS/BMJ authors

A [2016 BMJ paper](http://www.bmj.com/content/352/bmj.i637) found that around 65% of papers reprted results. They were looking at a subset of the papers we're including. They also used a manual search strategy which involved searching Scopus and manually comparing results. 

They have a higher match rate because they did manual search on keywords. 

In [None]:
from openpyxl import load_workbook
import sys
bmj_results = load_workbook(filename = './data/chen-bmj.xlsx')

In [None]:
nct_ids = {}
for sheet in bmj_results.worksheets:
    for i, row in enumerate(sheet.rows):
        if i == 0:
            continue
        if row[0].value:
#             print row[0].value, row[6].value, type(row[6].value)
            if isinstance(row[6].value, long):
                nct_ids[row[0].value] = str(row[6].value)
            else:
                nct_ids[row[0].value] = row[6].value
print len(nct_ids.keys()), 'NCT IDs found in the BMJ data'
print sum(1 for x in nct_ids.values() if x), 'of these have PMIDs'

In [None]:
bmj_pmids_missing_in_our_data = {}
df_indexed = df_overdue.set_index('nct_id')[['pmid']]
# print df_indexed.head(30)
df_indexed_with_pmids = df_indexed[~df_indexed.pmid.isnull()]
print len(df_indexed), 'trials missing results'
print len(df_indexed_with_pmids), 'trials of these have PMIDs, according to the CT.gov XML'

count_found = 0
count_not_found = 0
for k in nct_ids:
    try:
        row = df_indexed.loc[k]
#         print k, row
        count_found += 1
    except KeyError:
#         print k, 'not found'
        count_not_found += 1
        
print count_found, 'of the BMJ PMIDs found in our data'
print count_not_found, 'of the BMJ PMIDs not found in our data'
#     # Check how many of these have PMID in our data.
# # print nct_ids['NCT01001741']
# # print nct_ids['NCT00145795']