ClinicalTrials.gov ACT Tracker
This software is designed to process a subset of trial registry data from ClinicalTrials.gov that are subject to FDAAA 2017, i.e. which come with a legal obligation to report results. Such trials are known as ACTs (Applicable Clinical Trials) or pACTs (probable ACTS).
This subset is displayed in a website that makes tracking and reporting easier.
- Python script
- downloads a zip clinical trials registry data from ClinicalTrials.gov
- converts the XML to JSON
- uploads it to BigQuery
- runs SQL to transform it to tabular format including fields to indentify ACTs and their lateness
- downloads SQL as a CSV file
- Django management command
- imports CSV file into Django models
- precomputes aggregate statistics and turns these into rankings
- handles other metadata (in particular, hiding trials that are no longer ACTs)
- directly scrapes the website for metadata not in the zip (specifically, trials which have been submitted but are under a QA process).
These two commands are run daily via a
fab script, and the results
loaded into a staging database / website.
A separate command copies new data from staging to production (following moderation).
Much complex logic has been expressed in SQL, which makes it hard to read and test. This is a legacy of splitting the development between academics with the domain expertise (and who could use SQL to prototype) and software engineers. Now the project has been running for a while and new development interations are less frequent, a useful project would be as much of this logic to Python.
Similarly, the only reason step (1) exists is to create a CSV which can be imported to the database. That CSV is useful in its own right for QA by our academics, but the XML and JSON artefacts are just intermediate formats that could legitimately be dropped in a refactored solution (and the CSV could be generated directly from the database).
The historic reason for the XML -> JSON route is because BigQuery includes a number of useful JSON functions which can be manipulated by people competent in SQL. At the time of writing, there is an open issue with some ideas about refactoring this process.
There is a simple system to allow non-technical users to generate pages using markdown. It is documented here
Install these Python development packages before you begin. For example, on a Debian-based system:
apt install python3
Using Python 3, create and enter a virtualenv, as described here. For example:
python3 -m venv venv . venv/bin/activate
Install required Python packages.
pip install pip-tools pip-sync
Set environment variables required (edit
environment and then run
Checkout the respository.
cd .. git clone firstname.lastname@example.org:ebmdatalab/clinicaltrials-act-tracker.git cd -
Run the application.
cd clinicaltrials ./manage.py runserver
There are a few tests.
coverage run --source='.' manage.py test
Make a coverage report:
coverage html -d /tmp/coverage_html
We use fabric to deploy over SSH to a pet server. Deploy with
The code and data are updated via git from the master branch of their repositories.
The configuration is in
fabfile.py and the
When setting up a new server, put environment settings live in
Updating data takes around 2 hours. To do it manually, first run (from your local sandbox):
This downloads and processes the data and puts it on the staging site.
It is launched as a background process using
dtach. If you're happy
with this, copy data across to the live database (warning: overwrites
existing data!) with:
The target server
apt-get install dtach) to be installed by any users who
might run fabric scripts, e.g. you (the developer) and the
user (see below)fa