Skip to content

v25 The Ben Goldacre edition

Latest

Choose a tag to compare

@brunoamaral brunoamaral released this 11 Jun 12:55
· 66 commits to main since this release
02d034f

A few months ago I found out about OpenTrials.net by Prof. Ben Goldacre. It was a project aimed at "locate, match, and share all publicly accessible data and documents, on all trials conducted, on all medicines and other treatments, globally".

This edition is a show of appreciation for Prof. Goldacre's previous work.

GregoryAi is a modest answer to the same problem. In this version we are making a number of improvements to the way we fetch clinical trials from the world's top 3 registries. We now focus more on the identifiers to ensure the data is sound; with a tradeoff that now we may have a few duplicates if a trial is in two or more registries.

Subscribers to Brain-Regeneration alerts may get some duplicate alerts. It's a problem I am trying to solve by first keeping a chronological record of the raw data to analyse in more detail.

Gregory AI v25

Range: v24 (2026-05-30) → main (2026-06-10). 15 merged PRs, ~113 commits.

Highlights

  • Clinical trial identity was rebuilt around registry identifiers. Trials with the same title are no longer merged into one record when their registry IDs say they are different studies, and a trial's link no longer flip-flops between sources on every import.
  • Nothing gets lost anymore: a new links field on trials and articles keeps one URL per source. The main link is set by whichever source arrived first and stays put.
  • Richer trial data: a dedicated parser for the EU CTIS feed, new fields from the WHO/ICTRP export (acronym, secondary sponsor, results information), and a fix for EU dates that were being read month-first — 8 December was becoming 12 August.
  • The trials API now exposes every field, adds lookups by registry identifier (NCT, EudraCT, EUCT, CTIS), and exports to Excel.
  • Categories distinguish manual curation from automatic matching: rebuilds never touch assignments made by a human, and the pipeline only re-categorizes content that changed instead of everything, every time.
  • The test suite now runs on GitHub Actions on every push, in parallel.

⚠️ Before you upgrade

Three things to know before deploying. Details in the Breaking changes and Upgrade sections below.

  • API clients: the trial field retrospective_flag is now called prospective_registration (same values, clearer name). Update anything that reads the old name.
  • Migration 0050 adds and drops indexes on the largest tables — plan a short maintenance window.
  • Migration 0054 refuses to apply if the database holds real duplicate registry IDs. New commands help you find and merge those duplicates first.

What's new

Trial identity and de-duplication

GregoryAI ingests the same real-world trials from ClinicalTrials.gov, the EU registers, and the WHO portal. Until now, deciding whether two incoming records were "the same trial" leaned too much on the title — which merged distinct studies that shared a title, and let two sources fight over a single trial's link, overwriting each other on every import.

  • Registry identifiers now lead. A record with a matching title is no longer treated as the same trial when its registry identifiers point to a different study.
  • The database no longer enforces one globally unique title. Instead, each registry identifier (NCT, EudraCT, EUCT, EUCTR, CTIS) is unique on its own.
  • A trial's main link is set once, by the first source that reported it. The new links field keeps one URL per source — keyed by registry or hostname — so every source's address is preserved and visible in the admin.
  • Three new management commands:
    • audit_trial_merges — flags historical records that were probably merged wrongly, so you can review them.
    • merge_trials — merges confirmed duplicates into one trial, moving all related data before deleting the spares.
    • capture_trial_streams — records the raw inbound trial feeds to a file without touching the database, useful for analysing what the registries actually send.

Trial ingestion

  • New parser for the EU CTIS RSS feed extracts far more detail from each entry.
  • New fields from the WHO/ICTRP export: trial acronym, secondary sponsor, whether results are available, and the plan for sharing individual participant data.
  • New results_posted field, with proper parsing of results from ClinicalTrials.gov.
  • Fixed a date bug in the EU feed: dates are day-first (DD/MM/YYYY) but were being parsed month-first, silently shifting dates like 8 December to 12 August.
  • Feeds no longer overwrite existing data with blanks when a source omits a field.
  • The WHO importer records proper change history again.
  • Plain-language labels and help texts throughout the trial admin, with references to where each value comes from.

Trials API and exports

  • The API now returns every trial field, including all the new ones.
  • New filters to look up trials by registry identifier: ?identifiers= matches across all registries at once, or scope to one with ?nct=, ?eudract=, ?euct=, ?ctis=. All accept comma-separated lists, match case-insensitively, and are backed by new database indexes.
  • Filter by ?acronym= and ?has_results=true (results posted, a results date, a results link, or "results available: yes"). Acronyms are now populated from all three sources: the WHO/ICTRP export, the live ClinicalTrials.gov feed (captured from this release onwards), and a one-time backfill_trial_acronyms command that fills historical CTGov rows from the registry API — idempotent and safe to rerun.
  • New export_trials_xlsx command produces an Excel workbook with one sheet per subject, scoped to a team.
  • Trial CSV downloads now stream like article CSVs, so large exports no longer time out.

Categories

  • Categories now have a type: manual (curated by hand) or automatic (matched by terms).
  • Every category assignment records whether a human or the matcher created it. Automatic rebuilds only ever touch the matcher's own assignments — manual curation is never wiped.
  • The pipeline categorizes incrementally by default: only content that changed since the last run is re-checked (30-day window, configurable with --categories-days; use --full-category-rebuild for a full pass).
  • rebuild_categories now syncs by difference — it adds and removes only the assignments that changed, in stable batches, instead of rewriting everything.
  • New automatic categories are matched against existing content as soon as they are created, and editing a category's terms re-matches it immediately.

Articles and subjects

  • Articles get the same links treatment as trials: the canonical link is whichever source arrived first, and every other source URL is kept in the new links field.
  • Subjects now keep edit history.

Ops, settings, and CI

  • New GitHub Actions workflow runs the test suite on every push, in parallel, with faster Docker image builds.
  • DEBUG is now driven by the DJANGO_DEBUG environment variable and defaults to off. The container picks the right server automatically: gunicorn in production, Django's dev server when debugging.
  • Database optimization (migration 0050): adds indexes to the hot paths on Articles and Trials and drops redundant ones, with a step-by-step production runbook.
  • The admin bulk action "Disable all emails" now also unsubscribes the person from every list, so the global flag and per-list subscriptions can no longer drift apart.
  • New prepare_v24_upgrade helper for anyone still upgrading from v23, and the v24 release documents moved to docs/releases/v24/.

⚠️ Breaking changes

  • retrospective_flag renamed to prospective_registration on the Trials model and in the API. It is a pure rename — values are unchanged — but any client reading the old field name must update.
  • Trial title uniqueness replaced by per-registry identifier uniqueness. Migration 0054 checks for real duplicate registry IDs first and fails loudly if it finds any, listing them. Run audit_trial_merges to review and merge_trials to fix, then re-run migrations.
  • DEBUG defaults to off. Local development setups must set DJANGO_DEBUG=True in .env (see example.env).

Upgrade

  1. Back up the database and confirm the dump is restorable.
  2. Add DJANGO_DEBUG to your .env (False in production, True for local development).
  3. Apply migrations during a short maintenance window — migration 0050 rebuilds indexes on the largest tables. See apply_0050_prod_runbook.md.
  4. If migration 0054 fails with duplicate registry IDs, review them with audit_trial_merges, merge with merge_trials, and re-run migrations.
  5. Update API clients that read retrospective_flag to use prospective_registration.
  6. Backfill acronyms for historical ClinicalTrials.gov trials: docker exec gregory python manage.py backfill_trial_acronyms. The command is idempotent — rerunning is safe if interrupted.