A few months ago I found out about OpenTrials.net by Prof. Ben Goldacre. It was a project aimed at "locate, match, and share all publicly accessible data and documents, on all trials conducted, on all medicines and other treatments, globally".
This edition is a show of appreciation for Prof. Goldacre's previous work.
GregoryAi is a modest answer to the same problem. In this version we are making a number of improvements to the way we fetch clinical trials from the world's top 3 registries. We now focus more on the identifiers to ensure the data is sound; with a tradeoff that now we may have a few duplicates if a trial is in two or more registries.
Subscribers to Brain-Regeneration alerts may get some duplicate alerts. It's a problem I am trying to solve by first keeping a chronological record of the raw data to analyse in more detail.
Gregory AI v25
Range: v24 (2026-05-30) → main (2026-06-10). 15 merged PRs, ~113 commits.
Highlights
- Clinical trial identity was rebuilt around registry identifiers. Trials with the same title are no longer merged into one record when their registry IDs say they are different studies, and a trial's link no longer flip-flops between sources on every import.
- Nothing gets lost anymore: a new
linksfield on trials and articles keeps one URL per source. The main link is set by whichever source arrived first and stays put. - Richer trial data: a dedicated parser for the EU CTIS feed, new fields from the WHO/ICTRP export (acronym, secondary sponsor, results information), and a fix for EU dates that were being read month-first — 8 December was becoming 12 August.
- The trials API now exposes every field, adds lookups by registry identifier (NCT, EudraCT, EUCT, CTIS), and exports to Excel.
- Categories distinguish manual curation from automatic matching: rebuilds never touch assignments made by a human, and the pipeline only re-categorizes content that changed instead of everything, every time.
- The test suite now runs on GitHub Actions on every push, in parallel.
⚠️ Before you upgrade
Three things to know before deploying. Details in the Breaking changes and Upgrade sections below.
- API clients: the trial field
retrospective_flagis now calledprospective_registration(same values, clearer name). Update anything that reads the old name. - Migration 0050 adds and drops indexes on the largest tables — plan a short maintenance window.
- Migration 0054 refuses to apply if the database holds real duplicate registry IDs. New commands help you find and merge those duplicates first.
What's new
Trial identity and de-duplication
GregoryAI ingests the same real-world trials from ClinicalTrials.gov, the EU registers, and the WHO portal. Until now, deciding whether two incoming records were "the same trial" leaned too much on the title — which merged distinct studies that shared a title, and let two sources fight over a single trial's link, overwriting each other on every import.
- Registry identifiers now lead. A record with a matching title is no longer treated as the same trial when its registry identifiers point to a different study.
- The database no longer enforces one globally unique title. Instead, each registry identifier (NCT, EudraCT, EUCT, EUCTR, CTIS) is unique on its own.
- A trial's main link is set once, by the first source that reported it. The new
linksfield keeps one URL per source — keyed by registry or hostname — so every source's address is preserved and visible in the admin. - Three new management commands:
audit_trial_merges— flags historical records that were probably merged wrongly, so you can review them.merge_trials— merges confirmed duplicates into one trial, moving all related data before deleting the spares.capture_trial_streams— records the raw inbound trial feeds to a file without touching the database, useful for analysing what the registries actually send.
Trial ingestion
- New parser for the EU CTIS RSS feed extracts far more detail from each entry.
- New fields from the WHO/ICTRP export: trial acronym, secondary sponsor, whether results are available, and the plan for sharing individual participant data.
- New
results_postedfield, with proper parsing of results from ClinicalTrials.gov. - Fixed a date bug in the EU feed: dates are day-first (DD/MM/YYYY) but were being parsed month-first, silently shifting dates like 8 December to 12 August.
- Feeds no longer overwrite existing data with blanks when a source omits a field.
- The WHO importer records proper change history again.
- Plain-language labels and help texts throughout the trial admin, with references to where each value comes from.
Trials API and exports
- The API now returns every trial field, including all the new ones.
- New filters to look up trials by registry identifier:
?identifiers=matches across all registries at once, or scope to one with?nct=,?eudract=,?euct=,?ctis=. All accept comma-separated lists, match case-insensitively, and are backed by new database indexes. - Filter by
?acronym=and?has_results=true(results posted, a results date, a results link, or "results available: yes"). Acronyms are now populated from all three sources: the WHO/ICTRP export, the live ClinicalTrials.gov feed (captured from this release onwards), and a one-timebackfill_trial_acronymscommand that fills historical CTGov rows from the registry API — idempotent and safe to rerun. - New
export_trials_xlsxcommand produces an Excel workbook with one sheet per subject, scoped to a team. - Trial CSV downloads now stream like article CSVs, so large exports no longer time out.
Categories
- Categories now have a type: manual (curated by hand) or automatic (matched by terms).
- Every category assignment records whether a human or the matcher created it. Automatic rebuilds only ever touch the matcher's own assignments — manual curation is never wiped.
- The pipeline categorizes incrementally by default: only content that changed since the last run is re-checked (30-day window, configurable with
--categories-days; use--full-category-rebuildfor a full pass). rebuild_categoriesnow syncs by difference — it adds and removes only the assignments that changed, in stable batches, instead of rewriting everything.- New automatic categories are matched against existing content as soon as they are created, and editing a category's terms re-matches it immediately.
Articles and subjects
- Articles get the same
linkstreatment as trials: the canonical link is whichever source arrived first, and every other source URL is kept in the newlinksfield. - Subjects now keep edit history.
Ops, settings, and CI
- New GitHub Actions workflow runs the test suite on every push, in parallel, with faster Docker image builds.
DEBUGis now driven by theDJANGO_DEBUGenvironment variable and defaults to off. The container picks the right server automatically: gunicorn in production, Django's dev server when debugging.- Database optimization (migration 0050): adds indexes to the hot paths on Articles and Trials and drops redundant ones, with a step-by-step production runbook.
- The admin bulk action "Disable all emails" now also unsubscribes the person from every list, so the global flag and per-list subscriptions can no longer drift apart.
- New
prepare_v24_upgradehelper for anyone still upgrading from v23, and the v24 release documents moved todocs/releases/v24/.
⚠️ Breaking changes
retrospective_flagrenamed toprospective_registrationon the Trials model and in the API. It is a pure rename — values are unchanged — but any client reading the old field name must update.- Trial title uniqueness replaced by per-registry identifier uniqueness. Migration 0054 checks for real duplicate registry IDs first and fails loudly if it finds any, listing them. Run
audit_trial_mergesto review andmerge_trialsto fix, then re-run migrations. DEBUGdefaults to off. Local development setups must setDJANGO_DEBUG=Truein.env(seeexample.env).
Upgrade
- Back up the database and confirm the dump is restorable.
- Add
DJANGO_DEBUGto your.env(Falsein production,Truefor local development). - Apply migrations during a short maintenance window — migration 0050 rebuilds indexes on the largest tables. See
apply_0050_prod_runbook.md. - If migration 0054 fails with duplicate registry IDs, review them with
audit_trial_merges, merge withmerge_trials, and re-run migrations. - Update API clients that read
retrospective_flagto useprospective_registration. - Backfill acronyms for historical ClinicalTrials.gov trials:
docker exec gregory python manage.py backfill_trial_acronyms. The command is idempotent — rerunning is safe if interrupted.