This repository is for anyone new to working with datasets released by the Clinical Practice Research Datalink (CPRD). Researchers tasked with understanding the database tables, then querying and filtering to create a research cohort, may find our pre-processing pipeline and interactive notebooks a helpful guide to getting started.
Please note:
-
You need your own copy of CPRD's synthetic/real data to run the code. This repository does not contain any data files.
-
CPRD are moving towards a TRE model of data access, instead of a researcher downloading data onto their own computer. Read more here.
-
This is a work in progress repository. If you would like to suggest or contribute a change, please read our contributor guide.
We aim to streamline the process for researchers using CPRD datasets, with the creation of clear documentation, efficient data management strategies and analytical pipelines. We will start with development of workflows utilising CPRD's medium fidelity synthetic datasets because they resemble
"the real world CPRD data with respect to the data types, data values, data formats, data structure and table relationships" ref.
New to Synthetic Data? Read an introduction here.
We will create and share documentation & code, in openly available languages. We will start by loading the data into a relational database and summarising some of its main features.
By working with our research collaborators, we aim to test workflows written with synthetic datasets on the real datasets to ensure transferability and utility. An anticipated mismatch will be the size of the data files and possibly the variability in file format. Please reach out to us if you want to test our code on your real CPRD data, or have any feedback on improving transferability and utility.
CPRD's most recently released data specifications can be found here for the real datasets and here for the synthetic datasets.
We include information on CPRD's Code Browser tool and how to request access to it.
The code-for-aurum folder uses Python
and postgreSQL
to create a pre-processing workflow for CPRD Aurum data which includes a conversion of data file format for compatibility, and then reading the data into tables in a relational database. Workbooks have been created to familiarise a user with the CPRD Aurum tables, including how they link together and how to build a sample cohort. See a preview below:
We acknowledge and thank these groups for making this project possible:
- National Institute for Health and Care Research (NIHR) for funding the AIM-RSF programme of work [NIHR202647] - see below.
- The AI for Multiple Long Term Conditions Research Support Facility (AIM-RSF) programme for facilitating the delivery of this project.
- This repository was created and is maintained by the AIM-RSF, led by Data Wranglers Rachael Stickland & Mahwish Mohammad.
- Clinical Practice Research Datalink (CPRD) for access to synthetic versions of their datasets [synthetic data request no: SD000021].
- The Alan Turing Institue. This project was supported in part through computational resources provided by The Alan Turing Institute under EPSRC grant EP/N510129/1.
The views expressed within any file in this repository are those of the author(s) within the AIM-RSF programme, and not necessarily those of the: NIHR, Department of Health and Social Care, Medicines and Healthcare products Regulatory Agency (MHRA) or CPRD.
This project follows the all-contributors specification, using the emoji key:
Rachael Stickland 📆 🚧 💻 📖 🤔 |
Mahwish Mohammad 🚧 💻 📖 🤔 👀 |
Batool Almarzouq 👀 🤔 |
Ann-Marie Mallon 📆 🤔 |
Kirstie Whitaker 🤔 |
Would you like to contribute? Please read our contributor guide.
This project is licensed under the MIT License. See the LICENSE file for more details.
Almarzouq, B., Mallon, A.-M., Mohammad, M., Stickland, R., Whitaker, K., & AIM-RSF team. (2024). Introduction to CPRD using synthetic datasets (cprd-data-wrangle) (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.13693616
You got to the end of the README? You get our 🦭 of approval!