👋 Welcome

👥 Who is this repository for?

This repository is for anyone new to working with datasets released by the Clinical Practice Research Datalink (CPRD). Researchers tasked with understanding the database tables, then querying and filtering to create a research cohort, may find our pre-processing pipeline and interactive notebooks a helpful guide to getting started.

Please note:

You need your own copy of CPRD's synthetic/real data to run the code. This repository does not contain any data files.
CPRD are moving towards a TRE model of data access, instead of a researcher downloading data onto their own computer. Read more here.
This is a work in progress repository. If you would like to suggest or contribute a change, please read our contributor guide.

🥅 Project Goals

We aim to streamline the process for researchers using CPRD datasets, with the creation of clear documentation, efficient data management strategies and analytical pipelines. We will start with development of workflows utilising CPRD's medium fidelity synthetic datasets because they resemble

"the real world CPRD data with respect to the data types, data values, data formats, data structure and table relationships" ref.

New to Synthetic Data? Read an introduction here.

We will create and share documentation & code, in openly available languages. We will start by loading the data into a relational database and summarising some of its main features.

By working with our research collaborators, we aim to test workflows written with synthetic datasets on the real datasets to ensure transferability and utility. An anticipated mismatch will be the size of the data files and possibly the variability in file format. Please reach out to us if you want to test our code on your real CPRD data, or have any feedback on improving transferability and utility.

CPRD's most recently released data specifications can be found here for the real datasets and here for the synthetic datasets.

💻 Current content

We include information on CPRD's Code Browser tool and how to request access to it.

The code-for-aurum folder uses Python and postgreSQL to create a pre-processing workflow for CPRD Aurum data which includes a conversion of data file format for compatibility, and then reading the data into tables in a relational database. Workbooks have been created to familiarise a user with the CPRD Aurum tables, including how they link together and how to build a sample cohort. See a preview below:

🤝 Contributions and Acknowledgments

We acknowledge and thank these groups for making this project possible:

National Institute for Health and Care Research (NIHR) for funding the AIM-RSF programme of work [NIHR202647] - see below.
The AI for Multiple Long Term Conditions Research Support Facility (AIM-RSF) programme for facilitating the delivery of this project.
- This repository was created and is maintained by the AIM-RSF, led by Data Wranglers Rachael Stickland & Mahwish Mohammad.
Clinical Practice Research Datalink (CPRD) for access to synthetic versions of their datasets [synthetic data request no: SD000021].
The Alan Turing Institue. This project was supported in part through computational resources provided by The Alan Turing Institute under EPSRC grant EP/N510129/1.

The views expressed within any file in this repository are those of the author(s) within the AIM-RSF programme, and not necessarily those of the: NIHR, Department of Health and Social Care, Medicines and Healthcare products Regulatory Agency (MHRA) or CPRD.

Thanks to specific contributors

This project follows the all-contributors specification, using the emoji key:

_{Rachael Stickland}
📆 🚧 💻 📖 🤔

_{Mahwish Mohammad}
🚧 💻 📖 🤔 👀

_{Batool Almarzouq}
👀 🤔

_{Ann-Marie Mallon}
📆 🤔

_{Kirstie Whitaker}
🤔

Would you like to contribute? Please read our contributor guide.

♻️ Licences

This project is licensed under the MIT License. See the LICENSE file for more details.

Citation

Almarzouq, B., Mallon, A.-M., Mohammad, M., Stickland, R., Whitaker, K., & AIM-RSF team. (2024). Introduction to CPRD using synthetic datasets (cprd-data-wrangle) (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.13693616

You got to the end of the README? You get our 🦭 of approval!

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
code-for-aurum		code-for-aurum
.all-contributorsrc		.all-contributorsrc
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
cprd-code-browser.md		cprd-code-browser.md
notebook_demo.gif		notebook_demo.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👋 Welcome

👥 Who is this repository for?

🥅 Project Goals

💻 Current content

🤝 Contributions and Acknowledgments

Thanks to specific contributors

♻️ Licences

Citation

About

Releases 2

Packages

Contributors 3

Languages

License

aim-rsf/cprd-data-wrangle

Folders and files

Latest commit

History

Repository files navigation

👋 Welcome

👥 Who is this repository for?

🥅 Project Goals

💻 Current content

🤝 Contributions and Acknowledgments

Thanks to specific contributors

♻️ Licences

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages