In [None]:
# NOTEBOOK SET UP (1) - ask for credentials and db info from user
import getpass
my_username = input('Your username: ')
my_password = getpass.getpass(prompt='Your password: ', stream=None)
this_host = input('Host name: ')
this_db = input('Database name: ')

# NOTEBOOK SET UP (2) - load Jupyter magic functions & connect to db (assumes db & tables already created)
%load_ext sql
%sql postgresql+psycopg2://{my_username}:{my_password}@{this_host}/{this_db}

In [None]:
# NOTEBOOK SET UP (3) - ask for necessary paths
GH_path = input("Local path to GH folder 'cprd-data-wrangle': ")
txt_data_path = input("Local path to CPRD Aurum txt files: ")

# Introduction to CPRD Aurum Sample Dataset

The aim of this notebook is provide familiarity with the tables that make up the CPRD Aurum Sample (Synthetic) Dataset.

This notebook assumes you have created a SQL database with the CPRD tables within. See code Step1A, Step1B and Step1C in `code-for-aurum` to see how the raw text files were transformed into tables within a SQL database.

This notebook can also act as a sanity check that you can view and query all the tables in your database. 


## About the dataset

The [data release notes on CPRD's website](https://www.cprd.com/synthetic-data) summarises the purpose of this synthetic dataset, instructs how to cite it, and presents summary statistics. 

Other than this, it points towards the [main Aurum data specifications](https://cprd.com/primary-care-data-public-health-research) for understanding the synthetic data files. This data specification includes the metadata that applies to both the synthetic and the real data (how tables are linked, what tables contain, field descriptions for each table).

## List the raw files and their size


In [None]:
# List the raw files and their size

import os
import pandas as pd

file_list_df = pd.DataFrame(os.listdir(txt_data_path), columns =['FileName'])
file_list_df["MB"] = " "

for index, row in file_list_df.iterrows():
    this_FileSize = os.path.getsize(txt_data_path + row['FileName'])
    thisFileSize_MB = (this_FileSize / 1024) / 1024
    file_list_df.loc[index]["MB"] = round(thisFileSize_MB, 2)
print(file_list_df)

File_Count = len(file_list_df.index)
MB_Total = round(file_list_df['MB'].sum(),2)
print('\n' + '################################' + '\nTotal of all ' + str(File_Count) + ' files: ' + str(MB_Total) + ' MB' + '\n################################')
print('\n' + "These are flat files stored as plain text (.txt)." 
      + '\n' + "The real data will be bigger than the synthetic data (GB not MB)." 
      + '\n' + "Therefore, the real data may store some text files listed here across multiple files.")

## List all tables in this sql database

In [None]:
%%sql 
SELECT table_name 
FROM information_schema.tables 
WHERE table_schema='public' AND table_type='BASE TABLE'

## Preview the data from one table

The notebook will prompt you for the name of the table and the number of rows you want to preview.

Tip: execute the SQL cell more than once for the same table because 'ORDER BY RANDOM()' will show you different data each time.


In [None]:
table_name = input('Table name: ')
n_rows = input('N rows to view: ')
%sql SELECT * FROM {table_name} ORDER BY RANDOM() LIMIT {n_rows} ;

## Preview the data from all tables

Tip: consider if you want to run this, because it will take some minutes to run and produce a lot of outputs.

In [None]:
for index, row in file_list_df.iterrows():
    file_name = row['FileName']
    table_name = file_name.split('.')[0]
    table_preview = %sql SELECT * FROM {table_name} ORDER BY RANDOM() LIMIT 3 ;
    print('\n' + '## Table ' + str(index) + ' of ' + str(File_Count) + '\n' + '## This table is ' + table_name)
    display(table_preview)

# Detailed exploration - Oct 2021 Release

The code *above* should in theory work for any CPRD data release, as it does not assume anything about the table names or how the tables are linked, and asks for user input.

The code *below* takes a guided and more detailed look at each table. This code will only run for you if your table names match those within the [October 2021 release](https://cprd.com/sites/default/files/2022-02/CPRD%20Aurum%20Sample%20Dataset%20Release%20Notes.pdf) of the CPRD Synthetic Aurum Dataset. The code below assumes information about table linkage that is based on this release date. 



## Size comparison: real versus synthetic 

Taking the real CPRD Aurum data to be the [May 2022 release](https://cprd.com/sites/default/files/2022-05/2022-05%20CPRD%20Aurum%20Release%20Notes.pdf) and the synthetic CPRD data to be [October 2021 release](https://cprd.com/sites/default/files/2022-02/CPRD%20Aurum%20Sample%20Dataset%20Release%20Notes.pdf):

| Metric | Real | Synthetic | Synthetic % of Real |
| -| - | - | - |
| Total Acceptable Patients | 41,200,722 | 39,388 | 0.1% |
| Total Current Patients | 13,300,067 | 13,858 | 0.1% |
| Total Practices (current & historic) | 1,491 | 14 | 0.9% |

The table shows that the real dataset has ~1,000 times more patients (total or current) and ~100 times more practices. 

Available follow-up time in years since 1st Jan 1995 (mean, sd, percentiles) is similar for the real and synthetic datasets. 


## What do the 27 files contain and how do they link togther?

At the time of writing this notebook, [v3.4 of the Aurum data specifications](https://www.cprd.com/sites/default/files/2024-04/CPRD%20Aurum%20Data%20Specification%20v3.4.pdf) describes **8 main data files** and **2 data dictionaries**. The other **17 files are lookup tables** to give values for the fields within the main files. However, the descriptions of the fields within these lookup table are not included in the data specifications. 

See the figure on page 5 of [v3.4 of the Aurum data specifications](https://www.cprd.com/sites/default/files/2024-04/CPRD%20Aurum%20Data%20Specification%20v3.4.pdf) which shows how each table is linked with one another and via which ID:

<img src="cprd-aurum-data-structure.png" alt="cprd-aurum-data-structure" width="700" />

### Preview the `Medical dictionary` and associated lookup table

"The Medical Dictionary contains information on all medical history observations that have been recorded in EMIS Web®. Observations are coded using a combination of SNOMED, Read and local EMIS® codes. Further information is provided in later sections of this document." *CPRD Aurum Data Specification Version 3.4*

- Links to the `Consultation` and `Observation` data tables on 'medcodeid'
- Links to the `EMISCodeCat` lookup table on 'emiscodecategoryid'


In [None]:
%%sql 
-- MedicalDictionary table
SELECT * FROM MedicalDictionary 
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- EMISCodeCat lookup table
SELECT * FROM EMISCodeCat
ORDER BY RANDOM() 
LIMIT 5;

### Tip: Execute SQL cells more than once because 'ORDER BY RANDOM()' will show you different data each time

## Preview the `Product dictionary` 

"The Product Dictionary contains information on drug and appliance prescriptions recorded in EMIS Web®. This information is coded using the Dictionary of Medicines and Devices (DM+D). Further information is provided in later sections of this document." *CPRD Aurum Data Specification Version 3.4*

- Links to the `Drug Issue` data table on 'prodcodeid'


In [None]:
%%sql 
-- ProductDictionary table
SELECT * FROM ProductDictionary 
ORDER BY RANDOM() 
LIMIT 5;

## Preview the data tables and associated lookup tables

### `Patient` table 
The `Patient` table "contains basic patient demographics and patient registration details for the patients." *CPRD Aurum Data Specification Version 3.4*

- Links to the `Practice` data table on 'pracid'
- Links to the `Staff` data table on 'usualgpstaffid'
- Links to the `Consultation`, `Observation` and `Drug Issue` data tables on 'patid'
- Links to the `Gender` lookup table on 'gender'
- Links to the `PatientType` lookup table on 'patienttypeid'

In [None]:
%%sql
-- Patient table 
SELECT * FROM Patient 
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- Gender lookup table
SELECT * FROM Gender;

In [None]:
%%sql
-- PatientType lookup table
SELECT * FROM PatientType
ORDER BY RANDOM() 
LIMIT 5;

### `Practice` table
The `Practice` table "contains details of each practice, including the practice identifier, practice region, and the last collection date." *CPRD Aurum Data Specification Version 3.4*

- Links to the `Patient` data table on 'pracid'
- Links to the `Region` lookup table on 'region'



In [None]:
%%sql
-- Practice table
SELECT * FROM Practice
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- Region lookup table
SELECT * FROM Region;

### `Staff` table
The `Staff` table contains practice staff details for each staff member, including job category. *CPRD Aurum Data Specification Version 3.4*
- Links to the `Patient` data table on 'staffid'
- Links to the `Practice` data table on the 'pracid' 
- Links to the `JobCat` lookup table on 'jobcatid'

In [None]:
%%sql
-- Staff table 
SELECT * FROM Staff 
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- JobCat lookup table
SELECT * FROM JobCat
ORDER BY RANDOM() 
LIMIT 5;

### `Consultation` table
The `Consultation` table "contains information relating to the type of consultation as entered by the GP (e.g. telephone, home visit, practice visit). Some consultations are linked to observations that occur during the consultation via the consultation identifier (consid)." *CPRD Aurum Data Specification Version 3.4*
- Links to the `Patient` data table on 'patid'
- Links to the `Practice` data table on 'pracid'
- Links to the `Staff` data table on 'staffid' 
- Links to the `Observation` data table on 'consid'
- Links to the `MedicaDictionary` table on 'consmedcodeid'
- Links to the `ConsSource` look up table on 'conssourceid'

In [None]:
%%sql
-- Consultation table
SELECT * FROM Consultation 
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- ConsSource lookup table
SELECT * FROM ConsSource
ORDER BY RANDOM() 
LIMIT 5;

### `Observation` table
The `Observation` table "contains the medical history data entered on the GP system including symptoms, clinical measurements, laboratory test results, and diagnoses, as well as demographic information recorded as a clinical code (e.g. patient ethnicity). Observations that occur during a consultation can be linked via the consultation identifier. CPRD Aurum data are structured in a long format (multiple rows per subject), and observations can be linked to a parent observation. For example, measurements of systolic and diastolic blood pressure will be grouped together via a parent observation for blood pressure measurement." *CPRD Aurum Data Specification Version 3.4*
- Links to the `Patient` data table on 'patid'
- Links to the `Practice` data table on 'pracid'
- Links to the `Staff` data table on 'staffid'
- Links to the `Consultation` data table on 'consid'
- Links to the `Problem` and `Referral` data tables on 'obsid'
- Links to the `MedicaDictionary` table on 'medcodeid'
- Links to the `NumUnit` lookup table on 'numunitid'
- Links to the `ObsType` lookup table on 'obstypeid'
- Links to itself on 'parentobsid' and 'probobsid'


In [None]:
%%sql
-- Observation table 
SELECT * FROM Observation
WHERE value IS NOT NULL AND numunitid IS NOT NULL AND numrangelow IS NOT NULL AND numrangehigh IS NOT NULL AND probobsid != 'None'
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- NumUnit lookup table
SELECT * FROM NumUnit
ORDER BY RANDOM() 
LIMIT 5;


In [None]:
%%sql
-- ObsType lookup table
SELECT * FROM ObsType;

### `Referral` table
The `Referral` table "contains referral details recorded on the GP system. Data in the referral file are linked to the observation file and contain ‘add-on’ data for referral-type observations. These files contain information involving both inbound and outbound patient referrals to or from external care centres (normally to secondary care locations such as hospitals for inpatient or outpatient care). To obtain the full referral record (including reason for the referral and date), referrals should be linked to the Observation file using the observation identifier (obsid)." *CPRD Aurum Data Specification Version 3.4*
- Links to the `Patient` data table on 'patid'
- Links to the `Practice` data table on 'pracid'
- Links to the `Observation` data table on 'obsid'
- Links to `RefServiceType` lookup table on 'refservicetypeid'
- Links to `RefUrgency` lookup table on 'refurgencyid'
- Links to `OrgType` lookup table on 'refsourceorgid'
- Links to `RefMode` lookup table on 'refmodeid'

In [None]:
%%sql
-- Referral table
SELECT * FROM Referral
WHERE reftargetorgid IS NOT NULL AND refurgencyid IS NOT NULL AND refservicetypeid IS NOT NULL AND refmodeid IS NOT NULL 
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- RefServiceType lookup table
SELECT * FROM RefServiceType;

In [None]:
%%sql
-- RefUrgency lookup table
SELECT * FROM RefUrgency;

In [None]:
%%sql
-- OrgType lookup table
SELECT * FROM OrgType LIMIT 3;

In [None]:
%%sql
-- RefMode lookup table
SELECT * FROM RefMode;

### `Problem` table
The `Problem` table "contains details of the patient’s medical history that have been defined by the GP as a ‘problem’. Data in the problem file are linked to the observation file and contain ‘add-on’ data for problem-type observations. Information on identifying associated problems, the significance of the problem and its expected duration can be found in this table. GPs may use ‘problems’ to manage chronic conditions as it would allow them to group clinical events (including drug prescriptions, measurements, symptom recording) by problem rather than chronologically. To obtain the full problem record (including the clinical code for the problem), problems should be linked to the Observation file using the observation identifier (obsid)." *CPRD Aurum Data Specification Version 3.4*
- Links to the `Patient` data table on 'patid'
- Links to the `Practice` data table on 'pracid'
- Links to the `Staff` data table on 'lastrevstaffid'
- Links to the `Observation` data table on 'obsid' and 'parentprobobsid'
- Links to the `ParentProbRel` lookup table on 'parentprobrelid'
- Links to the `ProbStatus` lookup table on 'probstatusid'
- Links to the `Sign` lookup table on 'signid'

In [None]:
%%sql
--Problem table 
SELECT * FROM Problem
WHERE lastrevdate IS NOT NULL AND lastrevstaffid IS NOT NULL AND parentprobrelid IS NOT NULL AND probstatusid IS NOT NULL AND signid IS NOT NULL
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- ParentProbRel lookup table
SELECT * FROM ParentProbRel;

In [None]:
%%sql
-- ProbStatus lookup table
SELECT * FROM ProbStatus;

In [None]:
%%sql
-- Sign lookup table
SELECT * FROM Sign;

### `DrugIssue` table
The `DrugIssue` table "contains details of all prescriptions on the GP system. This file contains data relating to all prescriptions (for drugs and appliances) issued by the GP. Some prescriptions are linked to problem-type observations via the Observation file, using the observation identifier (obsid)." *CPRD Aurum Data Specification Version 3.4*
- Links to the `Patient` data table on 'patid'
- Links to the `Practice` data table on 'pracid'
- Links to the `Staff` data table on 'staffid'
- Links to `Observation` and `Problem` data tables on 'probobsid'
- Links to `ProductDictionary` data dictionary table on 'prodcodeid'
- Links to the `Common_dosages` lookup table on 'dosageid'
- Links to the `QuantUnit` lookup table on 'quantunitid'

In [None]:
%%sql
-- DrugIssue table 
SELECT * FROM DrugIssue
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- Common_dosages lookup table
SELECT * FROM Common_dosages
ORDER BY RANDOM() 
LIMIT 5;

In [None]:
%%sql
-- QuantUnit lookup table
SELECT * FROM QuantUnit
ORDER BY RANDOM() 
LIMIT 5;