# Summary statistics for CPRD Aurum Sample (Synthetic) Dataset

The summary statistics created in this notebook follow the structure of those within the ['Release Notes: CPRD Aurum Sample Dataset October 2021'](https://www.cprd.com/sites/default/files/2022-02/CPRD%20Aurum%20Sample%20Dataset%20Release%20Notes.pdf) PDF. 

This notebook aims to replicate the numbers that CPRD provides using SQL commands, as an introduction to interacting with this dataset and the tables with SQL.

This notebook assumes you have created a SQL database with the CPRD tables within. See code Step1A, Step1B and Step1C in `code-for-aurum` to see how the raw text files were transformed into tables within a SQL database.

*We have not yet matched all the answers in the data specification - please let us know if you spot why!*


----
Preliminary setup code:

In [1]:
# NOTEBOOK SET UP (1) - ask for credentials and db info from user
import getpass
my_username = input('Your username: ')
my_password = getpass.getpass(prompt='Your password: ', stream=None)
this_host = input('Host name: ')
this_db = input('Database name: ')

# NOTEBOOK SET UP (2) - load Jupyter magic functions & connect to db (assumes db & tables already created)
%load_ext sql
%sql postgresql+psycopg2://{my_username}:{my_password}@{this_host}/{this_db}

### Total number of acceptable patients (including transferred out and deceased patients)
Permanent registrations only. The ‘acceptable’ flag refers to a research quality threshold based on CPRD metrics.

In [None]:
%%sql
-- Count total acceptable patients
SELECT COUNT(*)
FROM patient
WHERE acceptable = 1;

### Current number of acceptable patients (i.e. registered at currently contributing practices, excluding transferred out deceased patients)

In [None]:
%%sql
SELECT COUNT(*)
FROM patient
WHERE acceptable = 1 
AND cprd_ddate IS NULL  -- The data spec suggests to use cprd_ddate instead of emis_ddate 
AND regenddate IS NULL;  -- regenddate is null means no date of registration ending 

### Percentage of UK population coverage (current patients only)
Based on latest UK population estimates from the Office of National Statistics.


In [None]:
%%sql
SELECT COUNT(*)/667968.00 as percent_coverage
FROM patient
WHERE acceptable = 1
AND cprd_ddate IS NULL
AND regenddate IS NULL;

### Available follow-up time in years since 1st January 1995 (all patients including transferred out and deceased):
Follow-up time stated here does not incorporate the up-to-standard (UTS) date and the database includes records pre-dating the 1st of January 1995

*In this section, we don't quite match the answers in the release note!*


In [None]:
%%sql
-- Defining follow up time as difference between enddate and startdate
SELECT regenddate,
regstartdate,
regenddate-regstartdate AS followup_days,(regenddate-regstartdate)/365.0 AS followup_years
FROM Patient
WHERE regenddate IS NOT NULL
LIMIT 2;

In [None]:
%%sql
-- AVERAGE for all patients
SELECT AVG(
    (
    CASE WHEN regenddate IS NULL 
    THEN '2021-10-01' ELSE regenddate END -- if patient had no regenddate we assume the enddate is the date of cprd publication
    - 
    CASE WHEN regstartdate < '1995-01-01' -- we want to include all patients, but if regstartdate is before 1995-01-01, we only count from this date
    THEN '1995-01-01' ELSE regstartdate END
    )/365.0
    )
FROM Patient;


In [None]:
%%sql
-- MEDIAN for all patients
WITH cte AS (
    SELECT 
    (CASE WHEN regenddate IS NULL 
    THEN '2021-10-01' ELSE regenddate END 
    - 
    CASE WHEN regstartdate < '1995-01-01'
    THEN '1995-01-01' ELSE regstartdate END
    )/365.0 AS followup_years
    FROM Patient
    )
    --select * from cte  
    SELECT percentile_disc(0.25) WITHIN group (ORDER BY followup_years) FROM cte
    UNION ALL
    SELECT percentile_disc(0.5) WITHIN group (ORDER BY  followup_years) FROM cte
    UNION ALL 
    SELECT percentile_disc(0.75) WITHIN group (ORDER BY  followup_years) FROM cte

In [None]:
%%sql
---STDEV for all patients
SELECT STDDEV(
    (
    CASE WHEN regenddate IS NULL 
    THEN '2021-10-01' ELSE regenddate END 
    - 
    CASE WHEN regstartdate < '1995-01-01' 
    THEN '1995-01-01' ELSE regstartdate END
    )/365.0
    )
FROM Patient;

In [None]:
%%sql
-- AVERAGE for current patients
SELECT AVG(
    (
    CASE WHEN regenddate IS NULL 
    THEN '2021-10-01' ELSE regenddate END 
    - 
    CASE WHEN regstartdate < '1995-01-01'
    THEN '1995-01-01' ELSE regstartdate END
    )/365.0
    )
FROM Patient
WHERE regenddate IS NULL
AND cprd_ddate IS NULL;

In [None]:
%%sql
-- MEDIAN for all patients
WITH cte AS (
    SELECT 
    (CASE WHEN regenddate IS NULL 
    THEN '2021-10-01' ELSE regenddate END 
    - 
    CASE WHEN regstartdate < '1995-01-01'
    THEN '1995-01-01' ELSE regstartdate END
    )/365.0 AS followup_years
    FROM Patient
    WHERE regenddate IS NULL
    AND cprd_ddate IS NULL
    )
    --select * from cte  
    SELECT percentile_disc(0.25) WITHIN group (ORDER BY followup_years) FROM cte
    UNION ALL
    SELECT percentile_disc(0.5) WITHIN group (ORDER BY  followup_years) FROM cte
    UNION ALL 
    SELECT percentile_disc(0.75) WITHIN group (ORDER BY  followup_years) FROM cte


In [None]:
%%sql
-- STDEV for current patients 
SELECT STDDEV(
    (
    CASE WHEN regenddate IS NULL 
    THEN '2021-10-01' ELSE regenddate END 
    - 
    CASE WHEN regstartdate < '1995-01-01'
    THEN '1995-01-01' ELSE regstartdate END
    )/365.0
    )
FROM Patient
WHERE regenddate IS NULL
AND cprd_ddate IS NULL;


### Total number of practices (current and historic) included in the database

In [None]:
%%sql
SELECT COUNT(*) FROM practice;

In [None]:
%%sql
-- Total number of distinct practices 
SELECT COUNT(DISTINCT practice) FROM practice;

In [None]:
%%sql
SELECT * FROM practice;

### Currently contributing practices
Currently contributing practices are those contributing data to CPRD within 120 days of the database build 
being created. Practices that no longer contribute data to CPRD are classed as not currently contributing 
practices. The definition of currently contributing practices has been altered from 60 to 120 days to allow for the 
change to a quarterly release schedule planned up to March 2024

*This section onwards, we don't quite match the answers in the release note!*


In [None]:
%%sql
-- Last collection date (lcd) has to be within 120 days of database build release (Oct 2021 in this case)
SELECT *, ('2021-10-01'-lcd) AS "Collection Date & Release date diff" FROM practice WHERE ('2021-10-01'-lcd) < 120 -- you wont see any results as all practices fall outside this range

### Percentage coverage of UK general practices (currently contributing practices only)
Expressed as a percentage of all practices currently contributing to CPRD Aurum

> Not applicable, as above query returns no results


### Regional distribution of currently contributing practices

In [None]:
%%sql
-- Regional distribution of currently contributing practices
SELECT re.description AS Region, COUNT(pr.pracid) AS TotalPractices 
FROM practice PR
INNER JOIN region re 
ON re.regionid = pr.region
GROUP BY pr.region, re.description