# SPARRA data examples

Done in python for speed -- if you want to export the resulting dataframes, go for it.

## PIS

Each row is a prescription payment _event_. Each prescription event is characterised by a BNF section (see e.g. https://openprescribing.net/bnf/), a date, and a number of items. Features might be things like:
1. Number of distinct BNF sections for an ID
2. Count of items that come from one of a set of BNF sections
3. Total number of items across all BNF sections


In [29]:
import pandas as pd
import numpy as np

def random_dates(start, end, n, seed=1, replace=False):
    dates = pd.date_range(start, end).to_series()
    return list(dates.sample(n, replace=replace, random_state=seed))

MAX_ID = 10 # 10 different people
N_ROWS = 100 # lots of rows
START = pd.to_datetime('2015-01-01')
END = pd.to_datetime('2018-01-01')
id_vals = np.random.choice(MAX_ID, N_ROWS)
BNF_SECS = [
    '0101',
    '0102',
    '0103',
    '0104',
    '0105',
    '0106',
    '0107',
    '0108',
    '0109',
    '0201',
    '0202',
    '0203',
    '0204' # etc - got bored typing them in
]
date_vals = random_dates(START, END, n=N_ROWS, replace=True)
bnf_vals = np.random.choice(BNF_SECS, N_ROWS)
num_items_vals = np.random.poisson(2, N_ROWS) + 1 # don't want zeros
pis_data = pd.DataFrame({'id': id_vals, 'paid_date': date_vals, 'bnf_section': bnf_vals, 'num_items': num_items_vals}, index=range(N_ROWS))
print(pis_data.head())

   id  paid_date bnf_section  num_items
0   0 2017-11-27        0104          2
1   8 2015-08-24        0101          5
2   9 2018-01-01        0103          1
3   2 2017-06-24        0106          3
4   2 2016-12-16        0201          3


### Example PIS features

In [65]:
# distinct bnf
n_distinct = pis_data.groupby('id').agg(n_distinct_bnf = pd.NamedAgg(column='bnf_section', aggfunc='nunique')).reset_index()
print(n_distinct.head())
# total prescribed items
total_items = pis_data.groupby('id').agg(n_total_items = pd.NamedAgg(column='num_items', aggfunc='sum')).reset_index()
print(total_items.head())
# particular BNF (total prescribed items after some filtering on section)
pis_diar_and_anti = pis_data.loc[pis_data['bnf_section'].isin(['0203','0204']), :].groupby('id').agg(n_diar_and_anti = pd.NamedAgg(column='num_items', aggfunc='sum')).reset_index()
print(pis_diar_and_anti.head())

   id  n_distinct_bnf
0   0               6
1   1               8
2   2               6
3   3               7
4   4               5
   id  n_total_items
0   0             21
1   1             36
2   2             22
3   3             41
4   4             31
   id  n_diar_and_anti
0   0                4
1   1                7
2   3                4
3   4                6
4   7                4


## SMR04 (psych admissions -- example where there are multiple episodes per stay)

Bit harder to generate, so simple simulation

- Each row is an _episode_. Multiple episodes make up a stay.
- The `cis_marker` field can be used to group episodes from one id into a stay. I.e. all rows for a (`id`, `cis_marker`) tuple correspond to a single stay.
- `episode_within_cis` tells us the order of the episodes within a stah
- I've added a random code to show what the features might be like

In [97]:
from datetime import timedelta
N_STAYS = 100
start_dates = random_dates(START, END, n=N_STAYS, seed=1, replace=False)
data_dict = {'id': [], 'admission_date': [], 'discharge_date': [], 'cis_marker': [], 'episode_within_cis': [], 'some_code': []}
cis_dict = {}
# Generate 100 stays in total
for stay_idx in range(N_STAYS):
    # pick a start date for the stay
    start_date = start_dates[stay_idx]

    # pick an ID for the stay
    id = np.random.choice(MAX_ID, 1)[0]

    # decide how many episodes to generate
    n_episodes = np.random.poisson(1) + 1 # + 1 so we don't get zero

    # Take the next cis_marker (cis = continuous integrated stay)
    # or generate a random starting marker if we haven't hit this ID before
    # Note that the absolute value of the cis_marker is irrelevant
    # also Note that cis_marker values are not unique across the population, only
    # within a particular ID
    if id in cis_dict:
        cis_marker = cis_dict[id] + 1
        cis_dict[id] += 1
    else:
        cis_marker = np.random.choice(100) + 1
        cis_dict[id] = cis_marker

    # loop over the episodes
    for episode in range(n_episodes):
        episode_within_cis = episode + 1 # first episode in a stay has episode_within_cis = 1
        # First episode has stay start date as its start
        episode_start_date = start_date

        # Pick a duration in days (this can be zero as an episode might not take the whole day)
        episode_duration = np.random.poisson(1)
        episode_end_date = episode_start_date + timedelta(days=episode_duration)

        # Set start_date for the next episode to the end date of this one
        start_date = episode_end_date

        # Add a random code just to demonstrate the kind of features we might want
        code = np.random.choice(['a', 'b', 'c', 'd', 'e'])

        # Add the episode to the data dictionary
        data_dict['id'].append(id)
        data_dict['admission_date'].append(episode_start_date)
        data_dict['discharge_date'].append(episode_end_date)
        data_dict['episode_within_cis'].append(episode_within_cis)
        data_dict['some_code'].append(code)
        data_dict['cis_marker'].append(cis_marker)

# Put everything into a dataframe
smr04 = pd.DataFrame(data_dict, index=range(len(data_dict['id'])))
print(smr04.head())

   id admission_date discharge_date  cis_marker  episode_within_cis some_code
0   3     2017-11-17     2017-11-17          17                   1         e
1   5     2016-03-11     2016-03-12          50                   1         c
2   6     2017-12-10     2017-12-11          83                   1         b
3   6     2017-12-11     2017-12-14          83                   2         e
4   6     2017-12-14     2017-12-14          83                   3         a


### Example SMR04 features

Just some examples to hopefully cover the kind of things SPARRA uses.

Note one thing I haven't shown here -- sometimes for episodic data like this you have to be careful when doing _date filtering_. I.e. SPARRA only includes information for _stays_ that start within the relevant time period. Therefore you cannot simply filter on episodes dates, but you need to either:
1. Keep all episodes where the first episode in the stay is within the date range, or
2. Merge into stays first and then filter.

To transform to one row per stay, you'd need to do things like take the admission date from the first episode and the discharge from the last etc.

Note that some SPARRA features count stay-related things based on codes in their first episode (stays are classified as emergency / elective based just on information in their first episode) and some things based on the code appearing in any episode (alcohol related stays are counted as the number of stays where _any_ episode within the stay has an alcohol diagnosis code)

In [86]:
# number of stays per id
# (lots of ways of calculating this) -- this is one example: count the distinct cis_marker
n_stays = smr04.groupby('id').agg(n_stays = pd.NamedAgg(column='cis_marker', aggfunc='nunique')).reset_index()
print(n_stays.head())
# total nights in hospital
smr04['length_of_stay'] = smr04['discharge_date'] - smr04['admission_date']
n_bed_days = smr04.groupby('id').agg(total_bed_days = pd.NamedAgg(column='length_of_stay', aggfunc='sum')).reset_index()
print(n_bed_days.head())
# number of stays where the first episode had code 'b'
n_stays_b = smr04.loc[(smr04['some_code'] == 'b') & (smr04['episode_within_cis'] == 1), :].groupby('id').agg(n_stays_b = pd.NamedAgg(column='cis_marker', aggfunc='nunique')).reset_index()
print(n_stays_b.head())
# number of bed days for stays where any episode has code 'c'
# ran out of time to implement -- but basically need to keep all _stays_ where _any_ episode in the stay has a code c and then compute number of bed days

   id  n_stays
0   0       14
1   1       10
2   2        3
3   3       12
4   4       10
   id total_bed_days
0   0        19 days
1   1        31 days
2   2         4 days
3   3        22 days
4   4        18 days
   id  n_stays_b
0   0          4
1   1          2
2   2          1
3   3          2
4   4          1


# SPARRA LTC

This one is a bit different (but will work). The data has one row per ID and one column per Long Term Condition. The values are NA if the person does not have that LTC and the date they were first diagnosed if they do. There are 19 conditions. I've just done 3.

In [93]:
conditions = ['ASTHMA', 'DIABETES', 'PARKINSONS']
data_dict = {'id': [], 'ASTHMA': [], 'DIABETES': [], 'PARKINSONS': []}
start_dates = random_dates(START, END, n=MAX_ID*3, seed=1, replace=True)
date_pos = 0
for i in range(MAX_ID):
    data_dict['id'].append(i)
    for c_pos in range(len(conditions)):
        if np.random.rand() < 0.2:
            # they have this condition
            dat = start_dates[date_pos]
            date_pos += 1
            data_dict[conditions[c_pos]].append(dat)
        else:
            data_dict[conditions[c_pos]].append(np.NaN)
ltc_data = pd.DataFrame(data_dict, index=range(MAX_ID))

In [95]:
ltc_data

Unnamed: 0,id,ASTHMA,DIABETES,PARKINSONS
0,0,2017-11-27,NaT,NaT
1,1,NaT,2015-08-24,NaT
2,2,NaT,NaT,2018-01-01
3,3,NaT,2017-06-24,NaT
4,4,NaT,NaT,NaT
5,5,NaT,NaT,NaT
6,6,NaT,2016-12-16,NaT
7,7,NaT,NaT,NaT
8,8,NaT,2017-04-27,NaT
9,9,NaT,NaT,NaT


Exmaple features:
1. Years since the LTC diagnosis date (null if no date) - one feature per lTC so there would be three in this case: years_since_asthma, years_since_diabetes etc
2. Number of LTCs (i.e. number of non-null values for each id (row)

## Indicated features
These combine PIS and LTC. E.g. for a particular ID, feature starts with value 0. If they have any prescriptions in a set of BNF sections, then add 1. If they have a particular LTC date, then add 1. So value will be 0, 1, 2. There are five of these -- each with different BNF and LTC combinations.

This is the example where I think we'd want to be able to specify features in the json that combine multiple inputs, and the inputs would be the output of other transformations.

E.g. transform 1: do they have a prescritpion in some BNF section(s) (1 = yes, 0 = no)
transform 2: do they have a date in a particular LTC (1 = yes, 0 = no)
transform 3: transform 1 + transform 2 (per id)

Concrete example, BNF = 0201 and LTC = diabetes. So, need a transformation that computes a "has_0201" feature (0 or 1 depending if theyhave any 0201 PIS rows or now) and a has_diabetes feature (0 or 1 depending if they have null, or a date in LTC)

We would then need a third transformer that would ADD the has_0201 and has_diabetes feature for each id.