# Data archiving

As the data used in this paper is confidential, it cannot be published.

In lieu of this, two datasets must be synthesised: one for length of stay and one for interarrival times. Each of these is done via kernel density estimation for each cluster. These datasets are archived at: https://zenodo.org/record/3908167

In [1]:
import pandas as pd
from scipy import stats

In [2]:
copd = pd.read_csv(
    "../data/clusters/copd_clustered.csv", parse_dates=["admission_date"]
)

copd = copd.dropna(subset=["cluster"])
copd["cluster"] = copd["cluster"].astype(int)

In [3]:
def synthesise_column(data, column_to_synthesise, column_name, seed=0):
    """ Synthesise a column via KDE and a clustering. Return the artificial dataset. """

    dfs = []
    for cluster, values in data.groupby("cluster")[column_to_synthesise]:

        df = pd.DataFrame()
        kernel = stats.gaussian_kde(values)

        df[column_name] = kernel.resample(seed=seed)[0]
        df["cluster"] = cluster

        dfs.append(df)

    synth = pd.concat(dfs, ignore_index=True)
    return synth

## Length of stay

In [4]:
synth_los = synthesise_column(copd, "true_los", "los")

As we can see here, there are negative lengths of stay.

Some negative lengths are okay (as in the real dataset) but we'll trim a portion of the bottom from the final dataset.

In [5]:
synth_los["los"].describe()

count    10877.000000
mean         7.620979
std         12.457723
min         -2.523948
25%          1.477098
50%          4.231919
75%          8.781526
max        243.224711
Name: los, dtype: float64

In [6]:
trimmed_los = synth_los[synth_los["los"] >= synth_los["los"].quantile(0.06)]
pd.concat((copd["true_los"].describe(), trimmed_los["los"].describe()), axis=1)

Unnamed: 0,true_los,los
count,10877.0,10224.0
mean,7.70212,8.138036
std,11.861053,12.674558
min,-0.020833,-0.029627
25%,1.491667,1.913583
50%,4.195139,4.66067
75%,8.930556,9.250386
max,224.927778,243.224711


In [7]:
trimmed_los.to_csv("../data/synthetic/los.csv", index=False)

## Interarrival times

In [8]:
sorted_arrivals = copd.set_index("admission_date").sort_index()
sorted_clusters = sorted_arrivals["cluster"]

sorted_diffs = (
    sorted_arrivals.index.to_series()
    .diff()
    .dt.total_seconds()
    .div(24 * 60 * 60)
    .fillna(0)
)

diffs = pd.concat((sorted_diffs, sorted_clusters), axis=1)
diffs.columns = ["true_diff", "cluster"]

In [9]:
synth_diffs = synthesise_column(diffs, "true_diff", "diff")

Again, there are some negative values here. This isn't acceptable so we'll just trim them away.

In [10]:
synth_diffs["diff"].describe()

count    10877.000000
mean         0.274842
std          0.530319
min         -0.621404
25%          0.055015
50%          0.161872
75%          0.394376
max         25.452923
Name: diff, dtype: float64

In [11]:
trimmed_diffs = synth_diffs[synth_diffs["diff"] >= 0]
pd.concat((diffs["true_diff"].describe(), trimmed_diffs["diff"].describe()), axis=1)

Unnamed: 0,true_diff,diff
count,10877.0,9640.0
mean,0.273812,0.318118
std,0.399713,0.547828
min,0.0,5.6e-05
25%,0.053472,0.086876
50%,0.149306,0.195984
75%,0.395833,0.440737
max,25.152778,25.452923


In [12]:
trimmed_diffs.to_csv("../data/synthetic/diffs.csv", index=False)