# PACE Tutorial 3 (supplemental): Create the Synthetic APC database 

## Prerequisites

For the rest of the tutorial, we assume that:
  - Postgres is installed on your system ([download and install Postgres for your platform](https://www.postgresql.org/download/))
  - The Postgres server is running (instructions for starting the server are platform specific: consult the documentation for your platform)
  - A database named `db` has been created, and is owned by the local user
    - This may involve running `creatuser <your_username>` as the database administrator, followed by `createdb db`
    
[Postgres documentation](https://www.postgresql.org/docs/)

## Create the schema and table

The following SQL script creates the schema and table that we use in the database example

In [None]:
%%writefile create-diag-example.sql

DROP TABLE IF EXISTS diag_example.synth_apc CASCADE;

CREATE SCHEMA diag_example;
CREATE TABLE diag_example.synth_apc (
    "Key" SERIAL PRIMARY KEY,
    "DIAG_01" VARCHAR,
    "DIAG_02" VARCHAR,
    "DIAG_03" VARCHAR,
    "DIAG_04" VARCHAR,
    "DIAG_05" VARCHAR,
    "DIAG_06" VARCHAR,
    "DIAG_07" VARCHAR,
    "DIAG_08" VARCHAR,
    "DIAG_09" VARCHAR,
    "DIAG_10" VARCHAR,
    "ADMIAGE" INTEGER,
    "ADMIMETH" VARCHAR,
    "Mortality" INTEGER,
    "PROCODE3" VARCHAR,
    "SEX" INTEGER
);

\copy diag_example.synth_apc ("DIAG_01", "DIAG_02", "DIAG_03", "DIAG_04", "DIAG_05", "DIAG_06", "DIAG_07", "DIAG_08", "DIAG_09", "DIAG_10", "ADMIAGE", "ADMIMETH", "Mortality", "PROCODE3", "SEX") FROM '../examples/datasets/Synthetic_APC_DIAG_Fields.csv' DELIMITER ',' CSV HEADER

<div class="alert alert-success">If the command below fails, make sure that the database <a href=#Prerequisites>prerequisites</a> are satisfied.</div>

In [None]:
!psql --dbname db -f create-diag-example.sql

If the above command succeeds, it will echo each command that is run, finishing with `COPY 200`.

It should now be possible to continue with [Tutorial 3](./Tutorial%203%20-%20Loading%20data%20from%20Postgres.ipynb).

## (Optional) create an upsampled synthetic dataset

This section creates a dataset of 10 million rows by sampling with replacement from the dataset above, useful for testing and performance comparison.  It is loaded into a table named `synth_apc_10_7`.

In [None]:
import pandas as pd

df = pd.read_csv("../examples/datasets/Synthetic_APC_DIAG_Fields.csv")
df_large = df.sample(n=10_000_000, replace=True).reset_index(drop=True)
df_large.tail()

df_large.to_csv("./Synthetic_APC_DIAG_Fields_10_7.csv", index=False)

In [None]:
%%writefile create-diag-large-example.sql

DROP TABLE IF EXISTS diag_example.synth_apc_10_7 CASCADE;

CREATE TABLE diag_example.synth_apc_10_7 (
    LIKE diag_example.synth_apc INCLUDING ALL
);

\copy diag_example.synth_apc_10_7 ("DIAG_01", "DIAG_02", "DIAG_03", "DIAG_04", "DIAG_05", "DIAG_06", "DIAG_07", "DIAG_08", "DIAG_09", "DIAG_10", "ADMIAGE", "ADMIMETH", "Mortality", "PROCODE3", "SEX") FROM './Synthetic_APC_DIAG_Fields_10_7.csv' DELIMITER ',' CSV HEADER

In [None]:
!psql --dbname db -f create-diag-large-example.sql