# Preprocessing

Preprocess data from the [Physio Net 2012 challenge](https://physionet.org/challenge/2012/). The `PhysioNetDataset` can be used as a normal Pytorch Dataset, but because processing takes around 1 second per example, its more efficient to create the entire tensor using `preprocess_all()`, and save the entire array (especially because `preprocess_all()` takes advantage of multiprocessing. This is what happens here.

In [1]:
from pathlib import Path
import torch
import numpy as np
import pickle

In [2]:
from data import PhysioNetDataset

In [3]:
dataset = PhysioNetDataset()

In [4]:
input_array, outcomes = dataset.preprocess_all()

Processed 100 records
Processed 200 records
Processed 300 records
Processed 400 records
Processed 500 records
Processed 600 records
Processed 700 records
Processed 800 records
Processed 900 records
Processed 1000 records
Processed 1100 records
Processed 1200 records
Processed 1300 records
Processed 1400 records
Processed 1500 records
Processed 1600 records
Processed 1700 records
Processed 1800 records
Processed 1900 records
Processed 2000 records
Processed 2100 records
Processed 2200 records
Processed 2300 records
Processed 2400 records
Processed 2500 records
Processed 2600 records
Processed 2700 records
Processed 2800 records
Processed 2900 records
Processed 3000 records
Processed 3100 records
Processed 3200 records
Processed 3300 records
Processed 3400 records
Processed 3500 records
Processed 3600 records
Processed 3700 records
Processed 3800 records
Processed 3900 records
Processed 4000 records


In [5]:
input_array.shape, outcomes.shape

(torch.Size([4000, 48, 37]), torch.Size([4000]))

In [6]:
np.save('physio_input.npy', input_array.numpy())
np.save('physio_outcomes.npy', outcomes.numpy())

In addition, a `normalizing_dict` maps the features to their indices in the array. The features' means and standard deviations are also in the dictionary.

In [7]:
normalizing_dict = dataset.get_normalizing_dict()

In [8]:
with open('physio_normalizing_dict.pkl', 'wb') as f:
    pickle.dump(normalizing_dict, f)