# Preprocessing

Preprocess data from the [Physio Net 2012 challenge](https://physionet.org/challenge/2012/). The `PhysioNetDataset` can be used as a normal Pytorch Dataset, but because processing takes around 1 second per example, its more efficient to create the entire tensor using `preprocess_all()`, and save the entire array. This is what happens here.

In [11]:
from pathlib import Path
import torch
import numpy as np
import pickle

In [2]:
from data import PhysioNetDataset

In [3]:
dataset = PhysioNetDataset()

In [4]:
input_array, outcomes = dataset.preprocess_all()

100%|██████████| 4000/4000 [1:03:37<00:00,  1.05it/s]


In [6]:
input_array.shape, outcomes.shape

(torch.Size([4000, 48, 37]), torch.Size([4000]))

In [8]:
np.save('physio_input.npy', input_array.numpy())
np.save('physio_outcomes.npy', outcomes.numpy())

In addition, a `normalizing_dict` maps the features to their indices in the array. The features' means and standard deviations are also in the dictionary.

In [10]:
normalizing_dict = dataset.get_normalizing_dict()

In [14]:
with open('physio_normalizing_dict.pkl', 'wb') as f:
    pickle.dump(normalizing_dict, f)