# Experiment with memory efficient featurizers

In [None]:
import os
import sys
import datetime
import pandas as pd
import pickle as pkl
import numpy as np
import scipy

%load_ext autoreload
%autoreload 2

In [None]:
sys.path.append('/src')
from shared import utils
from shared.load_raw_data import load_raw_data_from_files
from shared.featurizers import Featurizer1
from eliot import to_file
to_file(sys.stdout)

In [None]:
data_dict = load_raw_data_from_files('/data/raw')
prediction_times = utils.get_prediction_timestamps(
                        data_dict['stays'],
                        "07:00:00"
                    )
# Save for later use 
with open("/data/processed/prediction_times.pkl", 'wb') as f_out:
    pkl.dump(prediction_times, file=f_out)

In [None]:
%%time

featurizer1 = Featurizer1(data_dict, prediction_times)
feature1_csr, feature1_names = featurizer1.process()

# Save for later use 
scipy.sparse.save_npz('/data/processed/features_csr.npz', feature1_csr)
with open('/data/processed/feature_colnames.pkl', 'wb') as f_out:
    pkl.dump(feature1_names, file=f_out)


#### .
#### .
#### .
#### .
### Construct prediction times... 
Current approach has holes, so we will use the stays, which are derived from the daily census and the transfer logs.
The point of this is to have an explicit model for when predictions are done so we can be careful about constructing our problem.
We will use <code>utils.get_prediction_timestamps</code>.

* First, assume the daily census records are the result of an automated process that runs just before midnight each day.  This seems to be consistent with the transfer logs (harder to say what is going on the other end).  
* Second, specify an explicit <code>time_of_day</code> at which predictions are run each day. 
* Valid prediction times are times for which two conditions hold: the patient was in the census from prior night, and the patient was not transferred out before <code>time_of_day</code>

This suggests that the easiest way to build prediction times is from our stays: 

* As before, use <code>patient_stays.get_patient_stays</code> to get patient stays.  This just splits continuous runs of daily census records using transfer log events.  
* Throw out the first day of each stay. 
* If there is a transfer time stamp for the end of the stay, see if it is before <code>time_of_day</code>.  If so, throw out the last day as a valid prediction time.  Otherwise, include it. 

This is best constructively, from stays, rather than as a filter on proposal prediction times built from the census we were doing previously. 