# Consolidate data (BATCH SPLIT)

<div style="text-align: right"> Author: Vassil Dimitrov </div>
<div style="text-align: right"> Date: 2023-08-01 </div>

As indicated in **2_Consolidate_data_1** the data will be split into batches and then the batches from all available tables except `encounters` will be merged in a single consolidated batch dataframe with encounters and corresponding values for each patient arranged by the time of occurrence.

## Prep

### Load libraries

In [8]:
import numpy as np
import pandas as pd
import math

## Functions

### `PATIENT` to index & explode

In [2]:
def patient_index_explode(tab):
    # Create a new column 'encounter' to differentiate between duplicate rows for each patient
    tab['encounter_n'] = tab.groupby('PATIENT').cumcount()+1
    tab = tab.pivot_table(index='PATIENT',
                          columns='encounter_n',
                          aggfunc='first')
    # Flatten the MultiIndex columns
    tab.columns = [f'{col[0]}_{col[1]}' for col in tab.columns]
    # Fill in nulls with 0
    tab.fillna(0, inplace=True)
    # Convert to uint32
    tab=tab.astype('uint32')
    sparse_dtype = pd.SparseDtype(np.uint32, fill_value=0)
    tab = tab.astype(sparse_dtype)
    return tab

---

### Read & Tidy

In [3]:
def read_n_tidy (tab_name_pkl):
    tab = pd.read_pickle(tab_name_pkl)
    tab.drop_duplicates (inplace=True)
    tab.fillna(0, inplace=True)
    cols2trans = list(tab.columns)
    for col in cols2trans:
        print(col)
        if col == 'PATIENT':
            continue
        else:
            tab[col] = tab[col].sparse.to_dense().astype(np.uint32)
    return tab

---
---

## Split `patients` to batches

### Read in `patients` table

In [4]:
patients = pd.read_pickle('patients2.pkl')
# Sanity check:
print('Patients dimensions:', patients.shape)

Patients dimensions: (1192923, 29)


In [5]:
display(patients.head())

Unnamed: 0_level_0,MARITAL,GENDER,m_since_birth,RACE_asian,RACE_black,RACE_hispanic,RACE_native,RACE_white,ETHNICITY_african,ETHNICITY_american,...,ETHNICITY_irish,ETHNICITY_italian,ETHNICITY_mexican,ETHNICITY_polish,ETHNICITY_portuguese,ETHNICITY_puerto_rican,ETHNICITY_russian,ETHNICITY_scottish,ETHNICITY_swedish,ETHNICITY_west_indian
PATIENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
d3ae0579-ac2c-48b0-a0c1-a858b63e3b99,0,0,50,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
6304009d-9c80-44f0-ae72-4c42eb5e8f38,0,0,134,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
927cdcb7-adfb-4968-b5d4-745764eb931e,0,1,254,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
47c8c88a-7c03-48d8-85d1-3f37299cac61,0,1,304,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
971c0a8b-e356-48d1-8de5-e0f9688b10cf,1,0,395,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


The table contains 1,192,923 entries (patients) with 29 features associated with each patient.

### Split `patients` in batches

The `patients` table will be split into batches containing 120,000 entries (120,000) patients. Each batch will be used to train and optimize a RNN implying that it will be further split into a train-validation subsets. A meta-model will be constructed that takes as input the outputs of the independently trained RNNs for each batch. Overall, the process will follow the workflow below:
1. batch1:
    - RNN_1 (119,293)
    - train
    - validation (for optimization)
2. batch2
    - RNN_2 (119,293)
    - train
    - validation (for optimization)
3. batch3
    - RNN_3 (119,293)
    - train
    - validation (for optimization)
4. batch4
    - RNN_4 (119,293)
    - train
    - validation (for optimization)
5. batch5
    - RNN_5 (119,293)
    - train
    - validation (for optimization)
6. batch6
    - RNN_6 (119,293)
    - train
    - validation (for optimization)
7. batch7
    - RNN_7 (119,293)
    - train
    - validation (for optimization)
8. batch8
    - Meta model performance assessment (119,293)  
9. batch9 & 10
    - Meta model incorporating all RNN outputs
    - train (batch 1-7)
    - validation (238,579 for optimization)
  
  Essentially, every 10th patient will be put into the same batch so that each batch contains approximately 120,000 patients.
  >**Important** to note that before the batches are obtained, the data will be shuffled twice in order to ensure no underlying patterns are captured when distributing it to batches for training. In any case, even if that were the case, the ensemble machine learning method adopted here will largely mitigate such effects.    

#### Reshuffle data

The data will be randomly reshuffled thrice for randomness.

In [26]:
# Reshuffle 1
patients = patients.sample(frac=1)
# Reshuffle 2
patients = patients.sample(frac=1)
# Reshuffle 3
patients = patients.sample(frac=1)

#### Split data into batches and save

In [27]:
# Calculate the total number of rows in the DataFrame
total_rows = len(patients)
print('total entries --', total_rows)

# Number of batches = 10
num_batches = 10
print('number of batches --', num_batches)

# Determine the batch size (every 10th row)
batch_size = math.ceil(total_rows / num_batches)
print('batch_size --', batch_size)

# Create an empty list to store the batches
batches = []

# Loop through the DataFrame and create batches
for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = start_idx + batch_size
    batch = patients.iloc[start_idx:end_idx]
    batches.append(batch)

total entries -- 1192923
number of batches -- 10
batch_size -- 119293


In [28]:
# Sanity check:
for i in batches:
    print(i.shape)

(119293, 29)
(119293, 29)
(119293, 29)
(119293, 29)
(119293, 29)
(119293, 29)
(119293, 29)
(119293, 29)
(119293, 29)
(119286, 29)


In [29]:
# Sanity check 2:
list(batches[0].index) in list(batches[1].index)

False

Now extract each dataframe from batch, save it as a separate object and as a pickle file.

In [30]:
for i, df in enumerate(batches):
    df_name = f'patients_batch{i+1}'
    exec(f'{df_name} = df')
    print(f'Finished extracting {df_name}...')
    exec(f'{df_name}.to_pickle("{df_name}.pkl")')
    print(f'Finished saving {df_name}.')

Finished extracting patients_batch1...
Finished saving patients_batch1.
Finished extracting patients_batch2...
Finished saving patients_batch2.
Finished extracting patients_batch3...
Finished saving patients_batch3.
Finished extracting patients_batch4...
Finished saving patients_batch4.
Finished extracting patients_batch5...
Finished saving patients_batch5.
Finished extracting patients_batch6...
Finished saving patients_batch6.
Finished extracting patients_batch7...
Finished saving patients_batch7.
Finished extracting patients_batch8...
Finished saving patients_batch8.
Finished extracting patients_batch9...
Finished saving patients_batch9.
Finished extracting patients_batch10...
Finished saving patients_batch10.


At this point, each batch will be processed separately in order to consolidate all available data into a dataframe amenable to RNN modeling.

---