# Data Preparation
**Author: Tesfagabir Meharizghi *(Adopted from Lin Lee Notebook)*<br>
Last Updated: 01/06/2021**

Notebook for prepares and splits the AE data with only one target variable:
- Target: d_00845 for C.Diff Infection (ICD-9 code C.Diff is  008.45)
- C.Diff is selected because main events/causes are relatively well known so that the features importances predicted from different algorithms and models could be compared with the ground truth
- Actions:
    - Load raw data
    - Split data into train/val/test
    - Save split data

## 0. Install packages - First time only

In [1]:
#pip install nb-black

In [12]:
%load_ext lab_black

%load_ext autoreload

%autoreload 2

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
##Import packages
import numpy as np
import pandas as pd
import os
from sklearn.model_selection import train_test_split

## 1. Raw_data

In [30]:
raw_data_path = "/home/ec2-user/SageMaker/CMSAI/modeling/tes/data/anonymize/AE/Data/Anonymized/365NoDeath/ae_patients_365_20110101.csv"

train_path = "/home/ec2-user/SageMaker/CMSAI/modeling/tes/data/anonymize/AE_CDiff_d00845/split/train.csv"
val_path = "/home/ec2-user/SageMaker/CMSAI/modeling/tes/data/anonymize/AE_CDiff_d00845/split/val.csv"
test_path = "/home/ec2-user/SageMaker/CMSAI/modeling/tes/data/anonymize/AE_CDiff_d00845/split/test.csv"

test_size = 0.1  # Split ratios: 0.8/0.1/0.1

## 2. Split files

In [8]:
raw_df = pd.read_csv(raw_data_path, low_memory=False)

In [9]:
print(raw_df.shape)
raw_df.head()

(1903423, 387)


Unnamed: 0,patient_id,365,364,363,362,361,360,359,358,357,...,d_5789,d_78791,d_6826,d_78659,d_78907,d_7840,d_28860,d_4660,d_6829,d_00845
0,D1TRFPDOL,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
1,G420LOHIQ,"d_s5856, h_90960",,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
2,86SS056GL,,,,,,,,,,...,0,0,0,0,0,0,0,1,0,0
3,5GWKKD3H9,"d_s436, h_E0260, p_D1B","d_s5990, d_s7295, h_99284, h_A0425, h_A0428, h...",,"admission, d_s27651, d_s43820, d_s5849, d_s682...","d_s1101, d_s25070, d_s27651, d_s6826, h_11721,...","d_s4439, d_s6826, h_93922, h_99232","d_s6826, d_sV5881, h_36569, h_76937, h_77001, ...","d_s44020, d_s6826, h_35492, h_35493, h_36248, ...","admission, d_s25000, d_s27651, d_s4439, d_s682...",...,0,0,0,0,0,0,0,0,0,0
4,U5WIVBOP9,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# Select only CDiff (d_00845) as label
label = raw_df.columns.tolist()[-1]
other_cols = raw_df.columns.tolist()[:-20]
selected_cols = other_cols + [label]
raw_df = raw_df[selected_cols]
raw_df.head()

Unnamed: 0,patient_id,365,364,363,362,361,360,359,358,357,...,8,7,6,5,4,3,2,1,0,d_00845
0,D1TRFPDOL,,,,,,,,,,...,,,,"d_s71941, h_97140, h_97530",,,,,,0
1,G420LOHIQ,"d_s5856, h_90960",,,,,,,,,...,,,,,,,,,"d_5856, h_90960",0
2,86SS056GL,,,,,,,,,,...,,,,,,,,,,0
3,5GWKKD3H9,"d_s436, h_E0260, p_D1B","d_s5990, d_s7295, h_99284, h_A0425, h_A0428, h...",,"admission, d_s27651, d_s43820, d_s5849, d_s682...","d_s1101, d_s25070, d_s27651, d_s6826, h_11721,...","d_s4439, d_s6826, h_93922, h_99232","d_s6826, d_sV5881, h_36569, h_76937, h_77001, ...","d_s44020, d_s6826, h_35492, h_35493, h_36248, ...","admission, d_s25000, d_s27651, d_s4439, d_s682...",...,h_G0154,,h_G0154,,h_G0154,,h_G0154,,h_G0154,0
4,U5WIVBOP9,,,,,,,,,,...,,,,,,,"d_s2469, d_s25000, d_s2722, d_s4011, d_s40200,...",,,0


In [32]:
x_train, x_val_test, y_train, y_val_test = train_test_split(
    raw_df[selected_cols[:-1]],
    raw_df[selected_cols[-1]],
    test_size=2 * test_size,
    stratify=raw_df[selected_cols[-1]],
)
x_val, x_test, y_val, y_test = train_test_split(x_val_test, y_val_test, test_size=0.5)

print(
    x_train.shape, x_test.shape, x_val.shape, y_train.shape, y_test.shape, y_val.shape
)

(1522738, 367) (190343, 367) (190342, 367) (1522738,) (190343,) (190342,)


In [33]:
x_train.loc[:, label] = y_train.tolist()
x_val.loc[:, label] = y_val.tolist()
x_test.loc[:, label] = y_test.tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [34]:
x_train.head()

Unnamed: 0,patient_id,365,364,363,362,361,360,359,358,357,...,8,7,6,5,4,3,2,1,0,d_00845
58586,IXD7U0Z74,,,,,,,,,,...,,,,,,,,,,0
1392643,4TJJ3BGPT,,,,,,,,,,...,,,,,,,,,,0
973847,QIW5R08DN,,,,"d_s490, h_99309",,"d_s4019, d_s8208, h_36415, h_80048, h_85027",,,,...,,,,,,,,,,0
1357475,WDPIV5G4M,,,,,,,,,,...,,,,,,,,,,0
1466276,JI24AUR24,,,,,,,,,,...,,,,,"d_s4660, h_99213",,,,,0


In [35]:
# Save data
if not os.path.exists(os.path.dirname(train_path)):
    os.makedirs(os.path.dirname(train_path))

x_train.to_csv(train_path, index=False)
x_val.to_csv(val_path, index=False)
x_test.to_csv(test_path, index=False)