# pre-processing

## contents
* [introduction](#introduction)
* [dataframe prep](#prep)
* [imports and load](#imports_and_load)
* [categorical features](#categorical)
* [numerical features](#numbers)
* [train test split and export](#traintest)

## introduction<a id='introduction'></a>

The purpose of this notebook is to prepare the data for modeling with these steps:

- independence will be labeled and unnecessary variables removed
- categorical features will be encoded
- numerical features will be standardized
- training and test data sets will be created and exported

## imports and load<a id='imports_and_load'></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings('ignore')

admissions = pd.read_parquet('../data/interim/notebook2_output.parquet')
print(admissions.shape)
admissions.head()

(66224, 26)


Unnamed: 0_level_0,patient_nbr,race,gender,age,admission_type_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,...,DischargedToCare,outpatientTF,emergencyTF,inpatientTF,one_lab_procedure,dx_cluster,Meglitinide_class,Sulfonylurea_class,Thiazolidinedione_class,Alpha-glucosidase inhibitor_class
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
64410,86047875,AfricanAmerican,Female,[20-30),emergency,emergency room,2,11,5,0.046863,...,False,True,False,True,False,2,No,Steady,No,No
28236,89869032,AfricanAmerican,Female,[40-50),emergency,emergency room,9,47,2,0.090064,...,False,False,False,False,False,3,No,No,No,No
36900,77391171,AfricanAmerican,Male,[60-70),urgent,Other,7,62,0,0.017615,...,False,False,False,False,False,2,No,Up,No,No
62256,49726791,AfricanAmerican,Female,[60-70),elective,Other,1,49,5,-0.52139,...,False,False,False,False,False,6,No,No,No,No
73578,86328819,AfricanAmerican,Male,[60-70),emergency,emergency room,12,75,5,0.046863,...,True,False,False,False,False,5,No,No,No,No


In [2]:
pip list --format=freeze > requirements-nb3.txt

Note: you may need to restart the kernel to use updated packages.


## dataframe prep<a id='prep'></a>

patient_nbr is also not needed, but more than one admission from a patient would not be considered independent which is needed for some models.  An additional feature indicating independence is needed.  This will have a value of 1 for 1 admission for a patient and a 0 for the rest.  An admission that resulted in a readmission will be given preference over one that did not.  After this, readmitBinary will be removed temporarily as they are the target variables. Readmitted will also be removed.

In [3]:
working_df = admissions.copy()
sort_order = ['<30 days', '>30 days']
working_df['readmitted'].cat.set_categories(sort_order, ordered=True, inplace=True)
working_df.sort_values(['patient_nbr', 'readmitted'], ascending=[True, False], inplace=True)

working_df['is_independent'] = working_df.groupby('patient_nbr').cumcount() == 0
working_df['is_independent'] = working_df['is_independent'].astype(bool)

target = working_df[['is_independent','readmitBinary']].copy()
working_df = working_df.drop(['patient_nbr','readmitted','readmitBinary'], axis=1)

working_df.head()

Unnamed: 0_level_0,race,gender,age,admission_type_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_diagnoses,...,outpatientTF,emergencyTF,inpatientTF,one_lab_procedure,dx_cluster,Meglitinide_class,Sulfonylurea_class,Thiazolidinedione_class,Alpha-glucosidase inhibitor_class,is_independent
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
26264286,Caucasian,Female,[50-60),emergency,emergency room,3,31,1,0.059233,5,...,False,False,True,False,2,No,Steady,No,No,True
14824206,AfricanAmerican,Female,[30-40),emergency,emergency room,5,49,0,-0.155541,3,...,False,False,False,False,6,No,Steady,No,No,True
8380170,AfricanAmerican,Female,[50-60),emergency,emergency room,6,43,2,0.046863,2,...,False,False,True,False,5,No,No,No,No,True
30180318,AfricanAmerican,Female,[50-60),emergency,emergency room,6,45,4,0.070441,6,...,False,False,True,False,8,No,No,No,No,False
55533660,AfricanAmerican,Female,[60-70),emergency,emergency room,10,54,2,0.106787,9,...,False,False,True,False,0,No,No,No,No,False


encounter_id is also not needed, but serves as the index, so it will be left in place.

## categorical features<a id='categorical'></a>

In [4]:
working_df.dtypes

race                                 category
gender                               category
age                                  category
admission_type_id                    category
admission_source_id                  category
time_in_hospital                        int64
num_lab_procedures                      int64
num_procedures                          int64
num_medications                       float64
number_diagnoses                        int64
metformin                            category
insulin                              category
glyburide-metformin                  category
DischargedToCare                         bool
outpatientTF                             bool
emergencyTF                              bool
inpatientTF                              bool
one_lab_procedure                        bool
dx_cluster                              int32
Meglitinide_class                      object
Sulfonylurea_class                     object
Thiazolidinedione_class           

The 4 *_class features at the bottom of this list are also categorical features and should be converted and 'Alpha-glucosidase inhibitor_class' should be renamed for conveinence.

In [5]:
working_df = working_df.rename(columns={'Alpha-glucosidase inhibitor_class': 'alpha_gluc_inhibitor_class'})

need_to_be_cat = ['alpha_gluc_inhibitor_class','Sulfonylurea_class','Thiazolidinedione_class','Meglitinide_class','dx_cluster']
working_df[need_to_be_cat] = working_df[need_to_be_cat].astype('category')
working_df.dtypes

race                          category
gender                        category
age                           category
admission_type_id             category
admission_source_id           category
time_in_hospital                 int64
num_lab_procedures               int64
num_procedures                   int64
num_medications                float64
number_diagnoses                 int64
metformin                     category
insulin                       category
glyburide-metformin           category
DischargedToCare                  bool
outpatientTF                      bool
emergencyTF                       bool
inpatientTF                       bool
one_lab_procedure                 bool
dx_cluster                    category
Meglitinide_class             category
Sulfonylurea_class            category
Thiazolidinedione_class       category
alpha_gluc_inhibitor_class    category
is_independent                    bool
dtype: object

Finally, the categorical variables will be one hot encoded.

In [6]:
working_df = pd.get_dummies(working_df, drop_first=True)

## numerical features<a id='numbers'></a>

Now the numerical features can be scaled to a range of 0-1.

In [7]:
finite_vals = working_df.loc[working_df['num_medications'] != -np.inf, 'num_medications']
finite_vals = finite_vals.mean()
working_df['num_medications'] = working_df['num_medications'].replace([np.inf, -np.inf, np.nan], finite_vals)

number_columns = working_df.select_dtypes(include=['int64', 'float64']).columns.to_list()
scaler = MinMaxScaler()
working_df[number_columns] = scaler.fit_transform(working_df[number_columns])

The boolean features also need to be converted to numerics.

In [8]:
bools = working_df.select_dtypes(include=['bool']).columns.to_list()
working_df[bools] = working_df[bools].astype(int)

In [9]:
pd.set_option('display.max_rows', None)
print(working_df.dtypes)
pd.reset_option('display.max_rows')

time_in_hospital                      float64
num_lab_procedures                    float64
num_procedures                        float64
num_medications                       float64
number_diagnoses                      float64
DischargedToCare                        int64
outpatientTF                            int64
emergencyTF                             int64
inpatientTF                             int64
one_lab_procedure                       int64
is_independent                          int64
race_Caucasian                          uint8
gender_Male                             uint8
age_[10-20)                             uint8
age_[20-30)                             uint8
age_[30-40)                             uint8
age_[40-50)                             uint8
age_[50-60)                             uint8
age_[60-70)                             uint8
age_[70-80)                             uint8
age_[80-90)                             uint8
age_[90-100)                      

## train test split and export<a id='traintest'></a>

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    working_df,
    target,
    test_size=0.2, 
    random_state=42,
    stratify=target['readmitBinary']
)

In [11]:
shapes = {
    'X_train': X_train.shape,
    'X_test': X_test.shape,
    'y_train': y_train.shape,
    'y_test': y_test.shape
}

shapes

{'X_train': (52979, 56),
 'X_test': (13245, 56),
 'y_train': (52979, 2),
 'y_test': (13245, 2)}

In [12]:
X_train.to_parquet(path='../data/interim/X_train.parquet')
X_test.to_parquet(path='../data/interim/X_test.parquet')
y_train.to_parquet(path='../data/interim/y_train.parquet')
y_test.to_parquet(path='../data/interim/y_test.parquet')