<h1> <center> Build dataset for classification task </center> </h1>

In [99]:
import pandas as pd
import numpy as np 
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline 

The purpose of this notebook is to build a commun dataset `DATA_DIR/data_merged.csv` that must be used as a starting point for every machine learning model for IE success/failure classification task

in this notebook you will find : 

* **Step 1** : Define Features types > Categorical / Numerical / Text / Date
* **Step 2** : Build the target > based on two columns `IMMEDIATE_EXTUBATION` and `SECONDARY_EXTUBATION`
* **Step 3** : Select a subset of patient > drop useless patients regarding the target
* **Step 4** : Export the new dataset > `DATA_DIR/data_merged.csv`


In [160]:
## path to data directory
DATA_DIR = "../../../data/"
PICKLE_OBJ =  "../../../data/pickle_object/"


In [161]:
mkdir ../../../data/pickle_object/

mkdir: ../../../data/pickle_object/: File exists


In [162]:
## import dataset 
dim_patient_intraoperatoire = pd.read_csv("{}dim_patient_intraoperatoire.csv".format(DATA_DIR))
dim_donneur = pd.read_csv("{}dim_donneur.csv".format(DATA_DIR))
dim_patient_preoperatoire = pd.read_csv("{}dim_patient_preoperatoire.csv".format(DATA_DIR))
dim_patient_postoperatoire = pd.read_csv("{}dim_patient_postoperatoire.csv".format(DATA_DIR))

## merge Preoperative, postoperative and donor dataset 
data = pd.merge(dim_patient_preoperatoire, dim_donneur, how = 'left', on = "numero")
data = pd.merge(data, dim_patient_postoperatoire[["numero", " immediate_extubation", "secondary_intubation"]], how = 'left', on = 'numero').rename(columns = {' immediate_extubation':  'immediate_extubation'})

data.columns = [i.lower() for i in data.columns]


# Define features type 

In [163]:
features_type = { 'id' : ["numero"],
     'date': ["date_transplantation"],

    'categorical': [

        "pathologie", 
        "sex", 
        "other_organ_transplantation", 
        "transplanted_twice_during_study_period", 
        "super_urgence", 
        "retransplant", 
        "preoperative_ICU", 
        "preoperative_vasopressor",
        "preoperative_mechanical_ventilation",
        "PFO", 
        "diabetes", 
        "preoperative_pulmonary_hypertension"
        "Insuffisance_renale", 
        "CMV_receveur", 
        "plasmapherese", 
        "preoperative_ECMO", 
        "thoracic_surgery_history", 
        "CMV_donneur", 
        "EBV_donneur", 
        "Sex_donor",
        "heure_arrivee_bloc",

        ],

    'continuous' : [
            "BMI_donor",
            "age", 
        "Poids", 
        "Taille", 
        "time_on_waiting_liste", 
        "LAS", 
        "body_mass_index",
        "PAPS", 
        "Age_donor", 
        "Poids_donor", 
        "Taille_donor", 
        "Donneur_CPT",
        "Tabagisme_donor",
        "Aspirations_donor", 
        "RX_donor", 
        "PF_donor",
        "oto_score"   
    ],

    'text' : [
        "atcd_medicaux", 
        "atcd_chirugicaux"
    ],

    'target' : [
        'immediate_extubation', 
        'secondary_intubation'
    ]

              }

In [164]:
# to lowercase 

for i in features_type.keys():
    features_type[i] = [f.lower() for f in features_type[i]]
    
## drop columns starting by unnamed (index)
[data.drop(i, axis = 1, inplace = True ) for i in data.columns if i.startswith("unnamed:")]

[None, None]

## Build the target

In [165]:
failure = (data["immediate_extubation"] == 1) & (data["secondary_intubation"] == 1)
success = (data["immediate_extubation"] == 1) & (data["secondary_intubation"] == 0)
no_ie = (data["immediate_extubation"] == 0) & (data["secondary_intubation"] == 0)
auto_extub = (data["immediate_extubation"] == 0) & (data["secondary_intubation"] == 1)

In [166]:
data["target"]  = np.nan

data["target"][failure] = "unsuccessful IE"
data["target"][success] = "successful IE"

data[data.target.notnull()][['target', 'immediate_extubation', "secondary_intubation"]].head()

Unnamed: 0,target,immediate_extubation,secondary_intubation
1,successful IE,1.0,0.0
5,successful IE,1.0,0.0
12,successful IE,1.0,0.0
13,successful IE,1.0,0.0
15,successful IE,1.0,0.0


In [167]:
print("effectif unsuccessful IE {}".format((failure).sum()))
print("effectif successful IE {}".format((success).sum()))
print("effectif NO IE {}".format((no_ie).sum()))
print("effectif Auto IE {}".format((auto_extub).sum()))
print('\n')

print("Missing values in the target \n \n", data[["immediate_extubation", "secondary_intubation"]].isnull().sum())

effectif unsuccessful IE 16
effectif successful IE 119
effectif NO IE 239
effectif Auto IE 34


Missing values in the target 
 
 immediate_extubation    1
secondary_intubation    3
dtype: int64


In [168]:
data.drop(['secondary_intubation','immediate_extubation'], inplace = True, axis = 1 )

### Select subset of patients

In [169]:
data = data[data.target.notnull()]
data.shape

(135, 43)

In [170]:
fct_temp = pd.read_csv('{}fct_temperature.csv'.format(DATA_DIR))
fct_temp['numero'] = fct_temp['id_patient']
fct_temp = fct_temp.drop_duplicates(['numero'])
n = pd.merge(data, fct_temp, how = 'inner', on = 'numero').shape[0]


"N° of matching patients between dim_patient dataset and fct_ tables : {} over {} , {} %".format(n,len(data),  round(n/len(data)*100))

'N° of matching patients between dim_patient dataset and fct_ tables : 74 over 135 , 55 %'

In [171]:
print("Dataset shape :  {}" .format(data.shape))
data.target.value_counts()/len(data) * 100

Dataset shape :  (135, 43)


successful IE      88.148148
unsuccessful IE    11.851852
Name: target, dtype: float64

In [172]:
## Quick cleaning : replace NF by nan value
data = data.replace("NF", np.nan)

# Export

In [173]:
import pickle 

## dataset :
data.to_csv('../../../data/data_merged.csv' , index = False)

## columns type : 
pickle.dump(features_type, open( "{}features_type.p".format(PICKLE_OBJ), "wb" ) )