## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('coag.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,NO,Y,Sex 1.Male 2.Female,Age,Before treatment,Severity 1.mild/moderate 2.severe,PT,Platelet,FDP,SIRS_t1,...,SOFA cardiovascular 0-1-2,SOFA CNS,SOFA renal,SOFA renal 0-1-2,Total SOFA (4 items),SIC score (PT+PLT),SIC score（PT+PLT+  Total SOFA）,SIC 1.positive 2.negative,Unnamed: 27,Unnamed: 28
0,1,1,1,72.0,1,2,1.17,41.0,17.0,2,...,2.0,,0,,5,2,4,2,,
1,2,1,2,76.0,1,2,1.95,23.0,18.7,3,...,2.0,3.0,2,2.0,8,4,6,1,,
2,3,2,1,93.0,2,2,1.83,57.0,22.9,3,...,,3.0,0,,2,4,6,1,,
3,4,1,2,76.0,2,2,1.23,66.0,23.4,2,...,2.0,4.0,4,2.0,8,3,5,1,,
4,5,1,1,76.0,2,2,1.08,44.0,3.9,4,...,1.0,2.0,3,2.0,3,2,4,2,,


Dropping the column `NO` as it is unrelated to the dependent variable

In [5]:
df = df.drop(['NO','PT'],axis=1)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1498 entries, 0 to 1497
Data columns (total 27 columns):
Y                                         1498 non-null int64
Sex
1.Male
2.Female                       1498 non-null int64
Age                                       1494 non-null float64
Before
treatment                          1498 non-null int64
Severity
1.mild/moderate
2.severe         1498 non-null int64
Platelet                                  1498 non-null float64
FDP                                       1303 non-null float64
SIRS_t1                                   1498 non-null int64
SIRS
3 or more
1. positive
2. negative    1498 non-null int64
JAAM DIC
score                            1498 non-null int64
JAAM
DIC
1.positive
2.negative            1498 non-null int64
SOFA
respiratory                          1498 non-null int64
SOFA
respiratory
0-1-2                    1262 non-null float64
SOFA
coagulation                          1498 non-null int64
SOFA
hepatic     

To clearn the data, we first replace the empty value by `nan`.

In [7]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,Y,Sex 1.Male 2.Female,Age,Before treatment,Severity 1.mild/moderate 2.severe,Platelet,FDP,SIRS_t1,SIRS 3 or more 1. positive 2. negative,JAAM DIC score,...,SOFA cardiovascular 0-1-2,SOFA CNS,SOFA renal,SOFA renal 0-1-2,Total SOFA (4 items),SIC score (PT+PLT),SIC score（PT+PLT+  Total SOFA）,SIC 1.positive 2.negative,Unnamed: 27,Unnamed: 28
0,1,1,72.0,1,2,41.0,17.00,2,2,4,...,2.0,,0,,5,2,4,2,,
1,1,2,76.0,1,2,23.0,18.70,3,1,6,...,2.0,3.0,2,2.0,8,4,6,1,,
2,2,1,93.0,2,2,57.0,22.90,3,1,6,...,,3.0,0,,2,4,6,1,,
3,1,2,76.0,2,2,66.0,23.40,2,2,5,...,2.0,4.0,4,2.0,8,3,5,1,,
4,1,1,76.0,2,2,44.0,3.90,4,1,4,...,1.0,2.0,3,2.0,3,2,4,2,,
5,2,1,69.0,2,2,13.0,5.30,3,1,5,...,,,0,,0,4,4,2,,
6,2,2,63.0,1,2,53.0,54.50,4,1,8,...,2.0,,1,1.0,7,4,6,1,,
7,2,1,84.0,2,2,79.0,5.80,0,2,3,...,,0.0,0,,1,2,3,2,,
8,1,1,69.0,1,2,77.0,13.50,3,1,5,...,,0.0,1,1.0,3,2,4,2,,
9,2,1,63.0,1,2,31.0,77.80,3,1,8,...,2.0,4.0,3,2.0,8,4,6,1,,


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1498 entries, 0 to 1497
Data columns (total 27 columns):
Y                                         1498 non-null int64
Sex
1.Male
2.Female                       1498 non-null int64
Age                                       1494 non-null float64
Before
treatment                          1498 non-null int64
Severity
1.mild/moderate
2.severe         1498 non-null int64
Platelet                                  1498 non-null float64
FDP                                       1303 non-null float64
SIRS_t1                                   1498 non-null int64
SIRS
3 or more
1. positive
2. negative    1498 non-null int64
JAAM DIC
score                            1498 non-null int64
JAAM
DIC
1.positive
2.negative            1498 non-null int64
SOFA
respiratory                          1498 non-null int64
SOFA
respiratory
0-1-2                    1262 non-null float64
SOFA
coagulation                          1498 non-null int64
SOFA
hepatic     

The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [9]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

We find bad columns which contain too many missing values, then remove them.

In [10]:
def remove_bad_columns(df,bad_column_threshold):
    # find bad columns having too many missing values
    n_null = np.array(df.isnull().sum(axis=0))
    bad_col = np.array([]).astype(int)
    for t in range(len(n_null)):
        if n_null[t] > bad_column_threshold:
            bad_col = np.append(bad_col,t)

    #print(bad_col)
    print('number of bad columns:',len(bad_col))

    # delete bad columns
    df = df.drop(df.columns[bad_col],axis=1)
    #df.info()
    return df    

In [11]:
df = remove_bad_columns(df,200)

number of bad columns: 7


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1498 entries, 0 to 1497
Data columns (total 20 columns):
Y                                         1498 non-null int64
Sex
1.Male
2.Female                       1498 non-null int64
Age                                       1494 non-null float64
Before
treatment                          1498 non-null int64
Severity
1.mild/moderate
2.severe         1498 non-null int64
Platelet                                  1498 non-null float64
FDP                                       1303 non-null float64
SIRS_t1                                   1498 non-null int64
SIRS
3 or more
1. positive
2. negative    1498 non-null int64
JAAM DIC
score                            1498 non-null int64
JAAM
DIC
1.positive
2.negative            1498 non-null int64
SOFA
respiratory                          1498 non-null int64
SOFA
coagulation                          1498 non-null int64
SOFA
hepatic                              1498 non-null int64
SOFA
cardiovascular

We find bad rows which contain too many missing values, then remove them.

In [13]:
def remove_bad_rows(df,bad_row_threshold):   
    # find bad rows having too many missing values
    n_null = np.array(df.isnull().sum(axis=1))
    bad_row = np.array([])
    for t in range(len(n_null)):
        if n_null[t] > bad_row_threshold:
            bad_row = np.append(bad_row,t)

    #print(bad_row)
    print('number of bad rows:',len(bad_row))

    # delete bad rows
    df = df.drop(bad_row)
    #df.info()
    return df

In [14]:
df = remove_bad_rows(df,5)

number of bad rows: 0


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1498 entries, 0 to 1497
Data columns (total 20 columns):
Y                                         1498 non-null int64
Sex
1.Male
2.Female                       1498 non-null int64
Age                                       1494 non-null float64
Before
treatment                          1498 non-null int64
Severity
1.mild/moderate
2.severe         1498 non-null int64
Platelet                                  1498 non-null float64
FDP                                       1303 non-null float64
SIRS_t1                                   1498 non-null int64
SIRS
3 or more
1. positive
2. negative    1498 non-null int64
JAAM DIC
score                            1498 non-null int64
JAAM
DIC
1.positive
2.negative            1498 non-null int64
SOFA
respiratory                          1498 non-null int64
SOFA
coagulation                          1498 non-null int64
SOFA
hepatic                              1498 non-null int64
SOFA
cardiovascular

For convenience, we separate independents `X` and dependent `y` from the data.

In [16]:
X = df.drop('Y',axis=1)
y = df['Y']

We impute the missing value of X at each column by its median value.

In [17]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [18]:
X = DataFrameImputer().fit_transform(X)

In [19]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1498 entries, 0 to 1497
Data columns (total 19 columns):
Sex
1.Male
2.Female                       1498 non-null int64
Age                                       1498 non-null float64
Before
treatment                          1498 non-null int64
Severity
1.mild/moderate
2.severe         1498 non-null int64
Platelet                                  1498 non-null float64
FDP                                       1498 non-null float64
SIRS_t1                                   1498 non-null int64
SIRS
3 or more
1. positive
2. negative    1498 non-null int64
JAAM DIC
score                            1498 non-null int64
JAAM
DIC
1.positive
2.negative            1498 non-null int64
SOFA
respiratory                          1498 non-null int64
SOFA
coagulation                          1498 non-null int64
SOFA
hepatic                              1498 non-null int64
SOFA
cardiovascular                       1498 non-null int64
SOFA
renal         

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [20]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('coag_cleaned.dat',Xy,fmt='%s')