## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('peptide.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,Patientcode,Age,Sex,Caucasianeth,Smokerever,Smokerpresent,DM,Hypertension,Cerebrovasceventprevious,PAD,...,CKD_EPI,hsCRP,NTProBNP,NTProBNPgreater218,NTproBNP100,MCP1,Gal3,hsTroponinI,TWEAK,SBP
0,497,63,1,1,1,0,0,1,0,0,...,79.744541,1.588,37.3,0,0.373,151.6,9.86,0.0,123.48,150
1,472,51,0,1,1,0,1,1,0,0,...,65.387946,4.155,556.0,1,5.56,164.13,18.45,0.0,238.96,100
2,14,83,1,1,1,0,0,1,0,0,...,69.292452,4.194,1110.0,1,11.1,155.11,3.79,0.023,618.84,145
3,438,64,1,1,1,0,0,1,0,0,...,89.943755,0.738,35.9,0,0.359,106.86,8.66,0.003,130.13,140
4,477,81,1,1,1,0,1,1,0,0,...,56.371204,0.542,224.0,1,2.24,209.73,8.23,0.003,116.26,120


We drop the column `Patientcode` which is not an attribute.

In [5]:
df = df.drop('Patientcode',axis=1)

There are 60 attributes

and 1 class: Death

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 61 columns):
Age                               699 non-null int64
Sex                               699 non-null int64
Caucasianeth                      699 non-null object
Smokerever                        699 non-null object
Smokerpresent                     699 non-null int64
DM                                699 non-null int64
Hypertension                      699 non-null int64
Cerebrovasceventprevious          699 non-null int64
PAD                               699 non-null int64
HFprevious                        699 non-null int64
EFlower40                         699 non-null int64
AF                                699 non-null int64
TypelastACS                       699 non-null int64
Numbervessels                     699 non-null object
PCIlastevent                      699 non-null int64
DESatlastevent                    699 non-null int64
CABGatlastevent                   699 non-nu

In [10]:
df = df.drop(['HISTOLOGYCANCER', 'CANCERBEFORE100DAYS', 'CauseofDeath','HISTOLOGYCANCERTABULATED','Datestartfollowup'],axis=1)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 699 entries, 0 to 698
Data columns (total 56 columns):
Age                               699 non-null int64
Sex                               699 non-null int64
Caucasianeth                      699 non-null object
Smokerever                        699 non-null object
Smokerpresent                     699 non-null int64
DM                                699 non-null int64
Hypertension                      699 non-null int64
Cerebrovasceventprevious          699 non-null int64
PAD                               699 non-null int64
HFprevious                        699 non-null int64
EFlower40                         699 non-null int64
AF                                699 non-null int64
TypelastACS                       699 non-null int64
Numbervessels                     699 non-null object
PCIlastevent                      699 non-null int64
DESatlastevent                    699 non-null int64
CABGatlastevent                   699 non-nu

To clearn the data, we first replace the empty value by `nan`.

In [12]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,Age,Sex,Caucasianeth,Smokerever,Smokerpresent,DM,Hypertension,Cerebrovasceventprevious,PAD,HFprevious,...,CKD_EPI,hsCRP,NTProBNP,NTProBNPgreater218,NTproBNP100,MCP1,Gal3,hsTroponinI,TWEAK,SBP
0,63,1,1,1,0,0,1,0,0,0,...,79.744541,1.588,37.3,0,0.373,151.60,9.86,0,123.48,150
1,51,0,1,1,0,1,1,0,0,0,...,65.387946,4.155,556,1,5.56,164.13,18.45,0,238.96,100
2,83,1,1,1,0,0,1,0,0,1,...,69.292452,4.194,1110,1,11.1,155.11,3.79,0.023,618.84,145
3,64,1,1,1,0,0,1,0,0,0,...,89.943755,0.738,35.9,0,0.359,106.86,8.66,0.003,130.13,140
4,81,1,1,1,0,1,1,0,0,0,...,56.371204,0.542,224,1,2.24,209.73,8.23,0.003,116.26,120
5,72,1,1,0,0,0,1,0,0,0,...,74.859030,4.650,143,0,1.43,344.80,5.71,0.007,272.19,125
6,75,1,1,0,0,1,1,0,0,0,...,65.320303,12.087,931,1,9.31,57.20,8.62,0.012,207.13,170
7,61,0,1,1,0,0,1,0,0,0,...,79.827842,2.684,301,1,3.01,132.40,10.05,0,251.73,130
8,65,0,1,0,0,0,1,0,0,0,...,91.214450,3.010,271,1,2.71,161.87,7.61,0,196.19,140
9,40,0,0,1,0,0,0,0,0,0,...,108.725755,2.644,85.4,0,0.854,68.80,6.50,0,98.73,99


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 699 entries, 0 to 698
Data columns (total 56 columns):
Age                               699 non-null int64
Sex                               699 non-null int64
Caucasianeth                      699 non-null object
Smokerever                        699 non-null object
Smokerpresent                     699 non-null int64
DM                                699 non-null int64
Hypertension                      699 non-null int64
Cerebrovasceventprevious          699 non-null int64
PAD                               699 non-null int64
HFprevious                        699 non-null int64
EFlower40                         699 non-null int64
AF                                699 non-null int64
TypelastACS                       699 non-null int64
Numbervessels                     699 non-null object
PCIlastevent                      699 non-null int64
DESatlastevent                    699 non-null int64
CABGatlastevent                   699 non-nu

We find bad rows which contain too many missing values, then remove them.

In [14]:
# find bad rows having too many missing values
n_null = np.array(df.isnull().sum(axis=1))
bad_row = np.array([])
for t in range(len(n_null)):
    if n_null[t] > 4:
        #print(t)
        bad_row = np.append(bad_row,t)
        
print(bad_row)
print(len(bad_row))

# delete bad rows
df = df.drop(bad_row)
df.info()

[]
0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 699 entries, 0 to 698
Data columns (total 56 columns):
Age                               699 non-null int64
Sex                               699 non-null int64
Caucasianeth                      699 non-null object
Smokerever                        699 non-null object
Smokerpresent                     699 non-null int64
DM                                699 non-null int64
Hypertension                      699 non-null int64
Cerebrovasceventprevious          699 non-null int64
PAD                               699 non-null int64
HFprevious                        699 non-null int64
EFlower40                         699 non-null int64
AF                                699 non-null int64
TypelastACS                       699 non-null int64
Numbervessels                     699 non-null object
PCIlastevent                      699 non-null int64
DESatlastevent                    699 non-null int64
CABGatlastevent                   699 n

The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [15]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

In [17]:
print(np.dtype(df['Caucasianeth']))
print(np.dtype(df['Smokerever']))
print(np.dtype(df['Numbervessels']))
print(np.dtype(df['Completerevasc']))
print(np.dtype(df['BMI']))
print(np.dtype(df['NTProBNP']))
print(np.dtype(df['NTProBNPgreater218']))
print(np.dtype(df['NTproBNP100']))
print(np.dtype(df['hsTroponinI']))
print(np.dtype(df['SBP']))


df["Caucasianeth"] = pd.to_numeric(df.Caucasianeth, errors='coerce')
df["Smokerever"] = pd.to_numeric(df.Smokerever, errors='coerce')
df["Numbervessels"] = pd.to_numeric(df.Numbervessels, errors='coerce')
df["Completerevasc"] = pd.to_numeric(df.Completerevasc, errors='coerce')
df["BMI"] = pd.to_numeric(df.BMI, errors='coerce')
df["NTProBNP"] = pd.to_numeric(df.NTProBNP, errors='coerce')
df["NTProBNPgreater218"] = pd.to_numeric(df.NTProBNPgreater218, errors='coerce')
df["NTproBNP100"] = pd.to_numeric(df.NTproBNP100, errors='coerce')
df["hsTroponinI"] = pd.to_numeric(df.hsTroponinI, errors='coerce')
df["SBP"] = pd.to_numeric(df.SBP, errors='coerce')

float64
float64
float64
float64
float64
float64
float64
float64
float64
float64


For convenience, we separate independents `X` and dependent `y` from the data.

In [18]:
X = df.drop('Death',axis=1)
y = df['Death']

We impute the missing value of X at each column by its median value.

In [19]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            #if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [20]:
X = DataFrameImputer().fit_transform(X)

In [21]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 699 entries, 0 to 698
Data columns (total 55 columns):
Age                               699 non-null int64
Sex                               699 non-null int64
Caucasianeth                      699 non-null float64
Smokerever                        699 non-null float64
Smokerpresent                     699 non-null int64
DM                                699 non-null int64
Hypertension                      699 non-null int64
Cerebrovasceventprevious          699 non-null int64
PAD                               699 non-null int64
HFprevious                        699 non-null int64
EFlower40                         699 non-null int64
AF                                699 non-null int64
TypelastACS                       699 non-null int64
Numbervessels                     699 non-null float64
PCIlastevent                      699 non-null int64
DESatlastevent                    699 non-null int64
CABGatlastevent                   699 non

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [22]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('peptide_cleaned.dat',Xy,fmt='%s')