## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('tanzania.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,GA,FHR_during_labour,Amniotic_fluid,Delivery_Mode,CS_category,CS_indication,BW,Apgar_1min,Apgar_5min,BMV,Outcome_30min,Outcome_24hours,Outcome_7days,Time_of_death,temp12ad0,puls13ad0,oxy14ad0,HIE_highest,major_cause,outcome
0,42.0,1,2,1,,,3000,8,10,2,2,2,3.0,2.0,36.0,144.0,60.0,,6.0,1
1,40.0,9,1,1,,,2500,9,10,2,2,3,,1.0,35.4,155.0,50.0,,6.0,1
2,38.0,1,1,1,,,2815,7,10,1,1,2,3.0,2.0,36.4,166.0,71.0,,6.0,1
3,36.0,1,1,1,,,2125,8,7,1,2,2,3.0,2.0,34.9,136.0,71.0,,6.0,1
4,32.0,9,1,1,,,2115,9,10,2,2,2,3.0,3.0,36.1,131.0,72.0,,6.0,1


The dependent variable is the patient outcome, `outcome`.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 671 entries, 0 to 670
Data columns (total 20 columns):
GA                   599 non-null float64
FHR_during_labour    671 non-null int64
Amniotic_fluid       671 non-null int64
Delivery_Mode        671 non-null int64
CS_category          262 non-null float64
CS_indication        261 non-null float64
BW                   671 non-null int64
Apgar_1min           671 non-null int64
Apgar_5min           671 non-null int64
BMV                  671 non-null int64
Outcome_30min        671 non-null int64
Outcome_24hours      671 non-null int64
Outcome_7days        410 non-null float64
Time_of_death        124 non-null float64
temp12ad0            377 non-null float64
puls13ad0            363 non-null float64
oxy14ad0             371 non-null float64
HIE_highest          154 non-null float64
major_cause          124 non-null float64
outcome              671 non-null int64
dtypes: float64(10), int64(10)
memory usage: 104.9 KB


To clearn the data, we first replace the empty value by `nan`.

In [6]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,GA,FHR_during_labour,Amniotic_fluid,Delivery_Mode,CS_category,CS_indication,BW,Apgar_1min,Apgar_5min,BMV,Outcome_30min,Outcome_24hours,Outcome_7days,Time_of_death,temp12ad0,puls13ad0,oxy14ad0,HIE_highest,major_cause,outcome
0,42.0,1,2,1,,,3000,8,10,2,2,2,3.0,2.0,36.0,144.0,60.0,,6.0,1
1,40.0,9,1,1,,,2500,9,10,2,2,3,,1.0,35.4,155.0,50.0,,6.0,1
2,38.0,1,1,1,,,2815,7,10,1,1,2,3.0,2.0,36.4,166.0,71.0,,6.0,1
3,36.0,1,1,1,,,2125,8,7,1,2,2,3.0,2.0,34.9,136.0,71.0,,6.0,1
4,32.0,9,1,1,,,2115,9,10,2,2,2,3.0,3.0,36.1,131.0,72.0,,6.0,1
5,36.0,1,1,1,,,2160,5,0,1,2,3,,1.0,34.0,132.0,49.0,17.0,6.0,1
6,40.0,1,1,2,1.0,5.0,2085,9,10,2,2,2,3.0,2.0,35.0,136.0,96.0,,6.0,1
7,38.0,1,1,1,,,2985,7,10,2,2,3,,1.0,35.0,100.0,66.0,,6.0,1
8,33.0,1,4,2,1.0,6.0,1660,9,10,2,2,2,3.0,2.0,36.7,158.0,53.0,,6.0,1
9,40.0,9,1,1,,,2700,9,10,2,2,3,,1.0,33.0,128.0,92.0,,6.0,1


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 671 entries, 0 to 670
Data columns (total 20 columns):
GA                   599 non-null float64
FHR_during_labour    671 non-null int64
Amniotic_fluid       671 non-null int64
Delivery_Mode        671 non-null int64
CS_category          262 non-null float64
CS_indication        261 non-null float64
BW                   671 non-null int64
Apgar_1min           671 non-null int64
Apgar_5min           671 non-null int64
BMV                  671 non-null int64
Outcome_30min        671 non-null int64
Outcome_24hours      671 non-null int64
Outcome_7days        410 non-null float64
Time_of_death        124 non-null float64
temp12ad0            377 non-null float64
puls13ad0            363 non-null float64
oxy14ad0             371 non-null float64
HIE_highest          154 non-null float64
major_cause          124 non-null float64
outcome              671 non-null int64
dtypes: float64(10), int64(10)
memory usage: 104.9 KB


We find bad rows which contain too many missing values, then remove them.

In [8]:
# find bad rows having too many missing values
n_null = np.array(df.isnull().sum(axis=1))
bad_row = np.array([])
for t in range(len(n_null)):
    if n_null[t] > 10:
        #print(t)
        bad_row = np.append(bad_row,t)
        
print(bad_row)
print(len(bad_row))

# delete bad rows
df = df.drop(bad_row)
df.info()

[]
0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 671 entries, 0 to 670
Data columns (total 20 columns):
GA                   599 non-null float64
FHR_during_labour    671 non-null int64
Amniotic_fluid       671 non-null int64
Delivery_Mode        671 non-null int64
CS_category          262 non-null float64
CS_indication        261 non-null float64
BW                   671 non-null int64
Apgar_1min           671 non-null int64
Apgar_5min           671 non-null int64
BMV                  671 non-null int64
Outcome_30min        671 non-null int64
Outcome_24hours      671 non-null int64
Outcome_7days        410 non-null float64
Time_of_death        124 non-null float64
temp12ad0            377 non-null float64
puls13ad0            363 non-null float64
oxy14ad0             371 non-null float64
HIE_highest          154 non-null float64
major_cause          124 non-null float64
outcome              671 non-null int64
dtypes: float64(10), int64(10)
memory usage: 110.1 KB


The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [9]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

For convenience, we separate independents `X` and dependent `y` from the data.

In [10]:
X = df.drop('outcome',axis=1)
y = df['outcome']

In [11]:
x1 = np.array(X)
x1

array([[42.,  1.,  2., ..., 60., nan,  6.],
       [40.,  9.,  1., ..., 50., nan,  6.],
       [38.,  1.,  1., ..., 71., nan,  6.],
       ...,
       [40.,  1.,  2., ..., nan, nan, nan],
       [40.,  9.,  1., ..., nan, nan, nan],
       [nan,  9.,  1., ..., 92., nan, nan]])

We determine and drop the variables with excessive missing values from the dataset.

In [12]:
i_missing = []
for i in range(x1.shape[1]):
    n_missing = np.sum(np.isnan(x1[:,i]))
    if n_missing > 5:
        print(i,n_missing)
        i_missing.append(i)    
print(i_missing)    

0 72
4 409
5 410
12 261
13 547
14 294
15 308
16 300
17 517
18 547
[0, 4, 5, 12, 13, 14, 15, 16, 17, 18]


In [13]:
x2 = np.delete(x1,i_missing,axis=1)
x2.shape

(671, 9)

We impute the missing value of X at each column by its median value.

In [14]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [15]:
X = DataFrameImputer().fit_transform(X)

In [16]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 671 entries, 0 to 670
Data columns (total 19 columns):
GA                   671 non-null float64
FHR_during_labour    671 non-null int64
Amniotic_fluid       671 non-null int64
Delivery_Mode        671 non-null int64
CS_category          671 non-null float64
CS_indication        671 non-null float64
BW                   671 non-null int64
Apgar_1min           671 non-null int64
Apgar_5min           671 non-null int64
BMV                  671 non-null int64
Outcome_30min        671 non-null int64
Outcome_24hours      671 non-null int64
Outcome_7days        671 non-null float64
Time_of_death        671 non-null float64
temp12ad0            671 non-null float64
puls13ad0            671 non-null float64
oxy14ad0             671 non-null float64
HIE_highest          671 non-null float64
major_cause          671 non-null float64
dtypes: float64(10), int64(9)
memory usage: 104.8 KB


Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [17]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('tanzania_cleaned.dat',Xy,fmt='%s')