## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('heartattack.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,...,Unnamed: 246,Unnamed: 247,Unnamed: 248,Unnamed: 249,Unnamed: 250,Unnamed: 251,Unnamed: 252,Unnamed: 253,Unnamed: 254,Unnamed: 255
0,28,1,2,130,132,0,2,185,0,0.0,...,,,,,,,,,,
1,29,1,2,120,243,0,0,160,0,0.0,...,,,,,,,,,,
2,29,1,2,140,?,0,0,170,0,0.0,...,,,,,,,,,,
3,30,0,1,170,237,0,1,170,0,0.0,...,,,,,,,,,,
4,31,0,2,100,219,0,1,150,0,0.0,...,,,,,,,,,,


The dependent variable is `num`, the resultant classification for heartattack patients in the study.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Columns: 256 entries, age to Unnamed: 255
dtypes: float64(243), int64(4), object(9)
memory usage: 588.1+ KB


To clearn the data, we first replace the empty value by `nan`.

In [6]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,...,Unnamed: 246,Unnamed: 247,Unnamed: 248,Unnamed: 249,Unnamed: 250,Unnamed: 251,Unnamed: 252,Unnamed: 253,Unnamed: 254,Unnamed: 255
0,28,1,2,130,132,0,2,185,0,0.0,...,,,,,,,,,,
1,29,1,2,120,243,0,0,160,0,0.0,...,,,,,,,,,,
2,29,1,2,140,?,0,0,170,0,0.0,...,,,,,,,,,,
3,30,0,1,170,237,0,1,170,0,0.0,...,,,,,,,,,,
4,31,0,2,100,219,0,1,150,0,0.0,...,,,,,,,,,,
5,32,0,2,105,198,0,0,165,0,0.0,...,,,,,,,,,,
6,32,1,2,110,225,0,0,184,0,0.0,...,,,,,,,,,,
7,32,1,2,125,254,0,0,155,0,0.0,...,,,,,,,,,,
8,33,1,3,120,298,0,0,185,0,0.0,...,,,,,,,,,,
9,34,0,2,130,161,0,0,190,0,0.0,...,,,,,,,,,,


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Columns: 256 entries, age to Unnamed: 255
dtypes: float64(243), int64(4), object(9)
memory usage: 588.1+ KB


We find bad rows which contain too many missing values, then remove them.

In [8]:
# find bad rows having too many missing values
n_null = np.array(df.isnull().sum(axis=1))
bad_row = np.array([])
for t in range(len(n_null)):
    if n_null[t] > 10:
        #print(t)
        bad_row = np.append(bad_row,t)
        
print(bad_row)
print(len(bad_row))

# delete bad rows
df = df.drop(bad_row)
df.info()

[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
  14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
  28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
  42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
  56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
  70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139.
 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153.
 154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167.
 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181.
 182. 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195.
 196. 

The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [9]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

For convenience, we separate independents `X` and dependent `y` from the data.

In [14]:
X = df.drop('num ',axis=1)
y = df['num ']

KeyError: "['num '] not found in axis"

In [12]:
x1 = np.array(X)
x1

array([[ 2. , 10.4,  6.3, ...,  0. ,  1. ,  1. ],
       [ 2. , 83.8, 27.1, ...,  0. ,  1. ,  1. ],
       [ 2. , 63.3, 51.4, ...,  0. ,  1. ,  1. ],
       ...,
       [ 2. , 38.4,  nan, ...,  0. ,  2. ,  1. ],
       [ 2. , 68.1, 79.2, ...,  1. ,  1. ,  1. ],
       [ 2. ,  4. ,  1.5, ...,  0. ,  1. ,  1. ]])

We determine and drop the variables with excessive missing values from the dataset.

In [13]:
i_missing = []
for i in range(x1.shape[1]):
    n_missing = np.sum(np.isnan(x1[:,i]))
    if n_missing > 5:
        print(i,n_missing)
        i_missing.append(i)    
print(i_missing)    

2 6
3 10
4 13
5 7
6 17
7 10
8 13
9 13
10 22
15 20
16 20
17 6
24 138
31 96
34 138
35 138
36 138
37 138
[2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 16, 17, 24, 31, 34, 35, 36, 37]


In [14]:
x2 = np.delete(x1,i_missing,axis=1)
x2.shape

(138, 29)

We impute the missing value of X at each column by its median value.

In [15]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [16]:
X = DataFrameImputer().fit_transform(X)

In [17]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 138 entries, 91 to 240
Data columns (total 47 columns):
Site                     138 non-null float64
SAincl                   138 non-null float64
SA2                      138 non-null float64
SA4                      138 non-null float64
SA6                      138 non-null float64
SA8                      138 non-null float64
SA10                     138 non-null float64
SA12                     138 non-null float64
SA1416                   138 non-null float64
SA2120                   138 non-null float64
SA2728                   138 non-null float64
studyarm                 138 non-null float64
sexe                     138 non-null float64
age                      138 non-null int64
lesionsince              138 non-null float64
BPSYST                   138 non-null float64
BPDIAST                  138 non-null float64
pulserateinclbeatsmin    138 non-null float64
tempinclCelsius          138 non-null float64
bodyweightinclkg      

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [18]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('heartattack_cleaned.dat',Xy,fmt='%s')