## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('heat.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,Sex_1male_2female,Age,Weather_1sunny_2cloudy_3rainy_4missing,Location_1outdoor_indoor,Functionaldependency_1notdisable_2disable,HT,HeartDisease,Pscyco,DM,CerevD,...,Plt,BUN,Cre,AST,ALT,CK,CRP,Admission1,ICU,Deadtodischarge
0,1.0,24.0,3.0,,1.0,0.0,0.0,0.0,0.0,0.0,...,23.2,17.0,1.69,29.0,48.0,506.0,17.4,1.0,,
1,,43.0,,1.0,,0.0,0.0,0.0,0.0,0.0,...,24.3,23.1,3.08,36.0,35.0,883.0,0.46,1.0,,
2,1.0,58.0,,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,31.2,19.4,1.42,33.0,16.0,815.0,0.16,1.0,,
3,1.0,46.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,33.3,32.9,4.2,33.0,54.0,173.0,1.0,1.0,,
4,1.0,57.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,14.1,21.0,1.33,77.0,36.0,386.0,0.103,1.0,1.0,


There are 35 attributes

and 1 class: `Deadtodischarge`

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3177 entries, 0 to 3176
Data columns (total 36 columns):
Sex_1male_2female                            3110 non-null float64
Age                                          3135 non-null float64
Weather_1sunny_2cloudy_3rainy_4missing       2567 non-null float64
Location_1outdoor_indoor                     1831 non-null float64
Functionaldependency_1notdisable_2disable    2797 non-null float64
HT                                           3175 non-null float64
HeartDisease                                 3175 non-null float64
Pscyco                                       3175 non-null float64
DM                                           3175 non-null float64
CerevD                                       3175 non-null float64
ParkinD                                      3175 non-null float64
CKD                                          3175 non-null float64
Dementia                                     3175 non-null float64
PreSBP                

We see that `Location_1outdoor_indoor`, `PreSBP`, `PreRR`, `PreBT`, `PreHR`, `Abdminal`, `Muscular` contain so many missing values, so we will remove these attributes.

In [6]:
df = df.drop(['Location_1outdoor_indoor', 'PreSBP', 'PreRR','PreBT','PreHR','Abdminal','Muscular'],axis=1)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3177 entries, 0 to 3176
Data columns (total 29 columns):
Sex_1male_2female                            3110 non-null float64
Age                                          3135 non-null float64
Weather_1sunny_2cloudy_3rainy_4missing       2567 non-null float64
Functionaldependency_1notdisable_2disable    2797 non-null float64
HT                                           3175 non-null float64
HeartDisease                                 3175 non-null float64
Pscyco                                       3175 non-null float64
DM                                           3175 non-null float64
CerevD                                       3175 non-null float64
ParkinD                                      3175 non-null float64
CKD                                          3175 non-null float64
Dementia                                     3175 non-null float64
PreGCSlessthan15                             2011 non-null float64
GCS                   

To clean the data, we first replace the empty value by `nan`.

In [8]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,Sex_1male_2female,Age,Weather_1sunny_2cloudy_3rainy_4missing,Functionaldependency_1notdisable_2disable,HT,HeartDisease,Pscyco,DM,CerevD,ParkinD,...,Plt,BUN,Cre,AST,ALT,CK,CRP,Admission1,ICU,Deadtodischarge
0,1.0,24.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,23.2,17.0,1.690,29.00,48.0,506.0,17.4,1.0,,
1,,43.0,,,0.0,0.0,0.0,0.0,0.0,0.0,...,24.3,23.1,3.080,36.00,35.0,883.0,0.46,1.0,,
2,1.0,58.0,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,31.2,19.4,1.420,33.00,16.0,815.0,0.16,1.0,,
3,1.0,46.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,33.3,32.9,4.200,33.00,54.0,173.0,1,1.0,,
4,1.0,57.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,14.1,21.0,1.330,77.00,36.0,386.0,0.103,1.0,1.0,
5,1.0,29.0,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,29.7,13.9,0.850,16.00,18.0,230.0,0.01,0.0,0.0,0.0
6,1.0,49.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,23.4,13.5,0.760,100.00,56.0,2156.0,0.05,0.0,0.0,0.0
7,1.0,59.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,20.7,30.6,1.320,40.00,58.0,136.0,0.01,0.0,0.0,0.0
8,1.0,75.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,22.2,23.8,1.290,63.00,13.0,171.0,0.03,0.0,0.0,0.0
9,1.0,93.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,...,9.0,20.9,1.060,21.00,27.0,68.0,0.02,1.0,,1.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3177 entries, 0 to 3176
Data columns (total 29 columns):
Sex_1male_2female                            3110 non-null float64
Age                                          3135 non-null float64
Weather_1sunny_2cloudy_3rainy_4missing       2567 non-null float64
Functionaldependency_1notdisable_2disable    2797 non-null float64
HT                                           3175 non-null float64
HeartDisease                                 3175 non-null float64
Pscyco                                       3175 non-null float64
DM                                           3175 non-null float64
CerevD                                       3175 non-null float64
ParkinD                                      3175 non-null float64
CKD                                          3175 non-null float64
Dementia                                     3175 non-null float64
PreGCSlessthan15                             2011 non-null float64
GCS                   

The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [10]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

The current type of variable 'CRP' is object, we will convert them to numeric variable.

In [11]:
print(np.dtype(df['CRP']))

df["CRP"] = pd.to_numeric(df.CRP, errors='coerce')

object


We find bad rows which contain too many missing values, then remove them.

In [12]:
def remove_bad_rows(df,bad_row_threshold):   
    # find bad rows having too many missing values
    n_null = np.array(df.isnull().sum(axis=1))
    bad_row = np.array([])
    for t in range(len(n_null)):
        if n_null[t] > bad_row_threshold:
            bad_row = np.append(bad_row,t)

    #print(bad_row)
    print('number of bad rows:',len(bad_row))

    # delete bad rows
    df = df.drop(bad_row)
    #df.info()
    return df

In [13]:
df = remove_bad_rows(df,5)

number of bad rows: 376


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2801 entries, 0 to 3174
Data columns (total 29 columns):
Sex_1male_2female                            2775 non-null float64
Age                                          2796 non-null float64
Weather_1sunny_2cloudy_3rainy_4missing       2327 non-null float64
Functionaldependency_1notdisable_2disable    2513 non-null float64
HT                                           2801 non-null float64
HeartDisease                                 2801 non-null float64
Pscyco                                       2801 non-null float64
DM                                           2801 non-null float64
CerevD                                       2801 non-null float64
ParkinD                                      2801 non-null float64
CKD                                          2801 non-null float64
Dementia                                     2801 non-null float64
PreGCSlessthan15                             1902 non-null float64
GCS                   

We drop specific rows missing from the y variable, `Deadtodischarge`.

In [None]:
df.drop(df.index[[2,3,30]])

We determine and drop the variables with excessive missing values from the dataset.

In [15]:
def remove_bad_columns(df,bad_column_threshold):
    # find bad columns having too many missing values
    n_null = np.array(df.isnull().sum(axis=0))
    bad_col = np.array([]).astype(int)
    for t in range(len(n_null)):
        if n_null[t] > bad_column_threshold:
            bad_col = np.append(bad_col,t)

    #print(bad_col)
    print('number of bad columns:',len(bad_col))

    # delete bad columns
    df = df.drop(df.columns[bad_col],axis=1)
    #df.info()
    return df  

In [16]:
df = remove_bad_columns(df,400)

number of bad columns: 3


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2801 entries, 0 to 3174
Data columns (total 26 columns):
Sex_1male_2female                            2775 non-null float64
Age                                          2796 non-null float64
Functionaldependency_1notdisable_2disable    2513 non-null float64
HT                                           2801 non-null float64
HeartDisease                                 2801 non-null float64
Pscyco                                       2801 non-null float64
DM                                           2801 non-null float64
CerevD                                       2801 non-null float64
ParkinD                                      2801 non-null float64
CKD                                          2801 non-null float64
Dementia                                     2801 non-null float64
GCS                                          2734 non-null float64
SBP                                          2691 non-null float64
BT                    

For convenience, we separate independents `X` and dependent `y` from the data.

In [18]:
X = df.drop('Deadtodischarge',axis=1)
y = df['Deadtodischarge']

We impute the missing value of X at each column by its median value.

In [19]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [20]:
X = DataFrameImputer().fit_transform(X)

In [21]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2801 entries, 0 to 3174
Data columns (total 25 columns):
Sex_1male_2female                            2801 non-null float64
Age                                          2801 non-null float64
Functionaldependency_1notdisable_2disable    2801 non-null float64
HT                                           2801 non-null float64
HeartDisease                                 2801 non-null float64
Pscyco                                       2801 non-null float64
DM                                           2801 non-null float64
CerevD                                       2801 non-null float64
ParkinD                                      2801 non-null float64
CKD                                          2801 non-null float64
Dementia                                     2801 non-null float64
GCS                                          2801 non-null float64
SBP                                          2801 non-null float64
BT                    

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [22]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('heat_cleaned.dat',Xy,fmt='%s')