## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('hcc.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,1.Gen,2.Sym,3.Alc,4.HepB,5.HepB,6.HepB,7.HepC,8.Cir,9.End,10.Smo,...,Unnamed: 246,Unnamed: 247,Unnamed: 248,Unnamed: 249,Unnamed: 250,Unnamed: 251,Unnamed: 252,Unnamed: 253,Unnamed: 254,Unnamed: 255
0,1,0,1,0,0,0,0,1,0,1,...,,,,,,,,,,
1,0,?,0,0,0,0,1,1,?,?,...,,,,,,,,,,
2,1,0,1,1,0,1,0,1,0,1,...,,,,,,,,,,
3,1,1,1,0,0,0,0,1,0,1,...,,,,,,,,,,
4,1,1,1,1,0,1,0,1,0,1,...,,,,,,,,,,


The dependent variable is the resultant classification after the study: `class`.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Columns: 256 entries, 1.Gen to Unnamed: 255
dtypes: float64(206), int64(6), object(44)
memory usage: 330.1+ KB


To clearn the data, we first replace the empty value by `nan`.

In [6]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,1.Gen,2.Sym,3.Alc,4.HepB,5.HepB,6.HepB,7.HepC,8.Cir,9.End,10.Smo,...,Unnamed: 246,Unnamed: 247,Unnamed: 248,Unnamed: 249,Unnamed: 250,Unnamed: 251,Unnamed: 252,Unnamed: 253,Unnamed: 254,Unnamed: 255
0,1,0,1,0,0,0,0,1,0,1,...,,,,,,,,,,
1,0,?,0,0,0,0,1,1,?,?,...,,,,,,,,,,
2,1,0,1,1,0,1,0,1,0,1,...,,,,,,,,,,
3,1,1,1,0,0,0,0,1,0,1,...,,,,,,,,,,
4,1,1,1,1,0,1,0,1,0,1,...,,,,,,,,,,
5,1,0,1,0,?,0,0,1,0,?,...,,,,,,,,,,
6,1,0,0,0,?,1,1,1,0,0,...,,,,,,,,,,
7,1,1,1,0,?,0,0,1,0,1,...,,,,,,,,,,
8,1,1,1,0,0,0,0,1,0,1,...,,,,,,,,,,
9,1,1,1,0,0,0,0,1,0,0,...,,,,,,,,,,


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Columns: 256 entries, 1.Gen to Unnamed: 255
dtypes: float64(206), int64(6), object(44)
memory usage: 330.1+ KB


The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [8]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

We find bad columns which contain too many missing values, then remove them.

In [9]:
def remove_bad_columns(df,bad_column_threshold):
    # find bad columns having too many missing values
    n_null = np.array(df.isnull().sum(axis=0))
    bad_col = np.array([]).astype(int)
    for t in range(len(n_null)):
        if n_null[t] > bad_column_threshold:
            bad_col = np.append(bad_col,t)

    #print(bad_col)
    print('number of bad columns:',len(bad_col))

    # delete bad columns
    df = df.drop(df.columns[bad_col],axis=1)
    #df.info()
    return df  

In [10]:
df = remove_bad_columns(df,20)

number of bad columns: 206


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 50 columns):
1.Gen           165 non-null int64
2.Sym           165 non-null object
3.Alc           165 non-null int64
4.HepB          165 non-null object
5.HepB          165 non-null object
6.HepB          165 non-null object
7.HepC          165 non-null object
8.Cir           165 non-null int64
9.End           165 non-null object
10.Smo          165 non-null object
11.Dia          165 non-null object
12.Obe          165 non-null object
13.Hem          165 non-null object
14.Art          165 non-null object
15.CRen         165 non-null object
16.HIV          165 non-null object
17.Non          165 non-null object
18.EVar         165 non-null object
19.Spl          165 non-null object
20.PHyp         165 non-null object
21.Thr          165 non-null object
22.LMet         165 non-null object
23.Rad          165 non-null object
24.Agedia       165 non-null int64
25.Alcpd        165 non-null object

We find bad rows which contain too many missing values, then remove them.

In [12]:
def remove_bad_rows(df,bad_row_threshold):   
    # find bad rows having too many missing values
    n_null = np.array(df.isnull().sum(axis=1))
    bad_row = np.array([])
    for t in range(len(n_null)):
        if n_null[t] > bad_row_threshold:
            bad_row = np.append(bad_row,t)

    #print(bad_row)
    print('number of bad rows:',len(bad_row))

    # delete bad rows
    df = df.drop(bad_row)
    #df.info()
    return df

In [13]:
df = remove_bad_rows(df,10)

number of bad rows: 0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165 entries, 0 to 164
Data columns (total 50 columns):
1.Gen           165 non-null int64
2.Sym           165 non-null object
3.Alc           165 non-null int64
4.HepB          165 non-null object
5.HepB          165 non-null object
6.HepB          165 non-null object
7.HepC          165 non-null object
8.Cir           165 non-null int64
9.End           165 non-null object
10.Smo          165 non-null object
11.Dia          165 non-null object
12.Obe          165 non-null object
13.Hem          165 non-null object
14.Art          165 non-null object
15.CRen         165 non-null object
16.HIV          165 non-null object
17.Non          165 non-null object
18.EVar         165 non-null object
19.Spl          165 non-null object
20.PHyp         165 non-null object
21.Thr          165 non-null object
22.LMet         165 non-null object
23.Rad          165 non-null object
24.Agedia       165 non-null int64
25.Alcpd        165 non-null object

For convenience, we separate independents `X` and dependent `y` from the data.

In [15]:
X = df.drop('Class',axis=1)
y = df['Class']

We impute the missing value of X at each column by its median value.

In [16]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [17]:
X = DataFrameImputer().fit_transform(X)

In [18]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165 entries, 0 to 164
Data columns (total 49 columns):
1.Gen           165 non-null int64
2.Sym           165 non-null object
3.Alc           165 non-null int64
4.HepB          165 non-null object
5.HepB          165 non-null object
6.HepB          165 non-null object
7.HepC          165 non-null object
8.Cir           165 non-null int64
9.End           165 non-null object
10.Smo          165 non-null object
11.Dia          165 non-null object
12.Obe          165 non-null object
13.Hem          165 non-null object
14.Art          165 non-null object
15.CRen         165 non-null object
16.HIV          165 non-null object
17.Non          165 non-null object
18.EVar         165 non-null object
19.Spl          165 non-null object
20.PHyp         165 non-null object
21.Thr          165 non-null object
22.LMet         165 non-null object
23.Rad          165 non-null object
24.Agedia       165 non-null int64
25.Alcpd        165 non-null object

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [19]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('hepato_cleaned.dat',Xy,fmt='%s')