## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('smoking.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,age,gender,A1,A2_completed_education,A3,A4,A5,A6,A7,A8,...,Low,Middle,High,Not_applicable,Smoke_daily,Smoke_less_then_daily,Smoked_in_the_past,Never_smoker,a62_income,a4_smoking_status
0,35,2,1,2,2,1,,2.0,3.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
1,34,1,4,3,3,1,,2.0,3.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
2,37,2,3,3,3,1,,4.0,2.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,1.0
3,37,2,3,3,3,1,,4.0,2.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,1.0
4,35,2,2,3,3,3,,,,,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,3.0,3.0


There are 118 attributes: <br/>


and 1 class: health status with 1 and 0 representing healthy and not healthy.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1588 entries, 0 to 1587
Columns: 119 entries, age to a4_smoking_status
dtypes: float64(49), int64(70)
memory usage: 1.4 MB


To clean the data, we first replace the empty value by `nan`.

In [6]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df = df.replace('99',np.nan,regex=True)
df = df.replace(99,np.nan,regex=True)
df

Unnamed: 0,age,gender,A1,A2_completed_education,A3,A4,A5,A6,A7,A8,...,Low,Middle,High,Not_applicable,Smoke_daily,Smoke_less_then_daily,Smoked_in_the_past,Never_smoker,a62_income,a4_smoking_status
0,35,2,1,2,2,1,,2.0,3.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
1,34,1,4,3,3,1,,2.0,3.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
2,37,2,3,3,3,1,,4.0,2.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,1.0
3,37,2,3,3,3,1,,4.0,2.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,1.0
4,35,2,2,3,3,3,,,,,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,3.0,3.0
5,29,2,5,3,2,1,,4.0,2.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,1.0
6,48,1,2,3,3,1,,2.0,3.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,1.0
7,41,2,5,3,3,1,,3.0,2.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,1.0
8,15,1,5,1,1,1,,1.0,3.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,1.0
9,37,2,1,3,1,1,,3.0,2.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,1.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1588 entries, 0 to 1587
Columns: 119 entries, age to a4_smoking_status
dtypes: float64(84), int64(35)
memory usage: 1.4 MB


The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [8]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

We remove columns with too many missing values.

In [9]:
def remove_bad_columns(df,bad_column_threshold):
    # find bad columns having too many missing values
    n_null = np.array(df.isnull().sum(axis=0))
    bad_col = np.array([]).astype(int)
    for t in range(len(n_null)):
        if n_null[t] > bad_column_threshold:
            bad_col = np.append(bad_col,t)

    #print(bad_col)
    print('number of bad columns:',len(bad_col))

    # delete bad columns
    df = df.drop(df.columns[bad_col],axis=1)
    #df.info()
    return df  

In [10]:
df = remove_bad_columns(df,200)

number of bad columns: 23


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1588 entries, 0 to 1587
Data columns (total 96 columns):
age                             1588 non-null int64
gender                          1588 non-null int64
A1                              1588 non-null int64
A2_completed_education          1588 non-null int64
A3                              1588 non-null int64
A4                              1588 non-null int64
A9                              1571 non-null float64
A16                             1587 non-null float64
A17                             1587 non-null float64
A18                             1588 non-null int64
A19                             1588 non-null int64
A20                             1588 non-null int64
A21                             1556 non-null float64
A22                             1585 non-null float64
A23                             1588 non-null int64
A24                             1588 non-null int64
A25                             1588 non-null int64

We find bad rows which contain too many missing values, then remove them.

In [12]:
def remove_bad_rows(df,bad_row_threshold):   
    # find bad rows having too many missing values
    n_null = np.array(df.isnull().sum(axis=1))
    bad_row = np.array([])
    for t in range(len(n_null)):
        if n_null[t] > bad_row_threshold:
            bad_row = np.append(bad_row,t)

    #print(bad_row)
    print('number of bad rows:',len(bad_row))

    # delete bad rows
    df = df.drop(bad_row)
    #df.info()
    return df

In [13]:
df = remove_bad_rows(df,10)

number of bad rows: 14


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1574 entries, 0 to 1573
Data columns (total 96 columns):
age                             1574 non-null int64
gender                          1574 non-null int64
A1                              1574 non-null int64
A2_completed_education          1574 non-null int64
A3                              1574 non-null int64
A4                              1574 non-null int64
A9                              1557 non-null float64
A16                             1573 non-null float64
A17                             1573 non-null float64
A18                             1574 non-null int64
A19                             1574 non-null int64
A20                             1574 non-null int64
A21                             1542 non-null float64
A22                             1571 non-null float64
A23                             1574 non-null int64
A24                             1574 non-null int64
A25                             1574 non-null int64

For convenience, we separate independents `X` and dependent `y` from the data.

In [15]:
X = df.drop('healthstatus3',axis=1)
y = df['healthstatus3']

We impute the missing value of X at each column by its median value.

In [16]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [17]:
X = DataFrameImputer().fit_transform(X)

In [18]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1574 entries, 0 to 1573
Data columns (total 95 columns):
age                             1574 non-null int64
gender                          1574 non-null int64
A1                              1574 non-null int64
A2_completed_education          1574 non-null int64
A3                              1574 non-null int64
A4                              1574 non-null int64
A9                              1574 non-null float64
A16                             1574 non-null float64
A17                             1574 non-null float64
A18                             1574 non-null int64
A19                             1574 non-null int64
A20                             1574 non-null int64
A21                             1574 non-null float64
A22                             1574 non-null float64
A23                             1574 non-null int64
A24                             1574 non-null int64
A25                             1574 non-null int64

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [19]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('smoking_cleaned.dat',Xy,fmt='%s')