## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('ef.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,sex,m32,m33_1,PE,m33_2,m34,m35,m35_1_2,m35_2_2,m35_1_3,...,IPE5,IPE6,IPE7,IPE8,IPE9,IPE10,Age_Categ,Age_recode_2,IIEF5_Categ,IIEF_Y_N
0,1,1,1.0,1.0,1.0,1,2,1.0,11.0,999.0,...,,5.0,5.0,,0.0,0.0,,,,
1,1,999,,,,5,1,,,,...,,2.0,1.0,,0.0,0.0,1.0,1.0,,
2,1,2,1.0,1.0,1.0,1,1,,,,...,,5.0,5.0,,0.0,0.0,,,,
3,1,5,2.0,2.0,3.0,2,1,,,,...,,1.0,1.0,,0.0,0.0,3.0,3.0,,
4,1,60,2.0,2.0,45.0,5,1,,,,...,4.0,3.0,5.0,5.0,4.0,3.0,1.0,1.0,2.0,1.0


The dependent variable is erectile functioning IIEF_Y_N, in terms of 1 and 2.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 55 columns):
sex                    1004 non-null int64
m32                    1004 non-null int64
m33_1                  920 non-null float64
PE                     816 non-null float64
m33_2                  816 non-null float64
m34                    1004 non-null int64
m35                    1004 non-null int64
m35_1_2                126 non-null float64
m35_2_2                43 non-null float64
m35_1_3                126 non-null float64
m35_2_3                46 non-null float64
m36_3                  133 non-null float64
m36_4                  133 non-null float64
m42                    1004 non-null int64
m43                    1004 non-null int64
m44                    1004 non-null int64
m45                    1004 non-null int64
m46                    1004 non-null int64
m47                    1004 non-null int64
m48                    1004 non-null int64
m49                    100

To clearn the data, we first replace the empty value by `nan`.

In [6]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,sex,m32,m33_1,PE,m33_2,m34,m35,m35_1_2,m35_2_2,m35_1_3,...,IPE5,IPE6,IPE7,IPE8,IPE9,IPE10,Age_Categ,Age_recode_2,IIEF5_Categ,IIEF_Y_N
0,1,1,1.0,1.0,1.0,1,2,1.0,11.0,999.0,...,,5.0,5.0,,0.0,0.0,,,,
1,1,999,,,,5,1,,,,...,,2.0,1.0,,0.0,0.0,1.0,1.0,,
2,1,2,1.0,1.0,1.0,1,1,,,,...,,5.0,5.0,,0.0,0.0,,,,
3,1,5,2.0,2.0,3.0,2,1,,,,...,,1.0,1.0,,0.0,0.0,3.0,3.0,,
4,1,60,2.0,2.0,45.0,5,1,,,,...,4.0,3.0,5.0,5.0,4.0,3.0,1.0,1.0,2.0,1.0
5,1,60,2.0,2.0,10.0,5,1,,,,...,,2.0,4.0,,0.0,0.0,2.0,2.0,2.0,1.0
6,1,5,2.0,2.0,2.0,3,1,,,,...,4.0,4.0,4.0,4.0,3.0,5.0,,,2.0,1.0
7,1,10,2.0,2.0,5.0,4,1,,,,...,3.0,2.0,2.0,4.0,4.0,5.0,6.0,6.0,2.0,1.0
8,1,999,999.0,,,999,1,,,,...,,,,,0.0,0.0,1.0,1.0,,
9,1,30,2.0,2.0,999.0,5,1,,,,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,3.0,3.0,1.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 55 columns):
sex                    1004 non-null int64
m32                    1004 non-null int64
m33_1                  920 non-null float64
PE                     816 non-null float64
m33_2                  816 non-null float64
m34                    1004 non-null int64
m35                    1004 non-null int64
m35_1_2                126 non-null float64
m35_2_2                43 non-null float64
m35_1_3                126 non-null float64
m35_2_3                46 non-null float64
m36_3                  133 non-null float64
m36_4                  133 non-null float64
m42                    1004 non-null int64
m43                    1004 non-null int64
m44                    1004 non-null int64
m45                    1004 non-null int64
m46                    1004 non-null int64
m47                    1004 non-null int64
m48                    1004 non-null int64
m49                    100

The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [8]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

Now we drop columns with many missing values.

In [9]:
def remove_bad_columns(df,bad_column_threshold):
    # find bad columns having too many missing values
    n_null = np.array(df.isnull().sum(axis=0))
    bad_col = np.array([]).astype(int)
    for t in range(len(n_null)):
        if n_null[t] > bad_column_threshold:
            bad_col = np.append(bad_col,t)

    #print(bad_col)
    print('number of bad columns:',len(bad_col))

    # delete bad columns
    df = df.drop(df.columns[bad_col],axis=1)
    #df.info()
    return df  

In [10]:
df = remove_bad_columns(df,400)

number of bad columns: 7


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 48 columns):
sex                    1004 non-null int64
m32                    1004 non-null int64
m33_1                  920 non-null float64
PE                     816 non-null float64
m33_2                  816 non-null float64
m34                    1004 non-null int64
m35                    1004 non-null int64
m42                    1004 non-null int64
m43                    1004 non-null int64
m44                    1004 non-null int64
m45                    1004 non-null int64
m46                    1004 non-null int64
m47                    1004 non-null int64
m48                    1004 non-null int64
m49                    1004 non-null int64
m50                    1004 non-null int64
m51                    1004 non-null int64
m52                    1004 non-null int64
m53                    1004 non-null int64
m54                    1004 non-null int64
m55                    1004 no

Now we drop rows with many missing values.

In [12]:
def remove_bad_rows(df,bad_row_threshold):   
    # find bad rows having too many missing values
    n_null = np.array(df.isnull().sum(axis=1))
    bad_row = np.array([])
    for t in range(len(n_null)):
        if n_null[t] > bad_row_threshold:
            bad_row = np.append(bad_row,t)

    #print(bad_row)
    print('number of bad rows:',len(bad_row))

    # delete bad rows
    df = df.drop(bad_row)
    #df.info()
    return df

In [13]:
df = remove_bad_rows(df,5)

number of bad rows: 338


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 666 entries, 4 to 1002
Data columns (total 48 columns):
sex                    666 non-null int64
m32                    666 non-null int64
m33_1                  666 non-null float64
PE                     658 non-null float64
m33_2                  658 non-null float64
m34                    666 non-null int64
m35                    666 non-null int64
m42                    666 non-null int64
m43                    666 non-null int64
m44                    666 non-null int64
m45                    666 non-null int64
m46                    666 non-null int64
m47                    666 non-null int64
m48                    666 non-null int64
m49                    666 non-null int64
m50                    666 non-null int64
m51                    666 non-null int64
m52                    666 non-null int64
m53                    666 non-null int64
m54                    666 non-null int64
m55                    666 non-null int64
AGE   

For convenience, we separate independents `X` and dependent `y` from the data.

In [15]:
X = df.drop('IIEF_Y_N',axis=1)
y = df['IIEF_Y_N']

We impute the missing value of X at each column by its median value.

In [16]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [17]:
X = DataFrameImputer().fit_transform(X)

In [18]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 666 entries, 4 to 1002
Data columns (total 47 columns):
sex                    666 non-null int64
m32                    666 non-null int64
m33_1                  666 non-null float64
PE                     666 non-null float64
m33_2                  666 non-null float64
m34                    666 non-null int64
m35                    666 non-null int64
m42                    666 non-null int64
m43                    666 non-null int64
m44                    666 non-null int64
m45                    666 non-null int64
m46                    666 non-null int64
m47                    666 non-null int64
m48                    666 non-null int64
m49                    666 non-null int64
m50                    666 non-null int64
m51                    666 non-null int64
m52                    666 non-null int64
m53                    666 non-null int64
m54                    666 non-null int64
m55                    666 non-null int64
AGE   

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [19]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('ef_cleaned.dat',Xy,fmt='%s')