## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('paradox.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,Site,Code,SAincl,SA2,SA4,SA6,SA8,SA10,SA12,SA1416,...,TempPara5,VitD,VitD3Groups,ThreeUTR,ThreeUTRvar,INT4,INT4var,D543,Callele,Callelevar
0,1.0,401,8.3,1.0,2.8,3.3,1.2,1.1,1.1,1.1,...,,,,,,,,,,
1,1.0,402,11.3,13.3,18.9,20.1,26.2,34.7,29.9,33.0,...,,,,,,,,,,
2,1.0,403,15.4,9.9,9.5,9.5,8.3,8.7,6.4,3.9,...,,,,,,,,,,
3,1.0,404,0.5,0.2,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
4,1.0,405,33.0,24.1,12.3,11.8,8.7,3.3,3.1,3.1,...,,,,,,,,,,


Dropping the column `code` as it is unrelated to the dependent variable

In [5]:
df = df.drop(['Code','WeekParadox3','lesnumbincl','PulsePara3','PulsePara5','TempPara3','TempPara5'],axis=1)

The dependent variable is the presence of a paradoxical response: paradox1.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 42 columns):
Site                     241 non-null float64
SAincl                   241 non-null float64
SA2                      233 non-null float64
SA4                      227 non-null float64
SA6                      223 non-null float64
SA8                      232 non-null float64
SA10                     215 non-null float64
SA12                     225 non-null float64
SA1416                   223 non-null float64
SA2120                   221 non-null float64
SA2728                   211 non-null float64
studyarm                 241 non-null float64
sexe                     241 non-null float64
age                      241 non-null int64
lesionsince              240 non-null float64
BPSYST                   216 non-null float64
BPDIAST                  216 non-null float64
pulserateinclbeatsmin    232 non-null float64
tempinclCelsius          237 non-null float64
bodyweightinclkg       

To clearn the data, we first replace the empty value by `nan`.

In [7]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,Site,SAincl,SA2,SA4,SA6,SA8,SA10,SA12,SA1416,SA2120,...,WeekParadox5,VitD,VitD3Groups,ThreeUTR,ThreeUTRvar,INT4,INT4var,D543,Callele,Callelevar
0,1.0,8.3,1.0,2.8,3.3,1.2,1.1,1.1,1.1,0.0,...,6.0,,,,,,,,,
1,1.0,11.3,13.3,18.9,20.1,26.2,34.7,29.9,33.0,33.0,...,0.0,,,,,,,,,
2,1.0,15.4,9.9,9.5,9.5,8.3,8.7,6.4,3.9,3.7,...,0.0,,,,,,,,,
3,1.0,0.5,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,,,,,,,,
4,1.0,33.0,24.1,12.3,11.8,8.7,3.3,3.1,3.1,7.1,...,0.0,,,,,,,,,
5,1.0,6.4,21.4,3.9,6.0,7.1,6.7,2.8,6.4,5.0,...,8.0,,,,,,,,,
6,1.0,3.3,1.3,0.2,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,,,,,,,,
7,1.0,3.6,3.3,0.6,0.5,0.2,0.0,0.0,0.0,0.0,...,0.0,,,,,,,,,
8,1.0,6.4,7.9,6.0,4.8,2.6,0.3,1.8,1.0,0.0,...,0.0,,,,,,,,,
9,1.0,9.1,7.9,7.9,5.7,4.1,2.0,2.8,4.1,0.0,...,16.0,,,,,,,,,


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 42 columns):
Site                     241 non-null float64
SAincl                   241 non-null float64
SA2                      233 non-null float64
SA4                      227 non-null float64
SA6                      223 non-null float64
SA8                      232 non-null float64
SA10                     215 non-null float64
SA12                     225 non-null float64
SA1416                   223 non-null float64
SA2120                   221 non-null float64
SA2728                   211 non-null float64
studyarm                 241 non-null float64
sexe                     241 non-null float64
age                      241 non-null int64
lesionsince              240 non-null float64
BPSYST                   216 non-null float64
BPDIAST                  216 non-null float64
pulserateinclbeatsmin    232 non-null float64
tempinclCelsius          237 non-null float64
bodyweightinclkg       

The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [9]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

We determine and drop the variables with excessive missing values from the dataset.

In [10]:
def remove_bad_columns(df,bad_column_threshold):
    # find bad columns having too many missing values
    n_null = np.array(df.isnull().sum(axis=0))
    bad_col = np.array([]).astype(int)
    for t in range(len(n_null)):
        if n_null[t] > bad_column_threshold:
            bad_col = np.append(bad_col,t)

    #print(bad_col)
    print('number of bad columns:',len(bad_col))

    # delete bad columns
    df = df.drop(df.columns[bad_col],axis=1)
    #df.info()
    return df    

In [11]:
df = remove_bad_columns(df,40)

number of bad columns: 9


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 33 columns):
Site                     241 non-null float64
SAincl                   241 non-null float64
SA2                      233 non-null float64
SA4                      227 non-null float64
SA6                      223 non-null float64
SA8                      232 non-null float64
SA10                     215 non-null float64
SA12                     225 non-null float64
SA1416                   223 non-null float64
SA2120                   221 non-null float64
SA2728                   211 non-null float64
studyarm                 241 non-null float64
sexe                     241 non-null float64
age                      241 non-null int64
lesionsince              240 non-null float64
BPSYST                   216 non-null float64
BPDIAST                  216 non-null float64
pulserateinclbeatsmin    232 non-null float64
tempinclCelsius          237 non-null float64
bodyweightinclkg       

We find bad rows which contain too many missing values, then remove them.

In [13]:
def remove_bad_rows(df,bad_row_threshold):   
    # find bad rows having too many missing values
    n_null = np.array(df.isnull().sum(axis=1))
    bad_row = np.array([])
    for t in range(len(n_null)):
        if n_null[t] > bad_row_threshold:
            bad_row = np.append(bad_row,t)

    #print(bad_row)
    print('number of bad rows:',len(bad_row))

    # delete bad rows
    df = df.drop(bad_row)
    #df.info()
    return df

In [14]:
df = remove_bad_rows(df,5)

number of bad rows: 5


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 236 entries, 0 to 240
Data columns (total 33 columns):
Site                     236 non-null float64
SAincl                   236 non-null float64
SA2                      229 non-null float64
SA4                      224 non-null float64
SA6                      221 non-null float64
SA8                      229 non-null float64
SA10                     215 non-null float64
SA12                     224 non-null float64
SA1416                   222 non-null float64
SA2120                   219 non-null float64
SA2728                   210 non-null float64
studyarm                 236 non-null float64
sexe                     236 non-null float64
age                      236 non-null int64
lesionsince              235 non-null float64
BPSYST                   212 non-null float64
BPDIAST                  212 non-null float64
pulserateinclbeatsmin    228 non-null float64
tempinclCelsius          232 non-null float64
bodyweightinclkg       

For convenience, we separate independents `X` and dependent `y` from the data.

In [16]:
X = df.drop('Paradox1',axis=1)
y = df['Paradox1']

We impute the missing value of X at each column by its median value.

In [17]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [18]:
X = DataFrameImputer().fit_transform(X)

In [19]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 236 entries, 0 to 240
Data columns (total 32 columns):
Site                     236 non-null float64
SAincl                   236 non-null float64
SA2                      236 non-null float64
SA4                      236 non-null float64
SA6                      236 non-null float64
SA8                      236 non-null float64
SA10                     236 non-null float64
SA12                     236 non-null float64
SA1416                   236 non-null float64
SA2120                   236 non-null float64
SA2728                   236 non-null float64
studyarm                 236 non-null float64
sexe                     236 non-null float64
age                      236 non-null int64
lesionsince              236 non-null float64
BPSYST                   236 non-null float64
BPDIAST                  236 non-null float64
pulserateinclbeatsmin    236 non-null float64
tempinclCelsius          236 non-null float64
bodyweightinclkg       

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [20]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('paradox_cleaned.dat',Xy,fmt='%s')