## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('stigma.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,ID,Y1,Y2,Y3,Y4,C01,C02,C03,C04,C05,...,X1,X2,X3,X4,X5,X6,X7,X8,X9,Unnamed: 38
0,1,1,0,1,0,0,0,0,1,0,...,1,1,1,0,1,0,0,1,0,
1,2,0,0,1,1,1,1,1,1,1,...,1,0,1,1,1,0,0,1,0,
2,4,1,1,1,0,0,1,0,1,1,...,0,0,1,1,1,0,0,1,0,
3,5,0,0,0,0,1,1,1,1,1,...,1,0,1,1,0,0,1,1,1,
4,6,0,1,1,1,1,1,1,1,1,...,0,0,1,0,0,0,0,1,1,


Dropping the column `ID` as it is unrelated to the dependent variable

In [5]:
df = df.drop(['ID','Unnamed: 38'],axis=1)

The dependent variable is the impact of stigma related to HIV-Y1.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10665 entries, 0 to 10664
Data columns (total 37 columns):
Y1     10665 non-null int64
Y2     10665 non-null int64
Y3     10665 non-null int64
Y4     10665 non-null int64
C01    10665 non-null int64
C02    10665 non-null int64
C03    10665 non-null int64
C04    10665 non-null int64
C05    10665 non-null int64
C06    10665 non-null int64
C07    10665 non-null int64
C08    10665 non-null int64
C09    10665 non-null int64
C10    10665 non-null int64
C11    10665 non-null int64
C12    10665 non-null int64
C13    10665 non-null int64
C14    10665 non-null int64
C15    10665 non-null int64
C16    10665 non-null int64
C17    10665 non-null int64
C18    10665 non-null int64
C19    10665 non-null int64
C20    10665 non-null int64
C21    10665 non-null int64
C22    10665 non-null int64
C23    10665 non-null int64
C24    10665 non-null int64
X1     10665 non-null int64
X2     10665 non-null int64
X3     10665 non-null int64
X4     10665 non-null i

To clearn the data, we first replace the empty value by `nan`.

In [7]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,Y1,Y2,Y3,Y4,C01,C02,C03,C04,C05,C06,...,C24,X1,X2,X3,X4,X5,X6,X7,X8,X9
0,1,0,1,0,0,0,0,1,0,1,...,1,1,1,1,0,1,0,0,1,0
1,0,0,1,1,1,1,1,1,1,1,...,0,1,0,1,1,1,0,0,1,0
2,1,1,1,0,0,1,0,1,1,0,...,1,0,0,1,1,1,0,0,1,0
3,0,0,0,0,1,1,1,1,1,1,...,1,1,0,1,1,0,0,1,1,1
4,0,1,1,1,1,1,1,1,1,1,...,1,0,0,1,0,0,0,0,1,1
5,1,1,1,1,1,1,0,1,0,0,...,0,0,1,1,0,1,0,0,1,0
6,1,1,1,0,0,0,0,1,1,0,...,1,0,1,1,1,0,0,0,1,1
7,1,0,1,0,1,1,0,1,1,1,...,1,0,0,1,1,1,0,0,1,0
8,1,0,0,0,0,1,1,1,1,0,...,1,1,0,1,0,1,0,0,0,1
9,1,1,1,0,1,1,0,0,0,1,...,1,1,0,1,0,0,0,0,1,1


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10665 entries, 0 to 10664
Data columns (total 37 columns):
Y1     10665 non-null int64
Y2     10665 non-null int64
Y3     10665 non-null int64
Y4     10665 non-null int64
C01    10665 non-null int64
C02    10665 non-null int64
C03    10665 non-null int64
C04    10665 non-null int64
C05    10665 non-null int64
C06    10665 non-null int64
C07    10665 non-null int64
C08    10665 non-null int64
C09    10665 non-null int64
C10    10665 non-null int64
C11    10665 non-null int64
C12    10665 non-null int64
C13    10665 non-null int64
C14    10665 non-null int64
C15    10665 non-null int64
C16    10665 non-null int64
C17    10665 non-null int64
C18    10665 non-null int64
C19    10665 non-null int64
C20    10665 non-null int64
C21    10665 non-null int64
C22    10665 non-null int64
C23    10665 non-null int64
C24    10665 non-null int64
X1     10665 non-null int64
X2     10665 non-null int64
X3     10665 non-null int64
X4     10665 non-null i

We find bad rows which contain too many missing values, then remove them.

In [9]:
# find bad rows having too many missing values
n_null = np.array(df.isnull().sum(axis=1))
bad_row = np.array([])
for t in range(len(n_null)):
    if n_null[t] > 10:
        #print(t)
        bad_row = np.append(bad_row,t)
        
print(bad_row)
print(len(bad_row))

# delete bad rows
df = df.drop(bad_row)
df.info()

[]
0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10665 entries, 0 to 10664
Data columns (total 37 columns):
Y1     10665 non-null int64
Y2     10665 non-null int64
Y3     10665 non-null int64
Y4     10665 non-null int64
C01    10665 non-null int64
C02    10665 non-null int64
C03    10665 non-null int64
C04    10665 non-null int64
C05    10665 non-null int64
C06    10665 non-null int64
C07    10665 non-null int64
C08    10665 non-null int64
C09    10665 non-null int64
C10    10665 non-null int64
C11    10665 non-null int64
C12    10665 non-null int64
C13    10665 non-null int64
C14    10665 non-null int64
C15    10665 non-null int64
C16    10665 non-null int64
C17    10665 non-null int64
C18    10665 non-null int64
C19    10665 non-null int64
C20    10665 non-null int64
C21    10665 non-null int64
C22    10665 non-null int64
C23    10665 non-null int64
C24    10665 non-null int64
X1     10665 non-null int64
X2     10665 non-null int64
X3     10665 non-null int64
X4     10665 non-n

The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [10]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

For convenience, we separate independents `X` and dependent `y` from the data.

In [11]:
X = df.drop('Y1',axis=1)
y = df['Y1']

In [12]:
x1 = np.array(X)
x1

array([[0, 1, 0, ..., 0, 1, 0],
       [0, 1, 1, ..., 0, 1, 0],
       [1, 1, 0, ..., 0, 1, 0],
       ...,
       [1, 1, 1, ..., 0, 1, 1],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 1, 0]])

We determine and drop the variables with excessive missing values from the dataset.

In [13]:
i_missing = []
for i in range(x1.shape[1]):
    n_missing = np.sum(np.isnan(x1[:,i]))
    if n_missing > 5:
        print(i,n_missing)
        i_missing.append(i)    
print(i_missing)    

[]


In [14]:
x2 = np.delete(x1,i_missing,axis=1)
x2.shape

(10665, 36)

We impute the missing value of X at each column by its median value.

In [15]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [16]:
X = DataFrameImputer().fit_transform(X)

In [17]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10665 entries, 0 to 10664
Data columns (total 36 columns):
Y2     10665 non-null int64
Y3     10665 non-null int64
Y4     10665 non-null int64
C01    10665 non-null int64
C02    10665 non-null int64
C03    10665 non-null int64
C04    10665 non-null int64
C05    10665 non-null int64
C06    10665 non-null int64
C07    10665 non-null int64
C08    10665 non-null int64
C09    10665 non-null int64
C10    10665 non-null int64
C11    10665 non-null int64
C12    10665 non-null int64
C13    10665 non-null int64
C14    10665 non-null int64
C15    10665 non-null int64
C16    10665 non-null int64
C17    10665 non-null int64
C18    10665 non-null int64
C19    10665 non-null int64
C20    10665 non-null int64
C21    10665 non-null int64
C22    10665 non-null int64
C23    10665 non-null int64
C24    10665 non-null int64
X1     10665 non-null int64
X2     10665 non-null int64
X3     10665 non-null int64
X4     10665 non-null int64
X5     10665 non-null i

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [18]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('stigma_cleaned.dat',Xy,fmt='%s')