## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
np.random.seed(1)

We load the raw data.

In [3]:
df = pd.read_csv('mental.csv',sep= ',')

In [4]:
df.head()

Unnamed: 0,id,school_year,semester,age,gender,height,weight,phq1,phq2,phq3,...,Unnamed: 245,Unnamed: 246,Unnamed: 247,Unnamed: 248,Unnamed: 249,Unnamed: 250,Unnamed: 251,Unnamed: 252,Unnamed: 253,Unnamed: 254
0,A1,1,1,19.0,m,1.5,75.0,2.0,1.0,1.0,...,,,,,,,,,,
1,B2,1,1,18.0,m,1.68,56.0,2.0,0.0,1.0,...,,,,,,,,,,
2,C3,1,1,19.0,m,1.74,76.0,1.0,1.0,0.0,...,,,,,,,,,,
3,D4,1,1,18.0,f,1.68,67.0,2.0,2.0,3.0,...,,,,,,,,,,
4,E5,1,1,18.0,m,1.8,83.0,0.0,1.0,1.0,...,,,,,,,,,,


Dropping the column `id` as it is unrelated to the dependent variable

In [5]:
df = df.drop(['id','gender','bed_time','wake_up_time','reported_sleep_hours','nap_duration','weekly_study_hours'],axis=1)

Dropping rows in the dependent variable that aren't missing significantly for other variables.

In [6]:
df = df.drop(df.index[[188,192,197,204,381]])
#df = df.drop(df.index[[188]])
#df = df.drop(df.index[[192]])
#df = df.drop(df.index[[197]])
#df = df.drop(df.index[[204]])

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 771 entries, 0 to 775
Columns: 248 entries, school_year to Unnamed: 254
dtypes: float64(246), int64(2)
memory usage: 1.5 MB


To clearn the data, we first replace the empty value by `nan`.

In [8]:
# replace empty by nan
df = df.replace(r'^\s+$', np.nan, regex=True)
df

Unnamed: 0,school_year,semester,age,height,weight,phq1,phq2,phq3,phq4,phq5,...,Unnamed: 245,Unnamed: 246,Unnamed: 247,Unnamed: 248,Unnamed: 249,Unnamed: 250,Unnamed: 251,Unnamed: 252,Unnamed: 253,Unnamed: 254
0,1,1,19.0,1.50,75.0,2.0,1.0,1.0,1.0,2.0,...,,,,,,,,,,
1,1,1,18.0,1.68,56.0,2.0,0.0,1.0,2.0,1.0,...,,,,,,,,,,
2,1,1,19.0,1.74,76.0,1.0,1.0,0.0,3.0,1.0,...,,,,,,,,,,
3,1,1,18.0,1.68,67.0,2.0,2.0,3.0,3.0,3.0,...,,,,,,,,,,
4,1,1,18.0,1.80,83.0,0.0,1.0,1.0,2.0,1.0,...,,,,,,,,,,
5,1,1,18.0,1.74,67.0,0.0,0.0,0.0,1.0,1.0,...,,,,,,,,,,
6,1,1,18.0,1.78,71.0,1.0,0.0,2.0,2.0,0.0,...,,,,,,,,,,
7,1,1,19.0,1.69,58.5,0.0,0.0,1.0,1.0,0.0,...,,,,,,,,,,
8,1,1,20.0,1.55,51.0,1.0,1.0,3.0,1.0,2.0,...,,,,,,,,,,
9,1,1,19.0,1.75,75.0,1.0,1.0,0.0,1.0,2.0,...,,,,,,,,,,


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 771 entries, 0 to 775
Columns: 248 entries, school_year to Unnamed: 254
dtypes: float64(246), int64(2)
memory usage: 1.5 MB


The data still contain errors, such as `\t`, ` `, `\?`. We will delete `\t` and ` ` and convert `\?` to `np.nan`.

In [10]:
df = df.replace('\t','',regex=True)
df = df.replace(' ','',regex=True)
df = df.replace('\?','np.nan',regex=True)

We determine and drop columns with multiple missing variables.

In [11]:
def remove_bad_columns(df,bad_column_threshold):
    # find bad columns having too many missing values
    n_null = np.array(df.isnull().sum(axis=0))
    bad_col = np.array([]).astype(int)
    for t in range(len(n_null)):
        if n_null[t] > bad_column_threshold:
            bad_col = np.append(bad_col,t)

    #print(bad_col)
    print('number of bad columns:',len(bad_col))

    # delete bad columns
    df = df.drop(df.columns[bad_col],axis=1)
    #df.info()
    return df  

In [12]:
df = remove_bad_columns(df,15)

number of bad columns: 213


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 771 entries, 0 to 775
Data columns (total 35 columns):
school_year                      771 non-null int64
semester                         771 non-null int64
age                              771 non-null float64
height                           766 non-null float64
weight                           767 non-null float64
phq1                             770 non-null float64
phq2                             770 non-null float64
phq3                             770 non-null float64
phq4                             770 non-null float64
phq5                             767 non-null float64
phq6                             770 non-null float64
phq7                             770 non-null float64
phq8                             770 non-null float64
phq9                             770 non-null float64
thoughts_of_dropping_out         771 non-null float64
previous_depression_diagnosis    770 non-null float64
previous_depression_treatment    76

We find bad rows with too many missing values and remove them.

In [14]:
def remove_bad_rows(df,bad_row_threshold):   
    # find bad rows having too many missing values
    n_null = np.array(df.isnull().sum(axis=1))
    bad_row = np.array([])
    for t in range(len(n_null)):
        if n_null[t] > bad_row_threshold:
            bad_row = np.append(bad_row,t)

    #print(bad_row)
    print('number of bad rows:',len(bad_row))

    # delete bad rows
    df = df.drop(bad_row)
    #df.info()
    return df

In [15]:
df = remove_bad_rows(df,5)

number of bad rows: 8


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 763 entries, 0 to 775
Data columns (total 35 columns):
school_year                      763 non-null int64
semester                         763 non-null int64
age                              763 non-null float64
height                           758 non-null float64
weight                           759 non-null float64
phq1                             762 non-null float64
phq2                             762 non-null float64
phq3                             762 non-null float64
phq4                             762 non-null float64
phq5                             759 non-null float64
phq6                             762 non-null float64
phq7                             762 non-null float64
phq8                             762 non-null float64
phq9                             762 non-null float64
thoughts_of_dropping_out         763 non-null float64
previous_depression_diagnosis    762 non-null float64
previous_depression_treatment    75

For convenience, we separate independents `X` and dependent `y` from the data.

In [17]:
X = df.drop('thoughts_of_dropping_out',axis=1)
y = df['thoughts_of_dropping_out']

We impute the missing value of X at each column by its median value.

In [18]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            # numerical --> mean, categorical --> median
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns)  
                               
            # numerical, categorical --> median                   
            #if X[c].dtype == np.dtype('O') else X[c].median() for c in X], index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [19]:
X = DataFrameImputer().fit_transform(X)

In [20]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 763 entries, 0 to 775
Data columns (total 34 columns):
school_year                      763 non-null int64
semester                         763 non-null int64
age                              763 non-null float64
height                           763 non-null float64
weight                           763 non-null float64
phq1                             763 non-null float64
phq2                             763 non-null float64
phq3                             763 non-null float64
phq4                             763 non-null float64
phq5                             763 non-null float64
phq6                             763 non-null float64
phq7                             763 non-null float64
phq8                             763 non-null float64
phq9                             763 non-null float64
previous_depression_diagnosis    763 non-null float64
previous_depression_treatment    763 non-null float64
gad1                             76

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [21]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('mental_cleaned.dat',Xy,fmt='%s')