## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [11]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [12]:
np.random.seed(1)

We load the raw data.

In [13]:
df = pd.read_csv('language.csv',sep= ',')

In [14]:
df.head()

Unnamed: 0,Y,filename,sex,age,age_years,corpus,group,child_TNW,child_TNS,examiner_TNW,...,word_errors,f_k,n_v,n_aux,n_3s_v,det_n_pl,det_pl_n,pro_aux,pro_3s_v,total_error
0,1,fssli009.cha,,165,13.75,Conti4,SLI,287,36,4,...,8,1.210456,0,2,2,7,0,0,1,12
1,1,fssli058.cha,,172,14.333333,Conti4,SLI,368,42,27,...,16,1.871708,0,4,0,5,0,0,0,9
2,1,fssli062.cha,,160,13.333333,Conti4,SLI,266,26,2,...,0,2.240602,0,1,0,5,0,0,0,6
3,1,fssli066.cha,,184,15.333333,Conti4,SLI,405,40,21,...,4,1.877762,1,0,0,11,0,0,0,12
4,1,fssli108.cha,,176,14.666667,Conti4,SLI,300,35,20,...,8,0.339524,0,1,1,5,0,0,0,7


We drop the column `filename` which is not an attribute. Additionally, we drop `group`, which is identical to `Y`, and `corpus`, which is not defined in the codebook.

In [15]:
df = df.drop('filename',axis=1)
df = df.drop('corpus',axis=1)
df = df.drop('group',axis=1)

There are 61 attributes and there is 1 class Y: presence of specific language impairment or not.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1163 entries, 0 to 1162
Data columns (total 61 columns):
Y                        1163 non-null int64
sex                      1044 non-null object
age                      1163 non-null int64
age_years                1163 non-null float64
child_TNW                1163 non-null int64
child_TNS                1163 non-null int64
examiner_TNW             1163 non-null int64
freq_ttr                 1163 non-null float64
r_2_i_verbs              1163 non-null float64
mor_words                1163 non-null int64
num_pos_tags             1163 non-null int64
n_dos                    1163 non-null int64
repetition               1163 non-null int64
retracing                1163 non-null int64
fillers                  1163 non-null int64
s_1g_ppl                 1163 non-null float64
s_2g_ppl                 1163 non-null float64
s_3g_ppl                 1163 non-null float64
d_1g_ppl                 1163 non-null float64
d_2g_ppl               

We see that `sex` contains so many missing values, so we will remove this attribute.

In [17]:
df = df.drop(['sex'],axis=1)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1163 entries, 0 to 1162
Data columns (total 60 columns):
Y                        1163 non-null int64
age                      1163 non-null int64
age_years                1163 non-null float64
child_TNW                1163 non-null int64
child_TNS                1163 non-null int64
examiner_TNW             1163 non-null int64
freq_ttr                 1163 non-null float64
r_2_i_verbs              1163 non-null float64
mor_words                1163 non-null int64
num_pos_tags             1163 non-null int64
n_dos                    1163 non-null int64
repetition               1163 non-null int64
retracing                1163 non-null int64
fillers                  1163 non-null int64
s_1g_ppl                 1163 non-null float64
s_2g_ppl                 1163 non-null float64
s_3g_ppl                 1163 non-null float64
d_1g_ppl                 1163 non-null float64
d_2g_ppl                 1163 non-null float64
d_3g_ppl              

For convenience, we separate independents `X` and dependent `y` from the data.

In [19]:
X = df.drop('Y',axis=1)
y = df['Y']

In [20]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1163 entries, 0 to 1162
Data columns (total 59 columns):
age                      1163 non-null int64
age_years                1163 non-null float64
child_TNW                1163 non-null int64
child_TNS                1163 non-null int64
examiner_TNW             1163 non-null int64
freq_ttr                 1163 non-null float64
r_2_i_verbs              1163 non-null float64
mor_words                1163 non-null int64
num_pos_tags             1163 non-null int64
n_dos                    1163 non-null int64
repetition               1163 non-null int64
retracing                1163 non-null int64
fillers                  1163 non-null int64
s_1g_ppl                 1163 non-null float64
s_2g_ppl                 1163 non-null float64
s_3g_ppl                 1163 non-null float64
d_1g_ppl                 1163 non-null float64
d_2g_ppl                 1163 non-null float64
d_3g_ppl                 1163 non-null float64
z_mlu_sli           

Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [21]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('language_cleaned.dat',Xy,fmt='%s')