### Childhood Autistic Spectrum Disorder Screening using Machine Learning

The early diagnosis of neurodevelopment disorders can improve treatment and significantly decrease the associated 
healthcare costs. In this project, we will use supervised learning to diagnose Autistic Spectrum Disorder 
(ASD) based on behavioural features and individual characteristics.

In [1]:
import sys
import pandas as pd
import sklearn
import keras

print('Python: {}'.format(sys.version))
print('Pandas: {}'.format(pd.__version__))
print('Sklearn: {}'.format(sklearn.__version__))
print('Keras: {}'.format(keras.__version__))

Using TensorFlow backend.


Python: 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) 
[Clang 6.0 (clang-600.0.57)]
Pandas: 1.0.1
Sklearn: 0.22.1
Keras: 2.3.1


### 1. Importing the Dataset

In [2]:
url = 'Autism-Child-Data.arff'

column_names = ['a1_score', 'a2_score', 'a3_score', 'a4_score', 'a5_score', 
                'a6_score', 'a7_score', 'a8_score', 'a9_score', 'a10_score',
                'age', 'gender', 'ethnicity', 'jundice', 'austim',
                'country_of_res', 'used_app_before', 'result', 'age_desc', 'relation',
                'class']

data = pd.read_csv(url, names=column_names)
data

Unnamed: 0,a1_score,a2_score,a3_score,a4_score,a5_score,a6_score,a7_score,a8_score,a9_score,a10_score,...,gender,ethnicity,jundice,austim,country_of_res,used_app_before,result,age_desc,relation,class
0,1,1,0,0,1,1,0,1,0,0,...,m,Others,no,no,Jordan,no,5,'4-11 years',Parent,NO
1,1,1,0,0,1,1,0,1,0,0,...,m,'Middle Eastern ',no,no,Jordan,no,5,'4-11 years',Parent,NO
2,1,1,0,0,0,1,1,1,0,0,...,m,?,no,no,Jordan,yes,5,'4-11 years',?,NO
3,0,1,0,0,1,1,0,0,0,1,...,f,?,yes,no,Jordan,no,4,'4-11 years',?,NO
4,1,1,1,1,1,1,1,1,1,1,...,m,Others,yes,no,'United States',no,10,'4-11 years',Parent,YES
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287,1,1,1,1,1,1,1,1,1,1,...,f,White-European,yes,yes,'United Kingdom',no,10,'4-11 years',Parent,YES
288,1,0,0,0,1,0,1,0,0,1,...,f,White-European,yes,yes,Australia,no,4,'4-11 years',Parent,NO
289,1,0,1,1,1,1,1,0,0,1,...,m,Latino,no,no,Brazil,no,7,'4-11 years',Parent,YES
290,1,1,1,0,1,1,1,1,1,1,...,m,'South Asian',no,no,India,no,9,'4-11 years',Parent,YES


Let's look into the features and finally the target output:

The attributes from A1_Score to A1_Score are the answer code of the question based on the screening method used. It is unknown which questions were asked, but the questions are binary, hence the answer is either 0 or 1. <br>

Attribute age is numeric, the gender is either m or f. The attribute ethnicity have one of the possible values: {Others, 'Middle Eastern', White-European, Black, 'South Asian', Asian, Pasifika, Hispanic, Turkish, Latino}. The attribute jundice describes whether the patient has jaundice or not. Attribute austim refers to whether there is a family member with PDD. Country of residence could be one of these:

{Jordan, 'United States', Egypt, 'United Kingdom', Bahrain, Austria, Kuwait, 'United Arab Emirates', Europe, Malta, Bulgaria, 'South Africa', India, Afghanistan, Georgia, 'New Zealand', Syria, Iraq, Australia, 'Saudi Arabia', Armenia, Turkey, Pakistan, Canada, Oman, Brazil, 'South Korea', 'Costa Rica', Sweden, Philippines, Malaysia, Argentina, Japan, Bangladesh, Qatar, Ireland, Romania, Netherlands, Lebanon, Germany, Latvia, Russia, Italy, China, Nigeria, 'U.S. Outlying Islands', Nepal, Mexico, 'Isle of Man', Libya, Ghana, Bhutan}

It is also documented whether the patient has used the screening app before. The age description categorizes the age of the person, it is redundant as we have the numeric age value of the person already as feature. The class says whether a person has ASD or not.

Two attributes did not become clear to me: <br>
@attribute result numeric <br> 
@attribute relation {Parent,Self,Relative,'Health care professional',self}.

In [3]:
print('Shape of DataFrame: {}'.format(data.shape))
print(data.loc[0])

Shape of DataFrame: (292, 21)
a1_score                      1
a2_score                      1
a3_score                      0
a4_score                      0
a5_score                      1
a6_score                      1
a7_score                      0
a8_score                      1
a9_score                      0
a10_score                     0
age                           6
gender                        m
ethnicity                Others
jundice                      no
austim                       no
country_of_res           Jordan
used_app_before              no
result                        5
age_desc           '4-11 years'
relation                 Parent
class                        NO
Name: 0, dtype: object


We can use the `describe()` function to describe the numeric data of the dataframe. But because we don't really know what the questions to the a1-a10 scores are, it is very hard to interpret them.

In [5]:
data.describe()

Unnamed: 0,a1_score,a2_score,a3_score,a4_score,a5_score,a6_score,a7_score,a8_score,a9_score,a10_score,result
count,292.0,292.0,292.0,292.0,292.0,292.0,292.0,292.0,292.0,292.0,292.0
mean,0.633562,0.534247,0.743151,0.55137,0.743151,0.712329,0.606164,0.496575,0.493151,0.726027,6.239726
std,0.482658,0.499682,0.437646,0.498208,0.437646,0.453454,0.489438,0.500847,0.500811,0.446761,2.284882
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
50%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,6.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0


### 2. Data Preprocessing

1. Let's drop the columns, that we don't need: result, age_desc (and maybe the relation column)
2. We should convert the categorical features into numeric columns.

In [6]:
data = data.drop(['result', 'age_desc'], axis=1)

In [7]:
data

Unnamed: 0,a1_score,a2_score,a3_score,a4_score,a5_score,a6_score,a7_score,a8_score,a9_score,a10_score,age,gender,ethnicity,jundice,austim,country_of_res,used_app_before,relation,class
0,1,1,0,0,1,1,0,1,0,0,6,m,Others,no,no,Jordan,no,Parent,NO
1,1,1,0,0,1,1,0,1,0,0,6,m,'Middle Eastern ',no,no,Jordan,no,Parent,NO
2,1,1,0,0,0,1,1,1,0,0,6,m,?,no,no,Jordan,yes,?,NO
3,0,1,0,0,1,1,0,0,0,1,5,f,?,yes,no,Jordan,no,?,NO
4,1,1,1,1,1,1,1,1,1,1,5,m,Others,yes,no,'United States',no,Parent,YES
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287,1,1,1,1,1,1,1,1,1,1,7,f,White-European,yes,yes,'United Kingdom',no,Parent,YES
288,1,0,0,0,1,0,1,0,0,1,7,f,White-European,yes,yes,Australia,no,Parent,NO
289,1,0,1,1,1,1,1,0,0,1,4,m,Latino,no,no,Brazil,no,Parent,YES
290,1,1,1,0,1,1,1,1,1,1,4,m,'South Asian',no,no,India,no,Parent,YES


In [8]:
X = data.drop(['class'], 1)
y = data['class']

In [9]:
X = pd.get_dummies(X) # turn categorical columns into multiple {1,0} numeric columns
X

Unnamed: 0,a1_score,a2_score,a3_score,a4_score,a5_score,a6_score,a7_score,a8_score,a9_score,a10_score,...,country_of_res_Syria,country_of_res_Turkey,used_app_before_no,used_app_before_yes,relation_'Health care professional',relation_?,relation_Parent,relation_Relative,relation_Self,relation_self
0,1,1,0,0,1,1,0,1,0,0,...,0,0,1,0,0,0,1,0,0,0
1,1,1,0,0,1,1,0,1,0,0,...,0,0,1,0,0,0,1,0,0,0
2,1,1,0,0,0,1,1,1,0,0,...,0,0,0,1,0,1,0,0,0,0
3,0,1,0,0,1,1,0,0,0,1,...,0,0,1,0,0,1,0,0,0,0
4,1,1,1,1,1,1,1,1,1,1,...,0,0,1,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287,1,1,1,1,1,1,1,1,1,1,...,0,0,1,0,0,0,1,0,0,0
288,1,0,0,0,1,0,1,0,0,1,...,0,0,1,0,0,0,1,0,0,0
289,1,0,1,1,1,1,1,0,0,1,...,0,0,1,0,0,0,1,0,0,0
290,1,1,1,0,1,1,1,1,1,1,...,0,0,1,0,0,0,1,0,0,0


In [12]:
print("Now we have {} features".format(len(X.columns.values)))

Now we have 96 features


In [37]:
Y = pd.get_dummies(y)

In [38]:
import numpy as np

Y = Y['YES']
Y = np.expand_dims(Y, axis=-1)

In [40]:
Y.shape

(292, 1)

### 3. Split the Dataset into Training and Testing Datasets

In [41]:
from sklearn import model_selection

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size = 0.2)

In [42]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(233, 96)
(59, 96)
(233, 1)
(59, 1)


### 4. Building the Network - Keras

In [47]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

def create_model():
    model = Sequential()
    model.add(Dense(8, input_dim=96, kernel_initializer='normal', activation='relu'))
    model.add(Dense(4, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    
    adam = Adam(lr=0.001)
    model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
    return model

model = create_model()

print(model.summary())

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_13 (Dense)             (None, 8)                 776       
_________________________________________________________________
dense_14 (Dense)             (None, 4)                 36        
_________________________________________________________________
dense_15 (Dense)             (None, 1)                 5         
Total params: 817
Trainable params: 817
Non-trainable params: 0
_________________________________________________________________
None


### 5. Training the Network

We see the accuracy is over 99% already after 25 epochs, but the loss still decreases even afterwards. Whether the model is overfitting, needs to be evaluated when test data is involved.

In [48]:
model.fit(X_train, Y_train, epochs=50, batch_size=10, verbose = 1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.callbacks.History at 0x139bedd90>

### 6. Testing and Performance Metrics

In [58]:
from sklearn.metrics import classification_report, accuracy_score

predictions = model.predict_classes(X_test)

In [57]:
print('Results for Binary Model')
print(accuracy_score(Y_test, predictions))
print(classification_report(Y_test, predictions))

Results for Binary Model
0.9491525423728814
              precision    recall  f1-score   support

           0       0.96      0.93      0.95        28
           1       0.94      0.97      0.95        31

    accuracy                           0.95        59
   macro avg       0.95      0.95      0.95        59
weighted avg       0.95      0.95      0.95        59



### 7. Categorical Model

In [63]:
Y = pd.get_dummies(y)
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size = 0.2)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

def create_model():
    model = Sequential()
    model.add(Dense(8, input_dim=96, kernel_initializer='normal', activation='relu'))
    model.add(Dense(4, kernel_initializer='normal', activation='relu'))
    model.add(Dense(2, activation='sigmoid'))
    
    adam = Adam(lr=0.001)
    model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
    return model

model = create_model()
print(model.summary())

model.fit(X_train, Y_train, epochs=50, batch_size=10, verbose = 1)

predictions = model.predict_classes(X_test)

print('Results for Binary Model')
print(accuracy_score(Y_test[['YES']], predictions))
print(classification_report(Y_test[['YES']], predictions))

(233, 96)
(59, 96)
(233, 2)
(59, 2)
Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_19 (Dense)             (None, 8)                 776       
_________________________________________________________________
dense_20 (Dense)             (None, 4)                 36        
_________________________________________________________________
dense_21 (Dense)             (None, 2)                 10        
Total params: 822
Trainable params: 822
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 