## Autism Detection

Autism Spectrum Disorder (ASD) refers to a group of neurodevelopmental disorders that includes classical autism, Asperger syndrome, and pervasive developmental disorder not otherwise specified (PDD-NOS), all of which are categorized by impaired social skills, repetitive behaviors, sensory issues, and language delay. More than 1% of the population is afflicted, with a significant gender disparity, with males being 4 to 5 times more likely to be impacted than females.
The significant heterogeneity of ASD makes it challenging to create diagnostic criteria that can be used to quickly identify afflicted children and choose the best treatment options. ASD diagnosis now entails lengthy processes including many specialists' exams and the use of behavioural assessment tools. Machine Learning approaches might considerably speed up the diagnosis procedure by identifying the underlying brain mechanisms.
If the data allows it, it could also be interesting to identify the different manifestations of the disorder in females with respect to males, since females are known to have a completely different neuropathology, which could be a reason why they are less affected, or possibly just fail to be diagnosed using the current male-based criteria.

In [74]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer, StandardScaler, OneHotEncoder
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, KFold
#from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from pprint import pprint
#from keras.utils import to_categorical
from time import time
from IPython.display import display # Allows the use of display() for DataFrames


Road Map:

Step 0: Import Dataset.

Step 1: Clean Datasets (The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers).

Step 2: A quick visualization with Seaborn.

Step 3: At First, apply several Supervised Machine Learning (SML) techniques on the data for classification purpose.

Step 4: Next, I experimented with different topologies, optimizers, and hyperparameters for different models.

Step 5: Model tuning.

Step 6: Feature Selection.

Step 7: Then I built a Multi-Layer Perceptron and train it to classify indivisual with ASD based on its features.

Step 8: Conclusion.

In [130]:
# STEP - 0
# # Import the dataset

df_pheno = pd.read_csv('ABIDEII_Composite_Phenotypic.csv', encoding='unicode_escape')
display(df_pheno.head(n=5))
#Total Coulmns
columns_pheno = df_pheno.columns
print("columns", columns_pheno.size)
print("total", df_pheno.size)

Unnamed: 0,SITE_ID,SUB_ID,NDAR_GUID,DX_GROUP,PDD_DSM_IV_TR,ASD_DSM_5,AGE_AT_SCAN,SEX,HANDEDNESS_CATEGORY,HANDEDNESS_SCORES,...,ADI_R_C4_UNUSUAL_SENSORY_INTERESTS,ADI_R_C4_TOTAL,ADI_R_D_AGE_PARENT_NOTICED,ADI_R_D_AGE_FIRST_SINGLE_WORDS,ADI_R_D_AGE_FIRST_PHRASES,ADI_R_D_AGE_WHEN_ABNORMALITY,ADI_R_D_INTERVIEWER_JUDGMENT,Unnamed: 348,Unnamed: 349,Unnamed: 350
0,ABIDEII-OILH_2,28675,NDARKA309TA8,1,,,21.0,1,1.0,70.0,...,,,,,,,,,,
1,ABIDEII-OILH_2,28676,NDARZA591PDJ,1,,,21.0,1,1.0,60.0,...,,,,,,,,,,
2,ABIDEII-OILH_2,28677,NDARCP486GRR,1,,,24.0,1,1.0,80.0,...,,,,,,,,,,
3,ABIDEII-OILH_2,28678,NDARXG517VHQ,1,,,24.0,2,3.0,40.0,...,,,,,,,,,,
4,ABIDEII-OILH_2,28679,NDARUF669PBM,1,,,20.0,1,1.0,100.0,...,,,,,,,,,,


columns 351
total 426465


In [131]:
# Total number of records
n_records = len(df_pheno.index)

# Number of records where individual's with ASD
n_asd_yes = len(df_pheno[df_pheno['DX_GROUP'] == 1])

# Number of records where individual's with no ASD
n_asd_no = len(df_pheno[df_pheno['DX_GROUP'] == 2])

# Percentage of individuals whose are with ASD
yes_percent = float(n_asd_yes) / n_records *100



# Print the results
print("Total number of records:", n_records)
print ("Individuals diagonised with ASD: {}".format(n_asd_yes))
print ("Individuals not diagonised with ASD: {}".format(n_asd_no))
print ("Percentage of individuals diagonised with ASD: {:.2f}%".format(yes_percent))

df_pheno.describe()

#Total Males
males = df_pheno[df_pheno['SEX'] == 1]
print("males:", len(males))
## Autistic Males
males_autistic = males[males['DX_GROUP'] == 1]
print("males_autistic:", len(males_autistic))
## Control Males
males_control = males[males['DX_GROUP'] == 2]
print("males_control:", len(males_control))

#Total Females
females = df_pheno[df_pheno['SEX'] == 2]
print("females:", len(females))
## Autistic Females
females_autistic = females[females['DX_GROUP'] == 1]
print("females_autistic:", len(females_autistic))
## Control Females
females_controls = females[females['DX_GROUP'] == 2]
print("females_controls:", len(females_controls))


Total number of records: 1215
Individuals diagonised with ASD: 566
Individuals not diagonised with ASD: 649
Percentage of individuals diagonised with ASD: 46.58%
males: 938
males_autistic: 485
males_control: 453
females: 277
females_autistic: 81
females_controls: 196


Featureset Exploration:

This data contains 1215 instances, and contains the following attributes:

SUB_ID: Numeric (ABIDE II Unique ID Number)

DX_GROUP: Numeric (Diagnostic Group: 1=Autism; 2=Control)

PDD_DSM_IV_TR: Numeric (DSM-IV-TR PDD Category: 0=none; 1=Autism; 2=Aspergers; 3=PDD-NOS)

ASD_DSM_5: Numeric (DSM-5 ASD Category: 0=none; 1=ASD)

AGE_AT_SCAN: Numeric (Age in years)

SEX: Numeric  (Gender (Male=1/Female=2))

HANDEDNESS_CATEGORY: Numeric (Handedness Category: 1=right handed; 2=left handed; 3=mixed handed)

HANDEDNESS_SCORES: Numeric (Handedness Scores: right handed: scores ≥ 50; left handed: scores ≤-50; mixed
handed: scores between -50 and 50)

FIQ: Numeric (Full IQ Standard Score: Differential Ability Scale (DAS); Kaufman Brief Intellifence Test (KBIT-2); Wechsler Adult Intelligence Scale (WAIS); Wechsler Abbreviated Scale of Intelligence (WASI); Wechsler Intelligence Scale of Children (WISC); Wechsler Preschool and Primary Scale of Intelligence
(WPPSI-III))

VIQ: Numeric (Verbal IQ Standard Score: DAS; WAIS; WASI; WACI; WOOSI-III)

PIQ: Numeric (Perfomance IQ Standard Score: DAS; Raven's Progressive Matrices (Raven); Snijders-Oomen Nonverbal Intelligence (SON-R); WAIS; WASI; WACI; WOOSI-III)

<!-- FIQ_TEST_TYPE: String (IQ Test Used for Full Scale IQ)

VIQ_TEST_TYPE: String (IQ Test Used for Verbal IQ)
PIQ_TEST_TYPE: String (IQ Test Used for Performance IQ) -->

## Filtering the Data

In [90]:
# #Filters
# ## Age Filter 
# over_40 = df_pheno[df_pheno['AGE_AT_SCAN '] > 40]
# below_40 = df_pheno[df_pheno['AGE_AT_SCAN '] <= 40]
# print("Record over_40:", len(over_40))
# print("Record below_40:", len(below_40))
# print("female over_40:", len(females[females['AGE_AT_SCAN '] > 40]))
# print("femal below_40:", len(females[females['AGE_AT_SCAN '] <= 40]))
# print("male over_40:", len(males[males['AGE_AT_SCAN '] > 40]))
# print("male below_40:", len(males[males['AGE_AT_SCAN '] <= 40]))

# ## IQ Filter
# normal_iq = df_pheno[df_pheno['FIQ'] > 80]
# below_avg_iq = df_pheno[df_pheno['FIQ'] <= 80]
# print("Record normal_iq:", len(normal_iq))
# print("Record below_avg_iq:", len(below_avg_iq))
# print("female normal_iq:", len(normal_iq[normal_iq['SEX'] == 2]))
# print("female below_avg_iq:", len(below_avg_iq[below_avg_iq['SEX'] == 2]))
# print("male normal_iq:", len(normal_iq[normal_iq['SEX'] == 1]))
# print("male below_avg_iq:", len(below_avg_iq[below_avg_iq['SEX'] == 1]))

## Data preprocessing

Before data can be used as input for machine learning algorithms, it must be cleaned, formatted, and maybe even restructured — this is typically known as preprocessing. Unfortunately, for this dataset, there are many invalid or missing entries(?) we must deal with, moreover, there are some qualities about certain features that must be adjusted. This preprocessing can help tremendously with the outcome and predictive power of nearly all learning algorithms.

We use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names ():

In [150]:
filter_data = pd.read_csv('ABIDEII_Composite_Phenotypic.csv', na_values=['?'])
# # The PDD_DSM_IV_TR, ASD_DSM_5 field contains categorical data, since it has considerably large number of missing values, NaN's are replaced by string 'n/a' (not available) as a new categor
# filter_data.PDD_DSM_IV_TR = filter_data.PDD_DSM_IV_TR.fillna('n/a')
# filter_data.ASD_DSM_5 = filter_data.ASD_DSM_5.fillna('n/a')
# filter_data = filter_data.fillna(filter_data.mean()) # should replace with Standart Deviation?
display(filter_data.head(n=5))

Unnamed: 0,SITE_ID,SUB_ID,NDAR_GUID,DX_GROUP,PDD_DSM_IV_TR,ASD_DSM_5,AGE_AT_SCAN,SEX,HANDEDNESS_CATEGORY,HANDEDNESS_SCORES,...,ADI_R_C4_UNUSUAL_SENSORY_INTERESTS,ADI_R_C4_TOTAL,ADI_R_D_AGE_PARENT_NOTICED,ADI_R_D_AGE_FIRST_SINGLE_WORDS,ADI_R_D_AGE_FIRST_PHRASES,ADI_R_D_AGE_WHEN_ABNORMALITY,ADI_R_D_INTERVIEWER_JUDGMENT,Unnamed: 348,Unnamed: 349,Unnamed: 350
0,ABIDEII-OILH_2,28675,NDARKA309TA8,1,,,21.0,1,1.0,70.0,...,,,,,,,,,,
1,ABIDEII-OILH_2,28676,NDARZA591PDJ,1,,,21.0,1,1.0,60.0,...,,,,,,,,,,
2,ABIDEII-OILH_2,28677,NDARCP486GRR,1,,,24.0,1,1.0,80.0,...,,,,,,,,,,
3,ABIDEII-OILH_2,28678,NDARXG517VHQ,1,,,24.0,2,3.0,40.0,...,,,,,,,,,,
4,ABIDEII-OILH_2,28679,NDARUF669PBM,1,,,20.0,1,1.0,100.0,...,,,,,,,,,,


In [149]:
#Filter Data Set
selection = filter_data[(filter_data['AGE_AT_SCAN '] <= 40) & (filter_data['FIQ'] >= 80)]    # add check for FIQ = NaN
selection = selection.dropna(subset=['FIQ','DX_GROUP'], axis=0) # drop the row if FIQ or DX_GROUP is null
# columns_sel = selection.columns
# print("columns", columns_sel.size) # 2 null FIQ (351-349)
# selection = selection[['SEX','DX_GROUP','SUB_ID']]
## Total Females-Males
females = selection[selection['SEX'] == 2]
males = selection[selection['SEX'] == 1]
print("females", len(females))
print("males", len(males))
## Autistic Females-Males
females_autistic = females[females['DX_GROUP'] == 1]
males_autistic = males[males['DX_GROUP'] == 1]
males_control = males[males['DX_GROUP'] == 2]
print("females_autistic", len(females_autistic))
print("males_autistic", len(males_autistic))
print("males_control", len(males_control))
# therefore we need len(females_autistic) control males and len(females)-len(females_autistic) autistic males
# remove: #males-#autistic_females and #males
ma_to_remove = np.random.choice(males_autistic.index, size=240, replace=False)
mc_to_remove = np.random.choice(males_control.index, size=330, replace=False)
selection = selection.drop(ma_to_remove)
selection = selection.drop(mc_to_remove)
labels = selection[['SUB_ID','DX_GROUP']]
# labels_dict = labels.set_index('SUB_ID').T.to_dict('list')
display(selection.head(n=20))

females 239
males 807
females_autistic 65
males_autistic 410
males_control 397


Unnamed: 0,SITE_ID,SUB_ID,NDAR_GUID,DX_GROUP,PDD_DSM_IV_TR,ASD_DSM_5,AGE_AT_SCAN,SEX,HANDEDNESS_CATEGORY,HANDEDNESS_SCORES,...,ADI_R_C4_UNUSUAL_SENSORY_INTERESTS,ADI_R_C4_TOTAL,ADI_R_D_AGE_PARENT_NOTICED,ADI_R_D_AGE_FIRST_SINGLE_WORDS,ADI_R_D_AGE_FIRST_PHRASES,ADI_R_D_AGE_WHEN_ABNORMALITY,ADI_R_D_INTERVIEWER_JUDGMENT,Unnamed: 348,Unnamed: 349,Unnamed: 350
3,ABIDEII-OILH_2,28678,NDARXG517VHQ,1,,,24.0,2,3.0,40.0,...,,,,,,,,,,
5,ABIDEII-OILH_2,28680,NDARCL019RH1,1,,,23.0,2,1.0,100.0,...,,,,,,,,,,
6,ABIDEII-OILH_2,28681,NDARAZ912TJD,1,,,18.0,1,2.0,-60.0,...,,,,,,,,,,
9,ABIDEII-OILH_2,28684,NDARFE109KR6,1,,,19.0,1,1.0,80.0,...,,,,,,,,,,
10,ABIDEII-OILH_2,28685,NDARMY489LHC,1,,,27.0,2,2.0,-50.0,...,,,,,,,,,,
14,ABIDEII-OILH_2,28689,NDARCA181NBN,1,,,18.0,1,1.0,100.0,...,,,,,,,,,,
15,ABIDEII-OILH_2,28690,NDARDZ103BFY,1,,,25.0,1,3.0,-10.0,...,,,,,,,,,,
17,ABIDEII-OILH_2,28692,NDARUU049YGD,1,,,19.0,1,1.0,100.0,...,,,,,,,,,,
18,ABIDEII-OILH_2,28693,NDARBM216HWV,1,,,24.0,2,2.0,-100.0,...,,,,,,,,,,
20,ABIDEII-OILH_2,28695,NDARNA878NR1,1,,,18.0,1,1.0,100.0,...,,,,,,,,,,


In [135]:
females = selection[selection['SEX'] == 2]
males = selection[selection['SEX'] == 1]
females_autistic = females[females['DX_GROUP'] == 1]
females_control = females[females['DX_GROUP'] == 2]
males_autistic = males[males['DX_GROUP'] == 1]
males_control = males[males['DX_GROUP'] == 2]
print("females_autistic", len(females_autistic))
print("females_control", len(females_control))
print("males_autistic", len(males_autistic))
print("males_control", len(males_control))

females_autistic 65
females_control 174
males_autistic 170
males_control 67


print(selection.corr())

In [151]:
print(selection.corr())

                                SUB_ID  DX_GROUP  PDD_DSM_IV_TR  ASD_DSM_5  \
SUB_ID                        1.000000 -0.004542      -0.069198  -0.164976   
DX_GROUP                     -0.004542  1.000000      -0.864513  -1.000000   
PDD_DSM_IV_TR                -0.069198 -0.864513       1.000000   0.814608   
ASD_DSM_5                    -0.164976 -1.000000       0.814608   1.000000   
AGE_AT_SCAN                  -0.165979 -0.055640      -0.015402   0.106640   
...                                ...       ...            ...        ...   
ADI_R_D_AGE_WHEN_ABNORMALITY -0.150988       NaN      -0.325960        NaN   
ADI_R_D_INTERVIEWER_JUDGMENT -0.016701       NaN      -0.262781        NaN   
Unnamed: 348                 -0.312501       NaN      -0.171499        NaN   
Unnamed: 349                       NaN       NaN            NaN        NaN   
Unnamed: 350                       NaN       NaN            NaN        NaN   

                              AGE_AT_SCAN        SEX  HANDEDNES

In [None]:

# df_pheno.pop('SITE_ID')
# df_pheno.pop('NDAR_GUID')
# # df_pheno.pop('FIQ_TEST_TYPE')
# # df_pheno.pop('VIQ_TEST_TYPE')
# # df_pheno.pop('PIQ_TEST_TYPE')


# display(df_pheno.head(n=5))
# #Total Coulmns
# columns_pheno = df_pheno.columns
# print("columns", columns_pheno.size)