### Autism Screening Adult Dataset

**Description: <br>**
Data relates to Autism Spectrum Disorder (ASD) screening of adults containing 20 features. The features are composed of 10 behavioral and 10 individual characteristics that are considered effective for ASD detection. The dataset consists of 704 samples, each having 20 attributes and their corresponding label for ASD detection.

**Original Dataset Attributes:**

*Note: Attributes 1-10 are screening responses to a behavorial questionnaire. The rest are individual characteristics*

**Attribute 1: A1_Score<br>**
Decription: "I often notice small sounds when others do not"<br>
Values: Binary integer {0,1} or "No, Yes" 

**Attribute 2: A2_score<br>**
Description: "I usually concentrate more on the whole picture, rather than the small details"<br>
Values: Binary integer {0,1} or "No, Yes" 

**Attribute 3: A3_Score<br>**
Description: "I find it easy to do more than one thing at once"<br>
Values: Binary integer {0,1} or "No, Yes" 

**Attribute 4: A4_Score<br>**
Description: "If there is an interruption, I can switch back to what I was doing very quickly"<br>
Values: Binary integer {0,1} or "No, Yes" 

**Attribute 5: A5_Score<br>**
Description: "I find it easy to read between the lines when someone is talking to me"<br>
Values: Binary integer {0,1} or "No, Yes" 

**Attribute 6: A6_Score<br>**
Description: "I know how to tell if someone listening to me is getting bored"<br>
Values: Binary integer {0,1} or "No, Yes" 

**Attribute 7: A7_Score<br>**
Description: "When I'm reading a story I find it difficult to workout the character's intentions"<br>
Values: Binary integer {0,1} or "No, Yes" 

**Attribute 8: A8_Score<br>**
Description: "I like to collect information about categories of things"<br>
Values: Binary integer {0,1} or "No, Yes"

**Attribute 9: A9_Score<br>**
Description: "I find it easy to workout what someone is thinking or feeling just by looking at their face"<br>
Values: Binary integer {0,1} or "No, Yes" 

**Attribute 10: A10_Score<br>**
Description: "I find it difficult to workout people's intentions"<br>
Values: Binary integer {0,1} or "No, Yes" 

**Attribute 11: age<br>**
Description: Age (years)<br>
Values: Numeric integers 

**Attribute 12: gender<br>**
Description: Gender<br>
Values: {'f', 'm'} strings for female, male respectively 

**Attribute 13: ethnicity<br>**
Description: Ethnicity<br>
Values: common ethnicity strings (ex: "Turkish) 

**Attribute 14: jundice<br>**
Description: Born with jaundice<br>
Values: {'no', 'yes'} strings 

**Attribute 15: autism <br>**
Description: Family member has autism<br>
Values: {'no', 'yes'} strings 

**Attribute 16: country_of_res<br>**
Description: Country of residence<br>
Values: String of country name (ex: 'United States')

**Attribute 17: used_app_before<br>**
Description: Whether user has used screening app<br>
Values: {'no', 'yes'} strings 

**Attribute 18: result<br>**
Description: Screening app score based on algorithm<br>
Values: numeric integer

**Attribute 19: age_desc<br>**
Description: Age category<br>
Values: String (ex: "18 and over")

**Attribute 20: relation<br>**
Description: Relation to person completing test<br>
Values: String ex('Parent', 'Self', 'Caregiver', etc.) 

**Label 21: Class/ASD<br>**
Description: Label for detection of ASD<br>
Values: {'NO', 'YES'} strings 


*This can be confirmed from the loaded 'meta' printed below*

### Preprocessing - Autism Adult Dataset

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
from scipy.io.arff import loadarff
from matplotlib import pyplot as plt

# Loading arff file
Autism_Adult, meta = loadarff('Autism-Adult-Data.arff')

# meta contains info about arff file (attrs)
print("Loaded Meta", meta)

Loaded Meta Dataset: adult-weka.filters.unsupervised.attribute.NumericToNominal-Rfirst-10
	A1_Score's type is nominal, range is ('0', '1')
	A2_Score's type is nominal, range is ('0', '1')
	A3_Score's type is nominal, range is ('0', '1')
	A4_Score's type is nominal, range is ('0', '1')
	A5_Score's type is nominal, range is ('0', '1')
	A6_Score's type is nominal, range is ('0', '1')
	A7_Score's type is nominal, range is ('0', '1')
	A8_Score's type is nominal, range is ('0', '1')
	A9_Score's type is nominal, range is ('0', '1')
	A10_Score's type is nominal, range is ('0', '1')
	age's type is numeric
	gender's type is nominal, range is ('f', 'm')
	ethnicity's type is nominal, range is ('White-European', 'Latino', 'Others', 'Black', 'Asian', "'Middle Eastern '", 'Pasifika', "'South Asian'", 'Hispanic', 'Turkish', 'others')
	jundice's type is nominal, range is ('no', 'yes')
	austim's type is nominal, range is ('no', 'yes')
	contry_of_res's type is nominal, range is ("'United States'", 'Brazi

**Creating the data matrix**<br>
From the Autism_Adult records the data of the arff file, accessible by attribute names. When add the Autism_Adult data to matrix, each element in the matrix has the type numpy.bytes_, therefore need to convert to int or str type depending on each attribute type so data matrix can be manipulated without errors.

In [2]:
# Turn Autism_Adult into matrix of data
Autism_Adult_data = np.array(Autism_Adult[meta.names()[0]].astype(int, copy = True)).reshape(704,1)

# Attributes 1-10 are integers -> add to Autism_Adult_data
for i in range(1,11):
    Autism_Adult_data = np.c_[Autism_Adult_data, np.array(Autism_Adult[meta.names()[i]]).astype(int, copy = True)]
    
# Attributes 11-16 are strings -> add to Autism_Adult_data
for i in range(11,17):
    Autism_Adult_data = np.c_[Autism_Adult_data, np.array(Autism_Adult[meta.names()[i]]).astype(str, copy = True)]
    
# Attribute 17 is an integer -> add to Autism_Adult_data
Autism_Adult_data = np.c_[Autism_Adult_data, np.array(Autism_Adult[meta.names()[17]]).astype(int, copy = True)]

# Attributes 18-21 are strings -> add to Autism_Adult_data
for i in range(18,len(meta.names())):
    Autism_Adult_data = np.c_[Autism_Adult_data, np.array(Autism_Adult[meta.names()[i]]).astype(str, copy = True)]

**Creating pandas DataFrame for easier manipulation**<br><br>
Printing the pre-cleaned up DataFrame below

In [3]:
# Convert to pandas DataFrame for easier manipulation 
Autism_frame = pd.DataFrame(data = Autism_Adult_data, columns = meta.names()[:])

# Replace '?' with NaN, help to find columns of missing values
Autism_frame.replace('?', np.NaN, inplace = True)

# Printing first 10 rows of data frame
Autism_frame.head(10)

# Dimension of pandas DataFrame
#print(Autism_frame.shape) # 704 by 21 -> 704 sampes, 20 attributes, last column holds labels

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,gender,ethnicity,jundice,austim,contry_of_res,used_app_before,result,age_desc,relation,Class/ASD
0,1,1,1,1,0,0,1,1,0,0,...,f,White-European,no,no,'United States',no,6,'18 and more',Self,NO
1,1,1,0,1,0,0,0,1,0,1,...,m,Latino,no,yes,Brazil,no,5,'18 and more',Self,NO
2,1,1,0,1,1,0,1,1,1,1,...,m,Latino,yes,yes,Spain,no,8,'18 and more',Parent,YES
3,1,1,0,1,0,0,1,1,0,1,...,f,White-European,no,yes,'United States',no,6,'18 and more',Self,NO
4,1,0,0,0,0,0,0,1,0,0,...,f,,no,no,Egypt,no,2,'18 and more',,NO
5,1,1,1,1,1,0,1,1,1,1,...,m,Others,yes,no,'United States',no,9,'18 and more',Self,YES
6,0,1,0,0,0,0,0,1,0,0,...,f,Black,no,no,'United States',no,2,'18 and more',Self,NO
7,1,1,1,1,0,0,0,0,1,0,...,m,White-European,no,no,'New Zealand',no,5,'18 and more',Parent,NO
8,1,1,0,0,1,0,0,1,1,1,...,m,White-European,no,no,'United States',no,6,'18 and more',Self,NO
9,1,1,1,1,0,1,1,1,1,0,...,m,Asian,yes,yes,Bahamas,no,8,'18 and more','Health care professional',YES


<br>**Clean up discussion**:<br>
1) We correct the attribute names to reduce chances of spelling errors (example: original attribute name for Jaundice was jundice)<br>
2) Attribute 'ethnicity' had duplicate value of "others" and "Others" so that was merged into one being "Others"<br>
3) Outlier in age fixed (likely a typo '383' -> '38')<br>

<br>**Reduction discussion**:<br>
1) From the meta above, the attribute 'age_desc' has only one level which is "18 and more". This has no signifance so it is dropped<br>
2) ** ALSO REMOVED ATTRIBUTES: country, used app before, result **<br>
3) We decided to remove all samples with missing values. The missing values were predominantly categorical (ex: ethnicity, relation). It does not make sense to assign replacement values with the mean or median for non numeric variables. <br> <br>

**The cleaned up pandas DataFrame can be seen printed below**

In [4]:
# Correcting attribute names
Clean_Autism_frame = Autism_frame.rename(index=str, columns={"A1_Score": "A1", "A2_Score": "A2", "A3_Score": "A3", "A4_Score": "A4", "A5_Score": "A5", "A6_Score": "A6", "A7_Score": "A7", "A8_Score": "A8", "A9_Score": "A9", "A10_Score": "A10", "jundice": "jaundice", "austim":"autism", "contry_of_res": "country", "Class/ASD": "ASD"})


# Dropping attribute column 'age_descp' -> no significance
Clean_Autism_frame = Clean_Autism_frame.drop(columns = ['age_desc']) # Now 19 attributes instead of 20


# Replacing duplicate 'others' in ethnicity with 'Others'
Clean_Autism_frame = Clean_Autism_frame.replace({'ethnicity': 'others'}, 'Others')


# Replacing outlier in 'age'
Clean_Autism_frame = Clean_Autism_frame.replace({'age': 383}, 38)


# Remove samples with missing values
Clean_Autism_frame = Clean_Autism_frame.dropna() # This leaves us with 609 samples instead of 704


# ** Dropping attrs: country_of_res, used_app_before, age_desc, result **
Clean_Autism_frame = Clean_Autism_frame.drop(columns = ['country', 'used_app_before', 'result']) # Now 19 attributes instead of 20


# Printing first 10 rows of data frame
Clean_Autism_frame.head(10)


# Dimension of pandas DataFrame
#print(Clean_Autism_frame.shape) # 609 by 17 -> 16 attributes, last column holds labels

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,age,gender,ethnicity,jaundice,autism,relation,ASD
0,1,1,1,1,0,0,1,1,0,0,26,f,White-European,no,no,Self,NO
1,1,1,0,1,0,0,0,1,0,1,24,m,Latino,no,yes,Self,NO
2,1,1,0,1,1,0,1,1,1,1,27,m,Latino,yes,yes,Parent,YES
3,1,1,0,1,0,0,1,1,0,1,35,f,White-European,no,yes,Self,NO
5,1,1,1,1,1,0,1,1,1,1,36,m,Others,yes,no,Self,YES
6,0,1,0,0,0,0,0,1,0,0,17,f,Black,no,no,Self,NO
7,1,1,1,1,0,0,0,0,1,0,64,m,White-European,no,no,Parent,NO
8,1,1,0,0,1,0,0,1,1,1,29,m,White-European,no,no,Self,NO
9,1,1,1,1,0,1,1,1,1,0,17,m,Asian,yes,yes,'Health care professional',YES
10,1,1,1,1,1,1,1,1,1,1,33,m,White-European,no,no,Relative,YES


**Last preprocessing step is encoding categorical attributes**<br>
*(ex in Gender: 'f' = 0, 'm' = 1) - the dimensions of our data matrix and labels can be seen below*

In [5]:
# Encoding categorical attrs
from sklearn.preprocessing import LabelEncoder

attr_names = ['gender', 'ethnicity', 'jaundice', 'autism', 'relation', 'ASD']

labelEncoder_X = LabelEncoder()

for i in range(0,len(attr_names)):
        Clean_Autism_frame[attr_names[i]] = labelEncoder_X.fit_transform(Clean_Autism_frame[attr_names[i]])

# Data Matrix and Data Labels from clean pandas DataFrame        
Matrix = Clean_Autism_frame.values
Data_Matrix = Matrix[:,:-1]
Data_Matrix = Data_Matrix.astype('int')
Data_Labels = Matrix[:,-1]
Data_Labels = Data_Labels.astype('int')
#Data_Labels = np.reshape(Data_Labels, (609, 1))

print("Dim(Data_Matrix) = ", Data_Matrix.shape)
print("Dim(Data_Labels) = ", Data_Labels.shape)

Dim(Data_Matrix) =  (609, 16)
Dim(Data_Labels) =  (609,)


### Training & Testing, using k = 7 cross validation<br>

**Partitioning data and labels into folds, 87 samples per fold**

In [6]:
# Partitioning data into 7 folds
X_folds = np.array([Data_Matrix[:87], Data_Matrix[87:174], Data_Matrix[174:261], Data_Matrix[261:348], Data_Matrix[348:435], Data_Matrix[435:522], Data_Matrix[522:]])

# Partitioning labels into 7 folds
label_fold1 = Data_Labels[0:87]
label_fold2 = Data_Labels[87:174]
label_fold3 = Data_Labels[174:261]
label_fold4 = Data_Labels[261:348]
label_fold5 = Data_Labels[348:435]
label_fold6 = Data_Labels[435:522]
label_fold7 = Data_Labels[522:]
Labels_folds = np.array([label_fold1, label_fold2, label_fold3, label_fold4, label_fold5, label_fold6, label_fold7])

# Store errors
SVM_accuracies = []

<br>**SVM Training and Test Method**<br>
Arguments are which folds to use as train, which fold to use as test. Uses sklearn's SVM classifier to fit based on training_data, training_labels. Then tests classifier using test fold and compares to test_labels for accuracy.

In [7]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


def TrainAndTestSVM(in1, in2, in3, in4, in5, in6, in_test):
    # Train Data and Labels
    train_data = np.c_[X_folds[in1 - 1].T, X_folds[in2 - 1].T, X_folds[in3 - 1].T, X_folds[in4 - 1].T, X_folds[in5 - 1].T, X_folds[in6 - 1].T].T
    train_labels = np.concatenate((Labels_folds[in1 - 1], Labels_folds[in2 - 1], Labels_folds[in3 - 1], Labels_folds[in4 - 1], Labels_folds[in5 - 1], Labels_folds[in6 - 1]))
    
    # Test Data and Labels
    test_data = X_folds[in_test - 1]
    test_labels = Labels_folds[in_test - 1]
    
    # SVM Train
    clf = SVC(gamma = 'auto')
    clf.fit(train_data, train_labels)
    
    # Test SVM
    predictions = [] # Stores classifier predictions
    for i in range(0, 87):
        test_sample = test_data[i].reshape(1, -1)
        prediction = clf.predict(test_sample)
        predictions.append(prediction)
    
    accuracy = accuracy_score(test_labels, predictions)
    
    return accuracy

**Cross validation iterations**

In [8]:
# Iteration 1
# Using folds 1, 2, 3, 4, 5, 6 as training, fold 7 as test
iteration1_accuracy = TrainAndTestSVM(1, 2, 3, 4, 5, 6, 7)
SVM_accuracies.append(iteration1_accuracy)

# Iteration 2
# Using folds 1, 2, 3, 4, 5, 7 as training, fold 6 as test
iteration2_accuracy = TrainAndTestSVM(1, 2, 3, 4, 5, 7, 6)
SVM_accuracies.append(iteration2_accuracy)

# Iteration 3
# Using folds 1, 2, 3, 4, 6, 7 as training, fold 5 as test
iteration3_accuracy = TrainAndTestSVM(1, 2, 3, 4, 6, 7, 5)
SVM_accuracies.append(iteration3_accuracy)

# Iteration 4
# Using folds 1, 2, 3, 5, 6, 7 as training, fold 4 as test
iteration4_accuracy = TrainAndTestSVM(1, 2, 3, 5, 6, 7, 4)
SVM_accuracies.append(iteration4_accuracy)

# Iteration 5
# Using folds 1, 2, 4, 5, 6, 7 as training, fold 3 as test
iteration5_accuracy = TrainAndTestSVM(1, 2, 4, 5, 6, 7, 3)
SVM_accuracies.append(iteration5_accuracy)

# Iteration 6
# Using folds 1, 3, 4, 5, 6, 7 as training, fold 2 as test
iteration6_accuracy = TrainAndTestSVM(1, 3, 4, 5, 6, 7, 2)
SVM_accuracies.append(iteration6_accuracy)

# Iteration 7
# Using folds 2, 3, 4, 5, 6, 7 as training, fold 1 as test
iteration3_accuracy = TrainAndTestSVM(2, 3, 4, 5, 6, 7, 1)
SVM_accuracies.append(iteration3_accuracy)

**Printing mean SVM accuracy across all k = 7 cross validation iterations**<br>

In [10]:
print("Mean accuracy = ", np.mean(SVM_accuracies))

Mean accuracy =  0.9178981937602628


### Best SVM Training Model and Analysis