## Dataset 3 - Autism

This dataset provides 20 attributes to help determine if an adult could be on the autistic spectrum or have ASD. The data provided is based upon autism screening of adults in contrast to most other datasets based on behavior traits. In the dataset, 10 behavioral and 10 individual characteristics are provided.
https://archive.ics.uci.edu/ml/datasets/Autism+Screening+Adult

#### Problem
Our goal is to create a classifier, that can diagnose autism based on the answers to certain questions and physical characteristics. This is a classification problem and very much mirrors our hypothetical situations of diagnosing cancer that we discuss in class. Except our features are more based around psychological evaluation and not physical traits and attributes of a physical ailment.

#### Extending the problem
Out of all of the datasets, this one has the most social impact, and the creation of such an algorithm is most likely already the subject of study in various academic realms.


### Loading the Dataset
The dataset is laid out in a .arff file and needs to be loaded properly so that we can use it. Fortunately, Scipy has a function to hanle this.

Some features, like the answers to test questions, are binary. While others like age are intgers, and others are strings. 

Some features are translated into binary features. Gender is taken as a binary feature, with 'False' being taken as male and 'True' being taken female.

In [15]:
from scipy.io import arff
import numpy as np

file = './Data/Autism/Autism-Adult-Data.arff'
filedata, metadata = arff.loadarff(file)

'''
print (metadata)
print (filedata[0])
'''

# Data is not properly typed and needs to be converted
data = [[None for _ in range(len(filedata[0]))] for _ in range(len(filedata))]
for i in range(len(filedata)):
    for j in range(len(filedata[i])):
        # Binary features, Answer Scores
        if (j < 10):
            if (filedata[i][j] == b'1'):
                data[i][j] = True
            else:
                data[i][j] = False
        # Integer Features, Age feature, Screen Score
        elif (j == 10 or j == 17):
            data[i][j] = filedata[i][j]
        # Gender Feature to binary
        elif (j == 11):
            if (filedata[i][j] == 'm'):
                data[i][j] = False
            else:
                data[i][j] = True
        # String features, Ethnicity, Country of Origin (enclossed in '' within the string), 18 or older, relation
        elif (j == 12 or j == 15 or j == 18 or j == 19):
            data[i][j] = filedata[i][j].decode("utf-8")
        # Jaundice, Family Member with a PDD
        elif (j == 13 or j == 14 or j == 16):
            if (filedata[i][j] == b'yes'):
                data[i][j] = True
            else:
                data[i][j] = False
        # Final classification
        elif (j == 20):
            if (filedata[i][j] == b'YES'):
                data[i][j] = True
            else:
                data[i][j] = False

    # Make the row into a numpy array
    data[i] = np.array(data[i])

# Make the whole dataset into a numpy array
data = np.array(data)

# Transpose so that features are along the rows and data points are along the columns
data = data.transpose()

# Extract the label from the data
labels = data[20:,:]
data = data[:20,:]

# data is now the features matrix column wise and the labels are separated into a vector

### Cleaning the Dataset
With the data properly loaded, we need to look over our features and potentially clean or adjust the data.

In our dataset, there is an "18 years or older" feature, which is the same for every datapoint, a hold over from what can be assumed to be related to a legal obligation of an adult's consent to colllect the data. Thus we can cut that feature out entirely from the beginning.

Additionally, we can elect to exclude other features which we are not interested in including in our analysis such as .

We can cut these out of our data: