## Dataset 3 - Autism

This dataset provides 20 attributes to help determine if an adult could be on the autistic spectrum or have ASD. The data provided is based upon autism screening of adults in contrast to most other datasets based on behavior traits. In the dataset, 10 behavioral and 10 individual characteristics are provided.
https://archive.ics.uci.edu/ml/datasets/Autism+Screening+Adult

#### Problem
Our goal is to create a classifier, that can diagnose autism based on the answers to certain questions and physical characteristics. This is a classification problem and very much mirrors our hypothetical situations of diagnosing cancer that we discuss in class. Except our features are more based around psychological evaluation and not physical traits and attributes of a physical ailment.

#### Extending the problem
Out of all of the datasets, this one has the most social impact, and the creation of such an algorithm is most likely already the subject of study in various academic realms.


### Dataset Description

### Loading the Dataset
The dataset is laid out in a .arff file and needs to be loaded properly so that we can use it. Fortunately, Scipy has a function to hanle this.

Some features, like the answers to test questions, are binary. While others like age are intgers, and others are strings. 

Some features are translated into binary features. Gender is taken as a binary feature, with 'False' being taken as male and 'True' being taken female.

In [1]:
from scipy.io import arff
import numpy as np

file = './Data/Autism/Autism-Adult-Data.arff'
filedata, metadata = arff.loadarff(file)

'''
print (metadata)
print (filedata[0])
'''

# Data is not properly typed and needs to be converted
data = [[None for _ in range(len(filedata[0]))] for _ in range(len(filedata))]
for i in range(len(filedata)):
    for j in range(len(filedata[i])):
        # Binary features, Answer Scores
        if (j < 10):
            if (filedata[i][j] == b'1'):
                data[i][j] = True
            else:
                data[i][j] = False
        # Integer Features, Age feature, Screen Score
        elif (j == 10 or j == 17):
            data[i][j] = filedata[i][j]
        # Gender Feature to binary
        elif (j == 11):
            if (filedata[i][j] == 'm'):
                data[i][j] = False
            else:
                data[i][j] = True
        # String features, Ethnicity, Country of Origin (enclossed in '' within the string), 18 or older, relation
        elif (j == 12 or j == 15 or j == 18 or j == 19):
            data[i][j] = filedata[i][j].decode("utf-8")
        # Jaundice, Family Member with a PDD
        elif (j == 13 or j == 14 or j == 16):
            if (filedata[i][j] == b'yes'):
                data[i][j] = True
            else:
                data[i][j] = False
        # Final classification
        elif (j == 20):
            if (filedata[i][j] == b'YES'):
                data[i][j] = True
            else:
                data[i][j] = False

    # Make the row into a numpy array
    data[i] = np.array(data[i])

# Make the whole dataset into a numpy array
data = np.array(data)

# Transpose so that features are along the rows and data points are along the columns
data = data.transpose()

# Extract the label from the data
labels = data[20:,:]
data = data[:20,:]

# data is now the features matrix column wise and the labels are separated into a vector

### Cleaning the Dataset
With the data properly loaded, we need to look over our features and potentially clean or adjust the data.

In our dataset, there is an "18 years or older" feature, which is the same for every datapoint, a hold over from what can be assumed to be related to a legal obligation of an adult's consent to colllect the data. Thus we can cut that feature out entirely from the beginning.

Additionally, we can elect to exclude other features which we are not interested in including in our analysis such as .

We can cut these out of our data:

### Analysing our Cleaned Data

This dataset required the most adjustment so that it can be used for the purposes of this project, mostly stemming from the use of an arff file for storing and distributing the data. In addition, not every feature was relevnant as mentioned above. But now we can attempt to build a classifier around our data, to classify whether given ADS evaluation answers, and other factors, would lead to an ASD diagnosis. These are the 3 methods that will be used, along with the group member responsible for that method:

- Classification Method 1 - Brooks Tawil
- Naive Bayes - Jack Chiu
- Classification Method 3 - Gavin Mckim

# Naive Bayes

In [2]:
import numpy as np
import arff as ARFF

def uNiQuE(vec):
    popCtr = 0
    for p in vec:
        if p in vec[:popCtr]:
            vec = np.delete(vec, popCtr, 0)
            popCtr -= 1
        popCtr += 1
        
    return vec

def preProcess(data):
    i = 0
    for q in range(len(data)):
        if None in data[i] or data[i,10] == 383:
            data = np.delete(data, i, 0)
            i -= 1        
        i += 1

        for j in range(len(data[i-1])):
            if type(data[i-1,j]) == str:
                data[i-1,j] = data[i-1,j].lower()
    
    labels = data[:,-1]
    data = np.delete(data, -1, 1)
    
    ethnicities = uNiQuE(data[:,12])
    countries = uNiQuE(data[:,15])
    completed = uNiQuE(data[:,-1])
    yesNo = np.array(['no', 'yes'])
    ageVec = np.arange(20,80, 10)
    for i in range(len(data)):
        data[i,:10] = data[i,:10].astype(int)
        
        tempInd = np.where(ageVec > int(data[i,10]))
        data[i,10] = int(tempInd[0][0])
    
        if data[i,11] == 'm':
            data[i,11] = 1
        else:
            data[i,11] = 0
        tempInd = np.where(ethnicities == data[i,12])
        data[i,12] = int(tempInd[0])
        
        tempInd = np.where(yesNo == data[i,13])
        data[i,13] = int(tempInd[0])
        tempInd = np.where(yesNo == data[i,14])
        data[i,14] = int(tempInd[0])
        
        tempInd = np.where(countries == data[i,15])
        data[i,15] = int(tempInd[0])
        
        tempInd = np.where(yesNo == data[i,16])
        data[i,16] = int(tempInd[0])
        
        data[i,17] = int(data[i,17])
        
        data[i,18] = 1
        
        tempInd =  np.where(completed==data[i,19])
        data[i,19] = int(tempInd[0])
        
        labels[i] = int(labels[i] == 'yes')
    
    return data, labels

def kFold(data, labels, kFolds, typ = 'naive', k = 7):
    #shuffle
    inds = np.random.choice(np.arange(len(data)), len(data))
    data[:] = data[inds]
    labels[:] = labels[inds]

    startInd = 0
    stepSize = int(len(data)/kFolds)
    max10 = max(data[:,10])
    max12 = max(data[:,12])
    max15 = max(data[:,15])
    max17 = max(data[:,17])
    max19 = max(data[:,19])
    Errs = []
    for i in range(kFolds):
        if i != kFolds-1:
            testData = data[startInd:startInd+stepSize]
            testLabels = labels[startInd:startInd+stepSize]
            trainingData = data[:startInd]
            trainingData = np.concatenate((trainingData, data[startInd+stepSize:]))
            trainingLabels = labels[:startInd]
            trainingLabels = np.concatenate((trainingLabels,labels[startInd+stepSize:]))
        else:
            testData = data[startInd:]
            testLabels = labels[startInd:]
            trainingData = data[:startInd]
            trainingLabels = labels[:startInd]
        startInd += stepSize      
        if typ == 'kNN':
            trainingData[:,10] = trainingData[:,10]/max10
            trainingData[:,12] = trainingData[:,12]/max12
            trainingData[:,15] = trainingData[:,15]/max15
            trainingData[:,17] = trainingData[:,17]/max17
            trainingData[:,19] = trainingData[:,19]/max19
            testData[:,10] = testData[:,10]/max10
            testData[:,12] = testData[:,12]/max12
            testData[:,15] = testData[:,15]/max15
            testData[:,17] = testData[:,17]/max17
            testData[:,19] = testData[:,19]/max19
            temp = calcErr(trainingData, trainingLabels, testData, testLabels, k)
        elif typ == 'naive':
            temp = calcErrNaive(trainingData, trainingLabels, testData, testLabels)
        
        Errs.append(temp)
    
    return Errs
        
def trainNaive(data, labels):
    unique, counts = np.unique(labels, return_counts=True)
    prior = np.array([counts[0], counts[1]])
    prior = (prior+0.0)/len(data)
    conditional = np.zeros((2, len(data[0]), 60))
    for i in range(len(data)):
        for j in range(len(data[i])):
            conditional[labels[i], j, data[i,j]] += 1
            
    for i in range(len(conditional)):
        for j in range(len(conditional[0])):
            sumCondition = sum(conditional[0,j]) + sum(conditional[1,j])
            for k in range(len(conditional[0,j])):
                conditional[i,j,k] = conditional[i,j,k]/sumCondition
    
    return prior, conditional, unique

def testNaive(prior, conditional, unique, sample):
    prob = np.array([prior[0],prior[1]])
    for i in range(len(sample)):
        prob[0] = prob[0] * conditional[0,i,sample[i]]
        prob[1] = prob[1] * conditional[1,i,sample[i]]
    
    if prob[0] >= prob[1]:
        return unique[0]
        
    return unique[1]
        
def calcErrNaive(trainingData, trainingLabels, testData, testLabels):
    errs = 0
    prior, conditional, unique = trainNaive(trainingData, trainingLabels)
    for i in range(len(testData)):
        prediction = testNaive(prior, conditional, unique, testData[i])
        errs += int(prediction != testLabels[i])
    
    return np.round(1-errs/len(testLabels), 6)

#file = 'Autism-Adult-Data.arff'
dataset = ARFF.load(open(file))
DATA = np.array(dataset['data'])    
DATA, LABELS = preProcess(DATA)
inds = np.random.choice(np.arange(len(DATA)), len(DATA))
DATA[:] = DATA[inds]
LABELS[:] = LABELS[inds]
for i in np.arange(3,13,2):
    kErrs = kFold(DATA, LABELS, i, typ = 'naive')
    print('k = ', i, ' with average accuracy = ' , np.average(kErrs).round(6))
    print('Accuracy for each fold: ', kErrs)


k =  3  with average accuracy =  0.98523
Accuracy for each fold:  [0.99505, 0.985149, 0.97549]
k =  5  with average accuracy =  0.986857
Accuracy for each fold:  [1.0, 0.983471, 0.975207, 0.991736, 0.983871]
k =  7  with average accuracy =  0.993355
Accuracy for each fold:  [1.0, 1.0, 1.0, 0.976744, 0.988372, 0.988372, 1.0]
k =  9  with average accuracy =  0.991708
Accuracy for each fold:  [1.0, 1.0, 0.970149, 1.0, 0.970149, 1.0, 1.0, 0.985075, 1.0]
k =  11  with average accuracy =  0.996694
Accuracy for each fold:  [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.981818, 1.0, 0.981818, 1.0]


The Naive Bayes was created to classify from the data whether the individual would potentially have Autism. Evidently, from running the classifier with k-fold cross validation over various k values, the accuracy of the classifier was high $\geq 98\%$. It can be said that there is a strong correlation between the collected data and whether an individual had autism or not based on the Naive Bayes Classifier.