We use this notebook to randomly select indexes, and we do so by using the data in  Data_v7.csv: 
- 77 rows: 41 PD + 36 Controls, 
- 130 columns: Patient, Age, Turning Time (avg between exp 1 and exp 3), 42 indicators x 3 experiments, Label 

Except for Patient, Age and Label we used all variables in Parkinsonism & Related Disorders. Here we use only 126 variables

In [1]:
import pandas as pd
import numpy as np
import random
from sklearn import preprocessing, svm, tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
def sampling(original,n):
    """This method randomly selects n elements from original (it works for lists and sets)
    and returns a list of selected elements (selected) and the original list/set without 
    those selected elements (new). The method does not alter the original list/set.
    """
    selected = random.sample(original,n)
    new = original.copy()
    [new.remove(i) for i in selected]
    return selected,new

def getError(model,train_x, train_y,test_x, test_y):
    train_error,test_error = 0,0
    if model == "SVM":
        clf = svm.SVC()
    elif model == "LR":
        clf = LogisticRegression()
    elif model == "NN":
        clf = MLPClassifier()
    elif model == "kNN":
        clf = KNeighborsClassifier()
    elif model == "DT":
        clf = tree.DecisionTreeClassifier()
    elif model == "RF":
        clf = RandomForestClassifier()
    clf = clf.fit(train_x, train_y)   
    train_error = 1 - clf.score(train_x, train_y)
    test_error = 1 - clf.score(test_x, test_y)
    return train_error,test_error,clf.predict(test_x)

In [3]:
table7 = pd.read_csv("Data_v7.csv", header=0, sep=",")
types = {"EPI":1,"EPG2019S":1,"CONTROLES":0,"AsG2019S-":0} 
# Eliminate Age and Turning time
data = np.array(table7.iloc[:,3:-1])
labels = np.array(table7.iloc[:,-1])
labels = np.array([types[lab] for lab in labels])
# scaling data
scaler = preprocessing.StandardScaler()
scaler.fit(data)
data = scaler.transform(data)

The following methodology was used to sample data for training and testing: at each time step, two positive examples anf two negative examples were randomly extracted. The first of each was added to the training set and the other to the test set. This way we have an equal number of positive and negative examples in each set.

In [4]:
pos_labels = [i for i,elem in enumerate(labels) if elem==1]
neg_labels = [i for i,elem in enumerate(labels) if elem==0]
n = min(len(pos_labels),len(neg_labels))
train_index, test_index = [],[]
for n_samples in range(1,n // 2+1):
    x_p,pos_labels = sampling(pos_labels,2) 
    x_n,neg_labels = sampling(neg_labels,2)  
    train_index += [x_p[0],x_n[0]]
    test_index += [x_p[1],x_n[1]]

Since there where more positive than negative examples (41 vs 36), we kept sampling two positive examples while posible, adding the first to the training set and the second one to the test set

In [5]:
while len(pos_labels)>=2:
    x_p,pos_labels = sampling(pos_labels,2) 
    train_index += x_p[:1]
    test_index += x_p[-1:]

Only one example was left out (negative one).

The resulting training and test set are as follows (we save them here for further use):

In [6]:
train_index = [23, 67, 32, 17, 25, 70, 45, 1, 42, 66, 55, 65, 54, 68, 40, 12, 38, 4, 37, 9, 33, 8, 36, 71, 34, 16, 49, 
               14, 48, 2, 43, 3, 39, 75, 56, 64, 47, 28]
test_index = [21, 63, 57, 0, 44, 6, 50, 15, 52, 62, 27, 18, 53, 72, 20, 76, 41, 60, 51, 10, 58, 5, 24, 11, 19, 69, 31, 
              7, 26, 61, 59, 73, 30, 74, 29, 13, 46, 35]

Now we want to reserve 65\% for training (55 examples) and 35\% for testing (22 examples). Again, we pick them randomly, but we want to make sure that we get both "easy" examples and "difficult" ones, and in order to decide this we check how many classifiers missclassify each example in a Leave One Out setting.

In [7]:
models = ["SVM","LR","NN","kNN","DT","RF"]
wrong_class = {"SVM":[],"LR":[],"NN":[],"kNN":[],"DT":[],"RF":[]}
user_idx = range(len(labels))
for x in user_idx:
    trainLOO = list(set(user_idx).difference(set([x])))
    testLOO = [x]
    train_x,test_x = data[np.array(trainLOO)],data[np.array(testLOO)]
    train_y,test_y = labels[np.array(trainLOO)],labels[np.array(testLOO)]
    for model in models:
        train_error,test_error,pred_labels = getError(model,train_x, train_y,test_x, test_y)
        if test_y!=pred_labels:
            wrong_class[model].append(x)
difficult = {i:[] for i in range(len(models)+1)}
with open("Difficult.csv","w") as f:
    f.write("Index,"+",".join(models)+"\n")
    for i in range(len(labels)):
        # a 1 means that the model missclassifies that example 
        aux = [1 if i in wrong_class[model] else 0 for model in models]
        difficult[sum(aux)].append(i)
        f.write("%s,"%i+",".join([str(j) for j in aux])+"\n")


We will select a test set that has 35% of each type

In [8]:
test_set = []
for i in range(7):
    test_set += random.sample(difficult[i],round(0.35*len(difficult[i]))) 

This is what we've got.

In [9]:
test_set = [3, 55, 24, 31, 29, 53, 30, 9, 25, 75, 49, 56, 44, 41, 36, 6, 60, 23, 54, 57, 63, 47, 20, 26, 2, 38, 61]
train_set = [0, 1, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 27, 28, 32, 33, 34, 35, 37, 39, 
             40, 42, 43, 45, 46, 48, 50, 51, 52, 58, 59, 62, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 76]

There are 51.85\% positive examples in the test set and 54\% positive examples in the training set.
Not bad proportion, we will keep this one