<div style="text-align: left;">
<table style="width:100%; background-color:transparent;">
  <tr style="background-color:transparent;">
    <td style="background-color:transparent;">[<img src="http://project.inria.fr/saclaycds/files/2017/02/logoUPSayPlusCDS_990.png" width="70%">](http://www.datascience-paris-saclay.fr)</td>
    <td style="background-color:transparent;">[<img src="https://paris-saclay-cds.github.io/autism_challenge/images/institut_pasteur_logo.svg" width="30%">](https://research.pasteur.fr/en/team/group-roberto-toro/)</td>
  </tr>
</table> 
</div>

<center><h1>Imaging-psychiatry challenge: predicting autism</h1></center>

<center><h3>A data challenge on Autism Spectrum Disorder detection</h3></center>
<br/>
<center>_Roberto Toro (Institut Pasteur), Nicolas Traut (Institut Pasteur), Anita Beggiato (Institut Pasteur), Katja Heuer (Institut Pasteur),<br /> Gael Varoquaux (Inria, Parietal), Alex Gramfort (Inria, Parietal), Balazs Kegl (LAL),<br /> Guillaume Lemaitre (CDS), Alexandre Boucaud (CDS), and Joris van den Bossche (CDS)_</center>

## 0. Set up your conda environment

Before going to the nitty-gritty, make sure you installed all required packages as in the ami_environment.yml file.


## 1. Load the data

We start by downloading the data from Internet

In [1]:
from problem import get_train_data
from problem import get_test_data

data_train, labels_train = get_train_data()
data_test, labels_test = get_test_data()

### *Task 1*

Print the number of males and females in the training and test sets.

In [2]:
print("Training Data Gender Data")
print(data_train["participants_sex"].value_counts())

print("Test Data Gender Data")
print(data_test["participants_sex"].value_counts())


Training Data Gender Data
M    900
F    227
Name: participants_sex, dtype: int64
Test Data Gender Data
M    20
F     3
Name: participants_sex, dtype: int64


## 2. Define the evaluation

In [3]:
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from problem import get_cv
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedKFold

def evaluation(X, y):
    pipe = make_pipeline(FeatureExtractor(), Classifier())
    cv = get_cv(X, y)
    # cv = StratifiedKFold(n_splits=5, random_state=42)
    results = cross_validate(pipe, X, y, scoring=['roc_auc', 'accuracy'], cv=cv,
                             verbose=1, return_train_score=True,
                             n_jobs=2)
    
    return results

def evaluation_predict(X,y):
    # Note: in the cross_validate function, they use StratifiedShuffleSplit which allows for resampling
    pipe = make_pipeline(FeatureExtractor(), Classifier())
    cv = StratifiedKFold(n_splits=5, shuffle = True, random_state=42) 
    
    results = cross_val_predict(pipe, X, y, cv=cv,
                             verbose=1, n_jobs=2, method='predict')
    return results

### *Task 2*
Print the proportion of males and females in each fold of the cross-validation.

In [4]:
import warnings
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
warnings.filterwarnings("ignore", category=np.ModuleDeprecationWarning)
warnings.filterwarnings("ignore", message=".*`np.*` is a deprecated alias.*")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) 
data_train_sex = np.array(data_train['participants_sex'])
fold_number = 1
for train_index, test_index in cv.split(data_train,labels_train):
    train = data_train_sex[train_index]
    test = data_train_sex[test_index]

    train_male_count = np.count_nonzero(train == 'M')
    train_female_count = np.count_nonzero(train == 'F')

    test_male_count = np.count_nonzero(test == 'M')
    test_female_count = np.count_nonzero(test == 'F')

    print("Fold ", fold_number, ": Training Gender Proportion ", "Female 1 : Male ", round(train_male_count/train_female_count, 2), " & ", train_female_count, ":", train_male_count, sep = "")
    print("Fold ", fold_number, ": Test Gender Proportion ", "Female 1 : Male ", round(test_male_count/test_female_count, 2), " & ", test_female_count, ":", test_male_count, sep = "")
    fold_number += 1


Fold 1: Training Gender Proportion Female 1 : Male 3.74 & 190:711
Fold 1: Test Gender Proportion Female 1 : Male 5.11 & 37:189
Fold 2: Training Gender Proportion Female 1 : Male 4.24 & 172:729
Fold 2: Test Gender Proportion Female 1 : Male 3.11 & 55:171
Fold 3: Training Gender Proportion Female 1 : Male 3.88 & 185:717
Fold 3: Test Gender Proportion Female 1 : Male 4.36 & 42:183
Fold 4: Training Gender Proportion Female 1 : Male 3.96 & 182:720
Fold 4: Test Gender Proportion Female 1 : Male 4.0 & 45:180
Fold 5: Training Gender Proportion Female 1 : Male 4.04 & 179:723
Fold 5: Test Gender Proportion Female 1 : Male 3.69 & 48:177


## 3. Load the submission

Each submission defines a `FeatureExtractor` and a `Classifier`. It relies on:

* the file `submissions/<submission_name>/feature_extractor.py` corresponding to the feature extractor;
* the file `submission/<submission_name>/classifier.py` corresponding to the classifier.

In the cells below, you can change the name of the `<submission_name>` to load on the desired solution and later run it.

### 3.1 - Feature extractor

In [5]:
# %load submissions/starting_kit/feature_extractor.py

### 3.2 - Classifier

In [6]:
# %load submissions/starting_kit/classifier.py

## 4. Run the evaluation

In [7]:
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
warnings.filterwarnings("ignore", category=np.ModuleDeprecationWarning)
warnings.filterwarnings("ignore", message=".*`np.*` is a deprecated alias.*")

from submissions.nguigui_original.classifier import Classifier
from submissions.nguigui_original.feature_extractor import FeatureExtractor

# Make sure you download the functional data, if it is not already stored on your drive
from download_data import fetch_fmri_time_series
fetch_fmri_time_series(atlas='all')

Downloading the data from https://zenodo.org/record/3625740/files/power_2011.zip ...


: 

: 

#### Structural MRI features

In [None]:
results = evaluation(data_train, labels_train)

print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(
   np.mean(results['train_roc_auc']), np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(
   np.mean(results['test_roc_auc']), np.std(results['test_roc_auc'])))

print("Training score accuracy: {:.3f} +- {:.3f}".format(
   np.mean(results['train_accuracy']), np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(
   np.mean(results['test_accuracy']), np.std(results['test_accuracy'])))

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


Training score ROC-AUC: 1.000 +- 0.000
Validation score ROC-AUC: 0.735 +- 0.022 

Training score accuracy: 1.000 +- 0.000
Validation score accuracy: 0.670 +- 0.027


[Parallel(n_jobs=2)]: Done   8 out of   8 | elapsed:  8.7min finished


In [None]:
warnings.filterwarnings("ignore", message=".*`np.*` is a deprecated alias.*")
predictions = evaluation_predict(data_train,labels_train)


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:  4.2min finished


### *Task 3*

Complete this function to return model's accuracy for male and female samples seperately for each fold.

In [None]:
def accuracy_protectedGroups(predictions, cv_split, data, file_name):
  warnings.filterwarnings("ignore", message=".*`np.*` is a deprecated alias.*")

  fold_pred = [predictions[test] for train, test in cv.split(data_train,labels_train)]
  fold_labels = [np.array(labels_train)[test] for train, test in cv.split(data_train,labels_train)]

  data_train_sex = np.array(data_train['participants_sex'])
  #print(data_train_sex)
  i=0
  f = open(file_name+".txt", "w")
  for train_index, test_index in cv_split:
    male_accuracy = 0
    male_total = 0
    female_accuracy = 0
    female_total = 0
    train = data_train_sex[train_index]
    test = data_train_sex[test_index]
    #print(fold_pred[i])
    #print(fold_labels[i]) 
    #index = 0
    for index in range(len(fold_pred[i])): 
        f.write(str(fold_pred[i][index]))
        f.write(",")
        f.write(str(fold_labels[i][index]))
        f.write("\n")
        if test[index] == 'M':
            male_total += 1
            # print(index, "M", test[index], fold_pred[i][index], fold_labels[i][index])
            if round(fold_pred[i][index]) == fold_labels[i][index]:
              male_accuracy += 1
        else:
            #print(index, "F", test[index])
            female_total += 1
            if round(fold_pred[i][index]) == fold_labels[i][index]:
              female_accuracy += 1
    #print("Train: ", train)
    #print("Test: ", test)
    #print("Fold Pred", )
    #print("Test Index", test_index)
    i+=1
    print("Male: ", male_accuracy, " out of ", male_total,", ", round(male_accuracy/male_total*100, 2), "%. Female: ", female_accuracy, " out of ", female_total, ", ", round(female_accuracy/female_total*100, 2), "%. Total participants: ", female_total + male_total, sep="")
  f.close()

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle = True, random_state=42) 
cv_split = cv.split(data_train, labels_train)


accuracy_protectedGroups(predictions, cv_split, data_train, "mk_original")

Male: 130 out of 189, 68.78%. Female: 22 out of 37, 59.46%. Total participants: 226
Male: 115 out of 171, 67.25%. Female: 42 out of 55, 76.36%. Total participants: 226
Male: 118 out of 183, 64.48%. Female: 36 out of 42, 85.71%. Total participants: 225
Male: 119 out of 180, 66.11%. Female: 33 out of 45, 73.33%. Total participants: 225
Male: 120 out of 177, 67.8%. Female: 36 out of 48, 75.0%. Total participants: 225
