## The 1000 Brains Study Dataset

<img src='logo.png'>

### Exploratory Data Analysis

The first 94830 columns are the unique brain connections, i.e. the pairwise correlation between the activation in different brain regions.
There are 436 brain regions, therefore there are 436*435/2 unique connections. The way the number are created is that first a full correlation matrix (436 x 436) is computed and then the lower triangular matrix is converted into a vector.
 
The last two columns are age and sex.
 
When the file is loaded with `pandas.read_csv` the first 94830 columns are labelled only by number and, while the last two colums are laballed as “age” and “sex_f0_m1”.
 
So the first number is the connection between region 2 and 1, the second one connection between 3 and 1, then 3  and 2, … and so on.
 
The rows are simply the different subjects.

In [1]:
import pandas as pd
from IPython.display import Image
df_train = pd.read_csv('train.csv') 

In [6]:
df_train.columns

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '94822', '94823', '94824', '94825', '94826', '94827', '94828', '94829',
       'age', 'sex_f0_m1'],
      dtype='object', length=94832)

In [2]:
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
import joblib

In [3]:
def comp_cv(x,y,svc):
    predsum = 0
    # cross validation
    cv = KFold(n_splits=10)
    for train, test in cv.split(X=x):
        svc.fit(x[train], y[train])
        pred = svc.predict(x[test])
        prediction = (pred == y[test]).sum() / float(len(y[test]))
        predsum = predsum + prediction
        mean_acc = predsum / 10
    return mean_acc

In [5]:
data_path = ''
save_dir = '.'

X_train = pd.read_csv(f'{data_path}train.csv') 
y_train = X_train[['sex_f0_m1']]

X_train_feat = X_train.drop(columns=['sex_f0_m1','age'])

n_samples_train = X_train_feat.shape[0]
n_features_train = X_train_feat.shape[1]

In [15]:
X_train_feat.shape

(751, 94830)

In [7]:
n_components = min(n_samples_train, n_features_train)
pca = PCA(n_components=n_components)
pca.fit(X_train_feat)
X_train_red = pca.transform(X_train_feat)

y_train = np.ravel(y_train)

C_range = np.logspace(-5, 10, 16)
gamma_range = np.logspace(-10, 5, 16)

param_grid = dict(gamma=gamma_range, C=C_range)
cv = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)
grid.fit(X_train_red, y_train)

print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

svc = grid.best_estimator_
joblib.dump(svc, f'{save_dir}/model_1000brains')

#acc = comp_cv(x_train,y_train,svc)
pred_within = svc.predict(X_train_red)
acc_within = (pred_within == y_train).sum() / float(len(y_train))
print("Within sample prediction %0.3f" % (acc_within))


The best parameters are {'C': 10.0, 'gamma': 0.0001} with a score of 0.80
Within sample prediction 1.000


In [25]:
X_test = pd.read_csv(f'{data_path}submission_valid.csv') 
X_test_feat = X_test.drop(columns=['sex_f0_m1','age'])
X_test_red = pca.transform(X_test_feat)

y_pred = svc.predict(X_test_red)

### Submission

In [32]:
df=pd.DataFrame(y_pred, columns=['sex_f0_m1'])

In [33]:
df.to_csv("submission.csv", index=False)

#### You can open the submision.csv file (File -> Open) file and download it! After you download it, you can upload the `submission.csv` containing only the predictions not the features, to the challenge frontend. 