## Patient-wise Train/Test Split

As proposed by [Spanhol *et al.*](http://www.inf.ufpr.br/lesoliveira/download/TBME-00608-2015-R2-preprint.pdf), in order to avoid diffusing the images of unseen patients to the training set and to make the resulting classification model as generalizable as possible, a patient-wise train/test split strategy is devised. So, I chose to randomly pick 75% of patients for the training set and the remaining 25% of patients for the test set. The resulting train and test data frames are stored to be later used in order to build the classifier.

In [1]:
import pandas as pd
import numpy as np

In [2]:
np.random.seed(42)

In [3]:
df = pd.read_csv('./specc.csv')

In [4]:
patients = list(df['Patient_Id'].unique())

In [5]:
train_frac = 0.75

In [6]:
train_size = int(train_frac * len(patients))

In [7]:
train_patients = list(np.random.choice(patients, size = train_size, replace = False))
test_patients = list(set(patients) - set(train_patients)) 

In [8]:
df_train = df.set_index('Patient_Id').loc[train_patients].reset_index()
df_test  = df.set_index('Patient_Id').loc[test_patients].reset_index()

In [29]:
df_train.to_csv('train_data.csv', index=False)
df_test.to_csv('test_data.csv', index=False)