# Split Data into Train and Test

Randomly select 30% of the participants to keep as a holdout sample. We will perform our analysis and model training on the training 70% and then can use the holdout data to see whether our findings generalize well to unseen data. The code below was used to generate this random split. We've chosen to stratify our sample by both the response column (mindfulness), but also the study year (200s versus 300s) to have approximately the same number of treatment/control and 200s/300s participants in the training and test sample. We will use the lookup table with the holdout indicator generated here to split our data in our other analysis to ensure we're always using the same participants for our holdour set.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


In [10]:
# Read in survey data
df = pd.read_csv('survey data.csv',
                 usecols = ['studyid', 'mindfulness'])
df.head()


Unnamed: 0,studyid,mindfulness
0,201,1
1,202,0
2,203,1
3,204,0
4,205,1


In [11]:
# Add indicator for 200s vs 300s
df['300s'] = np.where(df['studyid']>=300, 1, 0)

In [33]:
# Split data into train and test
# Stratify using mindfulness and 300s indicator
train, test = train_test_split(df, test_size=0.3, random_state=0, 
                               stratify=df[['mindfulness', '300s']])


In [43]:
# Create indicator for sample being in holdout or not
test['holdout'] = 1
train['holdout'] = 0

In [44]:
# Append train and test data back togehter
df = pd.concat([train, test], axis = 0)


In [45]:
df

Unnamed: 0,studyid,mindfulness,300s,holdout
77,349,1,1,0
42,314,1,1,0
22,223,1,0,0
6,207,0,0,0
61,333,0,1,0
...,...,...,...,...
63,335,1,1,1
8,209,1,0,1
16,217,0,0,1
24,225,0,0,1


In [46]:
df.to_csv('holdout_samples_lookup.csv', index = False)