# Split Data for Sherlock Processing

If you plan to generate in-silico predictions on Sherlock and you have lots of data to process (thousands+), it's highly recommended that you split the data into subsets and run the structure prediction scripts in parallel on many nodes using a batch array. See 2.GeneratePredictions.ipynb for more details. 

This notebook lets you define how many subset files you want to generate.

In [None]:
import os
import pandas as pd

# If true, read in data+predictions.pkl and include previously-generated predictions in the
# newly-generated subsets. This is useful if, for example, you've previously run predictions
# but need to extend them with additional metadata
MERGE = True

job = input(
    "Job Name: whatever you'd like to call this processing run. We'll be storing the subset " +
    "files in SCRATCH here, and we need to fetch them later in the processing script"
)
subsetSize = int(input(
    "Subset Size: how many rows you'd like in each subset file. Total number generated is " +
    "rowsToProcess / subsetSize."
))

# Pick your input file according to whether you are processing the whole dataset or just high quality sequences
inputFile = f'../data/data_rdatOnly.pkl' 
# inputFile = f'../data/data_highQuality.pkl' 
outputDataDir = f'{os.environ["SCRATCH"]}/{job}/data'

# Create the output directory
os.makedirs(outputDataDir, exist_ok=True)

# Grab the data
data = pd.read_pickle(inputFile)
rowsToProcess = data.shape[0]

if MERGE:
    processed_data = pd.read_pickle('../data/data+predictions.pkl')
    data = data.join(processed_data[[col for col in processed_data.columns if col.endswith('_PRED') or col.endswith('_time')]])

# Loop over the data to generate subset files
for (i, index) in enumerate(range(0, rowsToProcess, subsetSize)):
    subset = data[index:index+subsetSize]
    subset.to_pickle(f'{outputDataDir}/{i:03}.pkl')

In [None]:
# OPTIONAL
# Test to make sure that the output subsets are what you expected 
test = pd.read_pickle(f'{outputDataDir}/000.pkl')
display(test)