# Collate subset files into one dataframe

If you generated the predictions on Sherlock using the batch array approach, you now have a folder full of pickle files with data and predictions. We need to collate those files into one dataframe for OpenKnotScore calculation.

In [None]:
import pandas
import numpy
import os
import re

job = ''
dataDir = f'{os.environ["SCRATCH"]}/{job}/data'

frames = []
for fname in sorted(os.listdir(dataDir)):
    if not re.fullmatch(r'\d+\.pkl', fname): continue
    frames.append(pandas.read_pickle(f"{dataDir}/{fname}"))

df = pandas.concat(frames)

Some of the jobs may have failed or timed out, so you may be left with missing data. This cell lets you check the input files to see which ones have missing data and need to be re-run. 

In [None]:
# Check for missing data in the full dataframe
display(df[df.isna().any(axis=1)])

# Check for missing data in each input file, so you can rerun failed jobs
missing = []

for (i,frame) in enumerate(frames):
  if len(frame[frame.isna().any(axis=1)]):
    missing.append(i)

sorted_missing = sorted(list(set(missing)))

print(sorted_missing)

If you have no missing data, save the dataframe.

In [None]:
df.to_pickle("../data/data+predictions.pkl")