# Prepare Training Data

In this notebook, you will load and prepare the data ready for training your first GNN. We will load the processed JSON files into `networkx` `DiGraph` objects, convert them into pytorch geometric `Data` objects and then save them to disk ready to be loaded for model training.

**Note:** If you are loading your own processed data (i.e output from `bin2ml`), you will need to amend the constants defined in the code block below.

In [None]:
PATH_TO_PROCESSED_JSON = "../../data/training/graphs"
OUTPUT_TRAIN_PKL_FILENAME = "../train.pkl"
OUTPUT_TEST_PKL_FILENAME = "../test.pkl"
NUM_CPUS_TO_USE = -1

In [None]:
from bin2mlpy.data_utils.convert_and_pickle import get_all_filenames, process_single_graph, format_and_clean_up_data_objs, save_as_pickled_data, split_train_eval
from joblib import Parallel, delayed
from tqdm import tqdm

## Get a list of filenames

In [None]:
filepaths = get_all_filenames(PATH_TO_PROCESSED_JSON)
print(f"Number of files: {len(filepaths)}")

## Split filepaths into `train` and `eval`

In [None]:
train_filepaths, test_filepaths = split_train_eval(filepaths)

In [None]:
print(len(train_filepaths), len(test_filepaths))

## Load and convert to `Data` objects 

In [None]:
train_data_tensors = Parallel(n_jobs=NUM_CPUS_TO_USE)(delayed(process_single_graph)(filename) for filename in tqdm(train_filepaths))
test_data_tensors = Parallel(n_jobs=NUM_CPUS_TO_USE)(delayed(process_single_graph)(filename) for filename in tqdm(test_filepaths))

## Format and Clean `Data` Objects

In [None]:
train_data_tensors_clean = format_and_clean_up_data_objs(train_data_tensors)
test_data_tensors_clean = format_and_clean_up_data_objs(test_data_tensors)

## Save processed `Data` objects to pickle file

In [None]:
save_as_pickled_data(train_data_tensors_clean, OUTPUT_TRAIN_PKL_FILENAME)
save_as_pickled_data(test_data_tensors_clean, OUTPUT_TEST_PKL_FILENAME)