# Creating the Training dataset

In this notebook we will read from the data files generated with Madgraph + Pythia +Delphes simulation and create a training/testing dataset for the LHCdoctor network.

### Getting all the headers ready

In [None]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import awkward as ak
import uproot

from medica import *

### Reading from the Json file

Here we are assuming that all the data are saved in a file with the path "Data/shuffled_data.json"

In [None]:
Data_Path = "Data/shuffled_data.json"

full_data = read_json_to_awkward(Data_Path)

# Getting all top-level branches and their information 
branch_info (full_data)

Now we need to refine the dataset

In [None]:
track_to_retain = 10 # minimum number of tracks per event
tower_to_retain = 10 # minimum number of towers per event
Refined_data = refinement(full_data, 10, 10)

So, we have the above number of events where there are atleast 10 track info and 10 tower info.

### Dataset creation


Now we will split the refined data in the window of say $10$ events, and we will randomly choose $30$ track parameters as well as $30$ tower paramaters for each of this events

In [None]:
window_size = 10 # size of the data-window
event_tower_to_retain = 30 # number of towers to retain per event
event_track_to_retain = 30 # number of tracks to retain per event

X_tracks, X_towers, X_missinget, y, training_dataset = Dataset_Creator(Refined_data, window=window_size, 
                                                                       event_tower=event_tower_to_retain, 
                                                                       event_track=event_track_to_retain, seed=42)

In [None]:
training_dataset.type.show()
branch_info (training_dataset)

## Timeseries dataset creation


We can also mimick a real scenario by adding new data to the window. 

In [None]:
window_size = 10 # size of the sliding window
Save_Path = f"Data/training_data_w_{window_size}.json" # path to save the created dataset
event_tower_to_retain = 30 # number of towers to retain per event
event_track_to_retain = 30 # number of tracks to retain per event

X_tracks, X_towers, X_missinget, y, training_dataset = Sliding_Window_Dataset_Creator(Refined_data, window=window_size, 
                                                                                      event_tower=event_tower_to_retain, 
                                                                                      event_track=event_track_to_retain, seed=42, 
                                                                                      save_json_path=Save_Path)

In [None]:
training_dataset.type.show()
branch_info (training_dataset)