# Raw data for top tagging
This notebook reads in the intial data (1.2 million events) and reduces the size to fit on a laptop.

The original data can be found [at this link](https://desycloud.desy.de/index.php/s/llbX3zpLhazgPJ6), with a description [here](https://docs.google.com/document/d/1Hcuc6LBxZNX16zjEGeq16DAzspkDC4nDTyjMp1bWHRo/edit).

Download the data and put it into the `data/top_tagging/raw/` directory.

In [1]:
import pandas as pd
import numpy as np

**Description**

 * 14 TeV, hadronic tops for signal, qcd diets background, delphes ATLAS detector card with pythia 
 * No MPI/pile-up included
 * We cluster particle-flow entries (produced by Delphes E-flow) into anti-kT 0.8 jets in the pT range [550,650]
 * All top jets are matched to a parton-level top within ∆R = 0.8, and to all top decay partons within 0.8
 * We also require |eta|_jet < 2
 * The leading 200 jet constituent four-momenta are stored, with zero-padding for jets with fewer than 200 (actually this is most of them, so maybe this could be lowered)
 * Constituents are sorted by pT, with the highest pT one first
 * The truth top four-momentum is stored as truth_px etc.
 * A flag (1 for top, 0 for QCD) is kept for each jet. It is called is_signal_new
 * The variable "ttv" (= test/train/validation) is kept for each jet. It indicates to which dataset the jet belongs. It is redundant as the different sets are already distributed as different files.


## Training Data

In [2]:
input_filename = "../data/top_tagging/raw/train.h5"
store = pd.read_hdf(input_filename, key='table')

In [3]:
qcd = store[store['is_signal_new']==0]
tops = store[store['is_signal_new']==1]

In [4]:
qcd.head(5)

Unnamed: 0,E_0,PX_0,PY_0,PZ_0,E_1,PX_1,PY_1,PZ_1,E_2,PX_2,...,E_199,PX_199,PY_199,PZ_199,truthE,truthPX,truthPY,truthPZ,ttv,is_signal_new
375,474.071136,-250.347031,-223.651962,-334.738098,103.236237,-48.866222,-56.790775,-71.02549,105.255569,-55.415001,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
377,150.504532,120.062393,76.852005,-48.274265,82.257057,63.801739,42.754807,-29.454842,48.573559,36.763199,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
378,251.645386,10.427651,-147.573746,203.56488,104.147797,10.718256,-54.497948,88.101395,78.043213,5.724113,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
379,451.566132,129.885437,-99.066292,-420.9841,208.410919,59.033958,-46.17709,-194.467941,190.183304,54.069675,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
380,399.093903,-168.432083,-47.205597,-358.717438,273.691956,-121.926941,-30.803854,-243.088928,152.837219,-44.400204,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0


In [5]:
del(store)

The data has 4 vectors for up to 200 particles for each jet, with zero padding. Therefore, take the first 800 elements of the data, and reshape into a 3D array (number of jets, particle number, 4 vector element).

In [6]:
qcd = (qcd.to_numpy()[:100000, :400]).reshape(-1,100,4)
tops = (tops.to_numpy()[:100000, :400]).reshape(-1,100,4)

In [7]:
print(qcd.shape, tops.shape)

(100000, 100, 4) (100000, 100, 4)


In [8]:
np.sum(qcd[:,:,0]==0) / (10000000) * 100

56.71727

In [9]:
events = np.vstack([qcd, tops])
print(events.shape)

(200000, 100, 4)


In [10]:
labels = np.hstack([np.zeros(qcd.shape[0]), np.ones(tops.shape[0])]).reshape(-1, 1)
print(labels.shape)

(200000, 1)


In [11]:
indices = np.arange(events.shape[0])
np.random.shuffle(indices)
print(indices[:10])

[117102  98098 169714 147773 134709 117878  89898 117212 173934 180284]


In [12]:
events = events[indices]
labels = labels[indices]

In [13]:
np.save('../data/top_tagging/smaller_raw/training_events.npy', events)
np.save('../data/top_tagging/smaller_raw/training_labels.npy', labels)

In [14]:
del(qcd)
del(tops)

In [15]:
del(events)
del(labels)

## Validation

In [16]:
input_filename = "../data/top_tagging/raw/val.h5"
store = pd.read_hdf(input_filename, key='table')

In [17]:
qcd = store[store['is_signal_new']==0]
tops = store[store['is_signal_new']==1]

In [18]:
del(store)

In [19]:
qcd = (qcd.to_numpy()[:10000, :400]).reshape(-1,100,4)
tops = (tops.to_numpy()[:10000, :400]).reshape(-1,100,4)

In [20]:
print(qcd.shape, tops.shape)

(10000, 100, 4) (10000, 100, 4)


In [21]:
events = np.vstack([qcd, tops])
print(events.shape)
labels = np.hstack([np.zeros(qcd.shape[0]), np.ones(tops.shape[0])]).reshape(-1, 1)
print(labels.shape)

(20000, 100, 4)
(20000, 1)


In [22]:
indices = np.arange(events.shape[0])
np.random.shuffle(indices)
print(indices[:10])

[ 5791  1139 10795  7938 17057  1178  1059 11671 18467  7179]


In [23]:
events = events[indices]
labels = labels[indices]

In [24]:
np.save('../data/top_tagging/smaller_raw/val_events.npy', events)
np.save('../data/top_tagging/smaller_raw/val_labels.npy', labels)

## Testing

In [25]:
input_filename = "../data/top_tagging/raw/test.h5"
store = pd.read_hdf(input_filename, key='table')

In [26]:
qcd = store[store['is_signal_new']==0]
tops = store[store['is_signal_new']==1]
del(store)

In [27]:
qcd = (qcd.to_numpy()[:10000, :400]).reshape(-1,100,4)
tops = (tops.to_numpy()[:10000, :400]).reshape(-1,100,4)

In [28]:
events = np.vstack([qcd, tops])
print(events.shape)
labels = np.hstack([np.zeros(qcd.shape[0]), np.ones(tops.shape[0])]).reshape(-1, 1)
print(labels.shape)

(20000, 100, 4)
(20000, 1)


In [29]:
indices = np.arange(events.shape[0])
np.random.shuffle(indices)
print(indices[:10])

[ 6405 16595  9808 19505  9057 15569  6160 12649  2459 12423]


In [30]:
events = events[indices]
labels = labels[indices]

In [31]:
np.save('../data/top_tagging/smaller_raw/test_events.npy', events)
np.save('../data/top_tagging/smaller_raw/test_labels.npy', labels)