## _Reco. Track Evaluation_

- evaluate track reconstruction of GNN
- we have reconstructed tracks from _`trkx_from_gnn.py`_ (see its code breakdown in _`trkx_from_gnn.ipynb`_)


This is code breakdown of _`eval_reco_trkx.py`_ by using the similar script from _`gnn4itk/scripts/eval_reco_trkx.py`_

In [1]:
import glob, os, sys, yaml

In [2]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
import torch
import time

In [4]:
from sklearn.cluster import DBSCAN
from multiprocessing import Pool
from functools import partial

In [5]:
# select a device
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [6]:
from LightningModules.Processing import SttTorchDataReader

### _(1) Tracks from GNN_

* from _`tracks_from_gnn.py`_

In [7]:
reco_track_path = "run/trkx_from_gnn"
reco_trkx_reader = SttTorchDataReader(reco_track_path)

In [8]:
# what are the events?
reco_trkx_reader.all_evtids[:10]

['900', '901', '902', '903', '904', '905', '906', '907', '908', '909']

In [9]:
# fetch a single event
reco_trkx_data = reco_trkx_reader(900)

In [10]:
reco_trkx_data.head()

Unnamed: 0,hit_id,track_id
0,97,0
1,19,1
2,48,-1
3,149,-1
4,198,2


In [11]:
# filter missed hits
reco_trkx_data.query("track_id==-1").head()

Unnamed: 0,hit_id,track_id
2,48,-1
3,149,-1
9,49,-1
10,150,-1
18,50,-1


In [12]:
# number of reco tracks
np.unique(reco_trkx_data.track_id.values)

array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9])

In [13]:
# renaming
reconstructed = reco_trkx_data

### _(2) Track Evaluation_

- _Fixing `eval_reco_trkx.py`_

In [14]:
# arguments for script: args = parser.parse_args()
max_evts = 100
force = True
num_workers = 8
outname = "run/trkx_reco_eval"
outdir = os.path.dirname(os.path.abspath(outname))
os.makedirs(outdir, exist_ok=True)

* Read raw CSV files to get truth information
* But I have torch-geometric data from the GNN stage

In [15]:
# fetch `raw` data
raw_tracks_path="run/gnn_evaluation/test"
raw_trkx_reader = SttTorchDataReader(raw_tracks_path)

In [16]:
n_tot_files = reco_trkx_reader.nevts
all_evtids = reco_trkx_reader.all_evtids
max_evts = max_evts if max_evts > 0 and max_evts <= n_tot_files else n_tot_files

In [17]:
raw_trkx_reader.all_evtids[:10]

['900', '901', '902', '903', '904', '905', '906', '907', '908', '909']

In [18]:
raw_trkx_data = raw_trkx_reader(900)

In [19]:
# particles: ['particle_id', 'pt', 'eta', 'radius', 'vz'] where radius = sqrt(vx**2 + vy**2) and and ['vx', 'vy', 'vz'] are the production vertex

In [20]:
# raw_trkx_data
# raw_trkx_data.hid.numpy()
# raw_trkx_data.pid.int().numpy()

In [21]:
raw_trkx_data

Data(x=[158, 3], pid=[158], layers=[158], event_file='/home/adeak977/current/3_deeptrkx/stttrkx-hsf/train_all/event0000000900', hid=[158], pt=[158], modulewise_true_edges=[2, 148], layerwise_true_edges=[2, 153], edge_index=[2, 946], y_pid=[946], scores=[1892])

In [22]:
# reco:  ['hit_id', 'track_id']
reco_trkx_data.head()

Unnamed: 0,hit_id,track_id
0,97,0
1,19,1
2,48,-1
3,149,-1
4,198,2


In [23]:
# truth:  ['hit_id', 'particle_id']
truth = pd.DataFrame({'hit_id': raw_trkx_data.hid.numpy(), 'particle_id': raw_trkx_data.pid.int().numpy()}, columns=['hit_id', 'particle_id'])
truth.head()

Unnamed: 0,hit_id,particle_id
0,97,6
1,19,9
2,48,8
3,149,4
4,198,2


In [24]:
np.unique(truth.particle_id.values)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

In [25]:
# particles: ['particle_id', 'pt', 'eta', 'radius', 'vz']
particles = pd.DataFrame({'particle_id': raw_trkx_data.pid.int().numpy(), 'pt': raw_trkx_data.pt.numpy()}, columns=['particle_id', 'pt'])

In [26]:
particles.drop_duplicates(subset=['particle_id']).shape

(10, 2)

In [27]:
np.unique(particles.particle_id.values)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

### Current Torch Geometric Data I have

```
Data(x=[158, 3], pid=[158], layers=[158], event_file='/home/adeak977/current/3_deeptrkx/stttrkx-hsf/train_all/event0000000900', hid=[158], pt=[158], modulewise_true_edges=[2, 148], layerwise_true_edges=[2, 153], edge_index=[2, 946], y_pid=[946], scores=[1892])
```

### What I have in my torch-geometric data after GNNBuilder?

1. x,y coordinates
2. hit_id (hid)
3. particle_id (pid)
4. pt
5. scores, etc

### What I don't have in my torch-geometric data after GNNBuilder?

1. eta
2. radius
3. vz


Can get `eta, radius, vz` if one re-process an event directly from **CSV** (similar to **ACTSCSVReader**) and add these variable in addition to what I already have.

### `evaluate_reco_tracks(truth_data, reco_data, particles)`

In [28]:
truth.head()

Unnamed: 0,hit_id,particle_id
0,97,6
1,19,9
2,48,8
3,149,4
4,198,2


In [29]:
reconstructed.head()

Unnamed: 0,hit_id,track_id
0,97,0
1,19,1
2,48,-1
3,149,-1
4,198,2


In [30]:
particles.head()

Unnamed: 0,particle_id,pt
0,6,0.461538
1,9,0.073745
2,8,0.510547
3,4,0.7129
4,2,0.650417


In [31]:
min_hits_truth=7
min_hits_reco=5
min_pt=0.
frac_reco_matched=0.5
frac_truth_matched=0.5

In [32]:
 # just in case particle_id == 0 included in truth.
if 'particle_id' in truth.columns:
    truth = truth[truth.particle_id > 0]

In [33]:
reconstructed.describe()

Unnamed: 0,hit_id,track_id
count,158.0,158.0
mean,120.962025,3.493671
std,69.570739,2.724615
min,1.0,-1.0
25%,66.25,2.0
50%,121.5,4.0
75%,177.75,5.0
max,243.0,9.0


In [34]:
# get number of spacepoints in each reconstructed tracks
n_reco_hits = reconstructed.track_id.value_counts(sort=False)\
    .reset_index().rename(
        columns={"index":"track_id", "track_id": "n_reco_hits"})

In [35]:
n_reco_hits.head(11)

Unnamed: 0,track_id,n_reco_hits
0,0,18
1,1,8
2,-1,13
3,2,18
4,3,17
5,4,31
6,5,17
7,6,10
8,7,11
9,8,11


In [36]:
# only tracks with a minimum number of spacepoints are considered
n_reco_hits = n_reco_hits[n_reco_hits.n_reco_hits >= min_hits_reco]
reconstructed = reconstructed[reconstructed.track_id.isin(n_reco_hits.track_id.values)]

In [37]:
reconstructed.describe()

Unnamed: 0,hit_id,track_id
count,154.0,154.0
mean,123.675325,3.350649
std,68.365883,2.608514
min,1.0,-1.0
25%,69.25,1.25
50%,123.5,4.0
75%,178.75,5.0
max,243.0,8.0


In [38]:
particles.describe()

Unnamed: 0,particle_id,pt
count,158.0,158.0
mean,5.544304,0.620882
std,2.756798,0.342762
min,1.0,0.068565
25%,3.0,0.424822
50%,6.0,0.647816
75%,8.0,0.714695
max,10.0,1.340751


In [39]:
# get number of spacepoints in each particle
hits = truth.merge(particles, on='particle_id', how='left')
n_true_hits = hits.particle_id.value_counts(sort=False).reset_index().rename(
    columns={"index":"particle_id", "particle_id": "n_true_hits"})

In [40]:
hits.describe()

Unnamed: 0,hit_id,particle_id,pt
count,2538.0,2538.0,2538.0
mean,121.039795,5.546887,0.630281
std,66.902735,2.643659,0.327627
min,1.0,1.0,0.068565
25%,68.0,3.0,0.459712
50%,121.0,6.0,0.648353
75%,176.0,8.0,0.71455
max,243.0,10.0,1.340751


In [41]:
n_true_hits.describe()

Unnamed: 0,particle_id,n_true_hits
count,10.0,10.0
mean,5.5,253.8
std,3.02765,63.432904
min,1.0,121.0
25%,3.25,225.0
50%,5.5,272.5
75%,7.75,289.0
max,10.0,324.0


In [42]:
# only particles leaves at least min_hits_truth spacepoints 
# and with pT >= min_pt are considered.
particles = particles.merge(n_true_hits, on=['particle_id'], how='left')

In [43]:
is_trackable = particles.n_true_hits >= min_hits_truth

In [44]:
# event has 3 columnes [track_id, particle_id, hit_id]
event = pd.merge(reconstructed, truth, on=['hit_id'], how='left')

In [45]:
event.head()

Unnamed: 0,hit_id,track_id,particle_id
0,97,0,6
1,19,1,9
2,48,-1,8
3,149,-1,4
4,198,2,2


In [46]:
# n_common_hits and n_shared should be exactly the same 
# for a specific track id and particle id

In [47]:
# Each track_id will be assigned to multiple particles.
# To determine which particle the track candidate is matched to, 
# we use the particle id that yields a maximum value of n_common_hits / n_reco_hits,
# which means the majority of the spacepoints associated with the reconstructed
# track candidate comes from that true track.
# However, the other way may not be true.

In [48]:
reco_matching = event.groupby(['track_id', 'particle_id']).size()\
        .reset_index().rename(columns={0:"n_common_hits"})

In [49]:
reco_matching.head(15)

Unnamed: 0,track_id,particle_id,n_common_hits
0,-1,4,6
1,-1,8,6
2,-1,10,1
3,0,6,18
4,1,9,8
5,2,2,18
6,3,1,11
7,3,9,6
8,4,3,15
9,4,7,16


In [50]:
# Each particle will be assigned to multiple reconstructed tracks
truth_matching = event.groupby(['particle_id', 'track_id']).size()\
    .reset_index().rename(columns={0:"n_shared"})

In [51]:
truth_matching.head(15)

Unnamed: 0,particle_id,track_id,n_shared
0,1,3,11
1,2,2,18
2,3,4,15
3,4,-1,6
4,4,7,11
5,5,5,17
6,6,0,18
7,7,4,16
8,8,-1,6
9,8,8,11


In [52]:
# add number of hits to each of the maching dataframe
reco_matching = reco_matching.merge(n_reco_hits, on=['track_id'], how='left')
truth_matching = truth_matching.merge(n_true_hits, on=['particle_id'], how='left')

# calculate matching fraction
reco_matching = reco_matching.assign(
    purity_reco=np.true_divide(reco_matching.n_common_hits, reco_matching.n_reco_hits))
truth_matching = truth_matching.assign(
    purity_true = np.true_divide(truth_matching.n_shared, truth_matching.n_true_hits))

In [53]:
# select the best match
reco_matching['purity_reco_max'] = reco_matching.groupby(
    "track_id")['purity_reco'].transform(max)
truth_matching['purity_true_max'] = truth_matching.groupby(
    "track_id")['purity_true'].transform(max)