# Pattern Matrix Replacement

This this notebook we explore strategies to replace the ***Hit Correlation*** step of Konrad's pipeline, specifically the *patten matrix* algorithm previously used.

## Approach
I am thinking of two approaches. First, use a simple MLP to determine if two given points are *causally related* or not. To do this, the following tasks are required:
- [x] prepare the data such that each row contains the features (x,y,z,time/timeslice) of each pair of points (exluding self)
    - [ ] subtask here is to use PCA to reduce the dimensions and see if the network performs same/worse
- [X] create labels for training set ie. *1* of related and *0* otherwise (this hinges upon the fact that we can extract labels from *mc_info* table)
- [ ] visualize the results

The other approach is to treat this as an unsupervised learning task and use clustering to determine *related* points.

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
from context import km3net
from km3net.data import data, pattern_matrix
from km3net.data import utils

In [9]:
path = km3net.utils.DATADIR + '/processed/slice-615.csv'
df = utils.load(path)
df[['pos_x', 'pos_y', 'pos_z', 'time', 'event_id']]

Unnamed: 0,pos_x,pos_y,pos_z,time,event_id
0,48.363,-24.102,83.611,9225002.0,
1,-37.784,30.774,94.341,9225007.0,
2,-75.557,-6.893,56.111,9225008.0,
3,-27.018,-60.655,150.731,9225013.0,
4,76.914,-77.120,150.789,9225014.0,
...,...,...,...,...,...
8478,50.125,12.122,139.831,9239990.0,
8479,-74.918,65.066,74.041,9239993.0,
8480,-94.292,-6.028,103.911,9239996.0,
8481,11.829,13.744,130.489,9239998.0,


In [14]:
sample = df.sample(frac=0.1)
sample

Unnamed: 0,pos_x,pos_y,pos_z,time,label,event_id,timeslice
287,-16.064,-76.624,151.011,9225675.0,0,,615
3592,-93.854,29.775,65.459,9231711.0,0,,615
3614,-63.629,-23.757,94.341,9231763.0,0,,615
1877,-73.311,30.290,121.731,9228239.0,0,,615
172,-47.798,12.654,37.731,9225394.0,0,,615
...,...,...,...,...,...,...,...
1072,20.518,102.115,196.331,9226515.0,0,,615
7359,-93.950,29.716,160.131,9237828.0,0,,615
3711,-73.486,30.199,139.889,9231970.0,0,,615
7223,39.623,103.299,55.800,9237695.0,0,,615


# Creating the "Pattern Matrix" Dataset

Since this step will essentially double the width of the dataset and square it's height, we will only use timeslice 665 (timeslice with the largest hits, as per the [exploration](notebooks/exploration.ipynb) conducted previously). The algorithm to generate the dataset is as follows:
1. create an empty dataframe to hold the `result`
2.iterate over original df with the row and index
    1. duplicate original df
    2. set value of `dup` columns to that of the row
    3. concat dup and original dfs (sideways) to create pairs
    4. drop the rows where `id1` is less than `id2` to avoid repeat pairs
    5. append dup to result
    
The algorithm was tested with a small sample of 10 rows before the dataset below was created.

## Generate labels

Once we "explode" the dataset, we need to generate labels. The logic is simple, if `eid1` and `eid2` are same then give it a label of 1, else a label of 0. There are 3 possible combinations that can occur: 1. hit-hit 2. hit-noise and 3. noise-noise. Since noise has `nan` for the event ids and since in Python `nan != nan` all 3 cases can be correctly handled by a simple comparison of the two column values.

Drop the columns that are not required for training, and write to csv.

In [15]:
sample.groupby('event_id').count()

Unnamed: 0_level_0,pos_x,pos_y,pos_z,time,label,timeslice
event_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1163.0,8,8,8,8,8,8
2042.0,107,107,107,107,107,107
2322.0,46,46,46,46,46,46
2363.0,19,19,19,19,19,19


In [16]:
exploded_sample = pattern_matrix.process(sample)
exploded_sample

Unnamed: 0,x1,y1,z1,t1,x2,y2,z2,t2,label
3592,-16.064,-76.624,151.011,9225675.0,-93.854,29.775,65.459,9231711.0,0
3614,-16.064,-76.624,151.011,9225675.0,-63.629,-23.757,94.341,9231763.0,0
1877,-16.064,-76.624,151.011,9225675.0,-73.311,30.290,121.731,9228239.0,0
172,-16.064,-76.624,151.011,9225675.0,-47.798,12.654,37.731,9225394.0,0
7815,-16.064,-76.624,151.011,9225675.0,58.991,31.948,37.841,9238638.0,0
...,...,...,...,...,...,...,...,...,...
7223,-93.950,29.716,160.131,9237828.0,39.623,103.299,55.800,9237695.0,0
2572,-93.950,29.716,160.131,9237828.0,-6.860,-57.851,37.841,9229671.0,0
7223,-73.486,30.199,139.889,9231970.0,39.623,103.299,55.800,9237695.0,0
2572,-73.486,30.199,139.889,9231970.0,-6.860,-57.851,37.841,9229671.0,0


In [17]:
print('Shape hits: {0}'.format(exploded_sample[exploded_sample['label'] == 1].shape))

Shape hits: (6905, 9)


In [18]:
print('Shape noise: {0}'.format(exploded_sample[exploded_sample['label'] != 1].shape))

Shape noise: (352223, 9)


In [19]:
exploded_sample.to_csv(DATADIR+'/train/slice-615-0-1.csv', index=False, header=False)

# MLP for "Pattern Matrix" Replacement

I followed this [tutorial](https://machinelearningmastery.com/pytorch-tutorial-develop-deep-learning-models/) to implement the first iteration of the network.

In [29]:
import km3net.model as model

In [30]:
path = DATADIR+'/train/slice-615-0-1.csv'
train_dl, test_dl = model.utils.prepare_data(path)
print(len(train_dl.dataset), len(test_dl.dataset))

240616 118512


In [31]:
net = model.mlp.MLP(8)
net

MLP(
  (hidden1): Linear(in_features=8, out_features=10, bias=True)
  (act1): ReLU()
  (hidden2): Linear(in_features=10, out_features=8, bias=True)
  (act2): ReLU()
  (hidden3): Linear(in_features=8, out_features=1, bias=True)
  (act3): Sigmoid()
)

In [83]:
train_model(train_dl, model)

In [84]:
acc = evaluate_model(test_dl, model)
print("Accuracy: %.3f" % acc)

Accuracy: 0.515
