# Pattern Matrix Replacement

This this notebook we explore strategies to replace the ***Hit Correlation*** step of Konrad's pipeline, specifically the *patten matrix* algorithm previously used.

## Approach
I am thinking of two approaches. First, use a simple MLP to determine if two given points are *causally related* or not. To do this, the following tasks are required:
- [x] prepare the data such that each row contains the features (x,y,z,time/timeslice) of each pair of points (exluding self)
    - [ ] subtask here is to use PCA to reduce the dimensions and see if the network performs same/worse
- [X] create labels for training set ie. *1* of related and *0* otherwise (this hinges upon the fact that we can extract labels from *mc_info* table)
- [ ] visualize the results

The other approach is to treat this as an unsupervised learning task and use clustering to determine *related* points.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from context import km3net
from km3net.data import data, pattern_matrix
from km3net.utils import DATADIR


# Creating the "Pattern Matrix" Dataset

Since this step will essentially double the width of the dataset and square it's height, we will only use timeslice 665 (timeslice with the largest hits, as per the [exploration](notebooks/exploration.ipynb) conducted previously). The algorithm to generate the dataset is as follows:
1. create an empty dataframe to hold the `result`
2.iterate over original df with the row and index
    1. duplicate original df
    2. set value of `dup` columns to that of the row
    3. concat dup and original dfs (sideways) to create pairs
    4. drop the rows where `id1` is less than `id2` to avoid repeat pairs
    5. append dup to result
    
The algorithm was tested with a small sample of 10 rows before the dataset below was created.

## Generate labels

Once we "explode" the dataset, we need to generate labels. The logic is simple, if `eid1` and `eid2` are same then give it a label of 1, else a label of 0. There are 3 possible combinations that can occur: 1. hit-hit 2. hit-noise and 3. noise-noise. Since noise has `nan` for the event ids and since in Python `nan != nan` all 3 cases can be correctly handled by a simple comparison of the two column values.

Drop the columns that are not required for training, and write to csv.

# Alternative datasets

- [ ] train using the difference between x,y,z,t
- [ ] train using a larger sample (50%, 75%, 100% of slice-615)
- [ ] train using sample from entire dataset, across timeslices

# MLP for "Pattern Matrix" Replacement

I followed this [tutorial](https://machinelearningmastery.com/pytorch-tutorial-develop-deep-learning-models/) to implement the first iteration of the network.

1. what happens if we vary the learning rate and momentum of the optimizer?

## Experiment 1

### Parameters
- Data: 10% of slice 615 (severe class imbalance)
- Loss: BCELoss
- Optimizer: SGD(lr=0.001, momentum=0.9)
- Layers: (8, 10), (10, 8)
- Activation: hidden -> ReLu, output -> Sigmoid
- Epochs: 10

### Accuracy
98%

### Remarks
***This experiment is flawed, since training was done with an imbalanced training set and testing with the same.***



## Experiment 2
To combat the shortcomings of Experiment 1, we equalize the targets in this experiment whilst keeping the parameters the sames.

### Parameters
- Data: 10% of slice 615 (equalized classes)
- Loss: BCELoss
- Optimizer: SGD(lr=0.001, momentum=0.9)
- Layers: (8, 10), (10, 8)
- Activation: hidden -> ReLu, output -> Sigmoid
- Epochs: 10

### Accuracy
81%

### Remarks
The model already performs well, but we can improve the performance with perhaps more data.

In [2]:
from context import km3net
from km3net.utils import DATADIR
import km3net.model.utils as model_utils
import km3net.data.utils as data_utils
import km3net.data.pattern_matrix as pm
from km3net.model.mlp import MLP
from torch.nn import BCELoss
from torch.optim import SGD
import torch
import pandas as pd

In [4]:
path = DATADIR+'/train/slice-615-0-1-equal.csv'
train_dl, test_dl = model_utils.prepare_data(path,normalise=True)
print(len(train_dl.dataset), len(test_dl.dataset))

9253 4557


In [5]:
net = MLP(8)
optimizer = SGD(net.parameters(), lr=0.001, momentum=0.9)
criterion = BCELoss()

In [7]:
model_utils.train(train_dl, net, criterion, optimizer, epochs=10)
acc = model_utils.test(test_dl, net)
print("Accuracy: %.3f" % acc.item())

Accuracy: 0.816


## Experiment 3
Next, we increase the size of training set whilst keeping the parameters the same.

### Parameters
- Data: 25% of slice 615 (equalized classes)
- Loss: BCELoss
- Optimizer: SGD(lr=0.001, momentum=0.9)
- Layers: (8, 10), (10, 8)
- Activation: hidden -> ReLu, output -> Sigmoid
- Epochs: 5

### Accuracy
50%

### Remarks
Increasing training size does not improve accuracy, problem must be somewhere else.

In [None]:
from context import km3net
from km3net.utils import DATADIR
import km3net.model.utils as model_utils
import km3net.data.utils as data_utils
import km3net.data.pattern_matrix as pm
from km3net.model.mlp import MLP
from torch.nn import BCELoss
from torch.optim import SGD
import torch
import pandas as pd

In [8]:
path = DATADIR+'/train/slice-615-2-5-equal.csv'
train_dl, test_dl = model_utils.prepare_data(path, normalise=True)
print(len(train_dl.dataset), len(test_dl.dataset))

42407 20887


In [10]:
net = MLP(8)
optimizer = SGD(net.parameters(), lr=0.001, momentum=0.9)
criterion = BCELoss()

In [11]:
model_utils.train(train_dl, net, criterion, optimizer, epochs=10)
acc = model_utils.test(test_dl, net)
print("Accuracy: %.3f" % acc.item())

[0,  1999] loss: 0.678
[1,  1999] loss: 0.562
[2,  1999] loss: 0.472
[3,  1999] loss: 0.404
[4,  1999] loss: 0.324
[5,  1999] loss: 0.215
[6,  1999] loss: 0.156
[7,  1999] loss: 0.130
[8,  1999] loss: 0.114
[9,  1999] loss: 0.102
Accuracy: 0.959
