# Pattern Matrix Replacement

This this notebook we explore strategies to replace the ***Hit Correlation*** step of Konrad's pipeline, specifically the *patten matrix* algorithm previously used.

## Approach
I am thinking of two approaches. First, use a simple MLP to determine if two given points are *causally related* or not. To do this, the following tasks are required:
- [ ] prepare the data such that each row contains the features (x,y,z,time/timeslice) of each pair of points (exluding self)
    - [ ] subtask here is to use PCA to reduce the dimensions and see if the network performs same/worse
- [ ] create labels for training set ie. *1* of related and *0* otherwise (this hinges upon the fact that we can extract labels from *mc_info* table)
- [ ] visualize the results

The other approach is to treat this as an unsupervised learning task and use clustering to determine *related* points.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
data = pd.read_csv("../data/data.csv")
data

Unnamed: 0,pos_x,pos_y,pos_z,time,label,event_id,timeslice
0,-17.661,32.245,65.231,0.0,0,,0
1,76.840,-77.173,186.931,0.0,0,,0
2,-73.403,30.509,94.511,0.0,0,,0
3,1.453,33.155,169.111,0.0,0,,0
4,49.456,47.904,140.111,0.0,0,,0
...,...,...,...,...,...,...,...
45820211,-57.230,-5.401,196.389,101502104.0,0,,6766
45820212,0.724,66.341,121.789,101516467.0,0,,6767
45820213,-26.436,86.737,160.131,101545421.0,0,,6769
45820214,-26.931,-21.994,178.511,101581891.0,0,,6772


## Creating the "Pattern Matrix" Dataset

Since this step will essentially double the width of the dataset and square it's height, we will only use timeslice 665 (timeslice with the largest hits, as per the [exploration](notebooks/exploration.ipynb) conducted previously). The algorithm to generate the dataset is as follows:
1. create an empty dataframe to hold the `result`
2.iterate over original df with the row and index
    1. duplicate original df
    2. set value of `dup` columns to that of the row
    3. concat dup and original dfs (sideways) to create pairs
    4. drop the rows with the same index (we do not want to pair rows with itself)
    5. append dup to result
    
The algorithm was tested with a small sample of 10 rows before the dataset below was created.

In [108]:
p1_col_names = {'pos_x': 'x1', 'pos_y': 'y1',
    'pos_z': 'z1', 'time': 't1',
    'label': 'l1', 'event_id': 'eid1',
    'timeslice': 'ts1', 'id':'id1'}
p2_col_names = {'pos_x': 'x2', 'pos_y': 'y2',
    'pos_z': 'z2', 'time': 't2',
    'label': 'l2', 'event_id': 'eid2',
    'timeslice': 'ts2', 'id':'id2'}

def explode(df):
    """
    Expects a dataframe which is then used to create `result`
    containing each row of `df` with all other rows. Returns
    `result` which is a dataframe.
    """
    result = pd.DataFrame()
    for id, row in df.iterrows():
        dup = df.copy()

        # may be a better way to do this
        dup['pos_x'] = row['pos_x']
        dup['pos_y'] = row['pos_y']
        dup['pos_z'] = row['pos_z']
        dup['time'] = row['time']
        dup['label'] = row['label']
        dup['event_id'] = row['event_id']
        dup['timeslice'] = row['timeslice']
        dup['id'] = row['id']
    
        dup = dup.rename(columns=p1_col_names)

        dup = pd.concat([dup, df], axis=1)

        dup = dup.rename(columns=p2_col_names)

        dup = dup[dup['id1'] != dup['id2']]

        result = pd.concat([result, dup])
    
    return result

In [137]:
# TODO does not work!
def explode2(df):
    print("df shape:", df.shape)
    result = pd.DataFrame()
    for id, row in df.iterrows():
        dup = row.to_frame().transpose()
        print("dup shape after frame and transpose:", dup.shape)
        dup = pd.concat([dup]*len(df))
        print("dup shape after concat:", dup.shape)
        dup = dup.rename(columns=p1_col_names)
        dup = pd.concat([dup, df], axis=1)
        print("dup shape after concat with df:", dup.shape)
        dup = dup.rename(columns=p2_col_names)
        dup = dup[dup['id1'] != dup['id2']]
        print("dup shape after drop:", dup.shape)
        result = pd.concat([result, dup])
    
    return result

### Profiling explore variants

Profile the `explode` and `explode2` function on a various samples to get an estimate for it's runtime and performance.

| method | n | time |
|--------|---|------|
|explode |100|1.07  |
|explode2|100||
|explode |200|2.75  |
|explode2|200||
|explode |300|5.95  |
|explode2|300||
|explode |500|19.23 |
|explode2|500||
|explode |850|83.70 |
|explode2|850||

In [123]:
timeslice = 615
df = data[data['timeslice'] == timeslice].sample(100)
df['id'] = df.index
df

Unnamed: 0,pos_x,pos_y,pos_z,time,label,event_id,timeslice,id
4229743,-16.698,-113.131,130.431,9228654.0,0,,615,4229743
4228129,-8.317,86.884,196.611,9226063.0,1,2322.0,615,4228129
4231456,-46.409,-58.954,196.389,9232156.0,0,,615,4231456
4233198,1.165,33.155,103.911,9233983.0,0,,615,4233198
4232452,-56.614,-41.213,74.211,9232733.0,1,2042.0,615,4232452
...,...,...,...,...,...,...,...,...
4228932,70.058,-59.653,121.959,9226900.0,0,,615,4228932
4234714,-73.941,-78.659,186.931,9237427.0,0,,615,4234714
4231936,10.753,-24.788,47.359,9232476.0,1,2042.0,615,4231936
4232369,-56.758,-41.105,55.941,9232681.0,1,2042.0,615,4232369


In [124]:
row = df[:1]
row

Unnamed: 0,pos_x,pos_y,pos_z,time,label,event_id,timeslice,id
4229743,-16.698,-113.131,130.431,9228654.0,0,,615,4229743


In [135]:
dup

Unnamed: 0,x1,y1,z1,t1,l1,eid1,ts1,id1,pos_x,pos_y,pos_z,time,label,event_id,timeslice,id
0,-16.698,-113.131,130.431,9228654.0,0.0,,615.0,4229743.0,,,,,,,,
1,-16.698,-113.131,130.431,9228654.0,0.0,,615.0,4229743.0,,,,,,,,
2,-16.698,-113.131,130.431,9228654.0,0.0,,615.0,4229743.0,,,,,,,,
3,-16.698,-113.131,130.431,9228654.0,0.0,,615.0,4229743.0,,,,,,,,
4,-16.698,-113.131,130.431,9228654.0,0.0,,615.0,4229743.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4235724,,,,,,,,,76.748,-77.014,47.131,9239147.0,0.0,,615.0,4235724.0
4235746,,,,,,,,,59.300,31.935,56.111,9239199.0,0.0,,615.0,4235746.0
4235810,,,,,,,,,88.385,-59.932,187.159,9239338.0,0.0,,615.0,4235810.0
4236033,,,,,,,,,-74.835,101.800,94.459,9239777.0,0.0,,615.0,4236033.0


In [115]:
result = explode2(df)
result

df shape: (100, 8)
dup shape after frame and transpose: (1, 8)
dup shape after concat: (100, 8)


ValueError: Shape of passed values is (10099, 16), indices imply (199, 16)

In [103]:
result.to_frame().transpose()

Unnamed: 0,pos_x,pos_y,pos_z,time,label,event_id,timeslice,id
4231874,12.078,13.6,65.289,9232445.0,1.0,2042.0,615.0,4231874.0


In [69]:
import cProfile

In [89]:
cProfile.run('explode(df)')

         12096049 function calls (11936613 primitive calls) in 83.703 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     4239    0.007    0.000    0.059    0.000 <__array_function__ internals>:2(argsort)
    11023    0.013    0.000    0.082    0.000 <__array_function__ internals>:2(atleast_2d)
     3392    0.004    0.000    0.010    0.000 <__array_function__ internals>:2(can_cast)
    16106    0.028    0.000   29.108    0.002 <__array_function__ internals>:2(concatenate)
     3392    0.004    0.000    0.014    0.000 <__array_function__ internals>:2(copyto)
     3392    0.004    0.000    0.165    0.000 <__array_function__ internals>:2(delete)
     5934    0.010    0.000    0.041    0.000 <__array_function__ internals>:2(min_scalar_type)
     4239    0.007    0.000   20.358    0.005 <__array_function__ internals>:2(vstack)
    92429    0.099    0.000    0.189    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
   

In [90]:
19.23/5.95

3.2319327731092438

In [91]:
83/19

4.368421052631579