# Clustering
The ultimate objective of the project is to identify timeslices with event hits. Clustering seems natural for this usecase since it operates on the notion of "distance" between samples which is indeed the case for this dataset (hits from related events occur close in space and time). Since, the number of hits originating from an event varies every timeslice, we cannot reliably estimate the number of clusters we want the model to learn. Thus, we are now in unsupervised territory.

In [45]:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from context import km3net
from km3net.utils import DATADIR

In [31]:
df = pd.read_csv(DATADIR+'/processed/slice-615.csv')
df

Unnamed: 0,pos_x,pos_y,pos_z,time,label,event_id,timeslice
0,48.363,-24.102,83.611,9225002.0,0,,615
1,-37.784,30.774,94.341,9225007.0,0,,615
2,-75.557,-6.893,56.111,9225008.0,0,,615
3,-27.018,-60.655,150.731,9225013.0,0,,615
4,76.914,-77.120,150.789,9225014.0,0,,615
...,...,...,...,...,...,...,...
8478,50.125,12.122,139.831,9239990.0,0,,615
8479,-74.918,65.066,74.041,9239993.0,0,,615
8480,-94.292,-6.028,103.911,9239996.0,0,,615
8481,11.829,13.744,130.489,9239998.0,0,,615


In [15]:
df.groupby('event_id').count()

Unnamed: 0_level_0,pos_x,pos_y,pos_z,time,label,timeslice
event_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1163.0,106,106,106,106,106,106
2042.0,938,938,938,938,938,938
2322.0,461,461,461,461,461,461
2363.0,187,187,187,187,187,187


There are event hits from 4 events, so we should expect 5 clusters (1 additional for noise hits).

In [32]:
X_train, X_test = train_test_split(df[['pos_x', 'pos_y', 'pos_z', 'time']].to_numpy(),
                                                   random_state=42)
print('shape X_train:', X_train.shape)
print('shape X_test:', X_test.shape)

shape X_train: (6362, 4)
shape X_test: (2121, 4)


In [33]:
X_train

array([[-2.664500e+01,  4.834400e+01,  1.507310e+02,  9.232815e+06],
       [-3.715500e+01, -4.216600e+01,  3.795900e+01,  9.237183e+06],
       [ 6.916400e+01,  4.872100e+01,  1.782890e+02,  9.226666e+06],
       ...,
       [ 5.937000e+01, -4.307000e+01,  1.966110e+02,  9.233700e+06],
       [-1.794800e+01,  1.046680e+02,  1.400590e+02,  9.226299e+06],
       [ 4.069800e+01,  6.717900e+01,  1.783410e+02,  9.237752e+06]])

In [34]:
scaler = StandardScaler().fit(X_train)
scaler.mean_

array([-2.54972839e+00,  9.30161742e-01,  1.14471873e+02,  9.23233953e+06])

In [35]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [39]:
X_train

array([[-0.48571926,  0.78240577,  0.73048282,  0.11290825],
       [-0.6975828 , -0.71115706, -1.54144194,  1.15016656],
       [ 1.44562551,  0.78862689,  1.28567118, -1.34727979],
       ...,
       [ 1.2481953 , -0.72607454,  1.65478945,  0.32306704],
       [-0.31040269,  1.71184377,  0.51548281, -1.43443038],
       [ 0.87179987,  1.09321404,  1.28671878,  1.2852856 ]])

In [40]:
cluster = DBSCAN().fit(X_train)

In [46]:
np.unique(cluster.labels_)

array([-1,  0])

In [47]:
label = pd.Series(cluster.labels_)
label

0       0
1       0
2       0
3       0
4       0
       ..
6357    0
6358    0
6359    0
6360    0
6361    0
Length: 6362, dtype: int64

In [51]:
label.value_counts()

 0    6347
-1      15
dtype: int64