# The Problem with Crossmatching #
Elizabeth Warrick

**Disclaimer: in future versions we hope to incorporate truth values into the Zooniverse subject data manifest.**

#### Description ####
I am attempting to crossmatch the truth classifications of the monte carlo events with the events that have already been classified by Beta testers. Initialy, I thought this would be a simple task of making a mask to select events from the truth value table that have been classified by users in the beta test, then save those as a new key-value pair and add it to the beta classifications catalogue. 

The current truth values are not necessasrily correct as they were determined using the old version of the mc label script that was made for a prior version of MC events and was therefore adapted to this version (albeit, still with its problems). Work is being done to implement the new one which is also having problems similar to the older classifier used here. 

* Old MC Labeler: /home/icecube/Desktop/eliz_zooniverse/icecubezooniverseproj_ver3/scripts_ver3/mc_truth.py

* "New" MC Labeler: /home/icecube/Desktop/eliz_zooniverse/icecubezooniverseproj_ver3/scripts_ver3/mc_labeler.py

### Imports ###

In [1]:
import tables
from tables import *
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from icecube import dataclasses, dataio, icetray, MuonGun
from icecube.icetray import I3Units
import icecube.MuonGun
import numpy as np
import h5py

For understanding the truth labels, here is their corresponding type from the old mc labeler script:

{0: 'Skimming',
          1: 'Starting Cascade',
          2: 'Through-Going Track',
          3: 'Starting Track',
          4: 'Stopping Track',
          5: 'Double-Bang',
          6: 'Stopping Tau',
          7: 'Glashow Hadronic Cascade',
          8: 'Glashow Track',
          9: 'Glashow Tau',
          11: 'Passing Track',
          21: 'Several Muons',
          22: 'Through-Going Bundle',
          23: 'Stopping Bundle',
          100: 'Unclassified',
          101: 'Unclassified'}

In [2]:
# Run this cell when finished using this notebook to close any opened hdf files. 
#format: variable_name.close()
numu_21971_493_hdf.close()

### Beta Data ###

Problems:
* Repeated events caused by using two different subruns (21971_000 and 21971_493) that had overlapping event IDs -- solved by splitting up user classifications by subject ID. 
    - If the zooniverse ID starts with a 7, then it is from Run 21971_493.
    * If the zooniverse ID starts with an 8, then it is from Run 21971_000.
* Some of the repeated events had different truth values.
    * This is resolved by splitting by subruns since there are sometimes TWO different events that just happen to have the same event ID. 


In [2]:
beta1 = '/home/icecube/Desktop/eliz_zooniverse/icecubezooniverseproj_ver3/plots_ver3/data_agg/classifications_counts_beta1.csv'
beta1_df = pd.read_csv(beta1)
b_new = beta1_df.rename(columns={'Unnamed: 0':'Zooniverse_ID'})

In [3]:
beta_run21971_493 = b_new[b_new['Zooniverse_ID']< 80000000]
beta_run21971_000 = b_new[b_new['Zooniverse_ID']> 80000000]

In [4]:
beta_unique_21971_493, beta_counts_21971_493 = np.unique(beta_run21971_493['Event'][:], return_counts = True)
beta_unique_21971_000, beta_counts_21971_000 = np.unique(beta_run21971_000['Event'][:], return_counts = True)

if (len(beta_run21971_493) - len(beta_unique_21971_493) == 0):
    print("There are no repeat events in Run21971_493.")
else:
    print("There are repeated events in Run21971_493.")
    
if (len(beta_run21971_000) - len(beta_unique_21971_000) == 0):
    print("There are no repeat events in Run21971_000.")
else:
    print("There are repeated events in Run21971_000.")

There are no repeat events in Run21971_493.
There are no repeat events in Run21971_000.


Therefore, by seperating the beta data sub-runs by Zooniverse ID I have made two unique dataframes of userclassifications for each sub-run. 

### Truth Classifications of MC Events###

I3 File uses the **InIceDSTPulse** pulse series

Problems:
* Initially looked through the truth classifications of the two sub-runs at the same time. -- I saw repeated event IDs each with their own truth values. 
    * Issue should be resolved by keeping the two sub-runs sperate and by seperating beta classifications by Zooniverse ID (assigned based on time of subject upload to Zooniverse website).
* Some of the events have recieved multiple truth values within eachi3 file that has associated truth values. 
    * When I look through steamshovel, I sometimes can't find those events with multiple truths. --WHere did those come from? Perhaps a coding error?
    * Are there events with multipe truth values that have also been classified in the bta test?
    * If not, can we ignore the repeated events?
    * Are the multiple truth values a result of inability to handle coincident events?
    * Do the events with multiple truths have the same truths or different?
    * Could I be rewriting the truth values in my code?

Current Plan:

I will repeat filtering to get truth values on one of the i3 files (Run21971_493) and see if it has multiple truths or repeated events. If the i3 file does have repeated events/more than one truth per event, then I need to look back at the old truthlabeler script (or else move on and focus on the new one). If the i3 file does not, then it was likely a coding issue on my end.

**Note: I used the same GCD as I did previously in my other tests. I am unsure how to determine if this is a GCD issue.**

**GCD location is '/home/icecube/Desktop/eliz_zooniverse/icecubezooniverseproj_ver3/i3_files/GeoCalibDetectorStatus_2012.56063_V1.i3.gz'**

In [7]:
#Open the i3 file's hdf table.

#Note that this file is a remake of the one I have been using. 

#Path to the i3 file hdf.
numu21971_493 = '/home/icecube/Desktop/eliz_zooniverse/icecubezooniverseproj_ver3/plots_ver3/data_agg/\
Crossmatch_Investigation/truthlabels_oldver_classifier_DST_IC86.2020_NuMu.021971.000493.i3.bz2.hd5'
numu_21971_493_hdf = h5py.File(numu21971_493, "r+") #open hdf

#store relavent arrays from i3 hdf as variables for easy access. 
event_ids = numu_21971_493_hdf['I3EventHeader']['Event'][:] #MC event IDs. 
event_truths = numu_21971_493_hdf['classification_truth_id']['value'][:] #MC truth values from mc_truth.py.

In [8]:
#Checking some stats on the i3 file... 
print("Number of Events in the I3 file: ", len(event_ids)) #prints total number of events. 

print("Number of Truth Classifications in the I3 File: ", len(event_truths)) #prints total number of truth values.

print("Number of unique Event IDs: ", len(np.unique(event_ids))) # prints number of unique event IDs. 

print(f'There are {len(event_ids) - len(np.unique(event_ids))}\
 repeated events in the I3 File for Run21971_493') #prints number of unique event IDs. 

Number of Events in the I3 file:  8545
Number of Truth Classifications in the I3 File:  8545
Number of unique Event IDs:  8255
There are 290 repeated events in the I3 File for Run21971_493


There is a problem because we initially expected there to be an equal number of unique events compared to the toal number of events in the i3 file and the hdf table. To investigate where the repated events come from, lets get a list of which events are repated and confirm that the list matches our above calculations. 

In [22]:
#Based off of a code snippet I found on StackOverflow:

'''
I intialize emtpy dictionary "count" and empty lists "triple" (for events with more than 2 truth values) 
and "double" (for events with 2 truth values).
'''

count = {}
triple = []
double = []
duplicates = []

for i in event_ids:
    if not i in count:
        count[i] = 1
    else:
        count[i] += 1
        if count[i] > 2:
            triple.append(i)
        if count[i] <=2 & count[i] > 1:
            double.append(i)
        if count[i] > 1:
            duplicates.append(i)

print("Number of events with more than 2 truth values: ", len(triple), "with event ID: ", triple)
print("The event: ", triple, "has ", count[triple[0]], " total associated truth values.")
print("Number of events with 2 truth values: ", len(double))
print("Total number of events with multiple truth values: ", len(double)+len(triple))

#To look at the whole list of event IDs and how many times they appear in the hdf file, uncomment the following line
#count

Number of events with more than 2 truth values:  1 with event ID:  [23431]
The event:  [23431] has  3  total associated truth values.
Number of events with 2 truth values:  289
Total number of events with multiple truth values:  290


From the above code we can see that 289 events have 2 truth values, while one event has 3 truth values. Lets look more closely at the event with 3 truth values and see if we can find why it appears more than once. 

In [10]:
#Want to find the index where the repeated event occurs. 
w = np.where(event_ids == triple[0])

for i in range(count[triple[0]]):
    print("Event ID: ", event_ids[w[0][i]], "with index: ", w[0][i], "and truth value: ", event_truths[w[0][i]])

Event ID:  23431 with index:  6203 and truth value:  11
Event ID:  23431 with index:  6204 and truth value:  11
Event ID:  23431 with index:  6205 and truth value:  11


We now have the index and truth value of this event. Lets look at the event header for these three events and see if there is any difference. 

In [13]:
#Start by building a numpy array of the values stored within the event header. 
A = np.hstack((numu_21971_493_hdf['I3EventHeader'][w[0][0]],
              numu_21971_493_hdf['I3EventHeader'][w[0][1]],
              numu_21971_493_hdf['I3EventHeader'][w[0][2]]))

df = pd.DataFrame(data = A, columns = numu_21971_493_hdf['I3EventHeader'].dtype.names[0:9][:])

print("Sub Events have different I3EventHeader values for: ",\
     numu_21971_493_hdf['I3EventHeader'].dtype.names[5][:], "and: ",\
     numu_21971_493_hdf['I3EventHeader'].dtype.names[7][:])

df

Sub Events have different I3EventHeader values for:  time_start_utc_daq and:  time_end_utc_daq


Unnamed: 0,Run,Event,SubEvent,SubEventStream,exists,time_start_utc_daq,time_start_mjd,time_end_utc_daq,time_end_mjd
0,21971,23431,0,0,1,130612472957056493,59000.171844,130612472957159653,59000.171844
1,21971,23431,1,0,1,130612472957189393,59000.171844,130612472957308803,59000.171844
2,21971,23431,2,0,1,130612472957349183,59000.171844,130612472957489083,59000.171844


The mystery of the multiple MC events is solved (for now)! It seems the repeated events come from different sub-events. I am unsure how significant the sub-events are and whether it is a good or bad idea to remove them. 

One way I can think to check is by looking at the truth values for the repeated events and if they all indeed have the same truth values. If they have the same truth values then it might imply that the events are the same enough to be recongized as identical truth/event pairs. 

In [30]:
"""
Use list of all repeated events (list called duplicates) to get their index and find their truth values. 
The goal here is to produce the list of repeated events and each of their truth values to see if they 
repeated events have the same truths. 
"""

dup_index = [] #initialize empty list to store the indices of duplicate values. 
for k in range(len(duplicates)):
    dup_index.append(np.where(event_ids == duplicates[k]))

for l in range(len(dup_index)):
    for j in range(len(dup_index[l][0])):
        repeat_event_id = event_ids[dup_index[l][0][j]]
        repeat_index = dup_index[l][0][j]
        repeat_truths = event_truths[dup_index[l][0][j]]
        print("Event ID :", repeat_event_id, "with index: ", repeat_index, "and truth value: ", repeat_truths)
        
#Note: I can make new empty lists or new key/value pairs to store the output values in if needed.         
        

Event ID : 270 with index:  61 and truth value:  1
Event ID : 270 with index:  62 and truth value:  1
Event ID : 305 with index:  72 and truth value:  2
Event ID : 305 with index:  73 and truth value:  2
Event ID : 378 with index:  87 and truth value:  101
Event ID : 378 with index:  88 and truth value:  101
Event ID : 489 with index:  111 and truth value:  4
Event ID : 489 with index:  112 and truth value:  4
Event ID : 578 with index:  142 and truth value:  2
Event ID : 578 with index:  143 and truth value:  2
Event ID : 670 with index:  158 and truth value:  11
Event ID : 670 with index:  159 and truth value:  11
Event ID : 995 with index:  247 and truth value:  101
Event ID : 995 with index:  248 and truth value:  101
Event ID : 1024 with index:  256 and truth value:  101
Event ID : 1024 with index:  257 and truth value:  101
Event ID : 1201 with index:  303 and truth value:  2
Event ID : 1201 with index:  304 and truth value:  2
Event ID : 1333 with index:  338 and truth value:  2

Event ID : 24451 with index:  6478 and truth value:  101
Event ID : 24451 with index:  6479 and truth value:  101
Event ID : 24484 with index:  6490 and truth value:  4
Event ID : 24484 with index:  6491 and truth value:  4
Event ID : 24551 with index:  6508 and truth value:  2
Event ID : 24551 with index:  6509 and truth value:  2
Event ID : 24687 with index:  6545 and truth value:  4
Event ID : 24687 with index:  6546 and truth value:  4
Event ID : 24750 with index:  6563 and truth value:  101
Event ID : 24750 with index:  6564 and truth value:  101
Event ID : 24785 with index:  6576 and truth value:  11
Event ID : 24785 with index:  6577 and truth value:  11
Event ID : 24850 with index:  6592 and truth value:  101
Event ID : 24850 with index:  6593 and truth value:  101
Event ID : 25130 with index:  6658 and truth value:  101
Event ID : 25130 with index:  6659 and truth value:  101
Event ID : 25379 with index:  6724 and truth value:  4
Event ID : 25379 with index:  6725 and truth va

From just looking by eye it seems like each group of repeated events have the same truth values which makes me think that getting rid of the repeats wouldn't matter. But this might change depending on if the pulse series (InIceDSTPulse) or the values of another key (like the MC tree and its children) matter. 