# Validation of the classification based on the WFsim #

2019/09/28

Authors:
 - Clark, Michael <clark632@purdue.edu>
 - Angevaare, Joran <j.angevaare@nikhef.nl>
 
**Updates:**

2019/11/14

## This notebook #
Buggs in the WFsim that are important to keep in mind:
 -  <s>There is no double photo-emission taken into account
 -  There may be only ~ 500 events before the WFsim crashes<s>
 
 
Possible extensions:
 - Add afterpulse boolian to the 'truth' info
 - Do the same for the other detector types

In [1]:
import strax
import straxen

In [2]:
import wfsim

We include ``recarray_tools.py`` here that is used to add columns and do things with structured arrays. 
Taken from:

    https://github.com/XENON1T/XeAnalysisScripts/tree/master/PeakFinderTest

In [3]:
from peak_classification.peak_finder import *

ModuleNotFoundError: No module named 'recarray_tools'

In [None]:
from peak_classification.wfsim_utils import *

Initize the wavefrom simulator

In [None]:
c = dict(event_rate = 50, chunk_size=1, nchunk=1)
inst = rand_instructions(c)
pd.DataFrame(inst).to_csv('test_uni.csv', index=False)

In [None]:
st = strax.Context(
register=[wfsim.RawRecordsFromFax],
config=dict(fax_file='./test_uni.csv'),
**straxen.contexts.common_opts)

In [None]:
# Just some id from post-SR1, so the corrections work
run_id = '180519_1902'

In [None]:
!rm -r strax_data
peaks = st.make(run_id, 'peak_basics')

In [None]:
truth = st.get_array(run_id, 'truth')
data = st.get_array(run_id, ['peak_basics','peak_classification'])

This is to compensate for the fact that we dont have event numbers (Binning in time to group peaks)

In [None]:
n = c['nevents'] = c['event_rate'] * c['chunk_size'] * c['nchunk']
c['total_time'] = c['chunk_size'] * c['nchunk']
timing_grid = np.linspace(0, c['total_time'], n+1) * 1e9

In [None]:
### Proxy for event number

truth = append_fields(truth, 'merge_index',np.digitize(truth['t'], timing_grid))
data = append_fields(data, 'merge_index',np.digitize(data['time'], timing_grid))

In [None]:
### Proxy for area of truth peak

truth = append_fields(truth, 'area', truth['n_photon'])

**There is a bug that the types are listed here as strings, where in strax they are integers**
The code here is to change that such that we can compare them directly

In [None]:
###!! 
###!! 
###!!
###!!

truth = append_fields(truth, 'typeint',np.ones(len(truth)), dtypes=np.int)
# truth['typeint'][truth['type'] == 's2'] = np.int(2)
# truth['typeint'][truth['type'] == 's1'] = np.int(1)
data = append_fields(data, 'typeint',data['type'], dtypes=np.int)


Proxy for ``left`` and ``right`` (as in ``PAX``) sides of peak in truth.

In [None]:
### Proxy for left and right of peak
truth = append_fields(truth, 
                      ('time','endtime'), 
                      (truth['t_first_photon'],
                       truth['t_last_photon']))

##3 Will need to add check to see if last electron is after last photon as below
#
#truth['endtime'] = truth['t_last_photon']
#mask = truth['endtime'] < truth['t_last_electron']
#truth['endtime'][mask] = truth['t_last_electron'][mask]

**We think there is a bug that makes all the peak times 500 ns earlier than the truth values**

We change the data here to move all times by 500 ns

In [None]:
###!!
###!!
###!! 
# data['time'] = data['time']+500
# data['endtime'] = data['endtime']+500

Here in match_peaks.py, written by Jelle, to compare two sets of peaks

Changes:
  -  Changed 'type' to 'typeint' because types are listed as integers in strax

Call with (truth, data)

In [None]:
truthmatched, datamatched = match_peaks(truth,data)

Below is the output of match_peaks for the truth data.  
  - For each peak, **outcome** shows whether the peak was found, missed, merged, split up, or misidentified in the output of strax for the simulated data
  - **matched_to** shows which peak (peak_id in the other array) it was matched with, or the biggest peak it was matched with 

<img src='toptruthmatches.png'>
  
Below is the corresponding match_index in the simulated data
<img src='topdatamatch.png'>
  
You can see the splitting of the true s2 into an s1 and an s2

In [None]:
pd.DataFrame.from_records(truthmatched[['merge_index','type','time','area','endtime','matched_to','outcome']])


In [None]:
pd.DataFrame.from_records(datamatched[['merge_index','type','time','area','endtime','matched_to','outcome']]).head(20)
#pd.DataFrame.from_records(truthmatched[['merge_index','type','time','area','endtime','matched_to','outcome']])


In [None]:
pd.DataFrame.from_records(truthmatched[truthmatched['outcome'] == b'found'][['merge_index','type','time','area','endtime','matched_to','outcome']])

## Plotting the results ##
The plots below show the fraction of several of the ``dtypes`` of the ``truth`` or the ``data``. These fractions show how many of the ``peaks`` were found correctly.

In [None]:
plot_peak_matching_histogram(truthmatched,'typeint',bins=[0.5,1.5,2.5])
plt.xlabel('Peak Type')
plt.show()

In [None]:
plot_peak_matching_histogram(datamatched,'typeint',bins= [-0.5,0.5,1.5,2.5])
plt.xlabel('Peak Type')

In [None]:
plot_peak_matching_histogram(truthmatched,'z')
plt.xlabel('Depth')

In [None]:
plot_peak_matching_histogram(datamatched,'area_fraction_top')
plt.xlabel('Area Fraction Top')