In [2]:
# Module and data import
import numpy as np
import pandas as pd
import json
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.model_selection import train_test_split
from master_scripts.data_functions import (load_experiment, get_git_root, separation_distance, energy_difference,
                                           relative_energy, event_indices, normalize_image_data)
from master_scripts.analysis_functions import (doubles_classification_stats, singles_classification_stats)
from sklearn.metrics import f1_score
%load_ext autoreload
%autoreload 2
repo_root = get_git_root()

images = np.load(repo_root + "data/simulated/images_full_pixelmod.npy")
positions = np.load(repo_root + "data/simulated/positions_full.npy")
energies = np.load(repo_root + "data/simulated/energies_full.npy")
labels = np.load(repo_root + "data/simulated/labels_full.npy")

# Classification of Electron Events
For this part of the project, the goal is to separate events in the dataset into two categories, 'single' and 'double' electron events. Mainly two paths have been taken in order to do this so far. One using well-known network architectures pretrained on the [ImageNet](http://www.image-net.org/) database. The other using a network architecture developed in an ML-project at MSU, that we train from scratch ([MSU ML-Project, LaBollita](https://github.com/harrisonlabollita/MSU-Machine-Learning-Project)). In both cases the models are deep convolutional neural networks.

## TODO

    
 

## Classification Using Pretrained Models
Pretrained networks have previously been found to perform fairly well on other data than they were trained on,
such as data from the AT-TPC ([Kuchera et. al](https://arxiv.org/abs/1810.10350)). Because these models are trained on complex image data with a large amount of features, the idea is to use them as "feature extractors" on our own image data. Due to modern Python frameworks such as [TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), and [Keras](https://keras.io/), implementing and testing this approach is fairly straightforward, although not without challenges. The extracted features are then fed through a fully connected
network that we build and train from scratch using data with known labels.

We attempted classification with the following models, all pretrained on ImageNet:
* [DenseNet121, DenseNet169, DensNet201](https://arxiv.org/abs/1608.06993)
* [InceptionResNetV2](https://arxiv.org/abs/1602.07261)
* [InceptionV3](http://arxiv.org/abs/1512.00567)
* [MobileNet](https://arxiv.org/pdf/1704.04861.pdf)
* [MobileNetV2](https://arxiv.org/abs/1801.04381)
* [NASNetLarge, NASNetMobile](https://arxiv.org/abs/1707.07012)
* [ResNet50](https://arxiv.org/abs/1512.03385)
* [VGG16, VGG19](https://arxiv.org/abs/1409.1556)
* [Xception](https://arxiv.org/abs/1610.02357)

See https://keras.io/applications/ for preliminary info about implementation of these models.

One aspect of pretrained models is they could allow testing for feasability for ML applications
to classification or regression tasks, without needing an expensive infrastructure for training
models (looking at you, RTX2080Ti).

### Challenges

#### Input color
The models expect an RGB image as input, meaning dimensions of (height, width, channels), e.g VGG16 has default input (224, 224, 3). We can 'fake' these RGB channels by concatenating our input to itself, making our (16, 16, 1) input (16, 16, 3).

#### Input size and MaxPooling layers
Our data is essentially a 16x16 pixel image. This is far smaller than the expected input size for most,
if not all of these pretrained networks. Without doing any image manipulations, such as padding with additional rows and columns, this stops us from using the full depth of some networks. The typical architectures of these networks contain 'convolutional blocks' followed my a MaxPooling layer. The MaxPooling layers effectively cuts the input size in half for each MaxPooling layer. So for networks where we meet more than three such layers, our input is reduced to a single pixel, and the model will throw an error.

To combat this, the implementation of the models does two things:
1. Replace the input layer with one that accepts our input, (16, 16, 3). There are no weights in the input layer so this does not affect results.
2. Iterate over the layers in the model, adding them to our new model one by one, until we have added all layers or a specific error is thrown. This specific error allows us to catch when our model is too deep, and save the model in the state it is before this error is reached.





#### Are the extracted features viable for classification?
Before starting training of fully-connected networks that use the extracted features from pretrained models as input, we compared the feature distributions of each network to see if it is reasonable to expect classification to work. The distribution for each individual feature, across all provided samples, was compared for single and double events by using the [Kolmogorov-Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test). If the difference between the distributions is significant, there should be a meaningful difference between single and double events that allow classification to work.

### Results for pretrained models
The table below containes the results from our experiments with pretrained networks. The networks were trained on 160000 samples and tested on 40000 samples. The dataset was balanced (same number of each type of event present). To increase the robustness of the results we used k-fold cross-validation with 5 folds for all networks. This gives a clearer picture of the actual performance of the network, because without it you can technically get lucky and only have "easy" samples in your validation data the first run and get artificially good results.

| Model           | Min Accuracy | Max accuracy | Mean accuracy
| :---            |     :---:    |     :---:    |    :---:     
|DenseNet121      | 0.92         | 0.93         | 0.92
|DenseNet169      | 0.92         | 0.93         | 0.92
|DenseNet201      | 0.92         | 0.94         | 0.93
|InceptionResNetV2| 0.88         | 0.89         | 0.88
|InceptionV3      | 0.87         | 0.88         | 0.88
|MobileNet        | 0.50         | 0.86         | 0.71
|MobileNetV2      | 0.50         | 0.50         | 0.50
|NASNetLarge      | 0.91         | 0.92         | 0.92
|NASNetMobile     | 0.91         | 0.92         | 0.92
|ResNet50         | 0.50         | 0.50         | 0.50
|VGG16            | 0.91         | 0.92         | 0.91
|VGG19            | 0.88         | 0.90         | 0.89
|Xception         | 0.92         | 0.93         | 0.92

The network with the best accuracy was DenseNet201, both for max and mean accuracy. It is closely followed by the other DenseNet variants, NASNet variants, Xception, and VGG16. We chose DenseNet201 for further study. There are a number of other metrics than accuracy that together provide a deeper insight into the performance of a classifier.


### DenseNet201 Additional Analysis
To gain more insight we produce a few more metrics specifically for the top-performing model.
These metrics are:
* [ROC-Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
* [F1-Score](https://en.wikipedia.org/wiki/F1_score)
* [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

Specifically for double events, we look for which type of double events are more difficult to classify than others. To find out, we explore the distribution of relative energy and separation distances for events that are misclassified. We expect to find that low separation distances are more difficult as this is currently the case for humans. Double-events with sufficiently low separation distance may be indistinguishable from single-events.

### Which events are difficult to classify
* Plot relative energy without swapping e1 and e2 to constrain values to [0,1]
    Done.
* Scatterplot correctly classified events as well
* Investigate if there are some images that are always misclassified across multiple training runs

## Classification with custom model
We base our implemented model on the structure of state of the art models like VGG,
and compare the results with the model from previous work ([MSU ML-Project, LaBollita](https://github.com/harrisonlabollita/MSU-Machine-Learning-Project))

In [4]:
# Load experiment and associated model (must be a saved model instance complete with weights)
c_experiment_id = "ac1722ba32d2"
c_experiment = load_experiment(c_experiment_id)
c_model = tf.keras.models.load_model(repo_root + "models/" + c_experiment_id + ".h5")
# Print experiment metrics
print("==== Experiment metrics")
print(json.dumps(c_experiment['metrics'], indent=2))
print("====")

# Get validation indices used in the experiment and predict on validation set
c_val_idx = np.array(c_experiment['indices']['fold_0']['val_idx'])
# Predict on the validation set
c_prediction = c_model.predict(normalize_image_data(images[val_idx]))
c_val_pred = (c_prediction > 0.5).astype(int)

c_s_idx, c_d_idx, c_c_idx = event_indices(positions[val_idx])
c_non_close_idx = np.setdiff1d(np.concatenate((c_s_idx, c_d_idx), axis=0), c_c_idx)
c_f1_close = f1_score(labels[c_val_idx][c_c_idx], c_val_pred[c_c_idx])
c_f1_non_close = f1_score(labels[c_val_idx][c_non_close_idx], c_val_pred[c_non_close_idx])
print("F1-score for double events separated by less than 1 pixel:", c_f1_close)
print("F1-score for double events separated by more than 1 pixel:", c_f1_non_close)
print("F1-score for all events:", c_experiment['metrics']['f1_score'])

==== Experiment metrics
{
  "accuracy_score": 0.9843326315789473,
  "confusion_matrix": {
    "TN": 236777,
    "FP": 755,
    "FN": 6687,
    "TP": 230781
  },
  "f1_score": 0.9841323314939745,
  "matthews_corrcoef": 0.9689674479424194,
  "roc_auc_score": 0.9936520410947314
}
====
F1-score for double events separated by less than 1 pixel: 0.6349801959558057
F1-score for double events separated by more than 1 pixel: 0.9877403830618667
F1-score for all events: 0.9841323314939745


# Prediction of Energies and Positions
For this part of the project, the goal is to predict the energy and position of electrons in an event in the dataset. Previous work has made predictions of position for single-electron events with great performance ([MSU ML-Project, LaBollita](https://github.com/harrisonlabollita/MSU-Machine-Learning-Project)). We aim to reproduce and possibly improve the results from previous work on position prediction, predict the energy in single-electron events, and then move on to predict positions and energies in double-electron events.

This is essentially a regression problem, in that we have a continuous output variable that we want to relate to some input variables (our images). Similarly to classification, the convolutional layer works as a sort of feature extractor. The feature representation of our input is the fed as input to a regression layer which outputs our positions or energies. In fact, you can predict these values using linear regression, it just doesn't work very well. Thus, we enter the realm of "Deep Regression".

In part we intend to follow the work done in [A Comprehensive Analysis of Deep Regression](https://arxiv.org/abs/1803.08450), but using our own models built from scratch. Using pretrained networks is a possible path to try here, but currently we've met with implementation difficulties due to the size of our input.
We might solve this the same way as for classification, by simply not using all the layers. However, the article above also found that the placement of the regression layer was crucial, and performed best when placed after both fully connected layers.


## Single-electron events
Based on the architecture developed in previous work, we have made two separate models for prediction of position and energy. However, the only difference between the models for the single-electron case is the final output layer, which must account for outputting either one value in the energy case, or two values in the position case (x, y coordinates).

### Energy prediction results
500k events, R2 = 0.9764 after 4 epochs (earlystopping)


In [6]:
# Load experiment and associated model (must be a saved model instance complete with weights)
e_experiment_id = "e0ea61a8c8a6"
e_experiment = load_experiment(e_experiment_id)
e_model = tf.keras.models.load_model(repo_root + "models/" + e_experiment_id + ".h5")
# Print experiment metrics
print("==== Experiment metrics")
print(json.dumps(e_experiment['metrics'], indent=2))
print("====")

# Get validation indices used in the experiment and predict on validation set
e_val_idx = np.array(e_experiment['indices']['fold_0']['val_idx'])
# Predict on the validation set
e_prediction = e_model.predict(normalize_image_data(images[e_val_idx]))


OSError: SavedModel file does not exist at: /home/ulvik/git/master_analysis/models/e0ea61a8c8a6.h5/{saved_model.pbtxt|saved_model.pb}

### Position prediction results
500k events, R2 = 0.9855 after 5 epochs (earlystopping)

In [5]:
"11d3fa39e305"
# Load experiment and associated model (must be a saved model instance complete with weights)
p_experiment_id = "11d3fa39e305"
p_experiment = load_experiment(p_experiment_id)
p_model = tf.keras.models.load_model(repo_root + "models/" + p_experiment_id + ".h5")
# Print experiment metrics
print("==== Experiment metrics")
print(json.dumps(p_experiment['metrics'], indent=2))
print("====")

# Get validation indices used in the experiment and predict on validation set
p_val_idx = np.array(p_experiment['indices']['fold_0']['val_idx'])
# Predict on the validation set
p_prediction = p_model.predict(normalize_image_data(images[p_val_idx]))


'11d3fa39e305'

## Double-electron events
Based on the architecture used in classification, we have made two separate models for prediction of position and energy. This is also partly because of the difference in output, but also because we assume the tasks of predicting these quantities may be fundamentally different. It is also trivial to attempt prediction of positions using the energy model (and vice versa), as long as the final output layer is adjusted.

### Energy prediction results
500k events, R2 = 0.4552 (3 epochs)

### Position prediction results
500k events, R2 = 0.4819 after 3 epochs (earlystopping)


### Why the low score compared with single event?
* Loss functions
* Target and input representation
* Convolution blocks - complexity of network
* Fine-Tuning (ref. article, on pretrained networks)
* Regression layer placement
* Is the CNNs spacial invariance affecting its ability to predict positions well in multiple-object cases?

Papers to check out:
* [Numerical Coordinate Regression with CNNs](https://arxiv.org/abs/1801.07372)
* [DeepDistance](https://arxiv.org/abs/1908.11211)
* [A Comprehensive Analysis of Deep Regression](https://arxiv.org/abs/1803.08450)
* [Human pose estimation via Convolutional Part Heatmap Regression](https://arxiv.org/abs/1609.01743)
* [Evaluating and Calibrating Uncertainty Prediction in Regression Tasks](https://arxiv.org/abs/1905.11659)


Group by distance to check if there are some events that are dominating the results.
Also for relative energies. 
* Can we treat the data as heatmaps? Or can we output heatmaps from the CNN?
* Look at object detection

# Interpretability of models


## LIME
LIME (Local Interpretable Model-Agnostic Explanations, [Module](https://github.com/marcotcr/lime), [Paper](https://arxiv.org/abs/1602.04938)) is a project that aims to explain what classifiers are doing.
Have done preliminary testing with classification, but:
* Must make some changes to make it run with our classification model. Seems to be a problem with LIME assuming 3 channels (RGB).

If we can explain the behaviour of the classifier to a greater extent, perhaps this can aid in the regression.

# Unsupervised Methods
Unsupervised methods do not require labeled data, but rather look for underlying symmetries and structures in the data itself.

Currently looking into:
* Variational Autoencoders

# Outline for thesis
Start with the standard framework, similar to projects.
## Abstract
## Introduction
### Motivate the reader, why is this project relevant
## Theory
### LinReg
### LogReg
### NNs, backprop
### Convolutions
## Implementation
### Testing and training with simulated data
### Examples with current methods