# Using an existing NN on new data
There are only a few steps to getting your data ready to be pushed into the network. I realize that there are a lot of different files on taki and it is not very obvious on how things should be done.

As such I found it easier to just create this notebook to demonstrate how to use everything and what kind of functions you'll need to generate a confusion matrix like table.



## Download the required libraries running all parts

In [None]:
!pip install "tensorflow-gpu>=2.0.0" pandas matplotlib dill multiprocessing_on_dill

## Setting up the neural network
In order to setup the neural network the first thing is to download it from my GDrive. Ideally you would upload the network and keep it private to your drive if you wanted to run this notebook.

This will also allow us to load in the scaler object.

I'm using a self-compiled adaption of https://github.com/gdrive-org/gdrive to upload things from my computer directly to GDrive.

In [None]:
from google.colab import drive

drive.mount('/content/gdrive', force_remount=True)
root_path = 'gdrive/My Drive/'


def download_model(u):
  x = requests.get(u, stream=True, allow_redirects=True)
  print("Downloaded a",x.headers.get('content-type'))
  x.raw.decode_content = True # handle spurious Content-Encoding
  y = x.raw.read()
  # Dumb work around because Keras won't load from raw bytes
  with open("tmp.h5", "wb") as outfile:
    outfile.write(y)
  model = load_model("tmp.h5")
  return model
def download_py(filename, u):
  x = requests.get(u, stream=True, allow_redirects=True)
  print("Downloaded a",x.headers.get('content-type'))
  x.raw.decode_content = True # handle spurious Content-Encoding
  y = x.raw.read()
  # Dumb work around because Keras won't load from raw bytes
  with open(filename, "wb") as outfile:
    outfile.write(y)
  return model

## Preprocessing Input Data
Now we have to load in the scaler object and put it into the same features ranges as the data which the network was trained on.

I will assume that your data is RAW and not at all preprocessed.
1. There are no extra labels
2. It is in a gross CSV
3. You know the outputs and want to evaluate accuracy

Each subsection goes over how put it into the shape we desire! We are assuming that you are NOT attempting to generate false singles, false doubles, swapped rows, or anything like that. That process takes a very long time on taki and it would never run on colab.

1. If you're just trying to do prediction and do not have encoded data proceed through preprocessing as normal but skip "Getting a confusion matrix". Set `outputs=False`.
2. If you're trying to use preencoded data which is non-normalized just skip to "One hot encoding" and load your data there and proceed as normal.
3. If you're trying to use pre-encoded normalized data just skip to "One hot encoding" and name your inputs variable `xscale` instead of `x`.
4. If you want to use non-encoded data :
 1. If it has already been sanitized and labelled you can either edit the load command in "Convert from CSV to DF" then simply skip "Generating the Maggi Labels" and also skip "Adding the Permutation Labels". 
 2. You can also just edit the load command in "One hot encoding". Load your CSV into `all_data`



In [None]:
outputs=True

### Required helper functions

Run the below block to initialize all helper functions required.

We'll download the source files from taki to save on space notebook space.

In [None]:
from shutil import copy
files = ["convertToOneHot.py", "maggi.py", "makeNewLabels.py", "helpers.py"]
driveLocation = "colabFiles"
for aFile in files:
  copy("{}/{}/{}".format(root_path, driveLocation, aFile), "./{}".format(aFile))

### Converting From CSV to DF
Our first goal is to take in the original CSV and load it into a `pandas.DataFrame` so we can easily add all labels to it.

In [None]:
inputFile = "{}/{}".format(root_path, "PGML_DATA/1GEvents_150MeV_sp_E0_nulls.csv")
import pandas as pd
detM = pd.read_csv(inputFile)

### Generating the Maggi Labels
First we'll used an updated version of Dr. Maggi's preprocessing script to add his original labels which are then used to build the newer labels.

Note that in more time constrained scenario we would jump straight to One hot encoding after adding Maggi Labels.
However, for now, we do this a little slower.

In [None]:
from maggi import addMaggiLabels
lowEThresh = 0.05
upperEThresh = 2.7
sings, doubs, trips = addMaggiLabels(detM, lowEThresh, upperEThresh)

### Adding the Permutation Labels
Now we add all the newer labels! In the actual code on taki, for gigantic files, this is done using some parallel magic. Please note that the processors on Colab are terrible! You're better off loading pre-encoded data at an earlier stage and just skipping this step if you can.

I'll optimize the routines later...


In [None]:
from makeNewLabels import computePermuteLabels
print("Computing permutation labels for doubles")
doubs = doubs.apply(lambda row: computePermuteLabels(row, outputs=outputs), axis=1)
print("Computing permutation labels for triples")
trips = doubs.apply(lambda row: computePermuteLabels(row, outputs=outputs), axis=1)

### One hot encoding
Now we'll take the large DataFrame and one hot encode it so that it can be used in the network.

1. If you have data which has aleady been encoded feel free to just comment out all of the code in the next block. Just load your unscaled data into `x` 
2. if you have known outputs put them into `y`. 
3. If you do not have known outputs just set `y` to be an empty array

In [None]:
from pandas import concat, read_csv
from convertToOneHot import toOneHotArray
from numpy import array
# all_data = concat([trips, doubs])
all_data = read_csv("{}/{}/{}".format(root_path, driveLocation, "all_data_05G.csv"))
x, y = toOneHotArray(all_data, outputs=outputs)
# x = None
# y = array([])
# xscale = None

### Normalization
From here we will normalize your encoded data which finalizes its usability for the network.

In [None]:
import dill
with open("{}/{}/{}".format(root_path, driveLocation, "all_data_40G_balanced.scaler"), "rb") as infile:
  scaler = dill.load(infile)
xscaled = scaler.transform(x)

## Evaluating Network Performance
Here we compute the confusion matrix and also evaluate the NN performance on the data you've provided. If you're using data that does not have a known output then you should expect `y` to be an empty array. As such do not bother running through "Getting a confusion matrix".


### Required functions
This block loads the model and generates the two indexing dictionaries which will be needed for the following two blocks.

In [None]:
index1 = {'123': '0', '124': '1', '132': '2', '134': '3', '142': '4', '143': '5',
'213': '6', '214': '7', '231': '8', '234': '9', '241': '10', '243': '11', 
'312':'12', '314': '13', '321': '14', '324': '15', '341': '16', '342': '17', 
'412': '18', '413': '19', '421': '20', '423': '21', '431': '22', '432': '23', 
'444':'24'}
# Lazy
index2 = {}
for key in index1.keys():
  index2[int(index1[key])] = key

from tensorflow.keras.models import load_model
model = load_model("{}/{}/{}".format(root_path, driveLocation, "15dd8d785eb6f701fd1eda81572657bcb22a4fe76fa7b41672b1a60b5ea70e91-1024.h5"))

### Getting a confusion matrix
Here we evaluate the network's performance on your provided data


In [None]:
from pandas import DataFrame, set_option
from numpy import max, argmax, concatenate, where, zeros, arange, nan_to_num
set_option('display.max_columns', None)
set_option('display.max_rows', None)
def getConfusionMatrix(model, x, y):
  # Get predictions
  y_res = model.predict(x)
  y_max = max(y_res, axis=1).reshape(y_res.shape[0],1)
  # Convert prediction to binary classifcation
  y_res[y_res < y_max] = 0
  y_res[y_res >= y_max] = 1
  # Compare predicted results to real answers
  t = argmax(y, axis=1).reshape(y.shape[0], 1)
  # 25 classes classified 25 ways
  v = zeros((25,25))
  for i in range(25):
    # Select only rows for things which correctly classify to i
    z1, z2 = where(t == i)
    # Determine how many times (i) was classified into each class
    v[i,:] = y_res[z1].sum(axis=0)
  # Convert the raw values into percents of total
  vs = v.sum(axis=1).reshape(v.shape[0],1)
  # To prevent NaN
  vs[vs == 0] = 1
  percent = (v / vs)
  percent = nan_to_num(percent.flatten()).reshape(percent.shape)
  columns = list(map(lambda x: index2[x], sorted(index2.keys())))
  # columns.insert(0, "correct")
  d = DataFrame(v, index=columns, columns=columns)
  percent = DataFrame(percent, index=columns, columns=columns)
  return d, percent
raw, per = getConfusionMatrix(model, xscaled, y)
%load_ext google.colab.data_table
per
