# Caffe 2 Notebook

This notebook is for using Caffe 2 to train a neural network on Jet Data in the same way that the Keras Notebook does.

## Data Preprocessing 

The first few chunks below are just for formatting the data in a convenient manner. It removes some clank data points, adds labels to the images, and then aggregates the two distinct images.
I plan to summarize this code at another point so that it can be done in one run.

In [1]:
import numpy as np
import pandas as pd 
import h5py

import matplotlib.pyplot as plt
from matplotlib import colors
from matplotlib.colors import Normalize, LogNorm

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

np.random.seed(7)

from caffe2.python import core, utils, workspace
from caffe2.proto import caffe2_pb2

def get_data(filename):
    """
        This just takes the data and puts it in the shape and format that I would like to deal with.
        The data is converted to an hdf5 file using kratsg's gML file on github: https://github.com/kratsg/gML.git
    """
    data = h5py.File(filename, 'r')
    # 28 eta rows and 32 phi columns
    gTowerEt = data['gTowerEt'][:].reshape(-1, 28, 32)
    return (gTowerEt)



In [2]:
gTowerEt_background1 = get_data("JZ0W_hdf5/user._010556.JZ0W.hdf5")
gTowerEt_background2 = get_data("JZ0W_hdf5/user._010590.JZ0W.hdf5")
gTowerEt_background3 = get_data("JZ0W_hdf5/user._010592.JZ0W.hdf5")
gTowerEt_background4 = get_data("JZ0W_hdf5/user._010594.JZ0W1.hdf5")
gTowerEt_background = np.concatenate([gTowerEt_background1
                                  ,gTowerEt_background2
                                  ,gTowerEt_background3
                                  ,gTowerEt_background4])



In [3]:
gTowerEt_signal1 = get_data("ZvvHbb_hdf5/user._000118.ZvvHbb.hdf5")
gTowerEt_signal2 = get_data("ZvvHbb_hdf5/user._000123.ZvvHbb.hdf5")
gTowerEt_signal3 = get_data("ZvvHbb_hdf5/user._000130.ZvvHbb.hdf5")
gTowerEt_signal4 = get_data("ZvvHbb_hdf5/user._000139.ZvvHbb.hdf5")
gTowerEt_signal = np.concatenate([gTowerEt_signal1
                                 ,gTowerEt_signal2
                                 ,gTowerEt_signal3
                                 ,gTowerEt_signal4])

In [4]:
def flatten_data(data, second_dim):
    """
        Flatten data from 3 Dimensions to 2 Dimensions
        data: 3D input
        second_dim: what we want to flatten the input into 
    """
    flat_array = data.reshape(-1, second_dim)
    return flat_array

#get the signal flattened
gTower_signal_flat = flatten_data(gTowerEt_signal, 28*32)
gTower_background_flat = flatten_data(gTowerEt_background, 28*32)

# check what the data looks like now
print("shapes:\n signal:{}\n background: {}".format(gTower_signal_flat.shape, gTower_background_flat.shape))

# convert to a pandas data frame
df_signal_flat = pd.DataFrame(gTower_signal_flat)
df_background_flat = pd.DataFrame(gTower_background_flat)

#check what the data looks like
print("Row 832 is the last row in signal with all zeros: sum(832) = {}".format(df_signal_flat[832].sum()))
print("Row 831 is the first row in sgnal with not all zeros: sum(831) = {}".format(df_signal_flat[831].sum()))

#drop zero columns
df_signal_flat.drop(df_signal_flat.columns[832:], axis=1, inplace=True)

print("Row 832 is the last row in background with all zeros: sum(832) = {}".format(df_background_flat[832].sum()))
print("Row 831 is the first row in background with not all zeros: sum(831) = {}".format(df_background_flat[831].sum()))

#drop zero columns
df_background_flat.drop(df_background_flat.columns[832:], axis=1, inplace=True)

df_signal_flat.head()

shapes:
 signal:(1000, 896)
 background: (1000, 896)
Row 832 is the last row in signal with all zeros: sum(832) = 0.0
Row 831 is the first row in sgnal with not all zeros: sum(831) = 568013.125
Row 832 is the last row in background with all zeros: sum(832) = 0.0
Row 831 is the first row in background with not all zeros: sum(831) = 499746.78125


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,822,823,824,825,826,827,828,829,830,831
0,1948.007568,1229.803589,1362.21167,1773.021362,1805.363159,937.377747,770.113647,1283.386475,0.0,463.752502,...,1453.622314,1195.59021,991.512695,595.91272,180.354721,337.931519,828.340332,862.571777,830.117065,797.693115
1,504.729767,1053.224731,1948.789429,472.535736,431.983521,1618.163696,1084.297852,522.354248,1145.151978,884.213806,...,674.066772,1018.593689,2172.189453,1589.266357,1166.685791,845.343872,433.920105,595.932251,2259.294922,1107.355469
2,949.828003,716.89502,123.32637,945.23053,718.803589,65.158417,812.612,702.350159,560.865112,1201.162842,...,394.760132,0.0,322.434998,591.893433,846.602661,-170.080933,975.725342,134.92453,3593.008057,1467.026855
3,226.033798,-61.54623,538.230469,27.10696,0.0,177.788284,159.796814,568.038025,154.135437,0.0,...,114.888306,78.900848,1171.467407,0.719444,492.379333,1436.708984,138.477692,80.839256,73.998993,-30.185978
4,-151.946899,429.102966,270.738861,205.938675,233.599579,532.302063,735.905396,104.52816,657.799927,-145.565964,...,344.546844,78.911766,1261.254395,620.452209,251.628922,482.965149,1691.376953,29.792877,784.313354,5042.091797


In [5]:
df_background_flat.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,822,823,824,825,826,827,828,829,830,831
0,102.223244,368.700684,137.223206,3186.571777,3.529137,0.0,1580.153198,88.257584,241.882187,132.519989,...,-174.814178,1479.688599,442.842407,763.384705,183.14595,307.035736,380.927002,444.83728,524.020386,197.409546
1,1862.122681,725.858398,495.343079,522.68927,1037.605713,1459.342163,1467.959839,951.591431,1289.745483,0.0,...,1474.365601,754.765991,804.412048,987.89209,949.158386,317.838837,1316.921509,711.112122,566.910278,1134.835327
2,1338.201538,63.295807,1250.506104,1.934196,105.690559,1010.89801,814.23645,68.646568,-44.988132,871.175598,...,128.142059,99.144165,91.923775,337.149536,-27.088104,296.295044,826.219788,259.561218,504.078888,118.668991
3,508.373199,357.868134,54.637264,761.106384,240.089447,100.703369,477.226807,444.954498,702.838623,109.055504,...,524.925659,1218.190186,0.0,983.98291,704.678345,472.131653,111.450127,762.669006,20.642662,627.526917
4,1759.074219,1074.358521,51.794151,659.456604,-419.165039,619.112854,219.199036,487.793335,622.421021,32.543133,...,678.214233,395.708679,401.175659,570.642761,652.222168,319.069214,204.106003,1265.324707,103.765228,440.996307


In [6]:
def add_ones_or_zeros(data_frame, one_or_zero):
    """
        data_frame: should be a pandas Data Frame
        
        add ones or zeros to data depending on whether it is signal or background
    """
    if one_or_zero == 1:
        label = np.ones(len(data_frame))
        label_series = pd.Series(label)
        
        
    elif one_or_zero == 0:
        label = np.zeros(len(data_frame))
        label_series = pd.Series(label)
        
    else:
        print("Error: Not a 1 or 0")
        return None
    
    data_frame["832"] = label_series
    return data_frame 
    
df_signal_with_label = add_ones_or_zeros(df_signal_flat, 1)
df_background_with_label = add_ones_or_zeros(df_background_flat, 0)

print(df_signal_with_label["832"].sum(), df_background_with_label["832"].sum())

(1000.0, 0.0)


In [7]:
df = pd.concat([df_background_with_label, df_signal_with_label])

In [8]:
from sklearn.utils import shuffle
df = shuffle(df)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,823,824,825,826,827,828,829,830,831,832
72,617.601318,1025.455811,0.0,1011.240295,99.958839,333.287323,1539.869385,1168.764648,738.572998,372.952454,...,134.856674,-10.657075,26.755604,2667.298828,499.250366,27.744547,1043.870361,326.671387,1146.979004,1.0
48,2.102772,723.881165,296.339325,356.064789,543.670959,457.909302,764.226318,331.417816,760.755249,-15.101002,...,336.678497,29.638662,690.665833,660.871948,482.679749,848.962769,121.525925,280.449341,1931.817627,1.0
204,528.239075,186.798325,-79.647118,-66.266968,281.06662,62.878704,95.930397,0.0,515.882568,289.449341,...,245.353912,92.703529,127.512253,-160.476349,213.109863,488.552246,555.118774,304.462646,161.072601,0.0
694,2684.563232,2531.401123,692.038208,242.560425,1206.665771,996.734131,335.721771,378.905792,496.130066,796.467285,...,507.88974,137.892517,827.784912,1302.798584,329.696167,810.409546,637.320984,674.380676,970.458496,0.0
758,473.478973,251.773315,124.23793,524.202026,28.579481,109.355637,663.056824,176.261261,42.052715,105.625214,...,427.932251,806.014404,334.009796,560.595093,241.998962,306.200806,454.349487,135.056351,367.607117,1.0


In [9]:
labels = df['832']
labels = np.array(labels)
print(labels[:10])

[1. 1. 0. 0. 1. 0. 1. 1. 1. 1.]


In [10]:
df.drop(df.columns[832], axis=1, inplace=True)
images = np.array(df)
print(len(images[0]))

832


## First lines of Caffe 2 

This is where we start with the real heart of this program. First, we have to load in our data in a way that Caffe 2 can interpret it. This is in the form of a prototext file.

In [11]:
feature_and_label = caffe2_pb2.TensorProtos()
feature_and_label.protos.extend([
    utils.NumpyArrayToCaffe2Tensor(images[0]),
    utils.NumpyArrayToCaffe2Tensor(labels[0])])
print('This is what the tensor proto looks like for a feature and its label:')
print(str(feature_and_label))
print('This is the compact string that gets written into the db:')
print(feature_and_label.SerializeToString())

This is what the tensor proto looks like for a feature and its label:
protos {
  dims: 832
  data_type: FLOAT
  float_data: 617.601318359
  float_data: 1025.45581055
  float_data: 0.0
  float_data: 1011.24029541
  float_data: 99.9588394165
  float_data: 333.287322998
  float_data: 1539.86938477
  float_data: 1168.76464844
  float_data: 738.572998047
  float_data: 372.952453613
  float_data: 593.821166992
  float_data: 1474.41308594
  float_data: 486.048217773
  float_data: 149.646026611
  float_data: 920.097412109
  float_data: 287.132598877
  float_data: 971.567626953
  float_data: 858.580322266
  float_data: 16.8235168457
  float_data: 208.776351929
  float_data: 1155.65429688
  float_data: 977.845275879
  float_data: 1117.21203613
  float_data: 206.268203735
  float_data: 791.629821777
  float_data: 169.105651855
  float_data: 1758.08557129
  float_data: 1295.36657715
  float_data: 139.9737854
  float_data: 2443.47216797
  float_data: 1165.4888916
  float_data: 2014.13549805
  float

In [12]:
train_features, test_features, train_labels, test_labels = train_test_split(images, labels, test_size=0.33, random_state=42)

## Database storage 

After splitting up our data, we have to store it in a database so that we can set up a workspace for training -- or rather two databases for our training and testing.

In [13]:
def write_db(db_type, db_name, features, labels):
    """
        This writes our prototext file to a database 
    """
    db = core.C.create_db(db_type, db_name, core.C.Mode.write)
    transaction = db.new_transaction()
    for i in range(features.shape[0]):
        feature_and_label = caffe2_pb2.TensorProtos()
        feature_and_label.protos.extend([
            utils.NumpyArrayToCaffe2Tensor(features[i]),
            utils.NumpyArrayToCaffe2Tensor(labels[i])])
        transaction.put(
            'train_%03d'.format(i),
            feature_and_label.SerializeToString())
    # Close the transaction, and then close the db.
    del transaction
    del db

write_db("minidb", "jet_data_train.minidb", train_features, train_labels)
write_db("minidb", "jet_data_test.minidb", test_features, test_labels)

In [14]:
net_proto = core.Net("example_reader")
dbreader = net_proto.CreateDB([], "dbreader", db="jet_data_train.minidb", db_type="minidb")
net_proto.TensorProtosDBInput([dbreader], ["X", "Y"], batch_size=16)

print("The net looks like this:")
print(str(net_proto.Proto()))

The net looks like this:
name: "example_reader"
op {
  output: "dbreader"
  name: ""
  type: "CreateDB"
  arg {
    name: "db_type"
    s: "minidb"
  }
  arg {
    name: "db"
    s: "jet_data_train.minidb"
  }
}
op {
  input: "dbreader"
  output: "X"
  output: "Y"
  name: ""
  type: "TensorProtosDBInput"
  arg {
    name: "batch_size"
    i: 16
  }
}



## Workspace 

Now we have to create a workspace for training our data. This puts the data in a format such that we are now ready to start training.

In [15]:
workspace.CreateNet(net_proto)

True

In [16]:
# Let's run it to get batches of features.
workspace.RunNet(net_proto.Proto().name)
print("The first batch of feature is:")
print(workspace.FetchBlob("X"))
print("The first batch of label is:")
print(workspace.FetchBlob("Y"))

# Let's run again.
workspace.RunNet(net_proto.Proto().name)
print("The second batch of feature is:")
print(workspace.FetchBlob("X"))
print("The second batch of label is:")
print(workspace.FetchBlob("Y"))

The first batch of feature is:
[[ 306.33765    227.3192     409.3287    ...  -39.587357     0.
     3.6428757]
 [ 178.58865    542.9438     768.71155   ...  266.88086    138.71211
   356.798    ]
 [ 217.77202     14.171318   502.3112    ...   98.100914   125.365395
   290.37378  ]
 ...
 [ 702.2345    -121.18159    464.25055   ...  132.1763     605.97986
    62.991947 ]
 [ 818.6415     347.1042      70.546     ...  288.65002      0.
    47.663338 ]
 [1523.0967      43.926144   417.5659    ...  745.6074     743.99066
   523.2435   ]]
The first batch of label is:
[0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]
The second batch of feature is:
[[1208.56    -111.31558  365.5208  ... 1157.3169   281.565    385.49298]
 [ -98.58821  379.78326  704.10913 ...  255.14987  157.18744  163.7235 ]
 [ -89.07542 -109.96844  -70.74918 ...  484.91132  679.3028   284.90063]
 ...
 [ 319.17648  638.1909   395.49457 ...  181.8924   360.07996   81.0872 ]
 [ 593.931    446.83463  206.81105 ...   44.50211  -68