# Neural Networks
## MSDS 7333 - Section 401
## Case Study Week 12
[Data Science @ Southern Methodist University](https://datascience.smu.edu/)

### Due:
06 August 2018

### Table of Contents
* [Team Members](#Team-Members)
* [Abstract](#Abstract)
* [Introduction](#Introduction)
* [Methods](#Methods)
* [Results](#Results)
* [Conclusion](#Conclusion)
* [References](#References)

### <a name="Team-Members"></a>Team Members
* Kevin Cannon
* Austin Hancock

### <a name="Abstract"></a>Abstract

A subset of the Higgs Data Set is used to experiment and explore neural networks with the Keras API. Three different architectures, optimizers, and initializers are explored and scored to produce a final model score.

### <a name="Introduction"></a>Introduction

The goal of particle physics, or high-energy physics is to understand what are the most fundamental constituents of matter and how these particles interact. By discovering the most elementary constituents of matter and energy, exploring the basic nature of space and time itself, and probing the interactions between them, scientists work to understand how the universe works at its most elementary level.

The primary tools of experimental high-energy physicists are particle accelerators, which collide protons and/or antiprotons to create particles that only occur at extremely high-energy densities. Advanced particle accelerators, particle detectors, and sophisticated computing techniques are essential tools for modern particle physics research. The advancement of dedicated technology for particle physics has benefited tremendously from progress in other areas of science. With the scarcity and difficulty in obtaining data, statistical and machine learning tools are invaluable tools for discovery of new insights into the nature of matter.

In order to isolate highly dimensional particle collision data, the relative likelihood of a new particles is calculated. Since this likelihood is difficult to express analytically, collision data simulated using Monte Carlo methods are used as the basis for approximation. Machine learning classifiers, such as neural networks, provide a method to solve this computational problem.

The Higgs Data Set used in our neural network exploration comes from a paper by Pierre Baldi, Peter Sadowski, and Daniel Whiteson from the University of California, Irvine. Published in July 2014 in Nature Communications, the paper “Searching for exotic particles in high-energy physics with deep learning” shows that deep-learning classification methods can improve results over other approaches. From the abstract of the paper:
> Standard approaches have relied on ‘shallow’ machine-learning models that have a limited capacity to learn complex nonlinear functions of the inputs, and rely on a painstaking search through manually constructed nonlinear features. Progress on this problem has slowed, as a variety of techniques have shown equivalent performance. Recent advances in the field of deep learning make it possible to learn more complex functions and better discriminate between signal and background classes. Here, using benchmark data sets, we show that deep-learning methods need no manually constructed inputs and yet improve the classification metric by as much as 8% over the best current approaches. This demonstrates that deep-learning approaches can improve the power of collider searches for exotic particles.

We will use the data set to distinguish between a signal process which produces Higgs boson particles and a background process that does not produce the particles. The data was produced using Monte Carlo simulations, with 28 features and 11,000,000 rows.

### <a name="Methods"></a>Methods

For this analysis, we will use the Keras API. Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It does not handle itself low-level operations such as tensor products, convolutions and so on. Instead, it relies on a specialized, well-optimized tensor manipulation library to do so, serving as the "backend engine" of Keras. TensorFlow, which is the backend used for this lab, is an open-source symbolic tensor manipulation framework developed by Google.

Keras was developed with a focus on enabling fast experimentation. The core data structure of Keras is a model, a way to organize layers. The simplest type of model is the Sequential model, a linear stack of layers.

#### Initialize Data

In [1]:
# Original model
# Changed N from 10,500,000 to 1,050,000
# 3 layers. Neurons: 50, 50, 1

import pandas as pd
import numpy as np

N=1050000. #Change this line adjust the number of rows. 
data=pd.read_csv("HIGGS.csv",nrows=N,header=None)
test_data=pd.read_csv("HIGGS.csv",nrows=50000,header=None,skiprows=1050000)

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.optimizers import RMSprop
from keras.optimizers import Adagrad
from keras.optimizers import Adadelta
from keras.optimizers import Adam
from keras.optimizers import Adamax
from keras.optimizers import Nadam
from sklearn.metrics import roc_auc_score

y = np.array(data.loc[:,0])
x = np.array(data.loc[:,1:])
x_test = np.array(test_data.loc[:,1:])
y_test = np.array(test_data.loc[:,0])

#Begin 

model = Sequential()
model.add(Dense(50, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(50, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))
model.summary()

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=5, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 50)                1450      
_________________________________________________________________
activation_1 (Activation)    (None, 50)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 50)                2550      
_________________________________________________________________
activation_2 (Activation)    (None, 50)                0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 51        
__________

0.6739729324561039

#### Work # 1 & 2: Architecture and Activation Function Exploration
In this section, we select 3 different architectures and run the model to determine the scores. The sigmoid activation function will be used for all base models.

Then, we switch the activation functions from Sigmoid to two different activations functions for each architecture.

Ten basic activations function exist in the Keras API that can be used in the models:
* softmax
* elu
* selu
* softplus
* softsign
* relu
* tanh
* sigmoid
* hard_sigmoid
* linear

##### Architecture #1: Three layer model with sigmoid activation functions
Our starting point will be a three layer model with 100, 100, and 1 neurons in each respective layer. From here, we can see the effect of adding or subtracting layers from our model from the ROC score.

In [2]:
# 3 layers. Neurons: 100, 100, 1 

#Begin 

model = Sequential()
model.add(Dense(100, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(100, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))
model.summary()

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=5, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 100)               2900      
_________________________________________________________________
activation_4 (Activation)    (None, 100)               0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 100)               10100     
_________________________________________________________________
activation_5 (Activation)    (None, 100)               0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 101       
__________

0.68048853933017528

The three layer model (100, 100, 1 neurons) with sigmoid activation functions produced an ROC score of 0.680.

##### Architecture #1: Three layer model with softmax activation functions

In [58]:
# Change activation to "softmax"
#Begin 

model = Sequential()
model.add(Dense(100, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('softmax'))
model.add(Dropout(0.10))
model.add(Dense(100, kernel_initializer='uniform'))
model.add(Activation('softmax'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('softmax'))
model.summary()

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=5, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_162 (Dense)            (None, 100)               2900      
_________________________________________________________________
activation_161 (Activation)  (None, 100)               0         
_________________________________________________________________
dropout_107 (Dropout)        (None, 100)               0         
_________________________________________________________________
dense_163 (Dense)            (None, 100)               10100     
_________________________________________________________________
activation_162 (Activation)  (None, 100)               0         
_________________________________________________________________
dropout_108 (Dropout)        (None, 100)               0         
_________________________________________________________________
dense_164 (Dense)            (None, 1)                 101       
__________

0.5

The three layer model (100, 100, 1 neurons) with softmax activation functions produced an ROC score of 0.500.

##### Architecture #2: Four layer model with sigmoid activation functions
Now, we add a neuron layer to see how it impacts performance.

In [28]:
# 4 layers. Neurons: 50, 50, 1, 1


#Begin 

model = Sequential()
model.add(Dense(50, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(50, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))
model.summary()

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=5, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_81 (Dense)             (None, 50)                1450      
_________________________________________________________________
activation_80 (Activation)   (None, 50)                0         
_________________________________________________________________
dropout_54 (Dropout)         (None, 50)                0         
_________________________________________________________________
dense_82 (Dense)             (None, 50)                2550      
_________________________________________________________________
activation_81 (Activation)   (None, 50)                0         
_________________________________________________________________
dropout_55 (Dropout)         (None, 50)                0         
_________________________________________________________________
dense_83 (Dense)             (None, 1)                 51        
__________

0.50537715464564459

The four layer model (50, 50, 1, 1 neurons) with sigmoid activation functions produced an ROC score of 0.505.

##### Architecture #2: Four layer model with softplus activation functions

In [59]:
# Change activation to "softplus"
#Begin 

model = Sequential()
model.add(Dense(50, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('softplus'))
model.add(Dropout(0.10))
model.add(Dense(50, kernel_initializer='uniform'))
model.add(Activation('softplus'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('softplus'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('softplus'))
model.summary()

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=5, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_165 (Dense)            (None, 50)                1450      
_________________________________________________________________
activation_164 (Activation)  (None, 50)                0         
_________________________________________________________________
dropout_109 (Dropout)        (None, 50)                0         
_________________________________________________________________
dense_166 (Dense)            (None, 50)                2550      
_________________________________________________________________
activation_165 (Activation)  (None, 50)                0         
_________________________________________________________________
dropout_110 (Dropout)        (None, 50)                0         
_________________________________________________________________
dense_167 (Dense)            (None, 1)                 51        
__________

0.51575275231043649

The four layer model (50, 50, 1, 1 neurons) with softplus activation functions produced an ROC score of 0.516.

##### Architecture #3: Two layer model with sigmoid activation functions

In [54]:
# 2 layers. Neurons: 100, 1

#Begin 

model = Sequential()
model.add(Dense(100, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.summary()


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=5, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_152 (Dense)            (None, 100)               2900      
_________________________________________________________________
activation_151 (Activation)  (None, 100)               0         
_________________________________________________________________
dropout_101 (Dropout)        (None, 100)               0         
_________________________________________________________________
dense_153 (Dense)            (None, 1)                 101       
_________________________________________________________________
activation_152 (Activation)  (None, 1)                 0         
Total params: 3,001
Trainable params: 3,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.72343445943990137

The two layer model (100, 1 neurons) with sigmoid activation functions produced an ROC score of 0.723.

##### Architecture #3: Two layer model with softplus and relu activation functions

In [61]:
# Change activation to "softplus" and "relu"
#Begin 

model = Sequential()
model.add(Dense(100, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('softplus'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.summary()


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=5, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_171 (Dense)            (None, 100)               2900      
_________________________________________________________________
activation_170 (Activation)  (None, 100)               0         
_________________________________________________________________
dropout_113 (Dropout)        (None, 100)               0         
_________________________________________________________________
dense_172 (Dense)            (None, 1)                 101       
_________________________________________________________________
activation_171 (Activation)  (None, 1)                 0         
Total params: 3,001
Trainable params: 3,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.5

The two layer model (100, 1 neurons) with softplus and relu activation functions produced an ROC score of 0.500.

From this initial analysis, the two layer sigmoid model performed the best with an ROC of 0.723.

As the number of neuron layers increased, the ROC scores of the models decreases. In the first and third architectures, changing the activation function to something other than sigmoid decreased the ROC score of the model. However, the softplus activation function did slightly improve the ROC score for the four layer model.

Not only did the two layer model have the best ROC score, but it performed the quickest as well. The time to run per epoch hovered around 6-7 seconds for the two layer model, whereas the lower scoring three- and four-layer models took between 7-20 seconds per epoch.

#### Work # 3: Batch Size Variation
In this section, we will vary the batch size of our best model, the two layer model with sigmoid activation functions, and look at the results. For the original model, a batch size of 1,000 was used. Batch sizes of 10 and 100,000 will be tested.

In [67]:
# Arch 3
# Batch sizes
## 1,000 = 0.7234    (original)
## 100,000 = 0.5938
## 10 = 0.7832       (takes a long time to run)

#Begin 

model = Sequential()
model.add(Dense(100, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.summary()


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=5, batch_size=10)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_185 (Dense)            (None, 100)               2900      
_________________________________________________________________
activation_184 (Activation)  (None, 100)               0         
_________________________________________________________________
dropout_121 (Dropout)        (None, 100)               0         
_________________________________________________________________
dense_186 (Dense)            (None, 1)                 101       
_________________________________________________________________
activation_185 (Activation)  (None, 1)                 0         
Total params: 3,001
Trainable params: 3,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.78324926275445317

The smallest batch size, 10, produced the best ROC score for the model of 0.7832. Compared to the original score of 0.7234, the smaller batch size produced a noticeable improvement. However, the smaller batch size drastically increased the time for each epoch of the model, hovering around 180 seconds per epoch to complete. As batch sizes decrease further, the model may improve, but at the cost of time and increasing computing resources. 

#### Work # 4: Different Kernel Initializers
Using our best model so far, the two layer model with sigmoid activation functions with a more reasonable batch size of 1,000, we will use three different kernel initializers and examine the impact on ROC scores. Initializations define the way to set the initial random weights of Keras layers.

Sixteen basic initializers exist in the Keras API that can be used in the models:
* Initializer
* Zeros
* Ones
* Constant
* RandomNormal
* TruncatedNormal
* VarianceScaling
* Orthogonal
* Identity
* lecun_uniform
* glorot_normal
* glorot_uniform
* he_normal
* lecun_normal
* he_uniform

In [78]:
# Arch 3
# Initializers
## uniform = 0.7234 (original)
## lecun_uniform = 0.7385
## normal = 0.7301
## identity = (can only be used for 2D square matrices)
## orthogonal = 0.7394
## zero = 0.7200
## one = 0.5
## glorot_normal = 0.7412
## glorot_uniform = 0.7342
## he_normal = 0.7405
## he_uniform = 0.7415 (top score)

#Begin 

model = Sequential()
model.add(Dense(100, input_dim=x.shape[1], kernel_initializer='he_uniform')) # X_train.shape[1] == 28 here
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='he_uniform'))
model.add(Activation('sigmoid'))
model.summary()


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=5, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_206 (Dense)            (None, 100)               2900      
_________________________________________________________________
activation_204 (Activation)  (None, 100)               0         
_________________________________________________________________
dropout_131 (Dropout)        (None, 100)               0         
_________________________________________________________________
dense_207 (Dense)            (None, 1)                 101       
_________________________________________________________________
activation_205 (Activation)  (None, 1)                 0         
Total params: 3,001
Trainable params: 3,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.74145987549073533

After testing all of the initializers, the ROC scores for the model are as follows:
- uniform = 0.7234 (original)
- lecun_uniform = 0.7385
- normal = 0.7301
- identity = N/A (can only be used for 2D square matrices)
- orthogonal = 0.7394
- zero = 0.7200
- one = 0.500
- glorot_normal = 0.7412
- glorot_uniform = 0.7342
- he_normal = 0.7405
- he_uniform = 0.7415 *best score

From the scores, several of the initializers produced ROC scores that were similar. However, the he_uniform initializer produced a slightly higher score of 0.7415, edging out glorot_normal and he_normal.

#### Work # 5: Different Kernel Optimizers
Similar to the previous section, using our best model so far, the two layer model with sigmoid activation functions with a more reasonable batch size of 1,000, we will now use three different kernel optimizers and examine the impact on ROC scores. 

Eight basic initializers exist in the Keras API that can be used in the models, but only seven can be tested:
* SGD
* RMSprop
* Adagrad
* Adadelta
* Adam
* Adamax
* Nadam
* TFOptimizer (wrapper)

In [109]:
# Arch 3
# Optimizers
## SGD = 0.7234 (original)
## SGD = 0.7217 (change decay=0.0)
## SGD = 0.6663 (Change momentum=0.2)
## SGD = 0.7235 (Change nesterov=False)
### RMSprop: recommends leaving params at their default, except for learning rate
## RMSprop = 0.7038 (lr=0.001, rho=0.9, epsilon=None, decay=0.0)
## RMSprop = 0.7050 (lr=0.1, rho=0.9, epsilon=None, decay=0.0)
## RMSprop = 0.6280 (lr=0.1, rho=0.2, epsilon=None, decay=0.0)
## RMSprop = 0.6721 (lr=0.1, rho=0.9, epsilon=0.5, decay=0.0)
## RMSprop = 0.6389 (lr=0.1, rho=0.9, epsilon=None, decay=1e-6)
### Adagrad: recommends leaving params at their default
## Adagrad = 0.6757 (lr=0.01, epsilon=None, decay=0.0)
## Adagrad = 0.7463 (lr=0.1, epsilon=None, decay=0.0) 
### Adadelta: recommends leaving params at their default
## Adadelta = 0.6883 (lr=1.0, rho=0.95, epsilon=None, decay=0.0)
### Adam
## Adam = 0.7085 (lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
## Adam = 0.7614 (lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False) 
## Adam = 0.7497 (lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
### Adamax
## Adamax = 0.6928 (lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
## Adamax = 0.7671 (lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0) ******
### Nadam: recommends leaving params at their default
## Nadam = 0.7410 (lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
## Nadam = 0.7561 (lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)


#Begin 

model = Sequential()
model.add(Dense(100, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.summary()

#sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#rms = RMSprop(lr=0.1, rho=0.9, epsilon=None, decay=0.0)
#adagrad = Adagrad(lr=0.1, epsilon=None, decay=0.0)
#adadelta = Adadelta(lr=1.0, rho=0.95, epsilon=None, decay=0.0)
#adam = Adam(lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
#adamax = Adamax(lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
nadam = Nadam(lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=nadam)

model.fit(x, y, epochs=5, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_256 (Dense)            (None, 100)               2900      
_________________________________________________________________
activation_254 (Activation)  (None, 100)               0         
_________________________________________________________________
dropout_156 (Dropout)        (None, 100)               0         
_________________________________________________________________
dense_257 (Dense)            (None, 1)                 101       
_________________________________________________________________
activation_255 (Activation)  (None, 1)                 0         
Total params: 3,001
Trainable params: 3,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.5

After testing all the optimizers, the ROC score ranges are as follows:

* SGD = [0.6663, 0.7235]
* RMSprop = [0.6280, 0.7050]
* Adagrad = [0.6757, 0.7463]
* Adam = [0.7085, 0.7614] 
* Adamax = [0.6928, 0.7671] *best score
* Nadam = [0.7410, 0.7561]

From the scores, several of the optimizers produced ROC scores that were similar. However, the Adamax optimizer produced a slightly higher score of 0.7671, edging out Adam and Nadam.

#### Work # 6: Produce the best ROC score

Using the information and insight we have gained from the previous parameter experiments, we will attempt to produce a high ROC score. Batch size and number of epochs will be manipulated to produce the result.

In [114]:
# Arch 3
## Adamax = 0.7671 (oringal)
## Adamax, change epoch 7 = 0.7725 
## Adamax, change batch size 10 = 0.7492
## Adamax, change epoch 10 = 0.7881 
## Adamax, change epoch 10 = 0.7969 ******
## Adam, change epoch 7 = 0.7690

#Begin 

model = Sequential()
model.add(Dense(100, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 28 here
model.add(Activation('sigmoid'))
model.add(Dropout(0.10))
model.add(Dense(1, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.summary()


#adam = Adam(lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
adamax = Adamax(lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=adamax)

model.fit(x, y, epochs=20, batch_size=1000)
roc_auc_score(y_test,model.predict(x_test))
#end

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_266 (Dense)            (None, 100)               2900      
_________________________________________________________________
activation_264 (Activation)  (None, 100)               0         
_________________________________________________________________
dropout_161 (Dropout)        (None, 100)               0         
_________________________________________________________________
dense_267 (Dense)            (None, 1)                 101       
_________________________________________________________________
activation_265 (Activation)  (None, 1)                 0         
Total params: 3,001
Trainable params: 3,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch

0.79691400584015892

Starting with our previous model with the highest ROC score, the two layer model using sigmoid activation functions with a batch size of 1,000, we increased the number of epochs and decreased the batch size to improve the ROC score. Ultimately, we found that increasing the epochs more impactful than decreasing the batch size.

With a starting ROC of 0.7671 from the two layer model using the Adamax optimizer, we increased the epochs to 20 and left the batch size at 1,000 to produce our highest ROC score of 0.7969.

### <a name="Results"></a>Results

#### Questions
Based on the exercises performed in the Work sections above, we will answer questions about neural networks based on our results.

##### Q1: What was the effect of adding more layers/neurons?
As the number of neuron layers increased, the ROC scores of the models decreases for our data. As the model adds layers to the neural network, it adds dimensionality. However, the chances of overfitting increase with the additional layers, which diminish performance – in speed and ROC score.

##### Q2: Which parameters gave you the best result and why (in your opinion) did they work?
A two-layer model (100, 1 neurons) with sigmoid activation functions, Adamax optimizer, 20 epochs and a batch size of 1,000 produced the best results for our model. In terms of the biggest ROC score improvement, selecting two layers for the model was the most impactful parameter selection. However, once the model was fine-tuned in terms of a majority of the parameters, increasing the number of epochs had the biggest impact in the late stage improvement of the ROC score for the model.

##### Q3: For work item 6, how did you decide that your model was 'done'?
When assessing at what point our final model was 'done' we looked at two main factors: loss/accuracy scores at each epoch and ROC score (an additional factor would also be time taken to run, however, with our two layer architecture and batch sizes this was not an issue for this model). For the ROC score we knew from prior discussion that we were going to attempt to achieve close to 80%. For the loss and accuracy scores at each iteration we wanted to see that loss continued to decrease and that accuracy continued to increase. While each of these scores were continuing in the correct directions when we stopped our model at 20 epochs, the returns were diminishing. Given these narrow gains, and the fact that our ROC score was close to our target (with non-significant increases at further epochs), we concluded that the model had achieved is desired goal.

### <a name="Conclusion"></a>Conclusion

For this lab, we were able to explore the Keras neural network framework and analyze the Higgs data set. By optimizing parameters on our computing resources, we were able to achieve a model with an ROC score of 0.796. While the benchmark for the lab was to produce a model with an ROC score of 0.88, that score was not a realistic threshold for our group. By achieving more than 90% of that ROC score, we consider our model a success.

Future work for this analysis would be to further explore the parameters available in the Keras API. There are so many parameters that can be tuned to create unique models, including different Keras layer types. Additionally, with more computing resources, we could reduce the batch size, increase the epochs, and allow the model to churn longer to see if we could squeeze more performance out of the ROC score.

### <a name="References"></a>References

[1] Baldi, P., P. Sadowski, and D. Whiteson. “Searching for Exotic Particles in High-energy Physics with Deep Learning.” Nature Communications 5 (July 2, 2014).

[2] Chollet, Francois, and others. "Keras." 2015. https://keras.io