# <font color='red'>Project DIABETES - with Keras (part 1)</font>

In this project tutorial you will discover **how to use Keras to develop and evaluate a NN model for a single-class classification problem**. 

Goals:

* How to load a CSV dataset ready for use with Keras
* How to define and compile a multilayer perceptron model for a single-class classification problem in Keras
* How to evaluate a Keras NN model on a validation dataset (NOTE: *using Keras only*)

# A premise: the steps

In this exercise, there is not a lot of code required, so we are going to step over it slowly so that you will know how to create your own models in the future. The steps we are going to cover here are as follows:

1. Load Data
2. Define Model
3. Compile Model
4. Fit Model
5. Evaluate Model
6. Tie It All Together


# Description of the input data

The Pima Indians Onset of Diabetes dataset is a standard ML dataset, widely used worldwide as benchmark.


### Dataset availability:

Out of colab, if you want you can download it directly from a repository of mine, at this URL:
   * [here](https://drive.google.com/open?id=12pjLYLeuZ__4SVPuz6zL9QQpPiwbVYDKz_rsn4eeHzI)
   
More info:
   * [Genetic Studies of the Etiology of Type 2 Diabetes in Pima Indians](http://diabetes.diabetesjournals.org/content/53/5/1181)


### Dataset description:


* It describes patient medical record data for Pima Indians and whether they had an onset of diabetes within five years
* It is a single-class, binary classification problem (onset of diabetes as 1 or not as 0)
* Given that all attributes are numerical makes it easy to use directly with NNs that expect numerical inputs and numerical output values, so this is ideal for a first NN implementation in Keras (but be careful about needed data manipulations..).
* The input variables that describe each patient are numerical and have varying scales. Each entry has values for 8 different attributes, defined as follows:
   1. <font color='blue'>Number of times pregnant</font>
   2. <font color='blue'>Plasma glucose concentration a 2 hours in an oral glucose tolerance test</font>
   3. <font color='blue'>Diastolic blood pressure (mm Hg)</font>
   4. <font color='blue'>Triceps skin fold thickness (mm)</font>
   5. <font color='blue'>2-Hour serum insulin (mu U/ml)</font>
   6. <font color='blue'>Body mass index (BMI)</font>
   7. <font color='blue'>Diabetes pedigree function</font>
   8. <font color='blue'>Age (years)</font>
* The output variable is binary:    
   * <font color='blue'>the "class", onset of diabetes within five years, yes or no, 1 or 0</font>

### How the dataset looks like:

    6,148,72,35,0,33.6,0.627,50,1
    1,85,66,29,0,26.6,0.351,31,0
    8,183,64,0,0,23.3,0.672,32,1
    1,89,66,23,94,28.1,0.167,21,0
    0,137,40,35,168,43.1,2.288,33,1
    (...)


### Additional input from best practitioners:

The baseline accuracy if all predictions are made as no onset of diabetes is 65.1%. Top results on the dataset are in the range of 77.7% accuracy (note: using 10-fold cross-validation). Use this as target to aim for when developing your model(s).

# Start-up

## CHECK Import Classes and Functions

Start by importing all classes and functions you will need:
* all the functionality we require from **Keras**
* data loading functionalities from **numpy**

In [None]:
import numpy
# pandas
# --- not used on this example
# keras
from keras.models import Sequential
from keras.layers import Dense
#from keras.wrappers.scikit_learn import KerasClassifier
#from keras.utils import np_utils
# sklearn
#from sklearn.model_selection import cross_val_score
#from sklearn.model_selection import KFold
#from sklearn.preprocessing import LabelEncoder
#from sklearn.pipeline import Pipeline

## Initialize random nb generator

Important to ensure that the results we achieve from this model are repeatable, i.e. it ensures that the stochastic process of training a NN model can be reproduced.

In [None]:
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

## Load The Dataset

Download the dataset from the link above and place it in your current working directory, with filename

    pima-indians-diabetes.data.csv

Then, you can quickly inspect it from this same ipynb.

In [None]:
!ls -trl pima-indians-diabetes.data.csv

In [None]:
!head -5 pima-indians-diabetes.data.csv

Now, load the dataset. Easiest in this case is to load the file directly using the **numpy** function *loadtxt()*. There are 8 input attributes and 1 output (last column). While you do so, split the attributes into input variables (the matrix of **X - features**) and output variables (the array **Y - label**).

In [None]:
# load DIABETES dataset
dataset = numpy.loadtxt("pima-indians-diabetes.data.csv", delimiter=",")
X = dataset[:,0:8]   # columns from 1st to 8th into X
Y = dataset[:,8]     # column 9th into Y

Verify what you did.

In [None]:
len(X)

In [None]:
len(Y)

In [None]:
X

In [None]:
Y

# Define the model in Keras

Recap of key concepts and consequent implementation steps in this example:
   * Models in Keras are defined as a **sequence of layers**
   * We create a **Sequential** model and **add layers** one at a time until we are happy with the NN topology.
   * The first thing to get right is to ensure the **input layer** has the right number of inputs.
      * here we have 8 features, so pass 8 as nb dimensions in the input layer
   * How many **hidden layers**? Which type?
      * This is admittedly a very hard question. There are heuristics that we can use and often the best network structure is found through a process of trial and error experimentation. Generally, you need a network large enough to capture the structure (a.k.a. complexity) of the problem if that helps at all.
      * In this example: we will use a Fully-Connected NN structure with 1 hidden layer only
   * Implementation: FC layers are defined using the Keras **Dense** class. We can:
      * use it for as many hidden layer as we want (here: 3, i.e. 1 input, 2 hidden, 1 output - read below)
      * for each, specify the number of neurons, using the first argument
      * specify the activation function, using the activation argument (here: rectifier (**relu**) activation function on the first two layers and the **sigmoid** activation function in the output layer. 
         * *NOTE: It used to be the case that sigmoid and tanh activation functions were preferred for all layers. These days, better performance is seen using the rectifier activation function. We use a sigmoid activation function on the output layer to ensure our network output is between 0 and 1 and easy to map to either a probability of class 1 or snap to a hard classification of either class with a default threshold of 0.5*.
         
We can piece it all together by defining layers and adding them. We build a NN with:
   * a first hidden layer with 12 neurons that expects 8 input variables (e.g. input dim=8)
   * a second hidden layer with 8 neurons
   * the output layer with 1 neuron to predict the class (onset of diabetes or not)

In [None]:
# define model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Check:

In [None]:
model

The figure below provides a (manually done) depiction of the netw\ork structure defined above.

<img src="NNstructure.png" style="width:200px"/>

The model is defined. 

# Compile the model in Keras

Now that the model is defined, we need to actually create it, "compiling" it to be ready for efficient computation. 

By "compiling" the model we refer to the actual fact that this is done by using efficient numerical libraries under the covers (the so-called "backend" resources) such as those provided by Theano or **TensorFlow** (in our case, the latter). In the "compilation phase", the backend automatically chooses the best way to represent the network for training and making predictions running on your hardware. The outcome of this step will be a model we can finally train on our dataset.

NOTE to avoid confusion: this is *not* the training phase (yet - that is next one). 

When compiling, we must specify some additional properties required when training the network. We must specify:
   * the **loss function** to use to evaluate the best set of weights
   * the **optimizer** used to search through different weights for the network
   * any **metrics** we would like to collect and report during training

In this case we will use:
   * loss function -> logarithmic loss (which for a binary classification problem is defined in Keras as binary crossentropy)
   * optimizer -> the efficient adam gradient descent algorithm (and efficient default)
      * More in this paper: [Adam: A Method for Stochastic
Optimization](http://arxiv.org/abs/1412.6980)
   * choice of metric -> accuracy (because it is a classification problem, we will collect and report the classification accuracy as the metric)
   
NOTE: This is where you state that the work may happens on your CPU only, or on GPUs too.

In [None]:
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The model is compiled.

# Fit the model in Keras

Now that the model is defined, and compiled, we are ready to execute first the model on some data, i.e. to **fit the model** a.k.a. to perform the **training** phase on our dataset, i.e. finding the best set of weights to ultimately be able to use such model to make predictions for my problem. 

Concretely, now we train (fit) our model on our loaded dataset, by calling the *fit()* function on the compiled model, after providing few arguments: 
   * **nb epochs**, i.e. the training process will run for a fixed number of iterations through the dataset (set by using the *epochs* argument)
   * **batch size**, i.e. the nb instances to be evaluated before a weight update in the network is performed (set by using the *batch size* argument).
   
In this case, we will run:
   * for a small number of epochs (150)
   * using a relatively small batch size (10)
   
NOTE: again, these can be chosen experimentally by trial and error.

NOTE: This is the time-consuming part, where your work actually happens on your 1/more CPUs, 1/more GPUs..

In [None]:
# fit the model
model.fit(X, Y, epochs=150, batch_size=10)

Running the cell above, you should see a message for each of the epochs, printing the loss (roughly going down) and accuracy (roughly going up) for each epoch. It takes about 15 seconds to execute on my workstation running on a Macbook Air CPU with a Tensorflow backend.

# Evaluate the model in Keras

Now the model has been defined, compiled and trained (**well - see comments in the next section, though!**)

In this oversimplified example, we can evaluate our model on our training dataset using the *evaluation()* function and pass it the same input and output used to train the model. This will generate a prediction for each input and output pair and collect scores, including the average loss and any metrics we have configured, such as accuracy.

In [None]:
# evaluate the model
scores = model.evaluate(X, Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

Running the cell above, you should the final evaluation of the trained model on the training dataset. It takes about 2 seconds to execute on my workstation running on a Macbook Air CPU with a Tensorflow backend.

# Is this a good-enough model to deliver towards production?

This is a good model to familiarize with the process in a simplified way, but this is **absolutely not** a good model to put in production. 

<font color='orange'>NOTE: we have a FCNN trained on our *entire* dataset we may evaluate the performance of the NN on the same dataset: this will only give us an idea of how well we have modeled the dataset (e.g. training accuracy), but **no idea as of how well the model might perform on new, previously unseen data**. We have done this for simplicity: ideally, one should separate the entire dataset into train and test datasets for the training and evaluation of your model, or use k-flod cross-validation, etc.</font>

Realistically, the model we built will fail to accurately generalize to new, previously unseen data, as the accuracy we got above is overly optimistic. We need to work more to get towards a "good model" (whatever that means!). **But this is exactly what it looks like in ML practice: writing "one" model is easy, writing a "good model" is hard, improving towards a "better model" is the hardest**.

# Summary (so far) 

What we learned:

* how to develop and evaluate a NN using Keras only

Specifically:

* How to load data and make it available to Keras
* How to prepare single-class classification data for modeling
* How to design, compile, fit a Keras NN on the entire dataset at disposal
* How to evaluate the NN model 