# Lab 1 - Intro to BlueCrystal and Training Your First Fully Connected Network

In this first lab session, you will learn the basics of implementing deep learning models using TensorFlow 1.2 and how to use BlueCrystal Phase 4 (BC4) for training them. The aim is to learn the principle of training a fully connected layer. 

### Objectives:

1.- Build your first deep learning model using TensorFlow 1.2 for classifying Iris flowers using 3-dimensional data. 

2.- Train your model on BC4 and visualize the training process

3.- Evaluate your model.


## NOTICE:

Please ensure you can successfully run the [GPU stress Test](../../BC4_Stress_Test/RunningTensorFlow.ipynb) before attempting Lab 1. This will ensure you have a working platform prior to proceeding with this lab sheet

# 1. IRIS Flow Dataset

In 1936, British geneticist Ronald Fisher collected the IRIS Flow Dataset [Wiki](https://en.wikipedia.org/wiki/Iris_flower_data_set)

It contains 150 samples of three types of IRIS flowers. Each sample is described using four dimensions. Follow the wikipedia link above to look at the IRIS dataset file, and understand the various dimensions.

The aim of this lab is to classify the IRIS flow dataset, using a fully connected `deep` network. We will actually build a shallow one of 3 layers only, but the principles extend to any depth. 

# 2. Running Your First BC4 Code

## 2.1 Downloading Relevant Files

* First, visit the [GitHub labsheet repository](https://github.com/COMSM0018-Applied-Deep-Learning/labsheets)
* Clone the repository `git clone "https://github.com/COMSM0018-Applied-Deep-Learning/labsheets.git"`
* Copy `Lab_1_DNNs` into a new folder which we will refer to as `/path_to_files/`
* Using Jupyter notebook, open the file `Lab_1_DNNs/first_dnn.py`
* Note that the file includes code to load the IRIS dataset - the same format as that on the Wiki page, and splits it into two: features and labels. 

## 2.2   Copying files between your machine and BC4

BlueCrystal Phase 4 (BC4) is the latest update on the University's High Performance Computing (HPC) machine. It includes 32 GPU-accelerated nodes, each of them with two NVIDIA Tesla P100 GPU accelerators and also a visualization node equipped with NVIDIA GRID GPUs; what matters to us are the Tesla P100 GPU accelerators that we will use for training your Deep Learing algorithms. 

Further information on BC4 and the support we have for it are available at: https://www.acrc.bris.ac.uk/acrc/phase4.htm

There are two *modes* for using BC4: *Interactive* and *Batch*. We will use *Interactive* during lab sessions, since it allows the immediate excution of your program and you can see outputs directy on the terminal window (great for debugging); while the *Batch* method queues your job and generates files related with the excecution of your file. You will use *Batch* as part of the group project, so we will revisit that later.


**NOW** copy the provided folder `Lab_1_DNNs` (which contains `first_dnn.py`, `tensorboard_params.sh`, `go_interactive.sh`) to your account in BC4. 

For copying individual files from your machine to your home directory on BC4 use the next example with `go_interactive.sh`:

```$ scp  /path_to_files/Lab_1_DNNs/go_interactive.sh <your_UoB_ID>@bc4login.acrc.bris.ac.uk:```

or all files at once by using: 

```$ scp -r /path_to_files/*  <your_UoB_ID>@bc4login.acrc.bris.ac.uk:```

For copying back files from BC4 to your machine use the  command ```scp``` from a terminal on your machine, you can copy individual files, as well as directories:

```$ scp  <your_UoB_ID>@bc4login.acrc.bris.ac.uk:/path_on_bc4/foo.foo   /path_in_your_machine/```
 
 Alternatively, you may wish to use SSHFS to mount a directory on BC4 to a directory using:
 
```$ mkdir -p ~/bc4 && sshfs <your_UOB_ID>@bc4login.acrc.bris.ac.uk:/dir_on_bc4/ ~/bc4```
 

## 2.3 Logging in

The connection to BC4 is done via SSH 

```$ ssh <your_UoB_ID>@bc4login.acrc.bris.ac.uk```

You should see something like this in your home directory:
 
 ```
 Lab_1_DNNs
 |----------first_dnn.py 
 |----------tensorboard_params.sh 
 |----------go_interactive.sh```
 
**NOTE: If you cannot see the file structure above, you have not copied the files correctly**

## 2.4 Running Your Code

Now run your current dnn code as follows: 

```$ ./go_interactive.sh ```

Wait for a GPU to be allocated to you,


```$ python first_dnn.py```

Currently the code only loads your IRIS datafile. You can free the reserved GPU using

```$ exit```

# 3. Let's code now!

## 3.1 Prepare a Training/Testing split

First, we need to separate the loaded file of samples **```data```** into training and testing - let's do a 2:1 split (100 samples for training and 50 samples for testing)


**STOP**... before you proceed with this task, you first need to shuffle the file. **WHY?**

**NOW**, generate a random seed from Numpy


**NOW**, shuffle your data [hint: ```data = data.sample(frac=1).reset_index(drop=True)```]


**NOW**, divide your data into ```train_x```, ```train_y```, ```test_x```, ```test_y```


You can check you have correctly split the data by printing out the sizes of your variables

## 3.2 Define a Perceptron!

The first objective is to train a perceptron: $y=Wx-b$ given our training data ```train_x``` and ```train_y```

Let's define first these variables for a **single** data point. Both ```x``` and ```y``` should be placeholders:


```x = tf.placeholder(tf.float32, shape=[None, n_x])```

**NOW**, go ahead and define $y$.


**NOW**, define $W$ and $b$ as variables in Tensorflow (```tf.Variable```) with the right dimensions. Initialise to zero (we'll randomise this initialisation later)


If defined correctly you can then define the predictions over y using the multiplication operator ([```tf.matmul```](https://www.tensorflow.org/api_docs/python/tf/matmul)) as well as the softmax operator [```tf.nn.softmax```](https://www.tensorflow.org/api_docs/python/tf/nn/softmax)

**Note:** You have not trained anything yet, you are merely using random weights over placeholders for x and y

**Debug:** Just debug to check you've done it correctly


## 3.3 Train a Perceptron!

Now it is time to define the loss/cost function - that is the difference between the prediction, and the ground-truth

There are a few ways to do this, let's try this one [we assume ```prediction``` is what you calculated in 3.2]

In [None]:
cost = tf.reduce_mean(-tf.reduce_sum(y * tf.log(prediction), axis=1))

Again, this is only defining the cost function and not training for it. You will need an optimizer to do the job

**NOW** Define a gradient descent ```optimiser```: [```tf.train.GradientDescentOptimizer```](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer) 

**HINT** You will need to call the ```minimize``` function of the optimiser with the right parameter (cost)

**HINT** use a learning rate of 0.01 for now, we'll revisit this decision in later labs


To get a random initialisation and optimise for a fixed number of iterations, try the following

In [None]:
sess.run(tf.global_variables_initializer())

for epoch in range(10000):
    sess.run([optimizer], feed_dict={x: train_x, y: train_y})

## 3.4 Testing your trained perceptron

You can now use the trained weights to test the prediction on, say, the first element in your test set

In [None]:
sess.run(prediction, feed_dict={x: test_x[:1], y: test_y[:1]}).tolist()[0]

**NOW**, calculate the predictions of all your test set... This is the accuracy of your model

You can also print out the model's accuracy during epochs (maybe print it out every 100 epochs?), to see how the model is being trained

Debug and run, a sample of the **test** accuracy over time would be as follows

In [None]:
Accuracy of Perceptron at epoch 0 is 0.23999998
Accuracy of Perceptron at epoch 10 is 0.97999996
Accuracy of Perceptron at epoch 20 is 0.84000003
Accuracy of Perceptron at epoch 30 is 0.97999996
Accuracy of Perceptron at epoch 40 is 0.98000002
Accuracy of Perceptron at epoch 50 is 0.98000002
Accuracy of Perceptron at epoch 60 is 0.98000002
Accuracy of Perceptron at epoch 70 is 0.98000002
Accuracy of Perceptron at epoch 80 is 0.98000002
Accuracy of Perceptron at epoch 90 is 0.98000002
Accuracy of Perceptron at epoch 100 is 0.98000002

## 3.5 Define a *DEEP* fully connected network

Now let's change our perceptron's definition to represent a fully connected network.

We are aiming for a fully connected network that looks like this:

```x(4 dimensions) - h1 (10 nodes) - h2 (20 nodes) - h3 (10 nodes) - y (3 classes)```

Where h1 is the first layer of hidden nodes, and similarly for h2 and h3

We will choose ReLU as our activation function

We'll help you define the first layer (```h1```) using this code - **make sure you understand it!**

In [None]:
W_fc1 = tf.Variable(tf.truncated_normal([n_x, h1], stddev=0.1))
b_fc1 = tf.Variable(tf.constant(0.1, shape=[h1]))
h_fc1 = tf.nn.relu(tf.matmul(x, W_fc1) + b_fc1)

**NOW** it's your turn to complete building the fully connected network.

*Q: How many variables does your network contain??*

Once done, you can define the loss function using softmax cross entropy as follows:

In [None]:
cost_fcn = tf.losses.softmax_cross_entropy(onehot_labels=y, logits=predictions_fcn, scope="Cost_Function")

Instead of the Gradient descent optimiser, let's use the [Adagrad optimiser](https://www.tensorflow.org/api_docs/python/tf/train/AdagradOptimizer)

**HINT** use a learning rate of 0.1 for this one

**HINT** run your network for 3000 steps. *Q why do we need more iterations here?*

**NOW**, optimise and calculate the accuracy for your test set as you did previously

Print the accuracy every 100 epochs. Your output should look like:

In [None]:
Accuracy of my first dnn at epoch 0 is 0.31999999
Accuracy of my first dnn at epoch 100 is 0.31999999
Accuracy of my first dnn at epoch 200 is 0.67999995
Accuracy of my first dnn at epoch 300 is 0.67999995
Accuracy of my first dnn at epoch 400 is 0.67999995
Accuracy of my first dnn at epoch 500 is 0.67999995
Accuracy of my first dnn at epoch 600 is 0.67999995
Accuracy of my first dnn at epoch 700 is 0.74000001
Accuracy of my first dnn at epoch 800 is 0.95999998
Accuracy of my first dnn at epoch 900 is 0.95999998
Accuracy of my first dnn at epoch 1000 is 0.95999998
Accuracy of my first dnn at epoch 1100 is 0.95999998
Accuracy of my first dnn at epoch 1200 is 0.95999998
Accuracy of my first dnn at epoch 1300 is 0.95999998
Accuracy of my first dnn at epoch 1400 is 0.95999998
Accuracy of my first dnn at epoch 1500 is 0.97999996
Accuracy of my first dnn at epoch 1600 is 0.97999996
Accuracy of my first dnn at epoch 1700 is 0.97999996
Accuracy of my first dnn at epoch 1800 is 0.97999996
Accuracy of my first dnn at epoch 1900 is 0.97999996
Accuracy of my first dnn at epoch 2000 is 0.97999996
Accuracy of my first dnn at epoch 2100 is 0.97999996
Accuracy of my first dnn at epoch 2200 is 0.97999996
Accuracy of my first dnn at epoch 2300 is 0.97999996
Accuracy of my first dnn at epoch 2400 is 0.97999996
Accuracy of my first dnn at epoch 2500 is 0.97999996
Accuracy of my first dnn at epoch 2600 is 0.97999996
Accuracy of my first dnn at epoch 2700 is 0.97999996
Accuracy of my first dnn at epoch 2800 is 0.97999996
Accuracy of my first dnn at epoch 2900 is 0.97999996
Accuracy of my first dnn at epoch 3000 is 0.97999996

## 3.6 Summaries and Tensorboard
Tensorboard allows for the visualisation of training and testing statistics. To do this we can run tensorboard on Blue Crystal and, via the use of port forwarding, view the results on your local machine. 

First, we need to indicate what we want to be saved in the summaries. For now we will save **some images** that are feed in to model, the **loss** and **accuracy** for every batch. 

First, you need to specify where the logs and summaries will be stored

In [None]:
logs_path = "./logs/"

**NOW**, make sure you define all your DNN and all variables within a graph - we will revisit graphs in Lab2.

In [None]:
g = tf.get_default_graph()
with g.as_default():

For each value you would like to log, define a name scope and use tf.summary.scalar to add the value

In [None]:
with tf.name_scope('loss'):
    cost_fcn = tf.losses.softmax_cross_entropy(onehot_labels=y, predictions_fcn, scope="Cost_Function")
    tf.summary.scalar('loss', cost_fcn)

To record all the summaries, you need to merge the summary (within ```g.as_default```). Then create two writers, one for the training data and the other for the test data

In [None]:
merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter(logs_path + '/train')
test_writer = tf.summary.FileWriter(logs_path + '/test')

Write a summary per epoch, for both training and testing. You will need to output the summary from sess.run, then write it for both training and testing

In [None]:
train_writer.add_summary(summary_train, epoch)
test_writer.add_summary(summary_test, epoch)

**NOW**, debug and run

## 3.7 Monitoring your training

Follow the next steps for monitoring your training using Tensorboard:

1. Using the blue crystal ssh login (2.2) change to the lab 1 directory:

    ```$ cd Lab_1_DNNs/```
    
2. Switch to interactive mode, and note the change of the gpu login to a reserved gpu:

    ```$ ./go_interactive.sh ```
    
3. Run the following script. It will pop up two values: `ipnport=XXXXX` and `ipnip=XX.XXX.X.X.`

    ```$ chmod +x tensorboard_params.sh```

    ```$ ./tensorboard_params.sh```
    
    **Write them down since we will use them for using TensorBoard.**

4. Train the model using the command:
    
    ```$ python first_dnn.py & tensorboard --logdir=logs/ --port=<ipnport>```
   
   where `ipnport` comes from the previous step. It might take a minute or two before you start seeing the accuracy on the validation batch at every step

## 3.8 Visualising and Monitoring Your Training


1. Open a **new Terminal window** on your machine and type: 
    
    ``` ssh  <USER_NAME>@bc4login.acrc.bris.ac.uk -L 6006:<ipnip>:<ipnport>```</mark> 
    
    where `ipnip` and `ipnport` comes from step 2 in **3.7**.

2. Open your web browser (Use Chrome; Firefox currently has issues with tensorboard) and open the port 6006 (http://localhost:6006). This should open TensorBoard, and you can navigate through the summaries that we included.


3. Click on **Accuracy** and **Loss**

4. You should be able to see the train and test losses and accuracy like here,

![](./loss_example_DNN.png)
![](./accuracy_example_DNN.png)

# 4 Saving your trained model

You should copy your log files back from BC4, and save them for your first lab portfolio

```bash
scp -r <your_UoB_ID>@bc4login.acrc.bris.ac.uk:/Lab_1_DNNs/logs   /path_in_your_machine/

```

Both your directory `logs/` and your `csv` file should be submitted as part of your Lab_1 portfolio (see [**section 6**](#6.-Preparing-Lab_1-Portfolio)).

# 5. Closing all sessions

Once the training has finished, **close all sessions** by typing `exit`. You need to do this twice for an **interactive session.** 

**Please make sure closing your session in order to release the gpu node.**

# 6. Preparing Lab_1 Portfolio

You should by now have the following files, which you can zip under the name `Lab_1_<username>.zip` 

 ```
 Lab_1_<username>.zip
 |----------logs\ 
 |----------first_dnn.py
 ```
 
 Store this zip safely. You will be asked to upload all your labs' portfolio to ** SAFE at Week 10 ** - check SAFE for deadline details.

Ack: this lab is inspired by ideas from: [steadforce](https://steadforce.com/first-steps-tensorflow/)