# Supervised Learning: 

### Part of Machine Learning For Chemists

Before running this docuemnt you need to install tensorflow (Sorry!)

Probably not needed on a newish install!

 `conda upgrade conda`
 
 `conda upgrade --all`

create a new environment and install CPU version of tensorflow:

 `conda create -n tf tensorflow jupyter nb_conda`

switch to it:

 `conda activate tf`

add required moduiles:

 `pip install matplotlib pillow`

start an environment:

 `jupyter notebook`

This checks that your install is working well.

In [1]:
import tensorflow as tf
print("TensorFlow version: " + tf.__version__)

ModuleNotFoundError: No module named 'tensorflow.python.tools'; 'tensorflow.python' is not a package

**Important** Don't try to understand everything that this code does. This is bleeding edge whistle-stop tour designed to show you what is and isn't possible, not to teach you how to make a neural network! 

Import some stuff. Keras is the nice front end for TensorFlow, TensorFlow is the library for neural networks, this is what google runs its machine learning systems on.

In [None]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D

import numpy as np
import matplotlib as plt

from tensorflow.keras.preprocessing.image import load_img
from tensorflow.keras.preprocessing.image import img_to_array

import matplotlib.pyplot as plt
%matplotlib inline

import PIL
from IPython.display import display

This gets the labels for photos. The NN just outputs a vector of numbers and we need these labels in order to understand the images.

In [None]:
#labels_path = tf.keras.utils.get_file('ImageNetLabels.txt','https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt')
#imagenet_labels = np.array(open(labels_path).read().splitlines())
imagenet_labels = np.array(open('NNdata/ImageNetLabels.txt').read().splitlines())

# Classification

Neural networks can be used a **function approximators** or **pattern matchers**, but they have been very sucessful in **classification** problems. 

Regression models are supervised learning for **continuous** data, where we have data, we fit a **model** to it and we want to predict an output value. 

Neural networks can be used as supervised learning for **discrete** data, where we have a datapoint and we want to classify it, based on what we know about the data and the world in general.

In this notebook you will do some classifications with neural networks. 

Like in multivariate regression, we learn from a dataset X. For image recognition, that dataset is called ImageNet and containes 1.3 million example photos from 1000 categories. This data is chopped up and manipulated (i.e. crops are taken and the images are flipped) and this makes many times more data. Each image is labelled with the object the image contains. The data are messy and badly labelled, but there is a lot of it (due to all those human volunteers putting their pictures on flickr!). **Deep-neural networks require vast amounts of labelled data: big data**

In this section we are going to download the pre-trained models and test them out on some data. 

This loads a few pretrained NN models. Inception_v3 is the famous deep-NN from google you may have heard of. Vgg19 is a nice standard model.

In [None]:
#import keras
#import numpy as np
from tensorflow.keras.applications import vgg19, inception_v3, resnet50, mobilenet
from tensorflow.keras.applications import vgg16
 
#Load the VGG model
vgg_model = vgg16.VGG16(weights='imagenet')
 
#Load the Inception_V3 model
inception_model = inception_v3.InceptionV3(weights='imagenet')
 
#Load the ResNet50 model
#resnet_model = resnet50.ResNet50(weights='imagenet')
 
#Load the MobileNet model
mobilenet_model = mobilenet.MobileNet(weights='imagenet')

What we're going to do is load in some images and see how well the NNs do at classifying the images. (This code also reshapes the image so it will fit into the NN, don't worry about this). 


In [None]:
# This loads in an image from the folder NNdata -- make sure you have it!
filename = 'NNdata/pure_images/ILSVRC2012_val_00001218.JPEG'
# load an image in PIL format
original = load_img(filename, target_size=(224, 224))
 
# convert the PIL image to a numpy array
# IN PIL - image is in (width, height, channel)
# In Numpy - image is in (height, width, channel)
numpy_image = img_to_array(original)
print('numpy array size',numpy_image.shape)
 
# Convert the image / images into batch format
# expand_dims will add an extra dimension to the data at a particular axis
# We want the input matrix to the network to be of the form (batchsize, height, width, channels)
# Thus we add the extra dimension to the axis 0.
image_batch = np.expand_dims(numpy_image, axis=0)
print('image batch size', image_batch.shape)
plt.imshow(np.uint8(image_batch[0]))

This chunk of code will pre-process the image, and then asks the `vgg_model` to predict the classes for the image. This is the same as we did for the multivariate regression, except now our data is columns of pixels. The output `predictions` is a 1000-unit long vector telling you that chance that each of the 1000 classes is present in the image. We then take the top-5 highest probabilities (which I've chosen to present as percentages) and see if we agree with the NN.

In [None]:
# prepare the image for the VGG model
processed_image = vgg19.preprocess_input(image_batch.copy())
 
# get the predicted probabilities for each class
predictions = vgg_model.predict(processed_image)
# print predictions

#_ = plt.title("Prediction: " + predicted_class_name.title())
# convert the probabilities to class labels
# We will get top 5 predictions which is the default
predicted_top_5 = tf.keras.applications.vgg16.decode_predictions(predictions)[0]
[(class_name, prob) for (number, class_name, prob) in predicted_top_5]

print("Rank\tprobability\tname")
for i in range(5):
    item = predicted_top_5[i]
    print("{}.\t {:.2f}%\t\t{}".format(i+1, item[2]*100, item[1]))

You'll probably agree with the NN that there is both some sort of bird (that could be a kite) in the image and a pole. 

You can change the file name in the above code and put in your own photos if you want to test out the NN.

In [None]:
# list of files in the pure directory
File_list_pure=["ILSVRC2012_val_00001079.JPEG",
"ILSVRC2012_val_00001218.JPEG",
"ILSVRC2012_val_00001671.JPEG",
"ILSVRC2012_val_00003109.JPEG",
"ILSVRC2012_val_00003594.JPEG",
"ILSVRC2012_val_00003632.JPEG",
"ILSVRC2012_val_00003866.JPEG",
"ILSVRC2012_val_00004187.JPEG",
"ILSVRC2012_val_00004255.JPEG", 
"ILSVRC2012_val_00004472.JPEG",
"ILSVRC2012_val_00004613.JPEG",
"ILSVRC2012_val_00004756.JPEG",
"ILSVRC2012_val_00004920.JPEG",
"ILSVRC2012_val_00005287.JPEG",
"ILSVRC2012_val_00005747.JPEG",
"ILSVRC2012_val_00005847.JPEG", 
"ILSVRC2012_val_00006432.JPEG",
"ILSVRC2012_val_00006632.JPEG", 
"ILSVRC2012_val_00006722.JPEG", 
"ILSVRC2012_val_00007058.JPEG",
"ILSVRC2012_val_00007307.JPEG",
"ILSVRC2012_val_00009208.JPEG",
"ILSVRC2012_val_00009396.JPEG",
"ILSVRC2012_val_00009581.JPEG",
"ILSVRC2012_val_00011045.JPEG",
"ILSVRC2012_val_00012630.JPEG",
"ILSVRC2012_val_00013772.JPEG",
"ILSVRC2012_val_00014336.JPEG"]
print("The number of files is", len(File_list_pure))

**Exercise** Run the code below, and count how many of the images you think are correct, in that one of the top-5 labels in truely in the picture (do not score more than 1 item per image, so if there are two labels that match the image, that still counts as 1), then calculate the **top-5** accuracy on this test set.

Then count the **top-1** accuracy, this time, give the NN a score of 1 if and only if the first class it suggests is in the picture. Record the top-1 accuracy below. 

This is how NNs are scored in competetions. 

In [None]:
c=0
prob_pure=[]
for this_file in File_list_pure:
    c=c+1
    # This loads in an image from the folder NNdata -- make sure you have it!
    filename = 'NNdata/pure_images/' + this_file
    # load an image in PIL format
    original = load_img(filename, target_size=(224, 224))
    numpy_image = img_to_array(original)
    image_batch = np.expand_dims(numpy_image, axis=0)
    print('Picture', c)
    img = PIL.Image.fromarray(np.uint8(image_batch[0]))
    display(img)
    # prepare the image for the VGG model
    processed_image = vgg19.preprocess_input(image_batch.copy())
    # get the predicted probabilities for each class
    predictions = vgg_model.predict(processed_image)
    predicted_top_5 = tf.keras.applications.vgg19.decode_predictions(predictions)[0]
    [(class_name, prob) for (number, class_name, prob) in predicted_top_5]
    print("Rank\tprobability\tname - picture",c)
    for i in range(5):
        item = predicted_top_5[i]
        print("{}.\t {:.2f}%\t\t{}".format(i+1, item[2]*100, item[1]))
        if i == 0:
            prob_pure.append(item[2])


Top-5 Test Accuracy is :

Top-1 Test Accuracy is:

Now you've calcualated the test accuracy, see how sure the neural network was it was correct.

In [None]:
print('Probability of correctness is {:.2f} plus or minus {:.2f}'.format(100*np.mean(prob_pure), 100*np.std(prob_pure)/np.sqrt(28)))


A problem with neural networks is that we do not really know how they work, but we know that they do not work in the same way as humans beings. 

For example, look at this picture, can you identify it?

In [None]:
# This loads in an image from the folder NNdata -- make sure you have it!
filename = 'NNdata/b.JPEG/b.JPEG'
# load an image in PIL format
original = load_img(filename, target_size=(224, 224))
numpy_image = img_to_array(original)
print('numpy array size',numpy_image.shape)
image_batch = np.expand_dims(numpy_image, axis=0)
print('image batch size', image_batch.shape)
plt.imshow(np.uint8(image_batch[0]))

This code gets the NN to predict the class for this image.

In [None]:
# prepare the image for the VGG model
processed_image = vgg19.preprocess_input(image_batch.copy())
# get the predicted probabilities for each class
predictions = vgg_model.predict(processed_image)
# print predictions
#_ = plt.title("Prediction: " + predicted_class_name.title())
# convert the probabilities to class labels
# We will get top 5 predictions which is the default
predicted_top_5 = tf.keras.applications.vgg16.decode_predictions(predictions)[0]
[(class_name, prob) for (number, class_name, prob) in predicted_top_5]
print("Rank\tprobability\tname")
for i in range(5):
    item = predicted_top_5[i]
    print("{}.\t {:.2f}%\t\t{}".format(i+1, item[2]*100, item[1]))

The neural network cannot, even though it got it correct in when the colours were included. 

# Adversarial images

Proof that neural networks do not work like us is shown below with images designed to hack the neural networks.

Reference for this bit: Nguyen A, Yosinski J, Clune J. "Deep neural networks are easily fooled: High confidence predictions for unrecognizable images". In *Proceedings of the IEEE conference on computer vision and pattern recognition 2015* (pp. 427-436).

## Exercise 2. 
As before, count up the top-5 and top1 accuracy and write them down. How do they compare to the previous set of images?

In [None]:
c=0
prob_fool=[]
for this_file in range(16):
    c=c+1
    # This loads in an image from the folder NNdata -- make sure you have it!
    filename = 'NNdata/fool_images/' + str(this_file + 1) + '.png'
    # load an image in PIL format
    original = load_img(filename, target_size=(224, 224))
    numpy_image = img_to_array(original)
    image_batch = np.expand_dims(numpy_image, axis=0)
    print('Picture', c)
    img = PIL.Image.fromarray(np.uint8(image_batch[0]))
    display(img)
    #f=plt.figure()
    #plt.imshow(np.uint8(image_batch[0]))
    #f.canvas.draw_idle()
    # prepare the image for the VGG model
    processed_image = mobilenet.preprocess_input(image_batch.copy())
    # get the predicted probabilities for each class
    predictions = mobilenet_model.predict(processed_image)
    predicted_top_5 = tf.keras.applications.vgg19.decode_predictions(predictions)[0]
    [(class_name, prob) for (number, class_name, prob) in predicted_top_5]
    print("Rank\tprobability\tname - picture",c)
    for i in range(5):
        item = predicted_top_5[i]
        print("{}.\t {:.2f}%\t\t{}".format(i+1, item[2]*100, item[1]))
        if i == 0:
            prob_fool.append(item[2])

Top-5 Test Accuracy is :

Top-1 Test Accuracy is:

Now you've calcualated the test accuracy, see how sure the neural network was it was correct.

In [None]:
print('Probability of correctness is {:.2f} plus or minus {:.2f}'.format(100*np.mean(prob_fool), 100*np.std(prob_fool)/np.sqrt(28)))


## Conclusion

So, neural networks give a very high certainty to images which are not images of objects and are thus easy to fool. 

It sees they pay attention to small and local features, colour and texture, rather than the overall shape. 

The problem is, it is easy to see that the neural network answers are wrong in the visual realm, it would not be as easy to spot an error like this in a quantum chemistry approximation.

# Training a neural network: the effect of dataset size

This section will train a neural network on a set of data called MNIST, which is a database of 60,000 images of handwritten digits. (If you ever wondered how letters got to you, this is how automated postcode readers work). 

In [None]:
# epoch is a step, and we will train for 10 epochs.
epochs = 10

This gets the mnist data and sets up the test and train datasets.

In [None]:
batch_size = 128
num_classes = 10

# input image dimensions
img_rows, img_cols = 28, 28

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000,28,28,1)
x_test = x_test.reshape(10000,28,28,1)

print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

This code prints out the first 10 digits as an example of the dataset. 

In [None]:
for i in range(10):
    img = PIL.Image.fromarray(x_train[i].reshape((28,28)))
    display(img)

Now, we'll make a training set of only 1000 images, that should be more than enough, right?

In [None]:
x_train_small=x_train[0:1000]
y_train_small=y_train[0:1000]

This section sets up the neural network model. 

In [None]:
# instantiates the model
model = Sequential()
# bottom layer takes in the pictures, and does a convolution on them to look for edges
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=(28,28,1)))
# second layers takes in the output of the first and does a convolution on that data
model.add(Conv2D(64, (3, 3), activation='relu'))
# max pooling, we take the largest number of the convolutions
model.add(MaxPooling2D(pool_size=(2, 2)))
# dropout - some neurons are randomly removed during training as this reduces overtraining!
model.add(Dropout(0.25))
# changes the shape of the output - don't worry about this, it's just reshaping the data
model.add(Flatten())
# third layer, we add 128 feed-forward neurons
model.add(Dense(128, activation='relu'))
# tehy're droped out randomly as well!
model.add(Dropout(0.5))
# we use the softmax activation on the output 
model.add(Dense(num_classes, activation='softmax'))

Unlike the regularisation models, neural networks are a bit more complex. This sets up the **loss function** which tells the NN how bad the error is, choses the optimiser and tells the NN to measure accuracy.

In [None]:
model.compile(loss=tf.keras.losses.categorical_crossentropy,
              optimizer=tf.keras.optimizers.Adadelta(),
              metrics=['accuracy'])

This is the bit where it does the fit. Unlike the regression models we ran yesterday, fitting a deep NN takes a long time. 

In [None]:
# Fits the NN to the input data (train and validation)
model.fit(x_train_small, y_train_small,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
# This scores the model on the test set
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy: {:.2f}%'.format(score[1]*100))

Watch the a loss improve with epoch (on the training set), and val_loss is the loss on the validation set, the val_acc is the accuracy on the validation set, the acc is the accracy on the trianing set. 

That;s OK, but not great, let's see what happens when we use all the data.

In [None]:
# lets try using ALL the data
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy: {:.2f}%'.format(score[1]*100))

### Conclusion: 

You've just trained your first neural network. 

More data is better when less when training a neural network, but we really need to make sure that it is data of good quality. 

We can get away with less data if it is chosen in an intelligent way. 