# Programming for Chemists: Hands on machine learning for chemists

**Note: It is recommended to use Google Colab for this session due to allowing GPU access**

Before you start, run the following code box to install several requirements for this notebook:

In [None]:
# install deepchem
!pip install --pre deepchem

# install kora
!pip install kora

# import rdkit
import kora.install.rdkit

## Introduction

Machine learning, the buzzphrase of the last 5 years but what is it? why is it important? why is there a session dedicated to it? Is it just applied statistics? We are going to:

* Find out what machine learning is and why it is important for scientists.
* Tools needed to do it.
* How it is used in science.
* Find out many of you have done machine learning without realising it.
* Predict the solubility of small drug molecules using machine learning.

**Aim of this session:**

Machine learning is a **vast** and **rapidly** growing field of research built on complex mathematics and statistics; which we will **not** cover. 

* Instead we are going to explore how you can **use** it to solve real world problems; application over theory. 
* There is a high probability you will end up using some form of machine learning in your future career, so it is more important you understand how you can use it as opposed to spending months or years understanding the mathematics behind the algorithms. 
* [The following image](vas3k.com) gives a bird's-eye view of the machine learning world:

<center><img src="https://i.pinimg.com/736x/05/38/08/0538088de4041bda05c6f1febc99e1bb.jpg" width="auto" height="auto" /></center>

* We already benefit from machine learning in our day to day lives with but a few examples being:

    * **Image recognition**: Software trained using ML to scan through hundreds of MRI and CT medical images per second highlighting abnormalities. Computers have long exceeded the accuracy of a human in this field.
    * **Medical diagnosis**: Training programs using clinical symptoms to accurately diagnose patients. 
    * **Speech recognition**: ML is behind the rise of assistant software like Alexa and Siri. 
    * **Google**: Google is at the forefront, investing huge amounts of money into all facets of ML. It is used in their translation software, Google photos, Google assistant, Google Maps, self driving cars and many more.
    * **Financial services**: Used extensively in the finance sector to predict market shifts, customer spending patterns and it can even predict account closures before they occur!
    
    * Scientists have even used machine learning to [reconstruct images a person has looked at from monitoring their brain activity!](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006633)

## Artificial Intelligence, Machine Learning, and Deep Learning. What's the Difference?

<center><img src="https://raw.githubusercontent.com/adambaskerville/ProgrammingForChemists/master/images/AI_ML_DL.jpeg" width="auto" height="auto" /></center>


There are no agreed upon, standard definitions for any of these subject fields, but the [Google Machine Learning glossary](https://developers.google.com/machine-learning/glossary#m) provides sensible definitions:

**Artificial Intelligence (AI):** A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence. Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.

**Machine Learning (ML):** A program or system that builds (trains) a predictive model from input data. The system uses the learned model to make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model. Machine learning also refers to the field of study concerned with these programs or systems.

**Deep Learning (DL):** Is a machine learning technique that constructs artificial neural networks to mimic the structure and function of the human brain.

### Anatomy of a machine learning problem

**Most machine learning problems follow a similar implementation and solution pattern:**

1. Read in your data. 
2. Split the data into training and testing data.
3. Format the data. 
4. Fit a model to the training data.
5. Test the model on the testing data.

## Machine Learning in Python: TensorFlow

Python remains one of the most popular languages for machine learning; boasting a variety of libraries aimed at its implementation including [**TensorFlow**](https://www.tensorflow.org/); an end-to-end open source machine learning platform developed by Google, available for many programming languages including Python.

* First we import tensorflow into Python:

In [None]:
import tensorflow as tf

* Before we continue, various parts of the code in this worksheet will flag up warnings when they are run. **These are not a concern** but are merely a product of having to get various libraries to communicate with one another. 
* Certain Python libraries depend on specific versions of other Python libraries, so it is sometimes a balancing act to set up the environment to make them all happy. 
* The warnings usually refer to `deprecation` statements which simply mean they are suggesting you to use newer syntax from the latest release version. 

## Hardware Limitations

With everyone wanting to use machine learning, its implementation has become quite high level requiring small amounts of code.
* The biggest hurdle is the cost of the hardware to calculate in sensible time.

* Since the start of this session **terabytes of data** has been produced world wide which is the key ingredient for AI, ML and DL.
    * Volumes of data are **difficult** for standard computers to handle.
    * To achieve results within a sensible time frame you often need server racks full of Graphical Processing Units (GPUs) to run everything using massively parallel programming. 
    * The popularity of video gaming has been the dominant force in the development of faster and bigger GPU technology which has allowed AI, ML and DL to expand so quickly. 
    * For companies wanting to analyse large volumes of data the bill can be hundreds of thousands of pounds on hardware/software alone plus the power and expertise to run it.

### Difference between a CPU and GPU?

Here is a practical difference between a CPU and GPU from the Mythbusters:

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('-P28LKWTzrI',  width=700, height=500)

### Using GPU in Google Colab

Google Colab allows us to switch between using a CPU and GPU for code execution! 

* To do this go to:

    `Runtime -> Change runtime type`

* and select `GPU`. None corresponds to a CPU. 

* We will now run our first program on a GPU, **a matrix multiplication.**

* <font color='red'>**Exercise:** Write the code for the matrix multiplication after the %time command using the `matmul` command from the numpy library:</font>

In [None]:
import tensorflow as tf
import numpy as np

# define our matrix size
mat_size = 1000

# create matrices filled with random numbers using numpy
mat1_np = np.random.rand(mat_size, mat_size)
mat2_np = np.random.rand(mat_size, mat_size)

# convert these numpy arrays to tensors for use in tensorflow
mat1_tf = tf.convert_to_tensor(mat1_np)
mat2_tf = tf.convert_to_tensor(mat2_np)

# multiply the numpy matrices and time
%time 

*  <font color='red'>**Exercise:** Now do the same thing but using the `matmul` command from tensorflow on the converted numpy matrices:</font>

In [None]:
%time

* As you can see the **real time** is much shorter when using the GPU.

## Supervised vs. Unsupervised Learning

There are two main branches of classical machine learning, called **supervised** and **unsupervised** learning.

### Supervised Learning

Consider a mathematical function of the form

$$
    y = f(x).
$$

* Conventionally we plug in the input values $x$, into the known function $f$, to calculate $y$. 
* **Supervised learning** is where you have input values, $x$, and an output value $y$, and use an algorithm to calculate $f$, the mapping function from the input to the output. 
    * **The majority of practical machine learning uses supervised learning.** The more input and output data supplied, the more accurately $f$ can be defined **in theory**.

### Unsupervised Learning

* In contrast to supervised learning, **unsupervised learning** is where you only have input data $x$ and no corresponding output variables. 
    * The goal for unsupervised learning is to model the underlying structure or distribution in the data to learn more about it. 
    * It is phrased **unsupervised learning** as unlike supervised learning there is no correct answer and the algorithms are left to their own devises tasked with discovering and presenting interesting structure in the data.

## Deep learning: Artificial Neural Networks

A neural network, more properly referred to as an 'artificial' neural network (ANN) to distinguish it from its biological counterpart are inspired by biological neurons in a brain shown via the following image taken from [Machine learning algorithms in boiler plant root cause analysis ](https://www.ee.co.za/article/application-of-machine-learning-algorithms-in-boiler-plant-root-cause-analysis.html): 

<center><img src="https://raw.githubusercontent.com/adambaskerville/ProgrammingForChemists/master/images/neuron_node_comparison.jpeg" width="auto" height="auto" /></center>

* An artificial neuron takes inputs, does some mathematics with them and then produces a single output:

1. **Each input is first multiplied by a weight:**
    
$$
\begin{aligned}
    x_1 \rightarrow x_1 * w_1 \\
    x_2 \rightarrow x_2 * w_2 \\
    x_3 \rightarrow x_3 * w_3 \\
    x_4 \rightarrow x_4 * w_4 \\
    x_5 \rightarrow x_5 * w_5 \\
    x_6 \rightarrow x_6 * w_6 \\
\end{aligned}
$$

* The inputs $(x_1, x_2, \ldots , x_n)$ and weights $(w_1, w_2, \ldots, w_n)$ are real numbers and can be positive or negative.

2. **All the weighted inputs are added together and a bias is added which we will call b:**
    * We add a bias as this allows our activation function to shift left and right resulting in a better fit, similar to the 'y-intercept' in $y=mx + c$. 

$$
\text{Sum} = (x_1 * w_1) + (x_2 * w_2) + (x_3 * w_3) + (x_4 * w_4) + (x_5 * w_5) + (x_6 * w_6) + b
$$

3. **Pass the sum through an activation function:**

$$
    y = f(\text{Sum})
$$

4. The activation function is used to turn an unbounded input into a predictable one. A commonly used activation function is the **sigmoid function**:

$$
    S(x) = \frac{1}{1 + e^{-x}}
$$

5. The activation function is what decides if the artifical neuron fires. If its calculated value is higher than a threshold value it will fire, if not then it will not.

* <font color='red'>**Exercise:** Plot the sigmoid function below. What can you say the function does?</font>

In [None]:
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(-5, 5, 0.05)

* The sigmoid function only outputs numbers in the range $0 \rightarrow 1$.

* Big negative numbers become ~ 0, and big positive numbers become ~ 1.

**Simple example:**

Consider a simple 2-neuron network that uses the sigmoid activation function and has the following weights:

$$
\begin{aligned}
    x &= [4,5] \\
    w &= [0, 1] \\
    b &= 4
\end{aligned}
$$

* Multiply the inputs by their weights and sum them not forgetting the bias:

$$
\begin{aligned}
     &= (x_1 * w_1) + (x_2 * w_2) + b \\
     &= (4 * 0) + (5 * 1) + 4 \\
     &= 9
\end{aligned}
$$
* Now plug into our activation function:

$$
y = f(9) = 0.99987
$$

* The neuron outputs 0.99987 given the input $x=[4,5]$.

### Hidden layers

* A neural network is a bunch of neurons connected together. 
    * A hidden layer is any layer(s) between the input (first) layer and output (last) layer

<center><img src="https://raw.githubusercontent.com/adambaskerville/ProgrammingForChemists/master/images/neural-network-1.png" width="auto" height="auto" /></center>

### How do neural networks learn?

* Think about a person throwing a paper ball into a bin.
    * The first throw provides feedback on the mass of the paper ball, air resistance, distance to the bin, force of throw etc... If they missed the bin on their first throw, the brain will change how it conducts the second throw using the information it learned from the first throw. 
    * Neural networks **learn** in exactly the same way, typically by a feedback process called **backpropagation** ("backprop" for short). 
    * Backpropagation involves **comparing the output a network produces with the output it was meant to produce**, and using the difference between them to modify the weights of the connections between the layers in the network, working from the output layers through the hidden layers to the input layers-going backward. 
    * Given time backpropagation causes the network to learn, reducing the difference between actual and intended output with the intention to make them coincide.

* **Training a neural network = trying to minimize the difference between its current value and the expected value.**

## Worked Example: Linear Regression, Beer-Lambert law

[Linear Regression](https://mathworld.wolfram.com/LinearRegression.html) is a conventional statistical method borrowed by machine learning as a **supervised** learning algorithm where the predicted output is continuous having a constant slope. It is used to predict values such as sales or price; rather than classifying them into categories such as car or tractor. 
* Linear regression is the simplest machine learning algorithm which a lot of you have encountered before without realising it; **finding the line of best fit.** 

* Is this really machine learning? Lets check:

    * We have the input data, $x$.
    * We have the output data, $y$.
    * We calculate the gradient and point of intercept of a line, forming a function $f$, which attempts to map $x$ onto $y$  $\hspace{0.5cm}\therefore \hspace{0.5cm} y = f(x) $.

* Linear regression provides us the mapping function $f$, which we can then use to predict output values we have not explicitly calculated or measured. 
* **Lots of you have been doing machine learning without even realising it!** 

* We are going to solve a modified version of a problem some of you encountered in Excel workshop 4 of Maths and Data Analysis for Chemists.

**Question:**

UV/vis spectrophotometry is often used to determine concentrations of metal ions in solution. In order to do this, knowledge of the molar extinction coefficient is needed. This is the absorbance per unit concentration at a given wavelength. A student has made a series of standard samples of KMnO$_4$ of known concentrations, so as to determine the molar extinction coefficient using the Beer-Lambert law

$$
    A = \epsilon c l,
$$
where $A$ is the measured absorbance, $\epsilon$ is the molar extinction coefficient, $c$ is the concentration and $l$ is the path length of the spectrophotometer cell (1 cm). The spectrometer is known to suffer from a constant offset, hence the following equation should be better model the experimental data:

$$
    A = \epsilon c l + A_0,
$$

Using the data, predict the absorbance when the concentration is 0.7 mM.

**Solution:**

This question is asking for a line of best fit to be fitted to the data set, noting that the equation of a straight line has the same form as the Beer-Lambert law

$$
    \begin{array}{cccccc}
        & \underbrace{A} & = & \underbrace{\epsilon c l} & + & \underbrace{A_0} \\
        & \downarrow && \downarrow && \downarrow\\
        & y & = & mx & + & c 
    \end{array}
$$ 

The original data set consisted of 10 data points but I have generated 200 data points for the purposes of this example, stored in `UV_vis_data.csv`. We will read this file in using pandas and fit a line of best fit to it using two techniques

1. **No machine learning:** We use pandas to read in the `.csv` file and Using [`numpy.polyfit`](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html) which fits polynomials to data sets, where we specify the order of polynomial as `1`, a straight line:

* <font color='red'>**Exercise:** Complete the following code block to find the line of best fit and predict the absorbance:</font>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns # we call this library which is based on matplotlib

# set a dark grid style for the plot (looks nice!)
sns.set_style("darkgrid")

# create the figure (fig) and axes (ax) objects
fig, ax = plt.subplots()

# define the column names in the data file
colnames=['concentration', 'absorbance']

# specify the location of the dataset
data_url = "https://raw.githubusercontent.com/adambaskerville/ProgrammingForChemists/master/data/UV_vis_data.csv"

# read the data file into a Pandas DataFrame and assign the column names
uv_dat = pd.read_csv(, names=colnames)

# use numpy to fit a line of best fit using 'polyfit'
m, c = np.polyfit(uv_dat['concentration'], uv_dat[''], 1)

# create a numpy array of x-values in order to plot the line of best fit
x = np.arange(uv_dat['concentration'].min(),uv_dat['concentration'].max(),  0.001)

# define how the y-values are calculated using the equation of a straight line
y = m* + c

# define the axes labels
ax.set_xlabel('Concentration / mM')
ax.set_ylabel('Absorbance / a.u.')

# plot a scatter plot of the data
plt.scatter(uv_dat['concentration'], uv_dat['absorbance'], s=5, color='blue');

# plot the line of best fit

# show the plot
plt.show()

# declare a value of x, the concentration, for the line of best fit to use and produce a corresponding value of the absorbance
conc =

# print the predicted value of absorbance given an absorbance of 0.7 a.u. 
print("predicted absorbance = ", m*conc + c)

2. **Machine learning:** Using a neural network:

    * We call the neural network from the **Keras** sub-library using the `sequential` function.
        * **Keras** is a wrapper built on top of TensorFlow making it more accessible and easier to work with.
    * We want to use 1 hidden layer -> `keras.layers.Dense(units=1, input_shape=[1])`.
    * We need to **compile** our model to be run and specify the loss measurement as the mean squared difference algorithm.
    * We specify the optimizer which is going to change our weights for us. Here we use the **Adam** optimizer. (Not me, it abbreviates from adaptive moment estimation).
    * We finally call the `fit` function on our model.
        * An **epoch** is not the same thing as an iteration, an epoch is a **complete pass of the training data**
    * We plot the loss against epoch number
    * <font color='red'>**Exercise:** We finally ask the model to predict the absorbance:</font>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd

# call the sequential model (neural netowrk) using keras
model = tf.keras.Sequential([tf.keras.layers.Dense(units=1, input_shape=[1])])

# compile the model using a mean-squared loss and adam optimizer
model.compile(loss='mean_squared_error', optimizer=tf.keras.optimizers.Adam(learning_rate=0.1))

# We now fit the model to our data, converting them into numpy arrays as we go
history = model.fit(np.asarray(uv_dat['concentration']), np.asarray(uv_dat['absorbance']), epochs=20)

# plot the loss with increasing epoch number
plt.xlabel("Epoch Number")
plt.ylabel("Loss Magnitude")
plt.plot(history.history['loss'])
plt.show()

# declare a value of the concentration
conc = 

# use our model to predict the absorbance
print("predicted absorbance = ", model.predict([conc]))

* Both the standard linear regression method along with the machine learning implementation produce **nearly identical answers.**

* Our next example we will venture into the world of **drug discovery** and use machine learning to predict the solubility of drug molecules. First we will cover some prerequisites. 

## DeepChem: Open source machine learning for Life Science

[DeepChem](https://deepchem.io/) is a collection of open source tools for drug discovery, materials science, quantum chemistry, and biology; including **machine learning**. 

* Application of machine learning to drug discovery is **very big business** [estimated to be worth \\$591 million in 2018](https://www.fnfresearch.com/ai-for-drug-discovery-market-by-drug-type)  and is expected to reach a value of around \\$12 billion by 2027.

**Why?**

* It is estimated that bringing a new drug to market [costs major pharmaceutical companies at least \\$4 billion](https://www.forbes.com/sites/matthewherper/2012/02/10/the-truly-staggering-cost-of-inventing-new-drugs/#3d91a3bd4a94), and can take 10-15 years with less than 10% making it to market.

* Machine learning offers efficient understanding of vast amounts of chemical data, allowing for selection of the best drug candidates and predicting their possible properties; **all without even setting foot in a lab**. 
    * Drug companies like machine learning as it **can save them huge amounts of money and time.**

### Basics of DeepChem

The first thing we need learn is how we can represent a wide range of complex chemical structures using a unique metric; SMILES: **S**implified **M**olecular-**I**nput **L**ine-**E**ntry **S**ystem. 

* These take the form of a single line notation for describing the structure of chemical species using short strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.

[SMILES](https://www.rdkit.org/docs/GettingStartedInPython.html) has thefollowing syntax rules:

1. **Atoms and Bonds:**

* Atoms are represented by the standard abbreviation of the chemical elements, in square brackets. A bond is represented using one of the following symbols: 
   
| Symbol    | Meaning                 | 
|:---------:|:------------------------|
| -         | Single bond             |   
| =         | Double bond             |
| #         | Triple bond             |
| $         | Quadruple bond          |
| :         | Aromatic bond           |
| .         | Disconnected structures |

* Examples:
        
| SMILES      |  Chemical formula  | Name            | 
|:------------|:-------------------|:----------------|
|  [NH4+]     | NH$_4^+$           | Ammonium        |   
|  [OH-]      | OH$^-$             | Hydroxide anion |
| [Na+].[Cl-] | NaCl               | Sodium chloride |
| [OH3+]      | H$_3$O$^+$         | Hydronium cation|

* Combining atomic symbols and bond symbols allows for simple chain structures to be represented. 
* The structures that are entered using SMILES are represented without hydrogens. 
* SMILES software understands the number of possible connections that an atom can have. 

| SMILES  |  Chemical formula  | Name            | 
|:--------|:-------------------|:----------------|
|  CC     | CH$_3$CH$_3$       | Ethane          |   
| C=C     | CH$_2$CH$_2$       | Ethene          |

2. **Branches:**

    * A branch from a chain is specified by placing the SMILES symbol(s) for the branch between parenthesis `()`. 
    * The string in parentheses is placed directly after the symbol for the atom to which it is connected. 
    * If it is connected by a double or triple bond, the bond symbol immediately follows the left parenthesis.

| SMILES            |  Name            | 
|:------------------|:-----------------|
|  CC(O)C           | 2-Propanol       |   
| CC(C)CC(=O)       | 2-Methylbutanal  |
| c1c(N(=O)=O)cccc1 | Nitrobenzene     |


3. **Rings:**

    * SMILES allows a user to identify ring structures using numbers to identify the opening and closing ring atom. For example, in `C1CCCCC1`, the first carbon has a number '1' which connects by a single bond with the last carbon which also has a number '1'. 
    * Chemicals that have multiple rings may be identified by using different numbers for each ring. 
    * If a double, single, or aromatic bond is used for the ring closure, the bond symbol is placed before the ring closure number. 

| SMILES         |  Name            | 
|:---------------|:-----------------|
|  C=1CCCCC1     | Cyclohexene      |   
| c1ccccc1       | Benzene          |
| C1OC1CC        | Ethyloxirane     |
| c1cc2ccccc2cc1 | Naphthalene      | 

4. **Charged atoms:**

    * Charges on an atom can be used to override the knowledge regarding valence that is built into SMILES software. 
    * The format for identifying a charged atom consists of the atom followed by brackets which enclose the charge on the atom. 

| SMILES              |  Name                               | 
|:--------------------|:------------------------------------|
|  CCC(=O)[O-1]       | Ionized form of propanoic acid      |   
| c1cccc[n+1]1CC(=O)O | 1-Carboxylmethyl pyridinium         |

* <font color='red'>**Exercise:** Pick several structures and draw them using the code box below</font>:

In [None]:
import kora.install.rdkit
from rdkit import Chem
from rdkit.Chem import Draw

# put some molecular structures in a list
smiles_list = ['OCCc1ccn2cnccc12','C1CC1Oc1cc2ccncn2c1','CNC(=O)c1nccc2cccn12']

# draw the chemical structures using MolFromSmiles
mol_list = [Chem.MolFromSmiles(x) for x in smiles_list]

# use MolsToGridImage to put the images in a grid
img = Draw.MolsToGridImage(mol_list, molsPerRow=5, subImgSize=(250, 250))

# print the structures to screen
img

### Worked Example: Predicting the solubility of small molecules

* An important property of a new candidate drug is its **solubility**; 
    * If it isn't soluble enough then it will be unlikely to enter a patient's bloodstream to have a therapeutic effect. 
* We will now use machine learning to build a model that predicts solubility of small molecules based on nothing but their chemical structure. 
    * We will be using the delaney dataset from [MoleculeNet](http://moleculenet.ai/datasets-1). 
        * This dataset contains structures and **log-scale water solubility data for 1128 compounds:** 

In [None]:
# load the dataset using deepchem
import deepchem as dc

# load the Delaney dataset
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')

# split the data into training and testing data sets
train_dataset, valid_dataset, test_dataset = datasets

* We will use a particular kind of neural network called a **graph convolutional network**, or "graphconv" for short. 

* A **graph** in machine learning is a data structure comprising of nodes (vertices) and edges connected together to represent information with no definite beginning or end. Nodes can be thought of as the atoms while the edges represent connections, the bonds.

* We specify `n_tasks=1` i.e. there is only one task, one output value (the solubility) for each sample. 
* We also specify that this is a regression model, meaning that the labels are continuous numbers and the model should try to reproduce them as accurately as possible. 
    * This is in contrast to a classification model, which tries to predict which of a fixed set of classes each sample belongs to. 
* To reduce overfitting, we specify `dropout=0.2`, meaning that 20% of the outputs from the hidden layer will randomly be set to 0:

In [None]:
model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)

* We now need to train the model on the data set. We simply give it the data set and tell it how many epochs of training to perform (that is, how many complete passes through the data to make).

In [None]:
model.fit(train_dataset, nb_epoch=100)

* We should now have a fully trained model, but first we must evaluate the model on the **test set**. 
* We do that by calling `evaluate()` on the model. 
    * For this example, we will use the **pearson correlation**, which is a number between -1 and +1 that indicates to what extent 2 variables are linearly related. 
    * We can evaluate it on both the training set and test set in order to test for overfitting:

In [None]:
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)

print("Training set score:", model.evaluate(train_dataset, [metric], transformers))
print("Test set score:", model.evaluate(test_dataset, [metric], transformers))

* The model is overfitting a little, but a correlation coefficient of 0.83 is respectable; so our very quick model can be said to predict the solubilities of molecules based on their molecular structures. **Not too bad for just a few lines of Python code!**

* I don't have any new molecules to hand to test our model on, so let's just use the first ten molecules from the test set. 
* For each one we print out the chemical structure (represented as a SMILES string) and the predicted solubility. 
* <font color='red'>If you had a molecule you could convert it into a SMILES format and test our model on it.</font>

In [None]:
# predict the solubility of 10 molecules from the test data
solubilities = model.predict_on_batch(test_dataset.X[:10])

# print the solubilities and the corresponding structure
for molecule, solubility in zip(test_dataset.ids, solubilities):
    print(solubility, molecule)

# draw the stuctures
mol_list = [Chem.MolFromSmiles(smiles) for smiles in test_dataset.ids[:10]]
img = Draw.MolsToGridImage(mol_list, molsPerRow=5)
img

**Complete solubility program:**

In [None]:
# load the dataset using deepchem
import deepchem as dc

tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)

model.fit(train_dataset, nb_epoch=100)

metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)

print("Training set score:", model.evaluate(train_dataset, [metric], transformers))
print("Test set score:", model.evaluate(test_dataset, [metric], transformers))

# predict the solubility of 10 molecules from the test data
solubilities = model.predict_on_batch(test_dataset.X[:10])

# print the solubilities and the corresponding structure
for molecule, solubility in zip(test_dataset.ids, solubilities):
    print(solubility, molecule)

# draw the stuctures
mol_list = [Chem.MolFromSmiles(smiles) for smiles in test_dataset.ids[:10]]
img = Draw.MolsToGridImage(mol_list, molsPerRow=5)
img

## Review

In this session we covered:

* The difference between Artificial Intelligence (AI), Machine Learning (ML) and Deep learning (DL).
* Hardware required for machine learning.
* Difference between a Central Processing Unit (CPU) ans Graphics Processing Unit (GPU)
* Doing machine learning using tensorflow in python.
* Showed that linear regression is a machine learning algorithm.
* Used linear regression to model the Beer-Lambert law.
* Learned the basics of the DeepChem library and SMILES notation.
* Wrote a program to estimate the solubility of small drug molecules.

## Further Reading:

The DeepChem community has put together a series of tutorials on how to use DeepChem including a large variety of worked examples:
    
[DeepChem Tutorials](https://github.com/deepchem/deepchem/tree/master/examples/tutorials)