<a href="https://colab.research.google.com/github/ds4geo/ds4geo/blob/master/WS%202020%20Course%20Notes/Session%209.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Science for Geoscientists - Winter Semester 2020**
# **Session 9 - Supervised Machine Learning - 2nd December 2020**

This week we will use two supervised machine learning techniques to try and classify rock types based on their bulk chemistry. We will use the traditional random forest method, and a simple neural network.

# 9.1 Introduction to Supervised Machine Learning
* Learning classification or regression from data with labels
* Aim to learn general rules which can be applied on other data
* Many algorithms are black boxes - the "rules" are difficult or impossible to know or understand

* examples


# 9.2 Introduction to rock geochem database
To try out some supervised ML, we use the rock geochemistry database published here:
https://essd.copernicus.org/articles/11/1553/2019/essd-11-1553-2019.pdf

We will try to use ML to predict the rock types based on the bulk chemistry.
I've already prepared and cleaned a subset of the data containing about 300,000 rock samples with rock names (top 14 only) and major element data:

https://github.com/ds4geo/ds4geo/blob/master/data/unordered/gwrgdb_maj_ele.csv?raw=true

## Inspecting the data
Below we:
* Load the data
* Do some simple visualisations
* Perform PCA analysis and standardisation

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
dat = pd.read_csv("https://github.com/ds4geo/ds4geo/blob/master/data/unordered/gwrgdb_maj_ele.csv?raw=true")

In [None]:
# dat.rock_name.value_counts()
fig, ax = plt.subplots(figsize=(10,7))
sns.countplot(data=dat, y="rock_name")

In [None]:
# Plot Al vs Si as scatter plot
fig, ax = plt.subplots(figsize=(12,10))
sns.scatterplot(data=dat, x="al2o3", y="sio2", hue="rock_name", ax=ax, palette="tab10")

In [None]:
# We'll want to standardise the data
from sklearn.preprocessing import StandardScaler

In [None]:
# Standardize
rdat = dat.iloc[:,1:]
sdat = StandardScaler().fit_transform(X=rdat)
sdat = pd.DataFrame(sdat, index=rdat.index, columns=rdat.columns)
sdat

In [None]:
# Lets do a PCA to see if there's clear structure
from sklearn.decomposition import PCA

In [None]:
pca = PCA()
pca.fit(sdat)

pdat = pca.transform(sdat)
pdat = pd.DataFrame(pdat, index=sdat.index)
evr = pca.explained_variance_ratio_

In [None]:
from sklearn import preprocessing
# convert rock names to integer labels - necessary later for classification
le = preprocessing.LabelEncoder()
y = le.fit_transform(dat.rock_name)

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
pc2plot = (1,2)
ax.scatter(pdat.loc[:, pc2plot[0]-1], pdat.loc[:, pc2plot[1]-1], alpha=0.2, c=["C{}".format(j) for j in y])
ax.axis('equal')
ax.set_xlabel("PC{} - explained variance: {}%".format(pc2plot[0], round(evr[pc2plot[0]-1]*100,1)))
ax.set_ylabel("PC{} - explained variance: {}%".format(pc2plot[1], round(evr[pc2plot[1]-1]*100,1)))

for v, nm in zip(pca.components_[[np.array(pc2plot)-1],:].T, sdat.columns):
  vec = v * 12.5
  ax.plot([0,vec[0]], [0, vec[1]],"r")
  ax.text(vec[0],vec[1], nm)

# 9.3 Machine Learning Algorithms: Random Forest
Here we try out a common machine learning classification method: random forest, and explain shortly how it works.

## ML models in python
At simplest, all these algorithms have the same usage style in python:

In [None]:
# We'll discuss this step later
from sklearn import model_selection
xt, xv, yt, yv = model_selection.train_test_split(sdat,y, test_size=0.2)

In [None]:
from sklearn.ensemble import RandomForestClassifier
# create the model - we'll discuss the parameters later
clf = RandomForestClassifier(max_depth=15, random_state=0, verbose=1, n_jobs=5)

# train the model on input data and corresponding labels
# This step takes some minutes, so we'll let it run while we continue with the explanations
clf.fit(xt, yt)

In [None]:
# Score the model to assess its accuracy
clf.score(xt, yt)

In [None]:
# Use the model for prediction
clf.predict(xv)

## Theory: Decision Trees and Random Forest
* Decision trees for classification
 * A hierarchical set of rules to classify from attributes/data

 * Rock classification example:
![](https://www.vagabondgeology.com/uploads/3/4/1/2/3412046/2367852.jpg?895)

  * Can be learnt from the data itself
  * Start with the feature/attribute which best splits the data

* Generalisation and the problem of overfitting
 * Aim is to learn the general patterns from the data
 * But ML models can end up simply memorising the training data, not learning the general patterns
 * This is overfitting
 * Decision trees are very prone to overfitting
* The solution: Random Forest
 * An ensemble of a large number of deliberately imperfect decision trees.
 * Randomly pick features/attributes at each level of the tree
 * Average results of all the decision trees in the forest

## Train, Validate, Test
Because ML models should learn general patterns from the data, its important to see how they perform on data which they do not train on. Standard practice is to split the available data into 2-3 groups:
1. Training data - data the model directly uses while learning.
2. Validation data - data the model uses periodically to test its own performance, but which isn't seen while learning.
3. Test data - the gold standard - data which the model has never even indirectly interacted with which provides a completely unbiased assessment of its performance.

See above: train_test_split()

Below we compare the performance of our model on the training and validation data.

In [None]:
clf.verbose = False # turn off progress messages

In [None]:
print("train score:", round(clf.score(xt, yt)*100,1),"%")
print("test score:", round(clf.score(xv, yv)*100,1),"%")

We can see that the model performs much better on the training data than the test data, but that it performs well on both!

This is an indication of slight overfitting. Dramatic overfitting can occur when train scores are very high (approaching 100%) but test scores approach 0%.

## Model hyperparameters
ML models have parameters which define their properties which depend on the type of model.
For random forest, two key parameters are:
1. The number of trees in the forest (n_estimators)- how large the ensemble is
2. The maximum tree depth (max_depth) - how many levels of decision can occur per tree

The above example used 100 trees (the default) and a maximum tree depth of 15.
Below we try some other combinations:

In [None]:
# 15 trees with max depth of 5
clf_small = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=0, verbose=1, n_jobs=5)
clf_small.fit(xt, yt)
print("train score:", round(clf_small.score(xt, yt)*100,1),"%")
print("test score:", round(clf_small.score(xv, yv)*100,1),"%")
# result very fast (only a few seconds), but much lower accuracy

In [None]:
# 50 trees with max depth of 5
clf_med1 = RandomForestClassifier(n_estimators=50, max_depth=5, random_state=0, verbose=1, n_jobs=5)
clf_med1.fit(xt, yt)
print("train score:", round(clf_med1.score(xt, yt)*100,1),"%")
print("test score:", round(clf_med1.score(xv, yv)*100,1),"%")
# More trees didn't help accuracies, but took much longer

In [None]:
# 10 trees with max depth of 15
clf_med2 = RandomForestClassifier(n_estimators=10, max_depth=15, random_state=0, verbose=1, n_jobs=5)
clf_med2.fit(xt, yt)
print("train score:", round(clf_med2.score(xt, yt)*100,1),"%")
print("test score:", round(clf_med2.score(xv, yv)*100,1),"%")
# Maximum depth seems to help accuracies a lot more, even with fewer trees, making it faster than the original case!

In [None]:
# 10 trees with max depth of 25
clf_med2 = RandomForestClassifier(n_estimators=10, max_depth=25, random_state=0, verbose=1, n_jobs=5)
clf_med2.fit(xt, yt)
print("train score:", round(clf_med2.score(xt, yt)*100,1),"%")
print("test score:", round(clf_med2.score(xv, yv)*100,1),"%")
# Increasing maximum depth even with few estimators increases the test score, but more (but not bad) overfitting!

## Assessing the Model Performance - Confusion Matrix
Assessing ML model performance is itself a huge topic, but one useful approach is the confusion matrix. It compares all possible combinations of true categories and predicted categories. Correct classifications are on the diagonal. It is valuable for seeing which categories the model struggles to classify.

Fortunately, sklearn provides a very easy all-in-one function to make confusion matrix plots.

Given the list of rock types, which confusions do you expect?

In [None]:
print(le.classes_)

In [None]:
from sklearn.metrics import plot_confusion_matrix

In [None]:
fig, ax = plt.subplots(figsize=(14,14))
plot_confusion_matrix(clf, xv, yv, normalize="true", display_labels=le.classes_, ax=ax, cmap="BuGn")


## Hyper parameter search
It is common practice to automatically test different hyperparameter combinations to find a set which produce the most accurate model.
We try a simple search here just on the max_depth parameter.

In [None]:
# import helpful progress bar library
from tqdm.notebook import tqdm

In [None]:
train_score = []
val_score = []
depth = []
for j in tqdm(range(5,30)):
    depth.append(j)
    
    clf = RandomForestClassifier(n_estimators=5, max_depth=j, random_state=0, verbose=0, n_jobs=8)
    clf.fit(xt, yt)
    
    train_score.append(clf.score(xt, yt))
    val_score.append(clf.score(xv, yv))



In [None]:
# Compare the scores vs tree depths
fig, ax = plt.subplots(figsize=(10,7))
ax.plot(depth,train_score, label="training score")
ax.plot(depth,val_score, label="validation score")
ax.legend()

# 9.4 Neural Networks and Deep Learning

![](https://imgs.xkcd.com/comics/machine_learning.png)

Neural networks are ML models designed to work analogously to neurons in the brain. Each neuron is connected to every neuron in the layer before and the layer after and each connection has a weighting.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Neural_network_example.svg/1200px-Neural_network_example.svg.png"  height="600" />


In the simplest possible terms, a neural network learns by randomly changing all of the weights, seeing if the output is closer or further away from the true data (i.e. the labels), and iteratively refining these weights until the learning doesn't improve further.

We will discuss in class more next week, but please watch the following videos to understand the general concepts:
A really light introduction (note, the "infinite classrom" learning approach here isn't a good analogy for the actual way neural networks usually learn, but it is a helpful concept to understand anyway):
https://www.youtube.com/watch?v=R9OHn5ZF4Uo

A footnote to the above video which very lightly introduces the real way neural networks learn:
https://www.youtube.com/watch?v=wvWpdrfoEv0

Taking a serious but excellent step into what is really going on:
https://www.youtube.com/watch?v=aircAruvnKk
See also the follow-up videos in that series.


sklearn makes it very easy to switch different ML models, so we can try using a deep neural network with very few changes, and leave greater understanding of what is really going on for next week.

First we replicate the original Random Forest example as closely as possible.

Note, during training, if we set verbose=1, we see the "loss" of the model for each iteration of training. The loss is a measure of model performance like accuracy, but where lower is better. Many loss functions are available depending on the data and problem.

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(32), verbose=1, max_iter=50)
mlp.fit(xt, yt)

print("train score:", mlp.score(xt, yt))
print("test score:", mlp.score(xv, yv))

## Deep learning
Deep learning is poorly defined, but roughly, it refers to neural networks with at least 2 hidden layers (i.e. excluding the input and output layers).

Lets try a very simple deep neural network.

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(32, 32), verbose=1, max_iter=50)
mlp.fit(xt, yt)

print("train score:", mlp.score(xt, yt))
print("test score:", mlp.score(xv, yv))

Next steps decided by class

# 9.5 Main Project
To be discussed