# Classification of basalt source

## Import scientific python libraries

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

import copy

from sklearn import neighbors

<img src="./images/volcanic-tectonics.png" width = 800 align = 'center'>

In this assignment you will continue your investigation of igneous geochemistry data. Igneous rocks are those that crystallize from cooling magma. Different magmas have different compositions associated with their origin as we explored a few weeks ago. During class today, we will continue to focus on data from mafic lava flows (these are called basalts and are the relatively low silica, high iron end of what we looked at in week 7).

> Igneous rocks form in a wide variety of tectonic settings,
including mid-ocean ridges, ocean islands, and volcanic
arcs. It is a problem of great interest to igneous petrologists
to recover the original tectonic setting of mafic rocks of the
past. When the geological setting alone cannot unambiguously
resolve this question, the chemical composition of
these rocks might contain the answer. The major, minor,
and trace elemental composition of basalts shows large
variations, for example as a function of formation depth
(e.g., Kushiro and Kuno, 1963) --- *Vermeesch (2006)*

For this analysis we are going to use a dataset that was compiled in 

Vermeesch (2006) Tectonic discrimination of basalts with classification trees, *Geochimica et Cosmochimica Acta*  https://doi.org/10.1016/j.gca.2005.12.016

These data were grouped into 3 categories:

- 256 ***Island arc basalts (IAB)*** from the Aeolian, Izu-Bonin, Kermadec, Kurile, Lesser Antilles, Mariana, Scotia, and Tonga arcs.
- 241 ***Mid-ocean ridge (MORB)*** samples from the East Pacific Rise, Mid Atlantic Ridge, Indian Ocean, and Juan de Fuca Ridge.
- 259 ***Ocean-island (OIB)*** samples from St. Helena, the Canary, Cape Verde, Caroline, Crozet, Hawaii-Emperor, Juan Fernandez, Marquesas, Mascarene, Samoan, and Society islands.

**Let's look at the illustration above and determine where each of these settings are within a plate tectonic context**

## Import data


The data are from the supplemental materials of the Vermeesch (2006) paper. The samples are grouped by affinity MORB, OIB, and IAB. They are additionally assigned affinity codes and colors from the default matplotlib cycle:

|affinity| affinity code | color |
|--------|---------------|-------|
| MORB| 0 | C0
| OIB |  1 | C1
| IAB |  2 | C2

In [None]:
basalt_data = pd.read_csv('./data/Vermeesch2006.csv')
basalt_data.head()

In [None]:
MORB_data = basalt_data[basalt_data['affinity']=='MORB']
OIB_data = basalt_data[basalt_data['affinity']=='OIB']
IAB_data = basalt_data[basalt_data['affinity']=='IAB']

## Can geochemical data be used to classify the tectonic setting?

These data are labeled. The author already determined what setting these basalts came from. However, is there a way that we could use these labeled data to determine the setting for an unknown basalt?

A paper published in 1982 proposed that the elements titanium and vanadium were particular good at giving insight into tectonic setting. The details of why are quite complicated and can be summarized as "the depletion of V relative to Ti is a function of the fO2 of the magma and its source, the degree of partial melting, and subsequent fractional crystallization." If you take EPS100B you will learn more about the fundamentals behind this igneous petrology. *For the moment you can consider the working hypothesis behind this classification to that different magmatic environments have differences in oxidation states that are reflected in Ti vs V ratios.*

Shervais, J.W. (1982) Ti-V plots and the petrogenesis of modern and ophiolitic lavas *Earth and Planetary Science Letters* https://doi.org/10.1016/0012-821X(82)90120-0

### Plot TiO2 (wt%) vs V (ppm)

**Make a scatter plot of TiO2 (wt%) vs V (ppm) with the markers color-coded by affinity. Include axis labels and a legend.**

## Classification by-eye method

In order to classify the basalt into their affinity based on titanium and vanadium concentrations, we can use a classification method.

The goal here is to be able to make an inference of what environment an unknown basalt formed in based on comparison to these data.

Let's say that we have three points where there affinity is unknown.
- point 1 has TiO2 of 4% and V concentration of 300 ppm
- point 2 has TiO2 of 1% and V concentration of 350 ppm
- point 3 has TiO2 of 1.9% and V concentration of 200 ppm

In [None]:
point_1_TiO2 = 4
point_1_V = 300
point_2_TiO2 = 1
point_2_V = 350
point_3_TiO2 = 1.9
point_3_V = 200

In [None]:
plt.figure(figsize=(6,6))
plt.scatter(MORB_data['TiO2 (wt%)'],MORB_data['V (ppm)'],label='mid-ocean ridge',edgecolors='black')
plt.scatter(OIB_data['TiO2 (wt%)'],OIB_data['V (ppm)'],label='ocean island',edgecolors='black')
plt.scatter(IAB_data['TiO2 (wt%)'],IAB_data['V (ppm)'],label='island arc',edgecolors='black')
plt.scatter(point_1_TiO2,point_1_V,label='unknown point 1',color='cyan',edgecolors='black',marker='d',s=100)
plt.scatter(point_2_TiO2,point_2_V,label='unknown point 2',color='magenta',edgecolors='black',marker='>',s=100)
plt.scatter(point_3_TiO2,point_3_V,label='unknown point 2',color='yellow',edgecolors='black',marker='s',s=100)
plt.xlabel('TiO2 (wt%)')
plt.ylabel('V (ppm)')
plt.legend()
plt.show()

***WRITE HOW YOU THINK THEY SHOULD BE CLASSIFIED HERE***

## Nearest Neighbors Classification

In nearest neighbors classification, classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point. There are different ways this can be done and can be weighted.

### Filter the data to ones that have Ti and V data

**Filter out the rows with NaN values in the `TiO2 (wt%)` or `'V (ppm)'` columns** (i.e. keep rows where its is not true that both of these are nans, you may need `~` and `isna()`).

### Normalize the data

Given that the nearest neighbor is a distance and TiO2 and V have ranges that are so different (in part because of different units) you need to normalize the data. Divide the 'TiO2 (wt%)' by the maximum 'TiO2 (wt%)' to get a value between 0 and 1. Do the same for V (ppm) as well.

**Add to your filtered dataframe a column called Ti_norm that is normalized TiO2.**

**Make a column called V_norm that is normalized vanadium.**

**Make a scatter plot of Ti_norm vs V_norm that is colored by affinity.** It should look a lot like the previous scatter plots.

### Preparing arrays of the data

**Make a 2 x n array of the TiO2 (wt%) and V (ppm) values (where n is the number of data points) and a 1 x n array of the classifications (the tectonic affinities).**

### Define your classifier

**Construct a classifier that uses the 5 nearest neighbors (`n_neighbors=5`) and weight points by the inverse of their distance (`weights='distance'`) such that closer neighbors of a query point will have a greater influence than neighbors which are further away.**

### Fit/train the classifier

**Feed the array of the data and the array of the classification in a `.fit` function preformed on the classifier object.**

### Normalize the mystery points

In [None]:
point_1_TiO2_norm = 4/np.max(basalt_data['TiO2 (wt%)'])
point_1_V_norm = 300/np.max(basalt_data['V (ppm)'])
point_2_TiO2_norm = 1/np.max(basalt_data['TiO2 (wt%)'])
point_2_V_norm = 350/np.max(basalt_data['V (ppm)'])
point_3_TiO2_norm = 1.9/np.max(basalt_data['TiO2 (wt%)'])
point_3_V_norm = 200/np.max(basalt_data['V (ppm)'])

### Predict the tectonic affinity of the mystery points using the neighbors classifier

**Use `.predict` to predict the tectonic affinity of these normalized mystery points using the trained neighbors alorithm.**

### Fit/train using the basalt_affinity_code rather than the string names and use it to predict basalt_affinity_code for the mystery points

### Visualizing the decision boundary

**Make a 101 x 101 grid of x and y values between 0 and 1.**

In [None]:
xx, yy = np.meshgrid(np.linspace(0, 1, 101),
                     np.linspace(0, 1, 101))
grid = np.c_[xx.ravel(), yy.ravel()]

### Classify the grid

**Use `.predict` to predict the tectonic affinity of these grid points using the trained neighbors alorithm.**

**Plot the classification boundaries by plotting the grid points colorcoded by their classifation. Add a scatter plot of the observed (normalized) data points points colorcoded by their labels on top.**

## Training and testing

How good is your nearest neighbor classifier? To answer this you'll need to find out how frequently your classifications are correct.


### Making a training and testing data set

There are 514 rows with TiO2 and V data. Use a random half of them for training and the other half for testing. To do this, shuffle all the rows, take the first 257 as the training set, and the remaining 257 for testing.

In [None]:
# Make a randomly ordered dataframe from the initial one
randomized_basalt_data = basalt_data_Ti_V.sample(frac=1) 

# Take the first 257 data points to use for "training"
training_data = copy.deepcopy(randomized_basalt_data.iloc[0:257])

# Use the rest to apply our machine learning on
remaining_data = copy.deepcopy(randomized_basalt_data.iloc[257:])

In [None]:
basalt_Ti_V_training = training_data[['Ti_norm', 'V_norm']].values
basalt_Ti_V_remaining = remaining_data[['Ti_norm', 'V_norm']].values
basalt_affinity_training = training_data['affinity code'].tolist()

In [None]:
classifier_neighbors.fit(basalt_Ti_V_training, basalt_affinity_training)

### Visualize the classification regions fit with half the data

Send the grid to the classifier to see the classification regions and decision boundary that has been fit with half of the data.

In [None]:
grid_classes = classifier_neighbors.predict(grid)
grid_classes = grid_classes.reshape(xx.shape)

In [None]:
plt.figure(figsize=(6,6))
plt.pcolormesh(xx, yy, grid_classes, cmap=cmap)
plt.xlabel('Ti_norm')
plt.ylabel('V_norm')
plt.xlim(0,1)
plt.ylim(0,1)
plt.gca().set_aspect('equal', 'box')
plt.show()

### Compare the remaining data (test data) to the classification regions

Place the test data on this graph and you can see at once that while the classifier got many of the points right, there are some mis-classified points.

In [None]:
cmap = ListedColormap(['C0', 'C1', 'C2'])
plt.figure(figsize=(6,6))
plt.pcolormesh(xx, yy, grid_classes, cmap=cmap)

plt.scatter(remaining_data['Ti_norm'],remaining_data['V_norm'],
                           color=remaining_data['color'],edgecolors='black')

plt.xlabel('Ti_norm')
plt.ylabel('V_norm')
plt.xlim(0,1)
plt.ylim(0,1)
plt.gca().set_aspect('equal', 'box')
plt.show()

### Estimating the accuracy of the classifier

Since the test set was chosen randomly from the original sample it should preform with similar accuracy on the overall population. Let's calculate the success rate of the classification.

Input the remaining data (test data) to the classifier and then assign these classified affinities to a new column in pandas.

In [None]:
remaining_classes = classifier_neighbors.predict(basalt_Ti_V_remaining)

In [None]:
remaining_data['predicted_class'] = remaining_classes

In [None]:
remaining_data.head()

Now you have a new column of the classified affinities for the test data. You also have the actually affinities given that the data were originally labeled with classifications. How often do they agree?

In [None]:
remaining_data['correct_assignment'] = remaining_data['predicted_class'].eq(remaining_data['affinity code'])
remaining_data.head()

In [None]:
remaining_data['correct_assignment'].value_counts(normalize=True) * 100

### Using scikit-learn functions to get an accuracy score of this nearest neighbor approach

Given that this approach of randomly splitting the data into training and test groups is quite common in machine learning classification, there are built-in convenience functions that can be used to more compactly do the same operations that you did above: `train_test_split` and `accuracy_score`.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(basalt_Ti_V, basalt_affinity_code,train_size=0.5)

# fit the model on one set of data
classifier_neighbors.fit(X1, y1)

# evaluate the model on the second set of data
y2_model = classifier_neighbors.predict(X2)
accuracy_score(y2, y2_model)

## Other classification algorithms

If you go to the scikit-learn homepage you will find many available classifiers: https://scikit-learn.org/stable/index.html. They are nicely illustrated in this code from the scikit-learn documentation.

In [None]:
# Code source: Gaël Varoquaux
#              Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

h = .02  # step size in the mesh

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),
            linearly_separable
            ]

figure = plt.figure(figsize=(27, 9))
i = 1
# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=.4, random_state=42)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    if ds_cnt == 0:
        ax.set_title("Input data")
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
               edgecolors='k')
    # Plot the testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
               edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. For that, we will assign a color to each
        # point in the mesh [x_min, x_max]x[y_min, y_max].
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
                   edgecolors='k')
        # Plot the testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                   edgecolors='k', alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        if ds_cnt == 0:
            ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                size=15, horizontalalignment='right')
        i += 1

plt.tight_layout()
plt.show()

As a word of warning, we shouldn't get too carried away. Clearly, there are complexities related to this approach (our accuracy scores aren't that high). Shervais notes that: 
> "More specific evaluation of the tectonic setting of these and other ophiolites requires
application of detailed geologic and petrologic data as well as geochemistry. The Ti/V discrimination diagram, however,
is a potentially powerful adjunct to these techniques."

Additionally, we would like to be able to assign physical processes to the classification.

## Explore other geochemical parameters of the data and build additional classifiers

**Tasks for you to complete**

- Use the seaborn library and use the sns.pairplot function to make cross-plots of other parameters (https://seaborn.pydata.org/generated/seaborn.pairplot.html)
- *Are there other geochemical parameters that you can use as a classifier that are as good or better than the Ti/V classifier?* Implement another classifier using the algorithm type of your choosing and determine its accuracy using a training set and a test set to address this question. ***scikit-learn will not be happy with missing values so filter out missing values beforehand***. ***Remember that if you are using the nearest neighbor approach that you need to normalize the data.***
- Build a classifier that uses more than 2 dimensions, as we did in class. Build a classifier that uses 3 or more parameters instead of 2. When you fit the classifier you provide an array that has:

    `[[data_a_point1,data_b_point1,data_c_point1],[data_a_point2,data_b_point2,data_c_point2]]`

    and then an array of type:

    `[point1_type, point2_type]`
    
    While we had Ti and V in the first array you could have these geochemical data and more so that instead of being 2 x n, it would be 3 x n or 4 x n (where n is the number of data point values) and 3 or 4 is the number of geochemical parameters you use.

### Turn in the Notebook

**Export as HTML and upload to bCourses.**