# Introduction to machine learning assignment: a bigger classification of basalt source

In the 2006 paper

>Vermeesch, P. (2006). Tectonic discrimination of basalts with classification trees. Geochimica et Cosmochimica Acta, 70, 1839-1848. https://doi.org/10.1016/j.gca.2005.12.016

Vermeesch wrote:

> *"If a much larger database were compiled, the trees would grow and their discriminative power increase, but they would still be easy to interpret"*

In a more recent paper, Doucet et al. compiled many more data. Rather than 756 basalt data points, they compiled 29,407 of which 22,005 correspond to the categories of Vermeesch (2006).

> Doucet, L. S., Tetley, M. G., Li, Z.-X., Liu, Y., & Gamaleldien, H. (2022). Geochemical fingerprinting of continental and oceanic basalts: A machine learning approach. Earth-Science Reviews, 233, https://doi.org/10.1016/j.earscirev.2022.104192

Your task in this assignment is use the data of Doucet et al. (2022) to evaluate whether the predictive power of the classification tree approach increases within this increase in data size as predicted by Vermeesch (2006).

## Import scientific Python libraries

In addition to the standard scientific Python libraries, a number of functions from `sklearn` with be needed as well.

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

## Import data

We will import the data from Doucet et al. 2022 that is provided as their supplemental table 1.

In [7]:
Doucet_data = pd.read_csv('./data/Doucet2022.csv',header=11)

The Doucet et al. 2022 study includes data from additional basalt types. To test Vermeesch's hypothesis, let's filter the data to be those from:

- ***Island arc basalts (IAB)*** *In the Doucet et al. dataset these are called `ARC-O` standing for oceanic arc.*
- ***Mid-ocean ridge (MORB)***
- ***Ocean-island (OIB)***

The code below filters to these types and creates a new dataframe called `basalt_data_MORB_OIB_IAB`

In [8]:
basalt_data = Doucet_data[(Doucet_data['type']=='MORB') | (Doucet_data['type']=='OIB') | (Doucet_data['type']=='ARC-O')]

## Build a decision tree classifier

Take the same approach that we did in class to build a decision tree classifier between the different `type` values (as they are called in the Doucet et al. (2022) data set. You will want to take this steps:

- Encode the target variable 'type' using LabelEncoder
- Split the data into features (X) and target (y)
    - When you do this split go ahead and drop the `['type','location','X1']` from X as we don't want them to be part of the classification. You can drop them with this code: 
    > `X = basalt_data.drop(['type','location','X1'], axis=1)`
- Impute missing values using median imputation
- Split the data into training and testing sets
- Train the decision tree classifier
- Make predictions on the test set
- Evaluate the classifier
- Plot the tree
- Get and disply the feature importances from the classifier

### Setting the `max_depth`
One consideration is that when setting the classifier there is a parameter `max_depth` than can be set to constrain the maximum depth of the tree. The default setting is `max_depth=None` which means it will keep going and going until the leafs of the tree contain a single category. For interpretability, it could be beneficial to set a `max_depth` value like so:

```
classifier = DecisionTreeClassifier(max_depth=12)
```

Once you have your machine learning classifier working, experiment with the tradeoff of predictive accuracy that comes with decreasing the depth of the tree and try to find a balance.

**How does the accuracy of the decision tree based on larger dataset from Doucet et al. (2022) compare to that using the smaller dataset from Vermeesch (2006)?**

*Write your answer here*

**What `max_depth` value do you think represents a good balance between predictive power and model complexity?**

*Write your answer here*

**What similarities and differences are there between the importance of different data fields (feature importance) between the decision tree built on the Vermeesch (2006) data compilation vs that built on the Doucet et al. (2022) data compilation?**

*Write your answer here*

## Optional extensions

- implement a decision tree on Doucet data with all of the categories included
- implement other supervised machine learning algorithms on the `basalt_data`. Recall that some require normalization.
- import the Vermeesch (2006) dataset and see how well the decision tree based on Doucet et al. (2022) does in terms of its classification. This will entail making sure that the column names and classification names are the same.