# Introduction to machine learning: classification of basalt source

## Import scientific python libraries

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

import seaborn as sns

## Machine learning
Text from: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

Learning problems fall into a few categories:
- **supervised learning**, in which the data comes with additional attributes that we want to predict (Click here to go to the scikit-learn supervised learning page).This problem can be either:
    - *classification*: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
    - *regression*: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.

- **unsupervised learning**, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization (Click here to go to the Scikit-Learn unsupervised learning page).

### Training set and testing set

Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.

**Today we will focus on classification through a supervised learning approach**

*Systems doing this type of analysis are all around us. Consider a spam filter for example*

# Classifying volcanic rocks

<img src="./images/volcanic-tectonics.png" width = 600 align = 'center'>

Today we are going to continue dealing with igneous geochemistry data. Igneous rocks are those that crystallize from cooling magma. Different magmas have different compositions associated with their origin as we explore two weeks ago. During class today, we will continue to focus on data from mafic lava flows (these are called basalts and are the relatively low silica, high iron end of what we looked at last week).

> Igneous rocks form in a wide variety of tectonic settings,
including mid-ocean ridges, ocean islands, and volcanic
arcs. It is a problem of great interest to igneous petrologists
to recover the original tectonic setting of mafic rocks of the
past. When the geological setting alone cannot unambiguously
resolve this question, the chemical composition of
these rocks might contain the answer. The major, minor,
and trace elemental composition of basalts shows large
variations, for example as a function of formation depth
(e.g., Kushiro and Kuno, 1963) --- *Vermeesch (2006)*

For this analysis we are going to use a dataset that was compiled in 

Vermeesch (2006) Tectonic discrimination of basalts with classification trees, *Geochimica et Cosmochimica Acta*  https://doi.org/10.1016/j.gca.2005.12.016

These data were grouped into 3 categories:

- 256 ***Island arc basalts (IAB)*** from the Aeolian, Izu-Bonin, Kermadec, Kurile, Lesser Antilles, Mariana, Scotia, and Tonga arcs.
- 241 ***Mid-ocean ridge (MORB)*** samples from the East Pacific Rise, Mid Atlantic Ridge, Indian Ocean, and Juan de Fuca Ridge.
- 259 ***Ocean-island (OIB)*** samples from St. Helena, the Canary, Cape Verde, Caroline, Crozet, Hawaii-Emperor, Juan Fernandez, Marquesas, Mascarene, Samoan, and Society islands.

**Let's look at the illustration above and determine where each of these settings are within a plate tectonic context**

## Import data


The data are from the supplemental materials of the Vermeesch (2006) paper. The samples are grouped by affinity MORB, OIB, and IAB.

In [None]:
basalt_data = pd.read_csv('./data/Vermeesch2006.csv')
basalt_data.tail()

In [None]:
print(basalt_data.columns)

## Can geochemical data be used to classify the tectonic setting?

These data are labeled. The author already determined what setting these basalts came from. However, is there are way that we could use these labeled data to determine the setting for an unknown basalt?

A paper published in 1982 proposed that the elements titanium and vanadium were particular good at giving insight into tectonic setting. The details of why are quite complicated and can be summarized as "the depletion of V relative to Ti is a function of the fO2 of the magma and its source, the degree of partial melting, and subsequent fractional crystallization." If you take EPS100B you will learn more about the fundamentals behind this igneous petrology. *For the moment you can consider the working hypothesis behind this classification to that different magmatic environments have differences in oxidation states that are reflected in Ti vs V ratios.*

Shervais, J.W. (1982) Ti-V plots and the petrogenesis of modern and ophiolitic lavas *Earth and Planetary Science Letters* https://doi.org/10.1016/0012-821X(82)90120-0

### Plot TiO2 (wt%) vs V (ppm)

In [None]:
# Create a scatter plot colored by 'affinity'
sns.scatterplot(data=basalt_data, x='TiO2_wt_percent', y='V_ppm', hue='affinity', edgecolor='k', s=50)

# Add a legend
plt.legend(title="Affinity")

# Label the axes
plt.xlabel('TiO2 (wt%)')
plt.ylabel('V (ppm)')

# Show the plot
plt.show()

### Use the pandas groupby function to group by affinity and describe the values of one column

In [None]:
basalt_data.groupby('affinity')['TiO2_wt_percent'].describe()

**CODE FOR YOU TO WRITE: Use the groupby command and describe the grouped vanadium concentration for the data.**

*Can we differentiate between the different affinities on titanium or vanadium concentration alone?*

## Eye test classification method

In order to classify the basalt into their affinity based on titanium and vanadium concentrations, we can use a classification method.

The goal here is to be able to make an inference of what environment an unknown basalt formed in based on comparison to these data.

Let's say that we have three points where their affinity is unknown.
- point 1 has TiO2 of 4% and V concentration of 300 ppm
- point 2 has TiO2 of 1% and V concentration of 350 ppm
- point 3 has TiO2 of 1.9% and V concentration of 200 ppm

**Let's take votes on how they should be classified**

***WRITE HOW YOU THINK THEY SHOULD BE CLASSIFIED HERE***

In [None]:
point_1_TiO2 = 4
point_1_V = 300
point_2_TiO2 = 1
point_2_V = 350
point_3_TiO2 = 1.9
point_3_V = 200

In [None]:
# Create a scatter plot colored by 'affinity'
sns.scatterplot(data=basalt_data, x='TiO2_wt_percent', y='V_ppm', hue='affinity', edgecolor='k', s=50)

# Plot the unknown points
plt.scatter(point_1_TiO2,point_1_V,label='unknown point 1',color='black',marker='d',s=100)
plt.scatter(point_2_TiO2,point_2_V,label='unknown point 2',color='red',marker='>',s=100)
plt.scatter(point_3_TiO2,point_3_V,label='unknown point 3',color='yellow',edgecolors='black',marker='s',s=100)

# Label the axes
plt.xlabel('TiO2 (wt%)')
plt.ylabel('V (ppm)')

# Add a legend
plt.legend(bbox_to_anchor=(1.05, 1))

# Show the plot
plt.show()

## A linear classification

An approach that has been taken in volcanic geochemistry is to draw lines to use for classification.

We are use the package scikit-learn in order to implement such a classification. Scikit-learn is a widely-used Python library for machine learning and data analysis. It provides a wide range of tools for data preprocessing, model selection, and evaluation. Its user-friendly interface and extensive documentation make it a popular choice for researchers, data analysts, and machine learning practitioners.

![scikit-learn logo](https://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)

We can use a tool in scikit-learn called SVC (Support Vector Classification) in order to do such a classification.

### Import sci-kit learn 

In [None]:
from sklearn.svm import SVC

### Define our classifier

We can use `SVC(kernel='linear')` as a classifier. The algorithm finds the best straight line, also called a hyperplane, to separate different groups of data. 

Once the lines have been found they can be used predict the group of new data points based on which side of the line they fall on.

In [None]:
classifier_svc_linear = SVC(kernel='linear')

### Preparing the data for classification 

We need to do a bit of prep work on the data first as not all of the data have Ti (wt %) and V (ppm) data. Let's define a new dataframe `basalt_data_Ti_V` that has the rows that contain both values. This will result in us using fewer data than is in the total dataset.

In [None]:
basalt_data_Ti_V = basalt_data[(~basalt_data['TiO2_wt_percent'].isna()) & (~basalt_data['V_ppm'].isna())]

In [None]:
print('number of basalt data:')
print(len(basalt_data))
print('number of basalt data with both Ti and V:')
print(len(basalt_data_Ti_V))

In machine learning literature and code conventions, uppercase "X" is often used to represent the matrix of feature variables (predictors), while lowercase "y" is used to represent the target variable (response).

This notation is used to visually distinguish between the two variables and indicate that "X" is a matrix (usually with multiple columns for each feature), while "y" is a vector (usually with a single column representing the target variable). The uppercase "X" signifies a multi-dimensional data structure, and the lowercase "y" signifies a one-dimensional data structure.

In [None]:
# Extract the necessary features and target variable
X = basalt_data_Ti_V[['TiO2_wt_percent', 'V_ppm']]
y = basalt_data_Ti_V['affinity']

The categorical variables that represent different categories that we have here are: 'MORB', 'IAB', and 'OIB'. However, most machine learning algorithms require numerical inputs. Label encoding is a technique that transforms the categorical variables into numerical labels. We can use the `sklearn.preprocessing` `LabelEncoder` function to do this task for us.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode the target variable
le = LabelEncoder()
y_encoded = le.fit_transform(y)

The original 'affinity' categories were:

In [None]:
y

And they have now been transformed to an array of numbers:

In [None]:
y_encoded

### Fit/train the classifier

Now that we have `X` as DataFrame of `['TiO2_wt_percent', 'V_ppm']` and `y_encoded` as numerical representation of the categories we can fit the classifier to the data.

To do this, we feed the DataFrame of the data and the array of the classification into a `.fit` function preformed on the classifier object.

In [None]:
classifier_svc_linear.fit(X, y_encoded)

### Visualizing the decision boundaries

Let's make a 101 x 101 grid of x and y values between 0 and the maximum values.

In [None]:
# Generate a grid of points over the feature space
xx, yy = np.meshgrid(np.linspace(0, max(basalt_data_Ti_V['TiO2_wt_percent']), 101),
                     np.linspace(0, max(basalt_data_Ti_V['V_ppm']), 101))
grid = np.c_[xx.ravel(), yy.ravel()]

In [None]:
plt.scatter(xx, yy, s=1)
plt.tight_layout()

### Classify the grid

We can then predict the class labels for each point in the grid.

In [None]:
grid_classes = classifier_svc_linear.predict(grid)

We can now plot up those grid with the actual data

In [None]:
# Reshape the predicted class labels to match the shape of the input grid
grid_classes = grid_classes.reshape(xx.shape)

cmap = ListedColormap(['C2', 'C0', 'C1'])

# Plot the decision boundary and the original data points with their labels
plt.figure()
plt.contourf(xx, yy, grid_classes, cmap=cmap, alpha=0.6)

sns.scatterplot(data=basalt_data, x='TiO2_wt_percent', y='V_ppm', hue='affinity', edgecolor='k', s=50)
plt.legend(loc='best')
plt.xlabel(X.columns[0])
plt.ylabel(X.columns[1])
plt.show()

We can now plot the unknown points onto this classified grid and see what their assignment would be.

In [None]:
# Reshape the predicted class labels to match the shape of the input grid
grid_classes = grid_classes.reshape(xx.shape)

cmap = ListedColormap(['C2', 'C0', 'C1'])

# Plot the decision boundary and the original data points with their labels
plt.figure()
plt.contourf(xx, yy, grid_classes, cmap=cmap, alpha=0.6)

sns.scatterplot(data=basalt_data, x='TiO2_wt_percent', y='V_ppm', hue='affinity', edgecolor='k', s=50)
# Plot the unknown points
plt.scatter(point_1_TiO2,point_1_V,label='unknown point 1',color='black',marker='d',s=100)
plt.scatter(point_2_TiO2,point_2_V,label='unknown point 2',color='red',marker='>',s=100)
plt.scatter(point_3_TiO2,point_3_V,label='unknown point 3',color='yellow',edgecolors='black',marker='s',s=100)

# Add a legend
plt.legend(bbox_to_anchor=(1.05, 1))

plt.xlabel(X.columns[0])
plt.ylabel(X.columns[1])
plt.show()

While we can visually see where the points fall, we can also ask the classifier to predict the values of these unknown points using `classifier_svc_linear.predict()`. We can return the actual labels of the data (rather than the encoded numbers) by using `le.inverse_transform()`

In [None]:
classified_points_encoded = classifier_svc_linear.predict([[point_1_TiO2,point_1_V],
                             [point_2_TiO2,point_2_V],
                             [point_3_TiO2,point_3_V]])
classified_points = le.inverse_transform(classified_points_encoded)
classified_points

### Training and testing

How good is our linear SVC classifier? To answer this we'll need to find out how frequently our classifications are correct.

**Discussion question**

*How should be determine the accuracy of this classification scheme using the data that we already have?*

### Import more `sklearn` tools

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

In [None]:
len(X_train)

In [None]:
len(X_test)

### Fit the model to the training data

In [None]:
classifier_svc_linear.fit(X_train, y_train)

### Make predictions on the testing data

The test set was held back from training. We can use to to evaluate the model. How often are the categorizations correction? To do this we can predict the categories using `classifier_svc_linear.predict()` and the compare those to the actual labels using `accuracy_score()`

In [None]:
# Make predictions on the test set
y_pred = classifier_svc_linear.predict(X_test)

# Evaluate the classifier
print("Accuracy:", accuracy_score(y_test, y_pred))

The above `accuracy_score` computes the proportion of correct predictions out of the total number of predictions. The accuracy score ranges from 0 to 1, where a higher score indicates better classification performance.

We can make a plot that is the classification based on the training data and can then plot the test data. The fraction of points that are plotting within the correct classification corresponds to the accuracy score.

In [None]:
grid_classes = classifier_svc_linear.predict(grid)

# Reshape the predicted class labels to match the shape of the input grid
grid_classes = grid_classes.reshape(xx.shape)

cmap = ListedColormap(['C2', 'C0', 'C1'])

# Plot the decision boundary and the original data points with their labels
plt.figure()
plt.contourf(xx, yy, grid_classes, cmap=cmap, alpha=0.6)

plt.scatter(X_test['TiO2_wt_percent'],X_test['V_ppm'],c=y_test,cmap=cmap)
plt.xlabel(X.columns[0])
plt.ylabel(X.columns[1])
plt.show()

## Taking a decision tree approach

While there is a nice simplicity to the linear SVC classifier approach comparing the TiO$_2$ vs V data, we have a lot more information from other aspects of the geochemistry. We might as well use that information as well.

Let's use all the data we can.

In [None]:
basalt_data.head(1)

Let's try a **Decision Trees** approach which is another supervised machine learning algorithm for classification 

**Decision Trees** are a type of flowchart-like structure where internal nodes represent decisions based on the input features, branches represent the outcome of these decisions, and leaf nodes represent the final output or class label. The primary goal of a decision tree is to recursively split the data into subsets based on feature values that maximize the separation between the classes.

*Why use Decision Trees?*

- Easy to understand and interpret: Decision Trees are human-readable and can be visualized, making them easy to understand and interpret even for those with limited machine learning experience.
- Minimal data preprocessing: Decision Trees do not require extensive data preprocessing, such as scaling or normalization, as they can handle both numerical and categorical features.
- Non-linear relationships: Decision Trees can model complex, non-linear relationships between features and target variables.
- Feature importance: Decision Trees can provide insights into feature importance, helping to identify the most relevant features for the problem at hand.

### Preparing the data for the decision tree

1. Encode the target variable 'affinity' using LabelEncoder: The target variable 'affinity' contains categorical data, which needs to be encoded as numerical values for the decision tree classifier. The LabelEncoder from scikit-learn is used to transform the 'affinity' column into numerical labels.

2. Split the data into features (X) and target (y): The dataset is split into two parts, features (X) and the target variable (y). Features are the input variables that the classifier will use to make predictions, and the target variable is the output we want the classifier to predict.

3. Impute missing values using median imputation: Decision tree classifiers cannot handle missing values in the input data. Therefore, missing values in the dataset need to be imputed (filled in) before training the classifier. Let's us median imputation which replaces the missing values with the median of the non-missing values in the same column. We can import and use the `SimpleImputer` function.

In [None]:
from sklearn.impute import SimpleImputer

# Encode the target variable 'affinity' using LabelEncoder
le = LabelEncoder()
basalt_data['affinity'] = le.fit_transform(basalt_data['affinity'])

# Split the data into features (X) and target (y)
X = basalt_data.drop('affinity', axis=1)
y = basalt_data['affinity']

# Impute missing values using median imputation
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

### Implement the `DecisionTreeClassifier`

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.3, random_state=42)

# Train the decision tree classifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test)

# Evaluate the classifier
print("Accuracy:", accuracy_score(y_test, y_pred))

### Plot the decision tree

In [None]:
from sklearn.tree import plot_tree

# Set up the figure and axis for the plot
fig, ax = plt.subplots(figsize=(30, 20))

# Get the original class names
class_names = le.inverse_transform(np.unique(y)).astype(str)

# Visualize the decision tree
plot_tree(classifier, filled=True, feature_names=X.columns, class_names=class_names, ax=ax)

# Save the figure
plt.savefig('decision_tree.pdf', dpi=300)

# Show the plot
plt.show()

### What aspects of the data are important for the classification?

We want to be able to readily determine what data fields are the most important for the decision tree. We can do that by determining "feature importance." The code below extracts and displays the importance score, also known as Gini importance or Mean Decrease Impurity, which is a measure of how much a feature contributes to the decision-making process of the decision tree model. 

In [None]:
# Get the feature importances from the classifier
importances = classifier.feature_importances_

# Pair the feature names with their corresponding importances
feature_importances = list(zip(X.columns, importances))

# Create a DataFrame from the feature importances
df_feature_importances = pd.DataFrame(feature_importances, columns=['Feature', 'Importance'])

# Sort the feature importances in descending order
df_feature_importances = df_feature_importances.sort_values(by='Importance', ascending=False)

# Reset the index and drop the old index column
df_feature_importances.reset_index(drop=True, inplace=True)

# Display the sorted feature importances
display(df_feature_importances)


**Discussion question**

*If we were going to build a classifier on just two variables, which should be pick?*

## Visualizing the classification using a "confusion matrix"

A confusion matrix is a table that is used to evaluate the performance of a classification algorithm. It visually displays the accuracy of a classifier by comparing its predicted labels against the true labels. The matrix consists of rows and columns that represent the true and predicted classes, respectively. Each cell in the matrix corresponds to the number of samples for a specific combination of true and predicted class labels.

The main diagonal of the matrix represents the correctly classified instances, while the off-diagonal elements represent the misclassified instances. By analyzing the confusion matrix, you can gain insights into the performance of the classifier and identify where it gets "confused."

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix using ConfusionMatrixDisplay
fig, ax = plt.subplots(figsize=(4, 4))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
disp.plot(cmap=plt.cm.Blues, ax=ax, values_format='d', colorbar=False)

# Add title and labels
ax.set_title('Confusion Matrix')
ax.set_xlabel('Predicted')
ax.set_ylabel('True')

# Show the plot
plt.show()


## Exploring other classification algorithms

If you go to the scikit-learn homepage you will find many available classifiers: https://scikit-learn.org/stable/index.html. They are nicely illustrated in this code from the scikit-learn documentation.

In [None]:
# Code source: Gaël Varoquaux
#              Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

h = .02  # step size in the mesh

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),
            linearly_separable
            ]

figure = plt.figure(figsize=(27, 9))
i = 1
# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=.4, random_state=42)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    if ds_cnt == 0:
        ax.set_title("Input data")
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
               edgecolors='k')
    # Plot the testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
               edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. For that, we will assign a color to each
        # point in the mesh [x_min, x_max]x[y_min, y_max].
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
                   edgecolors='k')
        # Plot the testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                   edgecolors='k', alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        if ds_cnt == 0:
            ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                size=15, horizontalalignment='right')
        i += 1

plt.tight_layout()
plt.show()

### Normalize the data

The decision tree suggested that Ti and Sr were the biggest differentiators. Let's have a look at Ti (wt %) vs Sr (ppm) and apply different classifying algorithms. For many of these algorithms, it is essential to normalize the data. For example, the nearest neighbor is a distance. Consider that in the TiO2 (wt%) vs. Sr (ppm) or V (ppm) the y-axis and x-axis are so different (in part because of different units). So we need to normalize the data.

We can use the `sklearn.preprocessing` function `StandardScaler` to help us here.

Let's make a `basalt_data_Ti_Sr` dataframe and then apply the `StandardScaler` approach.

In [None]:
basalt_data_Ti_Sr = basalt_data[(~basalt_data['TiO2_wt_percent'].isna()) & (~basalt_data['Sr_ppm'].isna())]

In [None]:
from sklearn.preprocessing import StandardScaler

# Extract the necessary features and target variable
X = basalt_data_Ti_Sr[['TiO2_wt_percent', 'Sr_ppm']]
y = basalt_data_Ti_Sr['affinity']

# Encode the target variable 'affinity' using LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate the scaler
scaler = StandardScaler()

# Fit the scaler to the training data and transform it
X_train_normalized = scaler.fit_transform(X_train)

# Transform the testing data using the fitted scaler
X_test_normalized = scaler.transform(X_test)

### Apply different classifiers

Now that we have normalized the data, we can apply classifiers, I have put in the `KNeighborsClassifier(3)`. **What are the strengths and weaknesses of this classifier?**

*write your answer here*

Play around with switching the classifier to other ones from the example above. **What would the best classifier to use?**

*write your answer here*

In [None]:
# Play around with changing KNeighborsClassifier(3) to different classifiers
classifier = KNeighborsClassifier(3)
classifier.fit(X_train_normalized, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test_normalized)

# Evaluate the classifier
print("Accuracy:", accuracy_score(y_test, y_pred))

# Create a meshgrid 
h = 0.02  # Mesh grid step size
x_min, x_max = X_test_normalized[:, 0].min() - 1, X_test_normalized[:, 0].max() + 1
y_min, y_max = X_test_normalized[:, 1].min() - 1, X_test_normalized[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Classify the grid points using the classifier
grid = np.c_[xx.ravel(), yy.ravel()]
grid_classes = classifier.predict(grid)

# Reshape the predicted class labels to match the shape of the input grid
grid_classes = grid_classes.reshape(xx.shape)

# Plot the decision boundary and the test data points
cmap = ListedColormap(['C2', 'C0', 'C1'])
plt.figure()
plt.contourf(xx, yy, grid_classes, cmap=cmap, alpha=0.6)
plt.scatter(X_test_normalized[:, 0], X_test_normalized[:, 1], c=y_test, cmap=cmap, edgecolors='k', marker='o', s=50)

# Add a legend
cbar = plt.colorbar(ticks=[0.375, 1., 1.625])
cbar.set_ticklabels(le.inverse_transform([0, 1, 2]))

plt.xlabel('Normalized TiO2_wt_percent')
plt.ylabel('Normalized Sr_ppm')
plt.title('Classifier (Test Data with Decision Boundary)')
plt.show()

## A word of warning

As a word of warning, we shouldn't get too carried away. Clearly, there are complexities related to this approach (our accuracy scores aren't that high). There are other types of contextual data that can give insight. For example, Shervais (1982) notes that: 
> "More specific evaluation of the tectonic setting of these and other ophiolites requires
application of detailed geologic and petrologic data as well as geochemistry. The Ti/V discrimination diagram, however,
is a potentially powerful adjunct to these techniques."

Additionally, we would like to be able to assign physical processes to any classification given that we are seeking insight into how the Earth works.

Ver
> "no classification method based solely on geochemical
data will ever be able to perfectly determine the
tectonic affinity of basaltic rocks (or other rocks for that
matter) simply because there is a lot of actual overlap between
the geochemistry of the different tectonic settings.
Notably IABs have a much wider range of compositions
than either MORBs or OIBs. Therefore, geochemical classification
should never be the only basis for determining
tectonic affinity. This is especially the case for rocks that
have undergone alteration. In such cases, mobile elements
such as Sr, which have great discriminative power, cannot
be used."