<figure>
  <IMG SRC="logoeost.png" WIDTH=100 ALIGN="right">
</figure>

# Classification of Seismic Sources - Random Forest Classifier


Based on, and with the courtesy of, the "*IA in geosciences" practical by C. Hibert / 28 January 2020*.

Adapted for the Skience2024 workshop (generalisation to N different classes & translation) by Thomas Lecocq.

---------

In this tutorial you will see how to implement a machine learning algorithm for a discrimination/classification problem using the Python function library `sickit-learn`. This function library is very comprehensive and one of the most widely used in the world for everything to do with Machine Learning. 


You will be working on seismological data, with the aim of achieving the best rate of correct identification between any number of source: signals generated by volcano-tectonic earthquakes, other type of volcano-generated signals, as well as noise samples. Having an algorithm that can make this discrimination on continuous data will make it possible to reconstruct chronicles of events on a volcano. These chronicles will potentially provide a better understanding of the volcano dynamics.

The dataset we computed includes a (very) small number of labelled "events" recorded by the a temporary deployment on Mount Merapi, Indonesa.

## Train & Classify!

The signals have already been transformed into a set of 58 attributes using the previous notebook.

The code block below loads the libraries that will be needed in this tutorial and loads the different data files.

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from ComputeAttributesV_MAT import get_attribute_names

In [None]:
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

## Exemple ##

#classname=('IceQ','EQ')
#plt.figure()
#plot_confusion_matrix(metrics.confusion_matrix(Classes, Y_pred),classes=classname)
#plt.show()

In [None]:
station = "GRW0"
channel = "BHZ"

classname=('VTB','MP', "gugu_short", "gugu_long", "NN", "ND")
arrays = []
for c in classname:
    fn = os.path.join("attributes", "%s.%s"%(station, channel), "%s.npy"%c)
    data = np.load(fn)
    print(c, ":", len(data), "items")
    arrays.append(data)


# A. Data preparation

We'll start by preparing the data:
  - Determine the number of events per class
  - Clean up the data: eliminate events for which outliers have been calculated (NaN and Inf; functions `np.isinf` and `np.isnan` for example)
  - Create variables that will randomly select a number _n_ of events from our training dataset (you can use the `np.random.randint` function).
  - From these variables, create the matrix containing the attributes for the training, and the associated matrix containing the corresponding classes (function `np.concatenate`). For the classes we need to associate integers with each of them. We'll arbitrarily assign a class number to each of the N event types. We'll start with a training set containing 5 events from each class.
  

In [None]:
def process_arrays(arrays):
    processed_arrays = []
    
    for array in arrays:
        array = array[~np.isinf(array).any(axis=1)]
        array = array[~np.isnan(array).any(axis=1)]
        array = array[:, 0:58]
        processed_arrays.append(array)
        
    return processed_arrays

def generate_train_data(NbrofEvent, processed_arrays):
    # processed_arrays = process_arrays(arrays)
    
    rand_ids = [np.random.choice(np.arange(len(arr)-1), size=NbrofEvent, replace=False).astype(np.int32) for arr in processed_arrays]
    
    train_attrib = np.concatenate([arr[rand_id, :] for arr, rand_id in zip(processed_arrays, rand_ids)], 0)
    
    # train_class = np.concatenate([np.zeros((len(rand_id), 1)) if i == 0 else np.ones((len(rand_id), 1)) for i, rand_id in enumerate(rand_ids)], 0)
    train_class = np.concatenate([np.full((len(rand_id), 1), i) for i, rand_id in enumerate(rand_ids)], 0)
    return train_attrib, train_class, rand_ids

# Example usage with 2 arrays ATTEQ and ATTIceQ
processed_arrays = process_arrays(arrays)

AttributesVal = np.concatenate([arr[1:] for arr in processed_arrays], 0)
Classes = np.concatenate([np.full((len(arr)-1, 1), i) for i, arr in enumerate(processed_arrays)], 0)

NbrofEvent = 5
Train_Attrib, Train_Class, ids = generate_train_data(NbrofEvent, processed_arrays)

Test_AttributesValRep = np.concatenate([np.delete(arr[1:], rand, 0) for arr, rand in zip(processed_arrays, ids)], 0)
Test_ClassesRep = np.concatenate([np.delete(np.full((len(arr)-1, 1), i), rand, 0) for i, (arr, rand) in enumerate(zip(processed_arrays, ids))], 0)


# B. Model training

We are now going to create and train the classifier:
  - Create a Random Forest classifier with 500 trees and store this model in a variable we'll call `clf` (see [RandomForestClassifier](https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/modules/generated/sklearn.ensemble.RandomForestClassifier.html)).
  - Train the classifier with the training dataset you created using the `fit` method.

In [None]:
clf=RandomForestClassifier(500)
clf.fit(Train_Attrib,Train_Class)

# C. Model identification and evaluation

Let's move on to using the trained model to identify the events in our dataset:

- Predict the class of all elements in the dataset using the `predict` method. Determine the accuracy of the classification using the `metrics.precision_score` function.
- Represent the results as a confusion matrix.
- Predict the class only of events that were not used for training, determine the precision and represent the results in the form of a confusion matrix (tip: `np.delete`).


In [None]:
pred=clf.predict_proba(AttributesVal)
Y_pred=clf.predict(AttributesVal)


plt.figure()
plot_confusion_matrix(metrics.confusion_matrix(Classes, Y_pred),classes=classname)
plt.show()

print(metrics.precision_score(Classes, Y_pred, average="weighted"))

                                                
Y_PredNoRep=clf.predict(Test_AttributesValRep)

plt.figure()
plot_confusion_matrix(metrics.confusion_matrix(Test_ClassesRep, Y_PredNoRep),classes=classname)
plt.show()

print(metrics.precision_score(Test_ClassesRep, Y_PredNoRep, average="weighted"))


# D. Importance of attributes

So far we have used 58 attributes to describe the signals (full list at the end of this tutorial). For processing hundreds of thousands of events, the most time-consuming step is the transformation of the signals into attributes (spectrum calculation, etc.). The _Random Forest_ algorithm is used to determine the importance of attributes in the discrimination process, by giving each attribute a score:
  
  - Determine the importance of each attribute in the identification process you carried out in __C.__. (method `feature_importances_`). You can represent these values (e.g. `plt.plot`, `plt.stem`, etc.).
  - Repeat the training and identification process, using only the 10 attributes with the highest score.
  - What do you deduce?

In [None]:
plt.figure(figsize=(8,12))
plt.stem(clf.feature_importances_, orientation="horizontal")
# plt.xticks(range(0,58))
plt.yticks(range(58), get_attribute_names().values())
plt.margins(0.01)
plt.show()

Indices_Att_trie=np.argsort(clf.feature_importances_)
#print(Indices_Att_trie)

Indices_10best=Indices_Att_trie[-10::]

Train_Attrib10=Train_Attrib[:,Indices_10best]
Train_Class10=Train_Class


clf10=RandomForestClassifier(500)
clf10.fit(Train_Attrib10,Train_Class10)

AttributesVal10=Test_AttributesValRep[:,Indices_10best]
# Classes=np.concatenate((np.zeros((len(ATTIceQ)-1,1)),np.ones((len(ATTEQ)-1,1))),0)

pred=clf10.predict_proba(AttributesVal10)
Y_pred10=clf10.predict(AttributesVal10)

print(metrics.precision_score(Test_ClassesRep, Y_pred10, average="weighted"))

In [None]:
# save the random forest for reuse
import joblib
# save
joblib.dump(clf, "./random_forest.joblib")