# XAS data validation by a simple machine learning model

**Objective**: Train a computer to recognize when a measured spectrum looks like XAS data.

At my beamline [BMM](https://www.bnl.gov/nsls2/beamlines/beamline.php?r=6-BM), we try to run 24 hours per day by relying upon robust instrumentation and well-tested automation. We have ways of lining up 10s of hours of data collection and letting the beamline run itself. From time to time, however, something goes wrong.  Maybe a detector has failed, maybe a sample has fallen off the sample holder, maybe the user told the automation to do the wrong thing.  Who knows?  Gremlins happen!

What we want is a spimple sort of data evaluation.  My talk today is not about XAS data reduction ... nor anlaysis ... nor interpretation ....

The problem I want to solve here is to flag a spectrum **as its measurement completes** as being either
1. "these data are probably reasonable" or
2. "someone's attention is probably needed"

The basic observation is that this identification problem is fundementally the same at the famous Iris Classification Problem &ndash; which is the "Hello World!" of machine learning.

## The Iris Classification Problem

In this classic problem, we work with a data set of observations of the morphology of the flowers of three species of iris:

![irises](./static/irises.png)
[(image source)](https://data-flair.training/blogs/iris-flower-classification/)

**Sepal**: One of the usually green leaflike structures composing the outermost part of a flower. Sepals often enclose and protect the bud and may remain after the fruit form

**Petal**: One of the often brightly colored parts of a flower immediately surrounding the reproductive organs; a division of the corolla.

Note that the shapes of the petals and sepals of these three species are different from one another.

In [None]:
import pandas
import sklearn.datasets
iris_set = sklearn.datasets.load_iris()

i = pandas.DataFrame()
i['sepal length'] = iris_set.data[:,0]
i['sepal width']  = iris_set.data[:,1]
i['petal length'] = iris_set.data[:,2]
i['petal width']  = iris_set.data[:,3]
i['target']       = iris_set.target
i['species']      = iris_set.target_names[iris_set.target]
i

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

formatter = plt.FuncFormatter(lambda i, *args: iris_set.target_names[int(i)])

plt.scatter(i['petal length'], i['petal width'], c=i['target'])
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.title("Classification: Petal measurements")
## plotting credit: http://stephanie-w.github.io/brainscribble/classification-algorithms-on-iris-dataset.html

In [None]:
plt.scatter(i['sepal length'], i['sepal width'], c=i['target'])
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.title("Classification: Sepal measurements")

In [None]:
plt.scatter(i['petal width'], i['sepal length'], c=i['target'])
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel('petal width')
plt.ylabel('sepal length')
plt.title("Classification: PW vs. SL")

Thanks to nicely contrasting colors and the human brain's penchant for finding patterns, you can look at these two representations of the iris dataset and see that the species cluster according to the dimensions of their sepals and petals.

Remember that these pictures are 2-dimensional samplings of a 4-dimensional space of sepal and petal measurements.  The clustering in this data set is in a 4D manifold.

Now, imagine picking a new iris of unknown species with the intent of identifying it.  You might measure the length and width of its sepal and petal and drop the new measurement onto that 4D manifold.  The idea is to determine its species on the basis of the nearby, tagged data points.

To implement this in numbers (as opposed to human perception), we'll use an algorithm called "K Nearest Neighbors" (KNN).  To explain KNN, I'll simply plagiarize Wikipedia:

![KNN](./static/KnnClassification.svg)
[(image source)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

Quoth Wikipedia: "The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle)."

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(i[['sepal length', 'sepal width', 'petal length', 'petal width']], i['target'], random_state=0)
X_train

In [None]:
X_test

In [None]:
y_train

In [None]:
y_test

In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5.)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

Here is the documentation from scikit-learn on the iris dataset:
https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html and for more information, follow the other links cited above.

So ... how does a fun botany problem related to XAS?

## Classifying XAS spectra

In [None]:
import pandas
import numpy
import os
notebook_path = os.path.dirname(os.path.abspath("Data Validation.ipynb"))

def fetch_xas_scan(uid):
    fname = uid + '.csv'
    data = pandas.read_csv(os.path.join(notebook_path, "data", "ML_corpus", fname))
    data.plot("dcm_energy", "xmu", xlabel='Energy (eV)', ylabel='$\mu$(E)')


Most of the data from this presentation is taken from a weekend in 2021 -- during time-of-covid, at a time when all the experiments at BMM were mail-in, and during a time when I was monitoring the beamline from home.  Being a nice Saturday, I set many hours of data collection running and then walked away.

That weekend, I was working on ceramic samples from colleagues at the University of Sheffield. We were measuring XAS on the iron, cerium, and titanium edges.

Here are some examples of reasonable data:

In [None]:
## a good one (Fe)
fetch_xas_scan('4de69926')

In [None]:
## a good one (Ce)
fetch_xas_scan('4f3c2372')

In [None]:
# a good one (Ti)
fetch_xas_scan('6916040c')

At some point during the weekend, something ... **BAD** ... happened. In truth, I don't quite remember what the problem was -- my lossy memory tells me that something weird happened with the fluorescence detector.  In any case, for something like 10 hours, garbage was measured before I finally noticed.

Here are a couple of examples of **BAD** data:

In [None]:
## a bad one
fetch_xas_scan('88b9e311')

In [None]:
## another bad one
fetch_xas_scan('64887ce3')

### Preparing the training data

This is a "supervised learning" problem.  That means that a human (me!) goes through the training data and tags each spectrum as *good* or *bad*.  

The data can be found in the `data/ML_corpus folder`. Each scan has been slurped from DataBroker, lightly reduced, then exported as a CSV file with columns for energy and $\mu$(E).  In the code block above, we used the `pandas.read_csv()` method to import the $\mu$(E) data for plotting.

I wrote a [simple shell script](./data/ML_corpus/tag.sh) that steps through each spectrum in the training set, displays a plot of each spectrum, and prompts for a score for each spectrum.

**"good data**: score = 1 &ndash; a spectrum that looked to my human eye like it stepped up then wiggled.

**"bad data"**: score = 0 &ndash; a spectrum that looked to my human eye like it **did not** step up then wiggle.

(Side note: human tagging of a data set is tedious and error prone.  An ideal model would be tolerant of errors in the training set.)

The scoring was saved as [a JSON file](./data/ML_corpus/tags.json). Let's see what the first seven entries in that JSON file look like:

In [None]:
import json, itertools
with open("./data/ML_corpus/tags.json") as infile:
    tags = json.load(infile)
dict(itertools.islice(tags.items(),  1, 8))

Let's do a spot check on of the good ones (`04fed2c6.csv`) and on one of the bad ones (`0920716b.csv`):

In [None]:
fetch_xas_scan('04fed2c6') # this one is tagged as "good"

In [None]:
fetch_xas_scan('0920716b') # this one is tagged as "bad"

Great!  Now we can start constructing our training set.

First thing: we need to "rationalize" the data. The trainer expects all the data to be the same size -- for example, each observation of an iris had 4 data points (width and length of sepal and petal).  Similarly, the XAS spectra in our training set need to have the same number of energy points. Because differnt scans mght have different numbers of energy point, we will interpolate all the data onto a 401-point grid which is evenly spaced across the energy range of the original XAS scan.

In [None]:
import numpy
def rationalize_mu(en, mu):
    '''Return energy and mu as a dataframe with data interpolated onto 
    a "rationalized" grid of equally spaced points.  GRIDSIZE = 401
    '''
    GRIDSIZE = 401
    ee=list(numpy.arange(en[0], en.iloc[-1], (en.iloc[-1]-en[0])/GRIDSIZE))
    mm=numpy.interp(ee, en, mu)
    if len(ee) > GRIDSIZE:
        ee = ee[:-1]
        mm = mm[:-1]
    df = pandas.DataFrame()
    df['dcm_energy'] = ee
    df['xmu'] = mm
    return(df)

def plot_rationalized_data(data, rat):
    '''Make a quick-n-dirty of the original data and the data interpolated onto a 401-point grid.
    '''
    data.plot("dcm_energy", "xmu", xlabel='Energy (eV)', ylabel='$\mu$(E)', label='original')
    ax = plt.gca()
    rat.plot("dcm_energy", "xmu", xlabel='Energy (eV)', ylabel='$\mu$(E)', label='rationalized', ax=ax)
    

data = pandas.read_csv(os.path.join(notebook_path, "data", "ML_corpus", '04fed2c6.csv'))
data_rational = rationalize_mu(data['dcm_energy'], data['xmu'])
plot_rationalized_data(data, data_rational)

And here's a bad one:

In [None]:
data = pandas.read_csv(os.path.join(notebook_path, "data", "ML_corpus", '0920716b.csv'))
data_rational = rationalize_mu(data['dcm_energy'], data['xmu'])
plot_rationalized_data(data, data_rational)

Almost ready!  Now, we need to import the entire tagged learning corpus into a form ready to be consumed by the sklearn classifier.

In [None]:
csv_files = [x for x in os.listdir(os.path.join(notebook_path, "data", "ML_corpus")) if x.endswith('csv')]
corpus = []
scores = []
for f in csv_files:
    data = pandas.read_csv(os.path.join(notebook_path, "data", "ML_corpus", f))
    data_rational = rationalize_mu(data['dcm_energy'], data['xmu'])
    corpus.append(list(data_rational['xmu']))
    scores.append(tags[f])
    
clf=KNeighborsClassifier(n_neighbors=1)
X_train, X_test, y_train, y_test = train_test_split(corpus, scores, random_state=0)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

94% success on the training set!  Not bad for an extremely naive approach to the problem.  Let's see if we can't improve upon this without having to do too much more work.

[SciKit Learn](https://scikit-learn.org/stable/index.html) comes with a rather enormous number of
[supervised learning models](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning). Let's try another one!

(Give a two sentence explanation of a random forest classifier.)

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

98.5%!  Woot!  Now we're gettin' somewhere!

We are already starting to push up against my dilettante's knwoledge of machine learning. A more informed choice of classifier can be made (and was, by Phil, in the paper a group of us here at NSLS-II just got published), but let's plow ahead using our random forest.

To get a sense of how this works, let's look at the first item in test portion of the training set.  Let's see how I tagged it, what it looks like when plotted, and how it evaluates using the model:

In [None]:
y_test[0]

In [None]:
plt.plot(X_test[0])

In [None]:
clf.predict([X_test[0]])

The predict function returns a 1 or a 0 on the basis of its evaluation of the supplied test data.  In this case, the model and I agree about these data. Yay!

Let's try it on a spectrum not in the training set! Here's an Fe edge spectrum measured on a completely different sample from a completely different area of science which was measured over a year later:

In [None]:
unknown = pandas.read_csv(os.path.join(notebook_path, "data", "ML_unknown", "unknown_Fe.csv"))
unknown_rational = rationalize_mu(unknown['dcm_energy'], unknown['xmu'])
plot_rationalized_data(unknown, unknown_rational)

In [None]:
clf.predict([list(unknown_rational['xmu'])])

Splendid!  A visual inspection tells us that the new spectrum looks like XAS data and our model agrees!

## Using our model

Once our model is created, we can follow sklearn's hints about [model persistence](https://scikit-learn.org/stable/modules/model_persistence.html).  The model gets serialized to a [joblib](https://joblib.readthedocs.io/en/latest/persistence.html) file.  The file containing the model serialization is part of the [bluesky profile at BMM](https://github.com/NSLS-II-BMM/profile_collection).  Thus this machine learning model is always available and ready to be integrated into our operations.

In practice, we compare *every* spectrum measured against our model.  The plan we use to measure an XAS spectrum includes a loop over the numbr of repetitions reqeusted by the user.  As part of that loop, the data that just finished are rationalized as discussed above and scored by the model.

At BMM, we use Slack to provide feedback to users and staff during the experiment.  In the screenshot below, you can see the result of the data evaluation for each of two repetitions on that sample.  At the end of the two repetitions, the data are merged and lightly process, then a picture of the data are posted to Slack.

![Slack+ML](./static/slack+ml.png)

In this way, user and staff are given a hint about whether the experiment is progressing generally well or not.

## Improving the model

In practice, the model developed in this tutorial is not strong enough for general use.  Here's an example:

In [None]:
failure = pandas.read_csv(os.path.join(notebook_path, "data", "ML_unknown", "unknown_Zr.csv"))
failure_rational = rationalize_mu(failure['dcm_energy'], failure['xmu'])
plot_rationalized_data(failure, failure_rational)

In [None]:
clf.predict([list(failure_rational['xmu'])])

Whomp! Whomp!

Those Zr edge data are obviously excellent data, but the model in its current state does not recognize that.

Over time, I have tagged more data and added them to the model.  The model in use at BMM is still not perfect.

Reliablility in the high 90s still means that every day, a user will ask me "Why did the data evaluation fail? What's wrong with my data?"  Sigh....

The actual implementation of this machine learning model at BMM is contained in [this file](https://github.com/NSLS-II-BMM/profile_collection/blob/master/startup/BMM/ml.py) from BMM's profile.