# Analysis of Neutron Diffraction Data

This notebook can be foun in the github repo:
https://github.com/gcsantucci/SMC_DataChallenge_Diffraction/blob/master/figs/analysis.md

This is part of a data science challenge promoted by Oak Ridge National Lab. See more here:

https://smc-datachallenge.ornl.gov/

This particular challenge is described in more details here:

https://smc-datachallenge.ornl.gov/2017/challenge-4/

The main idea of this challenge is to be able to detect phase transitions in a material by looking at curves of intensity as a function of distance (characteristical spacingof the material).
The strategy is to look for peaks of intensity at a given temperature, count the number of peaks and characterize them (area, center and width). Then, by doing the same analysis at the adjacent temperatue, we can study if the peak structure changed due to a phase transition.

In [None]:
%matplotlib inline  
import os
import time
import h5py
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec, pylab
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
from scipy.integrate import quad
from visuals import *
from peaks import *

Change file location accordingly:

In [None]:
path = '/Users/santucci/Dropbox/DataScience/SMC_DataChallenge'
infile = 'data/Powder_Diffraction.nxs'
infile = os.path.join(path, infile)

Read the file and inpect it:

In [None]:
f = h5py.File(infile, 'r')

In [None]:
for i in f:
    print(i)

In [None]:
for i in f['entry']:
    print(i)

In [None]:
for i in f['entry']['data']:
    print(i)

Let's define the necessary arrays that will be used in our analysis:
- ds: Since dspacing (Angstrom) is an array containing the bin edges, we can define a new array contaning the bin centers to plot our data.
- I: The intensities 2d-array, containing all the intensity values for a given temperature T: I[T][i]
- T: The corresponding temperature (Kelvin) array.

The GetPeaks function below is a simplified version of a matlab based peak detection algorithm (insert reference here!!).
This is one of the key parts of our algorithm. The way it works is basically we say that all points in the data are peaks, then we remove all of the points that are below some threshold (user input), and then we remove points that are too close to other peaks (user input).

The full algorithm can also look for valleys (by negating the data) and more complicated peak structure. But we have simplified the function since the data structure that we have is simpler.

### Inspecting the data:

Let's look at a few curves I(d) for random values of temperature to get familiarized with the data:

In [None]:
ncols = 3
nrows = 3
nfigs = ncols * nrows

fig = plt.figure(1, figsize=(6*ncols, 5*nrows))
for ifig in range(nfigs):
    itemp = np.random.randint(len(T))
    plt.subplot(nrows, ncols, ifig+1)
    plt.title('I(d) for T = {} K'.format(round(T[itemp]),2))
    plt.plot(ds,I[itemp])
    plt.xlabel('d (Angstrom)')
    plt.ylabel('Intensity')
plt.show()

# Characterization of the 3.25 A Peak

Let's study the peak between 3.2 and 3.3 A for all temperatures. Namely, let's look at the center of the peak (d spacing coordinate), the area under the peak and a typical width.

Let's concentrate in the (3.2,3.3) A region:

In [None]:
def CharacterizePeaks(dsmin, dsmax):
    X = ds[np.argwhere((ds>dsmin)&(ds<dsmax))].flatten()
    Y = np.array([i[np.argwhere((ds>dsmin)&(ds<dsmax))].flatten() for i in I])

    AreaPeak = []
    MeanPeak = []
    WidthPeak = []
    BkgPeak = []
    AvgPeak = []
    LSEPeak = []

    for temp, y in enumerate(Y):
        area, mean, sigma = GetSeed(X, y)
        bkg, err = GetBkg(I[temp])
        popt, pcov = curve_fit(gaus_b, X, y, p0=[area, mean, sigma, bkg])
        A, x0, sigma, b = popt
        Yhat = np.array([gaus_b(x, A, x0, sigma, b) for x in X])
        res = Yhat - y
        AreaPeak.append(A)
        MeanPeak.append(x0)
        WidthPeak.append(sigma)
        BkgPeak.append(b)
        AvgPeak.append(y[np.argwhere(X<3.24)].flatten().mean())
        LSEPeak.append(LSE(X, y, A, x0, sigma, b, err))

    AreaPeak = np.array(AreaPeak)
    MeanPeak = np.array(MeanPeak)
    WidthPeak = np.array(WidthPeak)
    BkgPeak = np.array(BkgPeak)
    AvgPeak = np.array(AvgPeak)
    LSEPeak = np.array(LSEPeak)
    return X, Y, AreaPeak, MeanPeak, WidthPeak, BkgPeak, AvgPeak, LSEPeak

In [None]:
X, Y, AreaPeak, MeanPeak, WidthPeak, BkgPeak, AvgPeak, LSEPeak = CharacterizePeaks(3.2, 3.3)

After fitting a gausian+background to all 3.25 A peaks (for all temperatures), we can look at the distribution of area, center and width for these different temperatures:

In [None]:
fig = plt.figure(5, figsize=(12,18))

ax1 = plt.subplot(511)
plt.plot(T, AreaPeak, 'b.-', label='Area')
plt.legend()
plt.title('Area under the 3.25 A Peak vs Temperature (K)')
plt.xlabel('Temperature (K)')
plt.ylabel('Area under the 3.25 A Peak')
plt.setp(ax1.get_xticklabels(), fontsize=10)

ax2 = plt.subplot(512, sharex=ax1)
plt.plot(T, MeanPeak, 'r.-', label='Mean')
plt.legend()
plt.title('Mean of the 3.25 A Peak vs Temperature (K)')
plt.xlabel('Temperature (K)')
plt.ylabel('Mean of the 3.25 A Peak')

ax3 = plt.subplot(513, sharex=ax1)
plt.plot(T, WidthPeak, 'k.-', label='Width')
plt.legend()
plt.title('Width of the 3.25 A Peak vs Temperature (K)')
plt.xlabel('Temperature (K)')
plt.ylabel('Width of the 3.25 A Peak')

ax4 = plt.subplot(514, sharex=ax1)
plt.plot(T, BkgPeak, 'g.-', label='Background')
plt.plot(T, AvgPeak, 'y.-', label='Average <3.24')
plt.legend()
plt.title('Background level around the 3.25 A Peak vs Temperature (K)')
plt.xlabel('Temperature (K)')
plt.ylabel('Background level around the 3.25 A Peak')

ax5 = plt.subplot(515, sharex=ax1)
plt.plot(T, LSEPeak, '.-', label='LSE')
plt.legend()
plt.title('Least Squares for the 3.25 A Peak vs Temperature (K)')
plt.xlabel('Temperature (K)')
plt.ylabel('LSE')

plt.tight_layout()
plt.show()

Let's normalize the data for visualization purposes:

In [None]:
AreaNorm = Norm(AreaPeak)
MeanNorm = Norm(MeanPeak)
WidthNorm = Norm(WidthPeak)

In [None]:
fig = plt.figure(3, figsize=(12,15))

ax1 = plt.subplot(311)
plt.plot(T, AreaNorm, 'b.-', label='Area')
plt.plot(T, MeanNorm, 'r.-', label='Mean')
plt.plot(T, WidthNorm, 'k.-', label='Width')
plt.legend()
plt.xlabel('Temperature (K)')
plt.show()

We can clearly see some interesting structure around T=150 K. Let's study this region more carefully:

In [None]:
phase = [i[0] for i in np.argwhere((T>149)&(T<151))]

In [None]:
T[phase]

In [None]:
temp = 0

fig = plt.figure(2, figsize=(8,6))
ax1 = plt.subplot(211)
plt.plot(X, Y[temp], 'b.', label='T = {}'.format(round(T[temp], 1) ))
plt.legend()
plt.title('Intensity Curves')
frame = pylab.gca()
frame.axes.get_xaxis().set_ticks([])
plt.ylabel('Intensity')

temp = 64
ax2 = plt.subplot(212)
plt.plot(X, Y[temp], 'b.', label='T = {}'.format(round(T[temp], 1)))
plt.legend()
plt.xlabel('dspacing (A)')
plt.ylabel('Intensity')

plt.subplots_adjust(hspace=.003)

plt.show()

In [None]:
temp = [0, 30, 64, 100]
for itemp in temp:
    TestFit(X, Y[itemp], round(T[itemp], 1))

We can clearly see that the 3.25 A peak structure at $T = 150$ K does not look gausian at all. Therefore the fit fails ( chi2 and residues indicate a problem) and our fitted values for A, c and w are not good. But since we are interested in finding a phase transition temperature we can still use the simple tools developed here so far. We can not trust the value for Area for example, but the value itslef is not important.

The important thing to keep in mind is how to find $T_{transition}$ !

## Peak Detection

Given a temperature T, let's find all the Intensity peaks and characterize them. We find peaks by using the detect_peaks function [1] and characterize them with the gausian fit just like in Q2.

[1] http://nbviewer.jupyter.org/github/demotu/BMC/blob/master/notebooks/DetectPeaks.ipynb