<a href="https://colab.research.google.com/github/jemappellelisa/datasciencecoursera/blob/master/rruff_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Our goal is to train a neural network to correctly guess the crystal system of a given powder x-ray diffraction (XRD) spectrum.

We start out by importing some standard libraries that will come in handy later.

In [0]:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt

#Playing with the data

Let us first download the data from Mark's github repository and save it in the numpy arrays 'datvalues' and 'datlabels'.

In [0]:
!git clone https://github.com/fiscioluzzi/ml_for_exp_condmat_workshop
  
datvalues = np.load('ml_for_exp_condmat_workshop/rruff/rruff_values_lownoise.npy')
datlabels = np.load('ml_for_exp_condmat_workshop/rruff/rruff_crsys_lownoise.npy')

Cloning into 'ml_for_exp_condmat_workshop'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 109 (delta 8), reused 4 (delta 1), pack-reused 85[K
Receiving objects: 100% (109/109), 25.07 MiB | 29.14 MiB/s, done.
Resolving deltas: 100% (40/40), done.


We are training with powder x-ray diffraction (XRD) spectra taken from the RRUFF database (http://rruff.info).

**Task #1:** Plot a few of the spectra in 'datvalues' to make yourself familiar with what the data looks like.

The label pertaining to the n-th spectrum is given by the n-th entry of the array 'datlabels'. The 7 crystal systems are labelled as follows:

1: triclinic;
2: monoclinic;
3: orthorhombic;
4: tetragonal;
5: trigonal;
6: hexagonal;
7: cubic.

The dataset was obtained from real experiments, so the occurence of crystal systems is not equally distributed.

**Task #2:**  Determine how many examples we have per crystal system.

Finally, we have to select which part of the data to train with and which to test with. For simplicity, let us just pick out 100 samples randomly for the test set, while the training set is made up of all the others.

In [0]:
order = np.random.permutation(len(datlabels))

train_data = datvalues[order][:546]
train_labels = datlabels[order][:546] - 1
test_data = datvalues[order][546:]
test_labels = datlabels[order][546:] - 1

Note that we have subtracted 1 from all labels, this is because it will be easier to work with labels 0...6 than with 1...7 later.

#Setting up and training the network

**Task #3:** Set up and train a neural network to learn the mapping from 'train_data' to 'train_labels'. 

That is, given a XRD spectrum from 'train_data', your network should output the corresponding label 0...6 of its crystal system with maximum accuracy.

**Task #4:** Now, apply your network to obtain predictions for the crystal systems of 'test_data', and determine how many of these are equal to the real labels in 'test_labels'.

**Task #5:** Finally, determine the performance of your network on subsets of 'test_data' corresponding to given crystal systems. Which crystal system is most easily recognized?

# Comparing the networks

In order to compare different networks, we have to get rid of the bias we introduced by choosing a random subset of 100 spectra as our test data. The straightforward way to do this is to loop over the entire procedure, from defining test and training data to evaluating the network accucary in Task #4.

**Task #6:** Average the network accuracy (=percentage of correctly classified spectra in the test data) over 50 different runs, where in each run the test and training data are randomly chosen anew.

In [0]:
accs = []

for i in range(50):
    order = np.random.permutation(len(datlabels))

    train_data = datvalues[order][:546]
    train_labels = datlabels[order][:546] - 1
    test_data = datvalues[order][546:]
    test_labels = datlabels[order][546:] - 1

    #youre code comes here!
    
    accs.append(test_acc)

print('Mean test accuracy:', np.mean(accs))