# Training of the classifier
This example will show how to train an sklearn classifier with artificial crystalline data

- First of all we will use the **CrystalAnalyzer** class to easily manage the data and feed the classifier with it.
- A multilayer perceptron will be used as **classifier** (the shown configuration will work quite well)
- Also a **scaler** needs to be defined, we will use the standard scaler from sklearn

To gain some speedup, a pool from multiprocessing is used with 6 processes (this is optional)

In [1]:
import multiprocessing as mp

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

from crystalanalysis import CrystalAnalyzer
from mixedcrystalsignature import MixedCrystalSignature

In [2]:
sign_calculator=MixedCrystalSignature(solid_thresh=0.55,pool=mp.Pool(6))

classifier = MLPClassifier(max_iter=300,tol=1e-5,
                           hidden_layer_sizes=(250,),
                           solver='adam',random_state=0, shuffle=True,
                           activation='relu',alpha=1e-4)

scaler=StandardScaler()

## Important parameters
- **noiselist** will provide a list of noiselevels that are used to build the test dataset (default from 0 % to 20%)
- **train_noiselist** is the list of noiselevels that are used for training (default from 4 % to 11 %)
- **volume** will define the volume that is to be filled with a crystal structure in x, y and z dimension ( 15,15,15 is just for demonstration, larger datasets are better!)
- **inner_distance** this gives the distance to calculate the inner volume to avoid errors at the border of the dataset

In [None]:
noiselist=list(range(0,21))
train_noiselist=list(range(4,12,1))
volume=[15,15,15]
inner_distance=2
ca=CrystalAnalyzer(classifier,scaler,sign_calculator,
                  noiselist=noiselist,train_noiselist=train_noiselist,
                  volume=volume,inner_distance=inner_distance)

## Generating the signatures for training and testing
Now that the Crystalanalyzer is defined, we can generate the training and test signatures accordingly. 

- **save_training_signatures** can be used to save the training data 
- likewise, **save_test_signatures** can be used to save the test data.

if you use multiprocessing it is advised to close the pool after your calculations are done.

In [3]:
ca.generate_train_signatures()
ca.save_training_signatures("training_data.pkl")
ca.generate_test_signatures()
ca.save_test_signatures("test_data.pkl")

# close and join the pool from multiprocessing
ca.sign_calculator.p.close()
ca.sign_calculator.p.join()

generating training signatures
finished
generating test signatures
finished


## Training from the signature data
The accuracy of the training is predicted (the resulting accuracy is a sign of **overfitting**, more data is needed!)

In [4]:
ca.load_training_signatures("training_data.pkl")
ca.train_classifier()
ca.save_classifier("mlpclassifier.pkl")
ca.save_scaler("standardscaler.pkl")

started training
finished training, time: 34.48882818222046
Accuracy on Train set: 0.9971575809396159


## Predicting the test dataset
You will retrieve a list of accuracies on the dataset for every noiselevel. 

If you want to plot the data, all of the data is in **ca.test_signatures** (which is a simple dictionary)

Other than that it is up to you what you do with a trained classifier. Also this repository provides already trained classifiers and scalers.

In [5]:
ca.load_scaler("standardscaler.pkl")
ca.load_classifier("mlpclassifier.pkl")
ca.load_test_signatures("test_data.pkl")
ca.predict_test()

predicted struc fcc idx 0 1.0
predicted struc fcc idx 1 1.0
predicted struc fcc idx 2 1.0
predicted struc fcc idx 3 1.0
predicted struc fcc idx 4 1.0
predicted struc fcc idx 5 1.0
predicted struc fcc idx 6 0.99951171875
predicted struc fcc idx 7 0.9990229604298974
predicted struc fcc idx 8 0.9985308521057786
predicted struc fcc idx 9 0.9975111996017919
predicted struc fcc idx 10 0.9989550679205852
predicted struc fcc idx 11 0.997628927089508
predicted struc fcc idx 12 0.8085603112840467
predicted struc fcc idx 13 0.732360097323601
predicted struc fcc idx 14 0.7344913151364765
predicted struc fcc idx 15 0.695906432748538
predicted struc fcc idx 16 0.7321428571428571
predicted struc fcc idx 17 0.5555555555555556
predicted struc fcc idx 18 0.6666666666666666
predicted struc fcc idx 19 nan
predicted struc fcc idx 20 nan
predicted struc hcp idx 0 1.0
predicted struc hcp idx 1 1.0
predicted struc hcp idx 2 1.0
predicted struc hcp idx 3 1.0
predicted struc hcp idx 4 1.0
predicted struc hcp id

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
