# MatSE580 Guest Lecture 2
## Introduction

In this guest lecture we will cover:
1. [Interacting with the database we set up in Lecture 1](#verify-the-connection-to-the-database) and [visualizing the results](#plotting-with-mongodb-charts) - using [pymongo](https://github.com/mongodb/mongo-python-driver) library and [MongoDB Charts](https://www.mongodb.com/docs/charts/) service
2. [Using machine learning (ML) tools to predict stability of materials](#pysipfenn) - using [pySIPFENN](https://pysipfenn.readthedocs.io/en/stable/)
3. [Using ML featurization and dimensionality reduction to embed materials in feature space](#featurization) - using [pySIPFENN](https://pysipfenn.readthedocs.io/en/stable/) with [MongoDB Charts](https://www.mongodb.com/docs/charts/) visualization
4. [Using faturization to guide DFT and improve ML models](#transfer-learning-on-small-dft-dataset)

**This notebook assumes that you already followed the instructions in Lecture 1 and you:**
1. Have a conda environment called `580demo` (or other) with all the packages installed, including:
    - `pymatgen`
    - `pymongo`
    - `pysipfenn`

2. Have a MongoDB database called `matse580` with collection `structures` to which you have access:
    - username (e.g. `student`)
    - API key / password string (e.g. `sk39mIM2f35Iwc`)
    - whitelisted your IP address or `0.0.0.0/0` (entire internet)
    - know the connection string (URI) to the database (e.g. `mongodb+srv://student:sk39mIM2f35Iwc@cluster0.3wlhaan.mongodb.net/?retryWrites=true&w=majority`)

3. You populated the database with all Sigma phase end members (see Lecture 1 - Inserting Data)

4. After you installed `pysipfenn`, you have downloaded all the [pre-trained models](https://zenodo.org/records/7373089) by calling `downloadModels()` and it finished successfully. If not, run this one liner:

        python -c "import pysipfenn; c = pysipfenn.Calculator(); c.downloadModels(); c.loadModels();"

If all of the above are true, you are ready to go!

In [1]:
from pprint import pprint            # pretty printing
from collections import defaultdict  # convenience in the example
import os                            # file handling
from datetime import datetime        # time handling
from zoneinfo import ZoneInfo        # time handling
from pymatgen.core import Structure  # pymatgen

## Verify the connection to the database
pymongo is a Python library that allows us to interact with MongoDB databases in a very intuitive way. Let's start by importing its `MongoClient` class and creating a connection to our database:

In [2]:
from pymongo import MongoClient
uri = 'mongodb+srv://amk7137:kASMuF5au1069Go8@cluster0.3wlhaan.mongodb.net/?retryWrites=true&w=majority'
client = MongoClient(uri)

and see what databases are available:

In [3]:
client.list_database_names()

['matse580', 'admin', 'local']

Now connect to `matse580\structures` collection

In [4]:
collection = client['matse580']['structures']

and verify that the Sigma phase structures we created are there:

In [5]:
print(f'Found: {collection.count_documents({})} structures\n')
pprint(collection.find_one({}, skip=100))

Found: 243 structures

{'POSCAR': 'Cr12 Fe10 Ni8\n'
           '1.0\n'
           '   8.5470480000000002    0.0000000000000000    0.0000000000000000\n'
           '   0.0000000000000000    8.5470480000000002    0.0000000000000000\n'
           '   0.0000000000000000    0.0000000000000000    4.4777139999999997\n'
           'Cr Fe Ni Fe Cr\n'
           '8 2 8 8 4\n'
           'direct\n'
           '   0.7377020000000000    0.0637090000000000    0.0000000000000000 '
           'Cr\n'
           '   0.2622980000000000    0.9362910000000000    0.0000000000000000 '
           'Cr\n'
           '   0.4362910000000000    0.2377020000000000    0.5000000000000000 '
           'Cr\n'
           '   0.7622980000000000    0.5637090000000000    0.5000000000000000 '
           'Cr\n'
           '   0.5637090000000000    0.7622980000000000    0.5000000000000000 '
           'Cr\n'
           '   0.2377020000000000    0.4362910000000000    0.5000000000000000 '
           'Cr\n'
           '   0.0637

### Plotting with MongoDB Charts

MongoBD Charts is an associated service that allows us to quickly visualize the data in the database online and share it with others, while keeping the source data secure and private.

***Note for Online Students: At this point we will pause the Jupiter Notebook and switch to the MongoDB Atlas website to set up the database, or skip until next week depending on the available time.** The process is fairly straightforward but feel free to stop by office hours for help!*

You should end up with some neat figures like the one below 

<p align="center">
  <img src="assets/MongoDBChartExample.png" width="500"/>
</p>

If you are interested in seeing a couple more examples, you can visit the dashboard of [ULTERA Database](https://ultera.org) for high entropy alloys.

## pySIPFENN

We will now complete a brief walkthrough covering core functionalities of the **pySIPFENN** or **py**(**S**tructure-**I**nformed **P**rediction of **F**ormation **E**nergy using **N**eural **N**etworks) package; available through the PyPI repository. For a full up-to-date documentation, please refer to the [pySIPFENN documentation page](https://pysipfenn.org) or [pySIPFENN GitHub repository](https://git.pysipfenn.org). You can also find news about our projects using SIPFENN at our [Phases Research Lab](https://phaseslab.org) group website.

On the conceptual level, pySIPFENN is a framework composed of:

- Featurizers / descriptor calculators allowing user to interpret atomic structures (hence **S**tructure-**I**nformed) and represent them with numbers in a way suitable for machine learning (ML) **P**rediction of properties. A few shipped to public are Ward2017 (general) and KS2022 (general or optimized to different material types) calcualting Ward2017 and KS2022 feature vectors, respectively. Thanks to how modular pySIPFENN is, you can generally just "steal" them as standalone modules and use them in your own projects.

- It can handle any properties user wants to predict based purely on the model training, but the key fundamental property of interest has been **F**ormation **E**nergy of materials and that is what is shipped by default with the package.

- It can use any [Open Neural Network Exchange (ONNX)](https://onnx.ai) trained on the supported feature vectors (Ward2017 and KS2022 included). The models shipped by default are **N**eural **N**etworks, hence inclusion in the name, but neither pySIPFENN nor ONNX is limited to NNs. You can export, for instance complete `scikit-learn` pipelines (as done [here in heaGAN package](https://github.com/amkrajewski/cGAN_demo/blob/master/heagan/notebooks/train_surrogates.ipynb)) and use them in pySIPFENN.

The figure below shows how they fit together conceptually:

<p align="center">
  <img src="assets/neuralnetcolorized.png" width="500"/>
</p>

### Getting Started

To utilize pySIPFENN for straightforward calculations, **only the Calculator class is needed**, which acts as an ***environment*** for all components of the package. Under the hood, it will do a lot of things for you, including both fetching and identification of available NN models. Afterwards, it will expose a very high-level API for you to use. 

In [6]:
from pysipfenn import Calculator     # The only thing needed for calculations

Could not import coremltools.

Dependencies for exporting to CoreML, Torch, and ONNX are not installed by default with pySIPFENN. You need to install pySIPFENN in "dev" mode like: pip install -e "pysipfenn[dev]", or like pip install -e ".[dev]" ifyou are cloned it. See pysipfenn.org for more details.


Now initialize the Calculator. When run, this should display all models detected (e.g. ✔ SIPFENN_Krajewski2020 Standard Materials Model)
and those not detected, but declared in the `modelsSIPFENN/models.json` file. If some networks are not detected (prepended with *x*), this may mean download (you were to do in Lecture 1) was not completed successfully. You can try to download them again by calling `c.downloadModels()` which will only download the missing ones.

In [7]:
c = Calculator()

*********  Initializing pySIPFENN Calculator  **********
Loading model definitions from: /Users/adam/opt/anaconda3/envs/580demo/lib/python3.10/site-packages/pysipfenn/modelsSIPFENN/models.json
Found 4 network definitions in models.json
✔ SIPFENN_Krajewski2020 Standard Materials Model
✔ SIPFENN_Krajewski2020 Novel Materials Model
✔ SIPFENN_Krajewski2020 Light Model
✔ SIPFENN_Krajewski2022 KS2022 Novel Materials Model
Loading all available models (autoLoad=True)
Loading models:


  0%|          | 0/4 [00:00<?, ?it/s]

100%|██████████| 4/4 [00:14<00:00,  3.65s/it]

*********  pySIPFENN Successfully Initialized  **********





The simplest and most common usage of pySIPFENN is to deploy it on a directory/folder containing atomic structure files such as POSCAR or CIF. To of so, one simply specifies its location and which descriptor / feature vector should be used. The latter determines which ML models will be run, as they require a list of specific and ordered features as input.

    c.runFromDirectory(directory='myInputFiles', descriptor='KS2022')

Furthermore, while the exact model can be specified by the user, by default all applicable models are run, as the run itself is 1-3 orders of magnitude faster than descriptor calculation. Following the link printed during `Calculator` initialization reveals which models will be run.

In this demonstration, a set of test files shipped under `assets/examplePOSCARS`. Let's run them with Ward2017 featurizer.

In [8]:
c.runFromDirectory(directory='assets/examplePOSCARS',
                   descriptor='Ward2017');

Importing structures...


100%|██████████| 6/6 [00:00<00:00, 21.46it/s]



Models that will be run: ['SIPFENN_Krajewski2020_NN9', 'SIPFENN_Krajewski2020_NN20', 'SIPFENN_Krajewski2020_NN24']
Calculating descriptors...


100%|██████████| 6/6 [00:05<00:00,  1.20it/s]


Done!
Making predictions...
Prediction rate: 20.5 pred/s
Obtained 6 predictions from:  SIPFENN_Krajewski2020_NN9
Prediction rate: 21.5 pred/s
Obtained 6 predictions from:  SIPFENN_Krajewski2020_NN20
Prediction rate: 131.0 pred/s
Obtained 6 predictions from:  SIPFENN_Krajewski2020_NN24
Done!


Now, all results are obtained and stored within the **c** Calculator object inside a few exposed conveniently named variables
_predictions_ and _inputFiles_. Also, the descriptor data is retained in _descriptorData_ if needed. Let's look up all 6 entries. Note that the unit of prediction will depend on the model used; in this case, it is eV/atom.

In [9]:
pprint(c.inputFiles)
pprint(c.predictions)

['12-Gd4Cr4O12.POSCAR',
 '13-Fe16Ni14.POSCAR',
 '14-Fe24Ni6.POSCAR',
 '15-Ta4Tl4O12.POSCAR',
 '16-Fe18Ni12.POSCAR',
 '17-Pr4Ga4O12.POSCAR']
[[-3.154766321182251, -3.214848756790161, -3.187128782272339],
 [-0.013867354951798916, 0.04655897989869118, 0.053411152213811874],
 [0.02639671415090561, 0.05997598543763161, 0.06677809357643127],
 [-2.467507839202881, -2.4308743476867676, -2.391871690750122],
 [0.01810809224843979, 0.06462040543556213, 0.10881152749061584],
 [-2.7106518745422363, -2.6583476066589355, -2.727781057357788]]


For user convenience, a few methods are provided for extracting the results. E.g., if pySIPFENN has been run from structure files, the `get_resultDictsWithNames()` method is available to conveniently pass results forward in the code.

In [10]:
c.get_resultDictsWithNames()

[{'name': '12-Gd4Cr4O12.POSCAR',
  'SIPFENN_Krajewski2020_NN9': -3.154766321182251,
  'SIPFENN_Krajewski2020_NN20': -3.214848756790161,
  'SIPFENN_Krajewski2020_NN24': -3.187128782272339},
 {'name': '13-Fe16Ni14.POSCAR',
  'SIPFENN_Krajewski2020_NN9': -0.013867354951798916,
  'SIPFENN_Krajewski2020_NN20': 0.04655897989869118,
  'SIPFENN_Krajewski2020_NN24': 0.053411152213811874},
 {'name': '14-Fe24Ni6.POSCAR',
  'SIPFENN_Krajewski2020_NN9': 0.02639671415090561,
  'SIPFENN_Krajewski2020_NN20': 0.05997598543763161,
  'SIPFENN_Krajewski2020_NN24': 0.06677809357643127},
 {'name': '15-Ta4Tl4O12.POSCAR',
  'SIPFENN_Krajewski2020_NN9': -2.467507839202881,
  'SIPFENN_Krajewski2020_NN20': -2.4308743476867676,
  'SIPFENN_Krajewski2020_NN24': -2.391871690750122},
 {'name': '16-Fe18Ni12.POSCAR',
  'SIPFENN_Krajewski2020_NN9': 0.01810809224843979,
  'SIPFENN_Krajewski2020_NN20': 0.06462040543556213,
  'SIPFENN_Krajewski2020_NN24': 0.10881152749061584},
 {'name': '17-Pr4Ga4O12.POSCAR',
  'SIPFENN_Kr

Alternatively, if results are to be preserved in a spreadsheet, they can be exported into a CSV.

In [11]:
c.writeResultsToCSV('myFirstResults_pySIPFENN.csv')

### Predicting all Sigma Endmembers from Lecture 1

Now, armed with power of pySIPFENN, we can quickly get formation energies of all Sigma phase endmembers we defined in last lectrue. We start by getting all the structures from the database:

In [22]:
structList, idList = [], []
for entry in collection.find({}):
    idList.append(entry['_id'])
    structList.append(Structure.from_dict(entry['structure']))
print(f'Fetched {len(structList)} structures')

Fetched 243 structures


Now, we will use `runModels` function, which is one layer of abstraction lower than `runFromDirectory` as it skips file processing and directly takes the structure objects. We will set `mode='parallel'` to run in parallel, which is much faster than sequential execution on multi-core machines. Each thread on a modern CPU should be able to process ~1 structure per second, so this should take about a minute.

We will also use `get_resultDicts` to get the results in a convenient format. 

In [14]:
c.runModels(structList=structList, descriptor='Ward2017', mode='parallel', max_workers=4)
results = c.get_resultDicts()


Models that will be run: ['SIPFENN_Krajewski2020_NN9', 'SIPFENN_Krajewski2020_NN20', 'SIPFENN_Krajewski2020_NN24']
Calculating descriptors...


  0%|          | 0/243 [00:00<?, ?it/s]

Could not import coremltools.

Dependencies for exporting to CoreML, Torch, and ONNX are not installed by default with pySIPFENN. You need to install pySIPFENN in "dev" mode like: pip install -e "pysipfenn[dev]", or like pip install -e ".[dev]" ifyou are cloned it. See pysipfenn.org for more details.
Could not import coremltools.

Dependencies for exporting to CoreML, Torch, and ONNX are not installed by default with pySIPFENN. You need to install pySIPFENN in "dev" mode like: pip install -e "pysipfenn[dev]", or like pip install -e ".[dev]" ifyou are cloned it. See pysipfenn.org for more details.Could not import coremltools.
Could not import coremltools.


Dependencies for exporting to CoreML, Torch, and ONNX are not installed by default with pySIPFENN. You need to install pySIPFENN in "dev" mode like: pip install -e "pysipfenn[dev]", or like pip install -e ".[dev]" ifyou are cloned it. See pysipfenn.org for more details.

Dependencies for exporting to CoreML, Torch, and ONNX are not i

In [18]:
pprint(results[0])

{'SIPFENN_Krajewski2020_NN20': 0.07977379858493805,
 'SIPFENN_Krajewski2020_NN24': 0.03619053587317467,
 'SIPFENN_Krajewski2020_NN9': 0.07845475524663925}


and now we can easily upload them back to the database, as we learned in Lecture 1

In [23]:
for id, result in zip(idList, results):
    collection.update_one({'_id': id}, {'$set': result})

and now they are accessible to anyone with access!

In [25]:
collection.find_one({}, skip=100)['SIPFENN_Krajewski2020_NN9']

0.15312525629997253

## Featurization

## Transfer Learning on small DFT dataset

## Further Resources