# Part 1: Tracking ML Model Predictions

In this tutorial notebook, we'll examine some examples of why we may want to track model predictions, and how this can be done.

First, let's import the `SynthesizabilityModel` class. This is a toy model (i.e. random number generator) that is meant to stand-in for a binary ML classifier, predicting the chance that some hypothetical material should be synthesizable or not.

**Q: Can you navigate to the class definition? What kinds of methods does the class provide?**


_Hint: You'll need to look at another file in this folder to understand the model definition._

In [2]:
from model import SynthesizabilityModel

## Starting with a list of materials

Let's now start by defining a _list_ of materials by their formulas, along with an instance of our imported model class.

**Q: If you'd like, try adding some other materials in the list, or otherwise modifying the list of materials.**

**Q: What happens if you add a material that isn't a string? Why does this happen?**

In [3]:
my_materials = [
    "Li7La3(SnO6)2",
    "Li3(WO3)8",
    "Li4Mn2P4H3O16",
    "Li2Ni(PO3)5",
    "MgV2O4",
]

predictor = SynthesizabilityModel()

### Running predictions

Let's run the prediction function in a loop over the list of materials. 

**Q: Are the results deterministic (do they change when you re-run)? Why or why not?**

**Q: Try timing the runtime -- e.g., uncomment the `%%time` at the start of the next code block. How long does it take to run predictions on these materials? If you had 1 billion material candidates to run predictions on, would you do anything differently?**

In [11]:
# %%time

for material in my_materials:
    print (
        f"The synthesizability score for {material} is",
        f"{round(predictor.predict_single(material), 2)}."
    )

The synthesizability score for Li7La3(SnO6)2 is 0.04.
The synthesizability score for Li3(WO3)8 is 0.72.
The synthesizability score for Li4Mn2P4H3O16 is 0.26.
The synthesizability score for Li2Ni(PO3)5 is 0.24.
The synthesizability score for MgV2O4 is 0.41.


## Predictions from the past

So far, we've played around with a toy model and thought a little bit about running predictions at larger scales. 

In many cases, machine learning involves multiple teams where the predictions are a handoff point between different groups. 

How can we understand and manage predictions, when they were made by a model that someone else built? (e.g., a previous grad student, a collaborator from another lab).

Let's first take a look at a single predictions file, provided to us by another researcher.

In [12]:
import json

with open("data/predictions_v1.json", "r") as prediction_file:
    print(json.dumps(
        json.load(prediction_file),
        indent=2
    ))

{
  "synthesizability_predictions": {
    "Li7La3(SnO6)2": 0.15,
    "Li3(WO3)8": 0.39,
    "Li4Mn2P4H3O16": 1.0,
    "Li2Ni(PO3)5": 1.0,
    "MgV2O4": 0.22
  },
  "citation": "https://www.nature.com/articles/s43246-021-00219-x"
}


**Q: What kinds of materials are these? If you were running an experimental lab group, which of these materials might you start thinking about how to synthesize? Why?**

_Hint: take a look at the article_

### Comparing predictions

There are two other predictions files you've received from your collaborators, and you haven't had a chance to get a hold of the grad student that produced the original predictions.

Let's take a look at all of these prediction files side by side. Printing out the entire JSONs makes it tricky to compare... is there a better way?

In [13]:
import pandas                   

files = [
    "data/predictions_v1.json",
    "data/predictions_v2.json",
    "data/predictions_final.json",
]

all_predictions = pandas.concat([           # Concatenate several Series...
        pandas.read_json(f)                 # Read the JSON file to Pandas
        ["synthesizability_predictions"]    # Select the sub-dictionary as a Series
        .rename(f)                          # Rename this Series to the filename
        for f in files                      # Do this for all files in the list
    ], axis=1)                              # Concatenate into a table (not a line)

The above code is somewhat simplified and overly-concise, but is an example of a classic data transformation pipeline. Each of the different data source files are read-in, we use the filenames to keep track of where each prediction came from, and we filter to a single column of data, and join everything together.

Let's take a look at the predictions now.

In [14]:
all_predictions

Unnamed: 0,data/predictions_v1.json,data/predictions_v2.json,data/predictions_final.json
Li2Ni(PO3)5,1.0,1.0,0.1
Li3(WO3)8,0.39,0.93,0.93
Li4Mn2P4H3O16,1.0,0.99,1.0
Li7La3(SnO6)2,0.15,0.95,0.15
MgV2O4,0.22,0.98,0.22


**Q: Which version of the predictions is the "right" one?**

**Q: How was the "final" version of the predictions made, and how does it relate to v1 and v2?**

**Q: Which materials would you select to make in a lab? Is it the same choice you made before? Are you as confident as you were before in your choice?**

## The key takeaway

Treating predictions as single numbers is dangerous, because it makes them very hard to reproduce and understand!

Instead, think of predictions as being **as important as a model**. Remember at the start of this notebook, we took the time to understand and look up the model definition -- let's consider what might be the case if you did this for predictions as well.

Let's look at two examples of structured, rich prediction objects:

In [15]:
with open("data/structured_prediction_1295f_6228a.json", "r") as prediction_file:
    print(json.dumps(
        json.load(prediction_file),
        indent=2
    ))

{
  "model_id": "27f49604-99df-4c90-91c2-353d8ae1295f",
  "prediction_id": "28fb2aea-84bf-4e4e-9d9e-51f8dfc6228a",
  "citation": "https://www.nature.com/articles/s43246-021-00219-x",
  "prediction_result": {
    "type": "binary_classification",
    "name": "synthesizability",
    "value": 0.39
  },
  "prediction_input": {
    "name": "Li3(WO3)8",
    "mp_id": "mp-1222771_Li",
    "material_id": "768a747c-015c-482c-acdf-bd4e7bbb4936"
  }
}


In [17]:
with open("data/structured_prediction_fee96_19f22.json", "r") as prediction_file:
    print(json.dumps(
        json.load(prediction_file),
        indent=2
    ))

{
  "model_id": "159c7941-15c3-4bc2-83f2-71f6b87fee96",
  "prediction_id": "1485d58c-f70d-443d-9b79-b19d34f19f22",
  "citation": "https://www.nature.com/articles/s43246-021-00219-x",
  "prediction_result": {
    "type": "binary_classification",
    "name": "synthesizability",
    "value": 0.42
  },
  "prediction_input": {
    "name": "Li3(WO3)8",
    "mp_id": "mp-1222771_Li",
    "material_id": "768a747c-015c-482c-acdf-bd4e7bbb4936"
  }
}


Now see if you are able to answer the following, by referencing against the above two structured prediction files...

**Q: Were these two predictions made by the same, or different models? How do you know?**

**Q: If you had to aggregate many of these prediction files, and select only the predictions from a particular model, how would you do that? (Bonus: Can you do it in Python code?)**

**Q: What was the input to these models when the prediction was made? Can you look up the material in an external database to verify exactly what the input is?**

**Q: If you had to ask your collaborator (who built the models and the predictions) to double check a prediction, what would you send over if you wanted to know about: A) a single prediction, B) all the predictions from a single model, C) all the predictions on a single material?**

## What we've learned

Predictions are often stored and sent as "lists of numbers," which can lead to very simple and efficient data pipelines and handoffs. However, this makes it difficult to track predictions and ensure reproducibility and explainability of our data.

By thinking about each prediction at the same level of importance of a model, we can ensure that the appropriate metadata is packaged alongside a prediction, so that the provenance of each prediction is easy to understand and communicate.