# Exploring Chemical Space with SMACT and Materials Project Database

In this notebook, we undertake a comprehensive exploration of binary chemical compositions. This approach can also be extended to explore ternary and quaternary compositions. Our methodology involves two primary tools: the SMACT filter for generating compositions and the Materials Project database for additional data acquisition. 

The final phase will categorize the compositions into four distinct categories based on their properties. The categorization is based on whether a composition is allowed by the SMACT filter (smact_allowed) and whether it is present in the Materials Project database (mp). The categories are as follows:

| smact_allowed | mp   | label      |
|---------------|------|------------|
| yes           | yes  | standard   |
| yes           | no   | missing    |
| no            | yes  | interesting|
| no            | no   | unlikely   |

## 1. Generate compositions with the SMACT filter

We begin by generating binary compositions using the SMACT filter. The SMACT filter serves as a chemical filter including oxidation states and electronegativity test.

[`generate_composition_with_smact`](./generate_composition_with_smact.py) function generates a composition with the SMACT filter. The function takes in the following parameters:

num_elements: number of elements in the composition

max_stoich: maximum stoichiometry of each element

max_atomic_num: maximum atomic number of each element

num_processes: number of processes to run in parallel

save_path: path to save the dataframe containing the compositions with the SMACT filter

In [None]:
from generate_composition_with_smact import generate_composition_with_smact

In [None]:
df_smact = generate_composition_with_smact(
    num_elements=2,
    max_stoich=8,
    max_atomic_num=103,
    num_processes=8,
    save_path="data/binary/df_binary_label.pkl",
)

# 2. Download data from the Materials Project database

Next, we download data from the Materials Project database using the `MPRester` class from the [`pymatgen`](https://pymatgen.org/) library. 

[`download_mp_data`](./download_compounds_with_mp_api.py) function takes in the following parameters:

mp_api_key: Materials Project API key

num_elements: number of elements in the composition

max_stoich: maximum stoichiometry of each element

save_dir: path to save the downloaded data

In [None]:
mp_api_key = None  # replace with your own MP API key

In [None]:
from download_compounds_with_mp_api import download_mp_data

# download data from MP for binary compounds
save_mp_dir = "data/binary/mp_data"
docs = download_mp_data(
    mp_api_key=mp_api_key,
    num_elements=2,
    max_stoich=8,
    save_dir=save_mp_dir,
)

## 3. Categorize compositions

Finally, we categorize the compositions into four lables: standard, missing, interesting, and unlikely.

In [None]:
from pathlib import Path
import pandas as pd

In [None]:
mp_data = {p.stem: True for p in Path(save_mp_dir).glob("*.json")}
df_mp = pd.DataFrame.from_dict(mp_data, orient="index", columns=["mp"])

In [None]:
# make category dataframe
df_category = df_smact.join(df_mp, how="left").fillna(False)
# make label for each category
dict_label = {
    (True, True): "standard",
    (True, False): "missing",
    (False, True): "interesting",
    (False, False): "unlikely",
}
df_category["label"] = df_category.apply(
    lambda x: dict_label[(x["smact_allowed"], x["mp"])], axis=1
)
df_category["label"].apply(dict_label.get)

# count number of each label
print(df_category["label"].value_counts())

# save dataframe
df_category.to_pickle("data/binary/df_binary_category.pkl")