# Querying the structure pool for the Cs-Te binary system

_This example is reproducing the steps to create the initial structure pool for the high-throughput calculations published here: [doi:10.1063/5.0082710](https://doi.org/10.1063/5.0082710)._

As an initial data pool of crystal structures we use the [Materials Project (MP)](https://materialsproject.org) database and the [Open Quantum Materials database (OQMD)](https://oqmd.org/) that can be readily interfaced using the ``StructureImporter`` class of the library:

In [None]:
from aim2dat.strct import StructureImporter

strct_imp = StructureImporter()

## Querying crystals from Materials Project and Open Quantum Materials Database

The first argument for the queries consists of the chemical compositions specified via the string `'Cs-Te'`.

As for the MP database we query the initial structures (specified via the keyword `structure_type`) since these structures still have all symmetries. Additionally, an individual API key has to be passed to the function which can be requested at the MP webpage.

In [None]:
import os

strct_imp.import_from_mp(
    "Cs-Te", os.environ["MP_OPENAPI_KEY"], structure_type="initial",
)

In [None]:
strct_imp.import_from_oqmd("Cs-Te", query_limit=1000)

The downloaded crystals are stored in a ``StructureCollection`` object which can be accessed via the ``structures`` property. We can check the number of imported structures via `len` or by printing the object:

In [None]:
len(strct_imp.structures)

In [None]:
print(strct_imp)

Since we have been querying data from two different databases we might also want to check whether there are crystals shared by both databases.
In this case we use the F-fingerprint (<a href="https://doi.org/10.1063/1.3079326" target="_blank">doi:10.1063/1.3079326</a>) to identify duplicate structures. The function to indentify duplicate structures is implemented in the ``StructureOperations`` class.

We can simply pass the ``StructureCollection``](aiida_scripts.structure_analysis.StructureCollection) object from the ``StructureImporter`` to the ``StructureOperations`` object upon initialization:

In [None]:
from aim2dat.strct import StructureOperations

strct_op = StructureOperations(structures=strct_imp.structures)
strct_op.n_procs = 2
strct_op.cunksize = 500
strct_op.verbose = False

We use the ``find_duplicates_via_ffingerprint`` function to identify duplicate crystals, the function returns the labels of duplicate pairs and removes the first member of the pair from the ``StructureCollection`` object if `remove_structures` is set to `True`:

In [None]:
strct_op.find_duplicates_via_ffingerprint(
    remove_structures=True, 
    threshold=0.001, 
    r_max=15.0, 
    delta_bin=0.005, 
    sigma=10.0
)

Once again we can check the final number of structures:

In [None]:
len(strct_op.structures)

## Analysing the initial dataset

Having the duplicate structures removed we can split the dataset based on the crystal's source database:

In [None]:
structures_mp = strct_op.structures[:32]
structures_oqmd = strct_op.structures[32:]

We can get a better overview of on the crystals by exporting the data into a pandas dataframe for better visualization:

In [None]:
df_mp = structures_mp.to_pandas_df(
    exclude_columns=["functional", "icsd_ids", "magnetic_moment", "direct_band_gap"]
)
df_mp

In [None]:
df_oqmd = structures_oqmd.to_pandas_df(
    exclude_columns=["functional", "icsd_ids", "magnetic_moment", "direct_band_gap"]
)

The dataset can be analyzed in more detail using the ``PhasePlot`` object from the ``plot`` sub-package of the library:

In [None]:
from aim2dat.plots import PhasePlot

Here we use the matplotlib-library to create the plots, interactive plots can also be generated by changing the ``backend`` to `"plotly"`:

In [None]:
phase_diagram = PhasePlot()
phase_diagram.ratio = (9, 4.5)
phase_diagram.show_crystal_system = True
phase_diagram.show_legend = True
phase_diagram.legend_bbox_to_anchor = (1.35, 1.0)
phase_diagram.backend = "matplotlib"

Chemical composition and formation energies can be readily parsed from the pandas data frames:

In [None]:
phase_diagram.import_from_pandas_df("MP", df_mp)
phase_diagram.import_from_pandas_df("OQMD", df_oqmd)

In [None]:
phase_diagram.plot_type = "scatter"
phase_diagram.plot_property = "formation_energy"
phase_diagram.plot(["MP", "OQMD"])

The stability is defined as the vertical distance of a phase with respect to the convex hull:

In [None]:
phase_diagram.plot_property = "stability"
phase_diagram.show_convex_hull = False
phase_diagram.plot(["MP", "OQMD"])

To analyze the distribution of the phases in their chemical configuration space we can plot a histogram of the total number of phases per concentration interval and crystal system:

In [None]:
phase_diagram.plot_type = "numbers"
phase_diagram.y_label = "Nr. of crystals"
phase_diagram.plot(["MP", "OQMD"])

## Exploiting chemical similarity to increase the structure pool

From the last plot it is noticeable that more than two thirds of the structures actually represent elemental phases.
This imbalance is due to the fact that most structures in online databases have been determined experimentally.
Thus, we often find that the chemical space (in this case the mixed phases) relevant is under-represented in the dataset because it is easier to experimentally analyze "simple" compounds.

One way to counteract this trend is to make use of the chemical similarity of cations or anions and also query structures containing of ions having the same oxidation state as the target system. The ions can then be replaced in a second step, thus obtaining a larger variety of structures.
To do so, we import new structures once again.
However, this time we exclude elemental phases straight-away by setting the corresponding constraint:

In [None]:
strct_imp = StructureImporter()
strct_imp.neglect_elemental_structures = True

In [None]:
strct_imp.import_from_mp(
    ["K-Te", "Rb-Te", "K-Se", "Rb-Se", "Cs-Se", "K-Po", "Rb-Po", "Cs-Po"],
    os.environ["MP_OPENAPI_KEY"],
    structure_type="initial",
)

In [None]:
strct_imp.import_from_oqmd(
    ["K-Te", "Rb-Te", "K-Se", "Rb-Se", "Cs-Se", "K-Po", "Rb-Po", "Cs-Po"], query_limit=1000
)

Now we can substitute the elements in ``StructureOperations`` object accordingly:

In [None]:
strct_op.structures = strct_imp.structures
structures_subst = strct_op[strct_op.structures.labels].substitute_elements(
    [("K", "Cs"), ("Rb", "Cs"), ("Se", "Te"), ("Po", "Te")],
    change_label=True,
)

Since we have now probably have quite a few duplicate structures we will try to remove them. This time, however, we use a less strict method to filter out structures that are likely to be duplicates of others using merely the composition and the space group as criteria.

**Note:** In order to reduce the run time, we only take the first 50 crystals for this example.

We can choose to restrict the method merely on the newly imported structures where we substituted the elements by using the `confined` keyword, thus keeping all the previous phases in our dataset and applying the tight constraint only on the newly created phases:

In [None]:
strct_op.structures = structures_mp + structures_oqmd + structures_subst[:50]
strct_op.find_duplicates_via_comp_sym(remove_structures=True, confined=(133, 133 + 50))

And now we can add the new structures to our plot object:

In [None]:
subst_structures = strct_op.structures[133:]
df_subst = subst_structures.to_pandas_df(
    exclude_columns=["functional", "icsd_ids", "magnetic_moment", "direct_band_gap"]
)
df_subst
phase_diagram.import_from_pandas_df("subst. structures", df_subst)
phase_diagram.plot(["MP", "OQMD", "subst. structures"])

We can clearly see that the number of mixed phases is larger in the new data pool.