# Interacting with ionic species representations using ElementEmbeddings

This notebook will serve as a tutorial for using the ElementEmbeddings package to interact with ionic species representations.

In [1]:
from elementembeddings.core import SpeciesEmbedding
from elementembeddings.composition import SpeciesCompositionalEmbedding, species_composition_featuriser

Elements are the building blocks of chemistry, but species (elements in a given charge state) dictate the structure and properties of inorganic compounds. 

For example, the local spin and atomic environment in Fe(s), FeO, Fe2O3, and Fe3O4 solids are different due to variations in the charge state and coordination of iron.

For composition only machine learning, there many representation schemes that enable us to represent compounds as vectors, built on embeddings of elements. However, this may present a limitation when we want to represent ionic species, as the charge state of the element is not taken into account. As such, we need to represent ionic species as vectors.

The ElementEmbeddings package contains a set of pre-trained embeddings for elements and ionic species, which can be used to represent ionic species in a vector space.

At the time of writing, the 200-dimension SkipSpecies vector embeddings are available for ionic species representations. These embeddings are trained using the Skip-gram model on a large dataset of inorganic compounds.

In [2]:
# Load the SkipSpecies vectors as a SpeciesEmbedding object

skipspecies = SpeciesEmbedding.load_data(embedding_name="skipspecies")


print(
    "Below is the representation of Fe3+ using the SkipSpecies vectors."
)

print(skipspecies.embeddings["Fe3+"])

Below is the representation of Fe3+ using the SkipSpecies vectors.
[-3.46536078e-02 -3.23320180e-02 -6.41056001e-02 -6.64595328e-03
 -3.81412022e-02 -9.60185826e-02 -1.92383174e-02 -2.02107765e-02
  8.79131556e-02  9.14798677e-02 -3.54749635e-02 -1.33267939e-01
 -1.77447721e-01 -9.33702961e-02 -7.14094117e-02 -6.68478478e-03
 -1.49846703e-01  3.65290008e-02 -1.11083306e-01  2.04584867e-01
 -7.30767250e-02  7.07381591e-02  1.29051596e-01  8.26864019e-02
 -3.41298096e-02  1.55206323e-01  5.24081439e-02  7.91398287e-02
  1.86461732e-02  1.88235074e-01  1.51956931e-01  1.14296928e-01
 -1.12691864e-01  6.95107281e-02 -1.16133653e-01 -1.42861262e-01
 -3.24610062e-02 -6.37443736e-02  9.47019458e-02 -7.04379454e-02
  1.51012568e-02 -6.04141466e-02 -7.57871270e-02  6.90726042e-02
 -3.73109318e-02 -1.04284994e-01 -7.36037940e-02 -3.05999294e-02
 -4.32690326e-03 -6.09171018e-02  1.28173083e-02  4.53064829e-01
  4.73245084e-02 -1.39801240e+00 -1.01322591e-01 -1.62838653e-01
 -4.33158763e-02 -1.320

We can check the ionic species which have a feature vector for a particular embedding

In [3]:
print("SkipSpecies has feature vectors for the following ionic species:\n")
print(skipspecies.species_list)

SkipSpecies has feature vectors for the following ionic species:

['H+', 'H-', 'Li+', 'Be2+', 'B+', 'B2+', 'B2-', 'B3-', 'B3+', 'B-', 'C4-', 'C-', 'C4+', 'C+', 'C2+', 'C3+', 'C2-', 'C3-', 'N3-', 'N2+', 'N3+', 'N-', 'N+', 'N2-', 'N5+', 'N4+', 'O2-', 'O-', 'F-', 'Na+', 'Mg2+', 'Al3+', 'Al2+', 'Si2+', 'Si4+', 'Si-', 'Si2-', 'Si4-', 'Si3+', 'Si3-', 'P5+', 'P2-', 'P3-', 'P4+', 'P+', 'P-', 'P3+', 'P2+', 'S2-', 'S6+', 'S-', 'S2+', 'S3+', 'S+', 'S4+', 'S5+', 'Cl-', 'Cl7+', 'Cl5+', 'Cl3+', 'K+', 'Ca2+', 'Sc3+', 'Sc+', 'Sc2+', 'Ti3+', 'Ti4+', 'Ti2+', 'V4+', 'V3+', 'V2+', 'V5+', 'Cr3+', 'Cr2+', 'Cr6+', 'Cr4+', 'Cr5+', 'Mn2+', 'Mn3+', 'Mn4+', 'Mn+', 'Mn7+', 'Mn6+', 'Mn5+', 'Fe2+', 'Fe3+', 'Fe+', 'Fe4+', 'Fe6+', 'Fe5+', 'Co2+', 'Co4+', 'Co3+', 'Co+', 'Ni2+', 'Ni4+', 'Ni3+', 'Ni+', 'Cu2+', 'Cu3+', 'Cu+', 'Zn2+', 'Ga+', 'Ga3+', 'Ga4+', 'Ga2+', 'Ge4-', 'Ge4+', 'Ge2-', 'Ge2+', 'Ge3+', 'As-', 'As2-', 'As3+', 'As5+', 'As3-', 'As+', 'As2+', 'As4+', 'Se2-', 'Se-', 'Se4+', 'Se6+', 'Se5+', 'Se2+', 'Se+', 'Se

We can also check which elements have an ionic species representation in the embedding

In [4]:
print("The folliowing elements have SkipSpecies ionic species representations:\n")
print(skipspecies.element_list)

The folliowing elements have SkipSpecies ionic species representations:

['Ba', 'C', 'Rb', 'Ac', 'Mo', 'Cl', 'U', 'Br', 'Se', 'La', 'Ni', 'Al', 'Au', 'Te', 'Ce', 'Ag', 'Hf', 'Th', 'Mg', 'Cu', 'S', 'Pa', 'Er', 'I', 'Nd', 'Hg', 'Na', 'Ca', 'Cd', 'Be', 'Ir', 'P', 'In', 'Yb', 'As', 'Ru', 'Re', 'Sm', 'Li', 'Tc', 'Cr', 'N', 'Co', 'Tb', 'Pb', 'Sb', 'K', 'Rh', 'Gd', 'Tm', 'W', 'Tl', 'Ge', 'Cs', 'Pm', 'Ho', 'Sn', 'Lu', 'V', 'B', 'Si', 'Ti', 'Pr', 'Bi', 'Ta', 'O', 'Dy', 'Y', 'Os', 'Eu', 'F', 'Fe', 'Sr', 'Pu', 'Mn', 'Sc', 'Pd', 'Np', 'Pt', 'H', 'Zn', 'Nb', 'Zr', 'Ga']


Like the element representations, BibTex citation information is available for the ionic species embeddings.

In [5]:
print(skipspecies.citation())

['@article{Onwuli_Butler_Walsh_2024, title={Ionic species representations for materials informatics}, DOI={10.26434/chemrxiv-2024-8621l}, journal={ChemRxiv}, author={Onwuli, Anthony and Butler, Keith T. and Walsh, Aron}, year={2024}} This content is a preprint and has not been peer-reviewed.', '@article{antunes2022distributed,title={Distributed representations of atoms and materials for machine learning},author={Antunes, Luis M and Grau-Crespo, Ricardo and Butler, Keith T},journal={npj Computational Materials},volume={8},number={1},pages={1--9},year={2022},publisher={Nature Publishing Group} }']


## Representing ionic compositions using ElementEmbeddings

In addition to representing individual ionic species, we can also represent ionic compositions using the ElementEmbeddings package. This is useful for representing inorganic compounds as vectors. Let's take the example of Fe3O4.

Fe3O4 is a mixed-valence iron oxide, with a formula unit of Fe3O4. We pass the composition as a dicitionary in the following format:

```python
composition = {
    'Fe2+': 1,
    'Fe3+': 2,
    'O2-': 4
    }
```

In [6]:
composition = {
    'Fe2+': 1,
    'Fe3+': 2,
    'O2-': 4
    }

Fe3O4_skipspecies = SpeciesCompositionalEmbedding(formula_dict=composition, embedding=skipspecies)

A few properties are accessible from the `SpeciesCompositionalEmbedding` class

In [8]:
# Print the pretty formula

print(Fe3O4_skipspecies.formula_pretty)

# Print the list of elements in the composition
print(Fe3O4_skipspecies.element_list)
# Print the list of ionic species in the composition
print(Fe3O4_skipspecies.species_list)



# Print the stoichiometric vector of the composition
print(Fe3O4_skipspecies.stoich_vector)

# Print the normalised stoichiometric vector of the composition
print(Fe3O4_skipspecies.norm_stoich_vector)

# Print the number of atoms
print(Fe3O4_skipspecies.num_atoms)

Fe3O4
['O', 'Fe']
['Fe2+', 'Fe3+', 'O2-']
[1 2 4]
[0.14285714 0.28571429 0.57142857]
7


### Featurising compositions

We can featurise the composition using the `.feature_vector` method. This method returns the feature vector for the composition. This is identical in operation to the `CompositionEmbedding` class for featurising compositions.

The `species_composition_featuriser` can be used to featurise a list of compositions. This is useful for featurising a large number of compositions. It can also export the feature vectors to a pandas DataFrame by setting the `to_dataframe` argument to `True`.

In [9]:
compositions = [
    { 'Fe2+': 1,'Fe3+': 2,'O2-': 4},
    {'Fe3+': 2, 'O2-': 3},
    {"Li+": 7, "La3+": 3, "Zr4+": 1, "O2-": 12},
    {"Cs+": 1, "Pb2+": 1, "I-": 3},
    {"Pb2+": 1, "Pb4+": 1, "O2-": 3},
]

featurised_comps_df = species_composition_featuriser(data=compositions, embedding="skipspecies",stats="mean", to_dataframe=True)

featurised_comps_df

Computing feature vectors: 100%|██████████| 5/5 [00:00<00:00, 17712.43it/s]


Unnamed: 0,formula,composition,mean_0,mean_1,mean_2,mean_3,mean_4,mean_5,mean_6,mean_7,...,mean_190,mean_191,mean_192,mean_193,mean_194,mean_195,mean_196,mean_197,mean_198,mean_199
0,Fe3O4,"{'Fe2+': 1, 'Fe3+': 2, 'O2-': 4}",-0.018255,0.001659,-0.009839,0.00523,-0.010928,-0.057023,-0.002567,-0.005813,...,-0.037202,-0.008057,-0.027421,-0.008534,-0.009001,0.002369,0.017834,-0.055822,-0.21939,0.020507
1,Fe2O3,"{'Fe3+': 2, 'O2-': 3}",-0.036597,-0.009373,-0.0137,-0.015516,-0.020896,-0.071463,0.002221,-0.014784,...,-0.04553,-0.024589,-0.037825,-0.025545,0.010654,-0.002034,-0.001094,-0.096479,-0.211483,0.035755
2,Li7La3ZrO12,"{'Li+': 7, 'La3+': 3, 'Zr4+': 1, 'O2-': 12}",-0.031236,-0.015952,-0.018968,-0.029273,-0.005297,-0.035049,0.045972,-0.032007,...,-0.04282,0.045177,-0.056733,0.006726,0.017449,-0.023732,0.021772,-0.034134,-0.102773,0.061038
3,CsPbI3,"{'Cs+': 1, 'Pb2+': 1, 'I-': 3}",-0.002381,0.023988,-0.026468,-0.020235,-0.002876,-0.033317,0.0763,-0.069057,...,0.055368,0.058231,-0.079549,-0.032172,-0.076099,-0.024554,0.108428,-0.058528,-0.055804,-0.031679
4,Pb2O3,"{'Pb2+': 1, 'Pb4+': 1, 'O2-': 3}",-0.077403,-0.015334,0.023065,-0.060073,-0.04316,-0.140865,0.067917,-0.044093,...,0.038975,0.102474,-0.051598,0.001011,-0.131225,-0.026707,0.14525,-0.057493,-0.18881,0.055239


##
We can also calculate the "distance" between two compositions using their feature vectors. This can be used to determine which compositions are more similar to each other.

In [11]:
print(
    f"The euclidean distance between Fe3O4 and Fe2O3 is {Fe3O4_skipspecies.distance({'Fe3+': 2, 'O2-': 3}, distance_metric='euclidean', stats='mean'):.2f}"
)
print(
    f"The euclidean distance between Fe3O4 and Pb2O3 is {Fe3O4_skipspecies.distance({'Pb2+': 1, 'Pb4+': 1, 'O2-': 3}, distance_metric='euclidean', stats='mean'):.2f}"
)
print(
    f"The euclidean distance between Fe3O4 and CsPbI3 is {Fe3O4_skipspecies.distance({'Cs+': 1, 'Pb2+': 1, 'I-': 3},distance_metric='euclidean', stats='mean'):.2f}"
)

The euclidean distance between Fe3O4 and Fe2O3 is 0.38
The euclidean distance between Fe3O4 and Pb2O3 is 1.60
The euclidean distance between Fe3O4 and CsPbI3 is 2.11


Based on the mean-pooled feature vectors, we can see that Fe3O4 is closer to Fe2O3 than either Pb2O3 and CsPbI3.