# Parsing Element Compositions
This notebook shows how to load a dataset of elemental composition then computing fractions of each element.

We're going to use [matminer](https://github.com/hackingmaterials/matminer) to do this, which is also a good route [for computing other features](https://github.com/hackingmaterials/matminer_examples/blob/main/matminer_examples/machine_learning-nb/formation_e.ipynb).

In [1]:
from matminer.featurizers.composition import ElementFraction
from matminer.featurizers.conversions import StrToComposition
from pymatgen import Composition
import pandas as pd
import numpy as np

## Make some example data.
We'll just make some easy compositions to start with

In [2]:
data = pd.DataFrame({'formula': ['NaCl', 'F2O3', 'Ba(NO3)2']})
data

Unnamed: 0,formula
0,NaCl
1,F2O3
2,Ba(NO3)2


## Parsing the compositions
The first thing we need to do before computing element fractions is parse the data from a string to the [pymatgen `Composition` object](https://pymatgen.org/pymatgen.core.composition.html#pymatgen.core.composition.Composition).

In [3]:
comp = Composition('FeO')
comp

Comp: Fe1 O1

These object possess a lot of functions used to study inorganic compositions (kind of like what RDKit does for molecules).

In [4]:
comp.get_wt_fraction('Fe')

0.7773048421310499

Matminer provides the ability to parse these compositions in a dataframe automatically.

Create the "converstion tool" and then call "featurize_many" to run the conversion for every entry in the column

In [5]:
feat = StrToComposition()
feat.featurize_dataframe(data, 'formula', pbar=False, inplace=True)  # Give it the data and the column to be parsed

In [6]:
last = data.iloc[-1]
print(f'The composition of {last["formula"]} was parsed to {last["composition"]}')

The composition of Ba(NO3)2 was parsed to Ba1 N2 O6


Note how it does nice things, like respect parentheses.

## Computing element fractions
There is a similar "featurizer" for computing element fractions

In [7]:
feat = ElementFraction()

In [8]:
feat.featurize_dataframe(data, 'composition', pbar=False, inplace=True)

  self[k1] = value[k2]


Note how it now contains columns named after the elements

In [9]:
data

Unnamed: 0,formula,composition,H,He,Li,Be,B,C,N,O,...,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr
0,NaCl,"(Na, Cl)",0,0,0,0,0,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,F2O3,"(F, O)",0,0,0,0,0,0,0.0,0.6,...,0,0,0,0,0,0,0,0,0,0
2,Ba(NO3)2,"(Ba, N, O)",0,0,0,0,0,0,0.222222,0.666667,...,0,0,0,0,0,0,0,0,0,0


You can get the names of the new columns from the Featurizer

In [10]:
feat.feature_labels()[:10]

['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne']

So, if you wanted to feed them to an ML algorithm

In [11]:
data[feat.feature_labels()]

Unnamed: 0,H,He,Li,Be,B,C,N,O,F,Ne,...,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr
0,0,0,0,0,0,0,0.0,0.0,0.0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0.0,0.6,0.4,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0.222222,0.666667,0.0,0,...,0,0,0,0,0,0,0,0,0,0


You can also call `transform` to compute them on demand

In [12]:
np.array(feat.transform(data['composition']))

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.5       , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.5       , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  