# dataset_composition.ipynb

## Purpose of this notebook

This notebook shows some basic examples of analysis that can be done to study the composition of the LIBE dataset. For instance, we look at the numbers of different species and different types of bonds that appear in the dataset, the different charges and spin multiplicities present, and the sizes of molecules as measured by the number of electrons.

## What you get

A series of bar plots.

In [None]:
from pathlib import Path

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import set_matplotlib_formats

from monty.serialization import loadfn

from pymatgen.core.periodic_table import Element
from pymatgen.analysis.graphs import MoleculeGraph

from deliberate.plotting import plot_bar

set_matplotlib_formats("retina")
%matplotlib inline

First, we need to load the dataset. This may take several minutes.

Note that users will need to change `DATASET_PATH` to the path where the `libe.json` file is located (the dataset is not included in this repository).

In [None]:
DATASET_PATH = "libe.json"

In [None]:
data = loadfn(DATASET_PATH)

First, we look at the absolute number of each element (C, F, H, Li, N, O, P, and S) are present in the dataset by summing over all molecules in the dataset. We find that C, H, and O are overrepresented in the dataset, and there is a relative paucity of N, P, and S. This makes sense, since P is only found in PF6 and associated molecules, while N and S only from from TFSI and FSI fragments.

In [None]:
all_species = np.concatenate([m["species"] for m in data])

plot_bar(
    all_species,
    x_label="Element",
    x_shift=-0.43,
    y_shift=20,
)

Then, we can look at how many unique molecules include each element. While C, H, and O still dominate (most molecules contain these three elements), the disparity is not so vast.

In [None]:
molecules_with_species = np.concatenate([list(set(m["species"])) for m in data])
plot_bar(
    molecules_with_species,
    x_label="Element",
    x_shift=-0.43,
    y_shift=20
)

A more useful analysis might be looking at the bonding present in this dataset. As perhaps expected from the species information, the most common bonds are C-C, C-H, and C-O. Li-O bonds are also common, however. For most molecules in the dataset, oxygen is the most favored atom that can coordinate with Li.

In [None]:
bond_types = list()
for d in data:
    if d["bonds"] is not None:
        for bd in d["bonds"]:
            bd_species = (d["species"][bd[0]], d["species"][bd[1]])
            bd = "-".join(sorted(bd_species))
            bond_types.append(bd)

plot_bar(
    bond_types,
    x_label="Bond type",
    x_shift=-0.5,
    y_shift=30,
    fig_size=(18, 6),
)

Looking at the number of bonds, we see a bell curve-like distribution with a long tail. Most molecules in LIBE are relatively small and contain a modest number of bonds, but there is a small number of much larger molecules containing many more bonds (up to 27).

In [None]:
num_bonds = [len(d["bonds"]) for d in data]

plot_bar(
    num_bonds,
    x_label="Number of bonds",
    x_shift=-0.5,
    y_shift=2,
    fig_size=(15, 6),
)

Now let's look at the distribution of molecule size as measured by the number of electrons. We plot this as a histogram to make the plot easier to read. We see a distribution that looks rather similar to the "number of bonds" plot above. Again, most molecules are relatively small, but there is a small set of somewhat larger molecules, with as many as 140 electrons.

In [None]:
nelectrons = list()
for d in data:
    N = sum([Element(s).Z for s in d["species"]]) - d["charge"]
    nelectrons.append(N)

plot_bar(
    nelectrons,
    x_label="Number of electrons",
    x_shift=-0.2,
    y_shift=2,
    fig_size=(18, 6),
    font_scale=0.8,
    num_bins=20
)

Finally, we can look at charges and spin multiplicities. Anionic molecules are favored over cationic species as well as neutral species, but the disparity is not so great. On the other hand, relatively few triplets were analyzed (triplet DFT calculations were only conducted on those singlets with less than 50 electrons, and not all of these calculations succeeded). 

In [None]:
charges = [d["charge"] for d in data]

plot_bar(
    charges,
    x_label="Molecule charge",
    x_shift=-0.1,
    y_shift=10,
)

In [None]:
spins = [d["spin_multiplicity"] for d in data]

plot_bar(
    spins,
    x_label="Molecule spin multiplicity",
    x_shift=-0.1,
    y_shift=10,
)