# filters.ipynb

## Purpose of this notebook

This notebook shows how you can filter the LIBE dataset in a variety of ways. While we have already filtered LIBE, it might be desirable to further limit the dataset for some application.

A number of filters are currently defined in `src/deliberate/filters.py`:
- `filter_by_charge_and_isomorphism`: Remove molecules that are isomorphic and have the same charge (taking, for instance, either the singlet or the triplet state, depending on which one has a lower electronic energy)
- `filter_species`: Remove molecules including certain species
- `filter_bond_species`: Remove molecules with certain types of bonds
- `filter_bond_length`: Remove molecules with bond lengths much longer than is typical
- `filter_num_bonds`: Remove molecules where any atoms have an inappropriate number of bonds (for instance, carbon having more than 4 bonds)

Note that with the `filter_dataset` function, multiple filters can be applied in sequence. Additional filter functions could be added, provided that they return a `List[Dict]`.

## What you get

`final`, a somewhat reduced LIBE dataset.

In [None]:
from typing import Dict, List, Optional, Callable

from monty.serialization import dumpfn, loadfn

import deliberate.filters as filters

First, we need to load the dataset. This may take several minutes.

Note that users will need to change `DATASET_PATH` to the path where the `libe.json` file is located (the dataset is not included in this repository).

In [None]:
DATASET_PATH = "libe.json"

In [None]:
data = loadfn(DATASET_PATH)

We'll first use each of the implemented filters individually to see their effect on the dataset.

In [None]:
by_charge_and_isomorhism = filters.filter_dataset(data,
                                                  [filters.filter_by_charge_and_isomorphism])

In [None]:
by_species = filters.filter_dataset(data,
                                [filters.filter_species])

In [None]:
by_bond_species = filters.filter_dataset(data,
                                         [filters.filter_bond_species])

In [None]:
by_bond_length = filters.filter_dataset(data,
                                        [filters.filter_bond_length])

In [None]:
by_num_bonds = filters.filter_dataset(data,
                                      [filters.filter_num_bonds])

Now let's apply all of the filters that we just used in sequence!

In [None]:
final = filters.filter_dataset(data,
                               [filters.filter_by_charge_and_isomorphism,
                                filters.filter_species,
                                filters.filter_bond_species,
                                filters.filter_bond_length,
                                filters.filter_num_bonds])