# Data-driven filters
We first import the dataframe containing compounds predicted to have a useful bandgap by the GBR model. 

**Note:** As the GBR model constructed previously relies on a certain amount of randomness, there is no guarantee that the same set of compounds will fall within the narrow window of $1.73 < E_g < 1.77$ eV each time. Below, we load the dataframe that was generated for the published work.

In [27]:
### Imports ###
import pandas as pd
import smact
from smact import Species
from smact.oxidation_states import Oxidation_state_probability_finder
from pymatgen import Composition, Structure
import json

In [28]:
useful_BGs_used = pd.read_csv('data/Useful_BGs_published.csv')
useful_BGs_used.describe()

Unnamed: 0.1,Unnamed: 0,0-norm,2-norm,3-norm,5-norm,7-norm,10-norm,minimum Number,maximum Number,range Number,...,avg d valence electrons,avg f valence electrons,compound possible,max ionic char,avg ionic char,band center,HOMO_energy,LUMO_energy,gap_AO,gbr_gap
count,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0,...,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0,17833.0
mean,1716694.0,4.0,0.588358,0.525421,0.498867,0.492881,0.489838,7.70235,65.008075,57.305725,...,2.840999,2.518925,0.146358,0.64285,0.115284,-2.321422,-0.225943,-0.222739,0.003204,1.75008
std,936707.8,0.0,0.061514,0.085514,0.10325,0.109118,0.11277,1.035162,16.5043,16.425818,...,1.58399,2.445449,0.353475,0.124394,0.027984,0.370841,0.052302,0.053096,0.013211,0.011513
min,165.0,4.0,0.5,0.39685,0.329877,0.304753,0.287175,3.0,20.0,12.0,...,0.0,0.0,0.0,0.183314,0.022018,-3.070597,-0.338381,-0.338381,0.0,1.730002
25%,860721.0,4.0,0.52915,0.444796,0.407234,0.401326,0.400117,8.0,50.0,43.0,...,1.583333,0.0,0.0,0.563121,0.096728,-2.600631,-0.26654,-0.26654,0.0,1.740014
50%,1831704.0,4.0,0.57735,0.517872,0.501229,0.500098,0.500003,8.0,73.0,65.0,...,2.6,2.0,0.0,0.681744,0.115808,-2.376573,-0.220603,-0.21776,0.0,1.750122
75%,2478602.0,4.0,0.648074,0.60912,0.600524,0.60004,0.600001,8.0,78.0,70.0,...,3.75,4.0,0.0,0.733532,0.135219,-2.067298,-0.182464,-0.180198,0.0,1.76002
max,3216891.0,4.0,0.744123,0.72869,0.727286,0.727273,0.727273,8.0,83.0,80.0,...,9.090909,12.6,1.0,0.908097,0.202455,-1.003085,-0.085375,-0.078699,0.132508,1.769999


### Sort by sustainability according to HHI

In [29]:
# There are some elements we definitely don't want from the get-go
unwanted_els = ['Be','Hg','Pb','Tl','Pr','Nd','Sm','Gd','Dy','Ho','Er','Tm','Lu','Hf','Ta']

# Convert to dict for ease
all_comps = list(useful_BGs_used.T.to_dict().values())

# reduce down to wanted compounds 
wanted_comps = []
for i in all_comps:
    list_els = Composition(i['composition_obj']).elements
    wanted = True
    for el in list_els:
        if el.symbol in unwanted_els:
            wanted = False
    if wanted == True:
        wanted_comps.append(i)

# Work out sustainability score (based on HHI) for each composition
def sus_calc(comp):
    sus_factor = 0
    for i in comp.elements:
        sus_factor += (comp.get_wt_fraction(i) *smact.Element(i.symbol).HHI_r)
    return sus_factor

for i in wanted_comps:
    i['sus_factor'] = sus_calc(Composition(i['composition_obj']))
    
# Return to dataframe        
filtered_useful_BGs = pd.DataFrame.from_dict(wanted_comps)

In [30]:
filtered_useful_BGs = filtered_useful_BGs.sort_values(by='sus_factor', ascending=True)
filtered_useful_BGs = filtered_useful_BGs.reset_index(drop=True)
selected_formulas = list(filtered_useful_BGs['pretty_formula'])

### Assign structures
Structures are found using the [method of Hautier et. al.](https://pubs.acs.org/doi/10.1021/ic102031h) as [implemented in pymatgen](http://pymatgen.org/pymatgen.analysis.structure_prediction.substitutor.html).
This approach requires a database of structures including oxidation states. Such a database cannot be stored publicly here on this repository, but can be generated using the Materials Project API and the following steps: 
1. Download structures from the Materials Project. 
2. Use the [Pymatgen bond valence analyser](http://pymatgen.org/pymatgen.analysis.bond_valence.html) to add oxidation states to all structures for which it is possible. 
3. Use these structures as a pool of candidates for the [ionic substitution algorithm](http://pymatgen.org/pymatgen.analysis.structure_prediction.substitutor.html) with a probability threshold set to 1E-5 to find structures for compositions in the `selected_formulas` list above, starting with the most sustainable (top of the list, lowest HHI score).
4. Repeat until enough (in our case 235) structures are found.

We import the 235 structures used in the published work below.

In [32]:
with open('data/structures_published.json', 'r') as f:
    quaternary_oxides_to_calc = json.load(f)

# Convert back to pymatgen Structure objects
quaternary_oxides_to_calc = [Structure.from_dict(i) for i in quaternary_oxides_to_calc]

### Apply oxidation state probability filter
We now check that the oxidation states adopted by the metals in each compounds are likely given the anions that are present. This is done according to a statistical model [outlined here](https://pubs.rsc.org/en/Content/ArticleLanding/2018/FD/C8FD00032H) and implemented in the [SMACT oxidation states module](https://smact.readthedocs.io/en/latest/smact.oxidation_states.html).

In this example, because we have imported only the 235 structures that were taken forward to first principles calculations in the original publication, all of them pass the test at a threshold of 0.005. 

In [33]:
ox = Oxidation_state_probability_finder()

# We ignore other non-metal species that may be present in positive oxidation states
metals_and_anions = smact.metals + ['O']

num_passes = 0
for struc in quaternary_oxides_to_calc:
    # Get a list of pymatgen species that we want to consider
    species = [i.specie for i in struc]
    species = [i for i in species if i.symbol in metals_and_O]
    
    # pass the species to the probability calculator and filter
    prob = ox.compound_probability(species)
    if prob < 0.005: 
        print(species)
        print('Below threshold!')
    else:
        num_passes += 1
        
print('number of compounds to pass the oxidation state probability test: {}'.format(num_passes))

number of compounds to pass the oxidation state probability test: 235


These compounds are taken forward in the following notebook to calculate thermodynamic stability and electronic properties.