# Checking A,P,R Atoms

In this notebook, I am checking that the identities of the A, P, and R atoms calculated in Notebook 01 with CrystalNN match the actual identities of the atoms from the main database. We will be comparing the dataframe pba_e_hull_df, from Notebook 01, with the json file pba_w_APR, which contains all of the data directly from the main database.

## Importing and Cleaning pba_w_APR

pba_w_APR is a json file. Let's first look to see if it's in the proper format to import:

In [1]:
with open('pba_w_APR.json', 'r') as file :
  pba_json = file.read()

In [2]:
print(pba_json[:1000])

[{"R": "Co", "P": "Fe", "A": "Ca", "n": 4, "input": {"structure": {"@module": "pymatgen.core.structure", "@class": "Structure", "lattice": {"matrix": [[9.95090252, -0.0003358, -0.0003358], [-0.0003358, 9.95090252, 0.0003358], [-0.0003358, 0.0003358, 9.95090252]], "a": 9.9509025313318, "b": 9.9509025313318, "c": 9.9509025313318, "alpha": 89.99613296435679, "beta": 90.00386703564321, "gamma": 90.00386703564321, "volume": 985.3429511575596}, "sites": [{"species": [{"element": "Ca", "occu": 1}], "abc": [0.75135993, 0.75127745, 0.75127745], "xyz": [7.476204862928603, 7.47588864272739, 7.47588864272739], "label": "Ca"}, {"species": [{"element": "Ca", "occu": 1}], "abc": [0.24872255, 0.24864007, 0.75127745], "xyz": [2.47467807727261, 2.474361857071396, 7.47588864272739], "label": "Ca"}, {"species": [{"element": "Ca", "occu": 1}], "abc": [0.24872255, 0.75127745, 0.24864007], "xyz": [2.47467807727261, 7.47588864272739, 2.474361857071396], "label": "Ca"}, {"species": [{"element": "Ca", "occu": 1

This looks like it's in the right format. We'll import the data using the loadfn method, which puts the data into a list of python dictionaries. This is the same method used in Notebook 01.

In [3]:
from monty.serialization import loadfn

In [4]:
data_1 = loadfn('pba_w_APR.json')

In [5]:
#data_1[0]

Let's loop through data_1 and parse the composition and atom identities into a pandas dataframe, which we'll then use to compare with the previously created dataframe.

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [49]:
pba_APR_df = pd.DataFrame(index=range(len(data_1)),columns=['A_atom','P_atom','R_atom','n_A'])
for i in range(len(data_1)):
    try:
        pba_APR_df.loc[i, 'P_atom'] = data_1[i]['P']
        pba_APR_df.loc[i, 'R_atom'] = data_1[i]['R']
        if 'A' in data_1[i]:
            pba_APR_df.loc[i, 'A_atom'] = data_1[i]['A']
        if 'n' in data_1[i]:
            pba_APR_df.loc[i, 'n_A'] = data_1[1]['n']
    except:
        print(i) #the indices that are printed are for atoms that ran into an error i.e. there is not A,P, or R field

3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782


In [50]:
pba_APR_df.head()

Unnamed: 0,A_atom,P_atom,R_atom,n_A
0,Ca,Fe,Co,4
1,Mg,Cr,Os,4
2,Ca,Fe,Mn,4
3,Ca,Mn,Os,4
4,Li,Cr,Cr,4


In [9]:
# len(pba_APR_df)

As we can see, this data set is much longer than the dataset originally analyzed in Notebook 01. For the purposes of checking the A, P, and R atoms from Notebook 01, we won't worry too much about this right now.

The method that we'll use is loop through the pba_e_hull_df, and for each structure we'll check that there is a corresponding structure in the pba_APR_df with the same A, P, and R atoms.

If our CrystalNN method misclassified any of the atom identities in Notebook 01, then it likely will not match any of the entries in the pba_APR_df, so we'll know there was an error.

In [10]:
#Importing pba_e_hull_df:
pba_e_hull_df = pd.read_csv('pba_e_hull_df.csv')
pba_e_hull_df.drop('Unnamed: 0', axis = 1, inplace = True)

In [11]:
pba_e_hull_df.head()

Unnamed: 0,Composition,e_above_hull,A_atom,P_atom,R_atom,n_A
0,Ca4 Fe4 Co4 C24 N24,0.435563,Ca,Fe,Co,4.0
1,Mg4 Cr4 Os4 C24 N24,0.319625,Mg,Cr,Os,4.0
2,Ca4 Mn4 Fe4 C24 N24,0.38262,Ca,Fe,Mn,4.0
3,Ca4 Mn4 Os4 C24 N24,0.303886,Ca,Mn,Os,4.0
4,Li4 Cr8 C24 N24,0.320146,Li,Cr,Cr,4.0


In [13]:
# list_of_errors = [] #This will be a list of the structures in the original df that do not correspond to any 
# # of the structures in pba_APR_df.
# for i in range(len(pba_e_hull_df)):
#     A_atom = pba_e_hull_df.iloc[i]['A_atom']
#     P_atom = pba_e_hull_df.iloc[i]['P_atom']
#     R_atom = pba_e_hull_df.iloc[i]['R_atom']
#     A_entries = pba_APR_df[pba_APR_df['A_atom'] == A_atom]
#     AP_entries = A_entries[A_entries['P_atom'] == P_atom]
#     APR_entries = AP_entries[AP_entries['R_atom'] == R_atom]
#     if len(APR_entries) == 0:
#         list_of_errors.append(i)
# print(list_of_errors)

As we can see, the list of erros is empty, so all of the structures from the Notebook 01 dataframe, pba_e_hull_df, have corresponding structures in this new dataframe. This indicates that our CrystalNN algorithm is correct.

## Filtering out structures with H or O

Because the new data contains significantly more structures, we will first filter out any structures containing H or O (because those aren't PBAs), and then we will see whether there are indeed additional PBAs in this data that weren't in the original file.

Let's loop through the list of structures in data_1 and create a pymatgen entry for each. Then we'll check whether each structure contains H or O, and if it doesn't, we'll add its pymatgen structure to a list of PBAs.

In [14]:
import pymatgen as mg
from pymatgen.entries.computed_entries import ComputedEntry

In [15]:
pba_entries = []
for i in range((len(data_1))):
    #Creating pymatgen entry:
    entry = ComputedEntry(composition=data_1[i]['input']['structure'].composition,
                              energy=data_1[i]['output']['energy'],
                              parameters = {"nelect": data_1[i]['input']['parameters']['NELECT'],
                                            "hubbards": data_1[i]['input']['hubbards'],
                                            "potcar_spec": data_1[i]['input']['potcar_spec'],
                                            "is_hubbard": data_1[i]['input']['is_hubbard']})
    
    if 'H' in entry.composition or 'O' in entry.composition: #Filtering out structures with hydrogen or oxygen
        pba_entries.append('')
    elif entry.composition.as_dict()['C'] != 24 or entry.composition.as_dict()['N'] != 24:
        #Filtering out structures that don't have 24 Cs or 24 Ns - these are also not PBAs
        pba_entries.append('')
    else:
        pba_entries.append(entry)

In [16]:
len(pba_entries)

3783

In [18]:
len(data_1)

3783

After filtering out all the structures containing oxygen or hydrogen, as well as the structures that don't contain 24 nitrogens and 24 carbons, we see that there are still 3259 structures remaining. Therefore, it appears that either this new dataset contains a significant number or repeats or that there are actually a large number of new PBAs. Let's dive deeper into the data to find out.

In [30]:
# composition_dict = {}
# errors = []
# for i in pba_entries:
#     if type(i) == str:
#         continue
#     if i.composition in composition_dict.keys():
#         if i.energy != composition_dict[i.composition]:
#             errors.append(i)
#     composition_dict[i.composition] = i.energy
# print(errors)

[ComputedEntry None - Mg4 Cr4 Os4 C24 N24
Energy = -500.7650
Correction = 0.0000
Parameters:
nelect = 352.0
hubbards = {}
potcar_spec = [{'titel': 'PAW_PBE Mg_pv 06Sep2000', 'hash': 'bbcf6f81cc34a3090d483ad641178746'}, {'titel': 'PAW_PBE Cr_pv 07Sep2000', 'hash': 'eb23364cc25164418f9f79efd8f04f7d'}, {'titel': 'PAW_PBE Os_pv 20Jan2003', 'hash': '7bb96cace1809ebeb4d030d71024c5bf'}, {'titel': 'PAW_PBE C 08Apr2002', 'hash': 'c0a8167dbb174fe492a3db7f5006c0f8'}, {'titel': 'PAW_PBE N 08Apr2002', 'hash': 'b98fd027ddebc67da4063ff2cabbc04b'}]
is_hubbard = False
Data:, ComputedEntry None - Li4 Cr8 C24 N24
Energy = -496.8777
Correction = 0.0000
Parameters:
nelect = 324.0
hubbards = {}
potcar_spec = [{'titel': 'PAW_PBE Li_sv 23Jan2001', 'hash': '4799bab014a83a07c654d7196c8ecfa9'}, {'titel': 'PAW_PBE Cr_pv 07Sep2000', 'hash': 'eb23364cc25164418f9f79efd8f04f7d'}, {'titel': 'PAW_PBE C 08Apr2002', 'hash': 'c0a8167dbb174fe492a3db7f5006c0f8'}, {'titel': 'PAW_PBE N 08Apr2002', 'hash': 'b98fd027ddebc67da40

In [None]:
## Phase diagrams with multiple PBAs

After checking with Shyam, it turns out that there are repeats in the file because the same atoms can have various configurations within the PBA structure. Therefore, we should make the phase diagram with all of the PBA compositions, and then pymatgen will just calculate the lowest possible energy above hull for each composition.

In [31]:
from pymatgen import MPRester
mpr = MPRester(api_key='clRGHmBDgp1xt9zA')
from pymatgen.entries.compatibility import MaterialsProjectCompatibility
mpc = MaterialsProjectCompatibility()
from pymatgen.analysis.phase_diagram import PhaseDiagram, PDPlotter
import pymatgen.analysis.local_env as localenv
crys = localenv.CrystalNN()

In [36]:
def my_crystal_nn(struct):
    A_atom_nn,P_atom_nn, R_atom_nn = '','','' #Initializing A, P, and R atoms
    for j in range(len(struct)): #Looping through all of the elements in the PBA
        atom = struct[j].as_dict()['species'][0]['element']
        if atom == 'C' and len(R_atom_nn)==0:
            nn_objs = crys.get_nn_info(struct, j) #getting the nearest neighbors for atom, which is the jth atom in struct
            for k in range(len(nn_objs)):
                nn = nn_objs[k]['site'].as_dict()['species'][0]['element']
                if nn != 'N':
                    R_atom_nn = nn
                    break
        elif atom == 'N' and len(P_atom_nn)==0:
            nn_objs = crys.get_nn_info(struct, j) #getting the nearest neighbors for atom, which is the jth atom in struct
            for k in range(len(nn_objs)):
                nn = nn_objs[k]['site'].as_dict()['species'][0]['element']
                if nn != 'C':
                    P_atom_nn = nn
                    break
    for j in struct.composition.as_dict().keys():
        if j not in [P_atom_nn,R_atom_nn,'C','N']:
            A_atom_nn = j
            break
    return [A_atom_nn,P_atom_nn,R_atom_nn]

In [72]:
my_crystal_nn(struct)

['Ca', 'Fe', 'Co']

In [101]:
pba_e_hull_df = pd.DataFrame(index=range(len(data_1)),columns=['Composition','Formula','e_above_hull','A_atom','P_atom','R_atom','n_A'])
for i in range(1):#(len(pba_entries)):
    struct=data_1[i]['input']['structure']
    pba_entry = pba_entries[i]
    comp = pba_entry.composition
    
    if type(pba_entry) == str: #If the entry is string, then it was filtered out and we didn't create a pymatgen entry for it.
        continue
    
    #Making list of other PBAs from list pba_entries with the same composition:
    same_comps = []
    for j in range(len(pba_entries)):
        if type(pba_entries[j]) is str: #skip empty entries
            continue
        if pba_entries[j].composition == comp:
            same_comps.append([pba_entries[j],j])
            
    comp_dict = {'structure': my_crystal_nn(struct)}
    for k in range(len(same_comps)):
        comp_dict[same_comps[k][1]] = my_crystal_nn(data_1[same_comps[k][1]]['input']['structure'])
    #comp_dict is now a dictionary of all structures in data_1 with the same composition. The dictionary index
    #is the index of the structure in data_1, and the value is a list of the [A,P,R] atom identities.
    
    for k in range(len(same_comps)):
        if comp_dict[same_comps[k][1]][1] != comp_dict['structure'][1] or comp_dict[same_comps[k][1]][2] != comp_dict['structure'][2]:
            del comp_dict[same_comps[k][1]]

    #We won't need the 'structure' entry in comp_dict anymore as we just used it to compare against the other entries.
    del comp_dict['structure']
    
    #We also want to delete the structures with different APR atoms dictionary from the same_comp list.
    same_comps_and_APR = []
    for k in range(len(same_comps)):
        if same_comps[k][1] in comp_dict.keys(): #check if the index for that structure is in the comp_dict keys
            same_comps_and_APR.append(same_comps[k])
    
#     #comp_dict is now a dictionary of the structures in data_1 with same composition AND same A,P,R atoms
    
#     #Accessing MP data for structures with the any of the same atoms:
#     comp_atoms = comp.as_dict().keys()
#     entries = mpr.get_entries_in_chemsys(comp_atoms)
#     #Adding our pbas to this list of pymatgen entries:
#     entries = entries + same_comps

#     #Applying correction using MPRester:
#     corrected_entries = mpc.process_entries(entries)

#     #Making phase diagram and accessing e_above_hull:
#     phase_d = PhaseDiagram(corrected_entries)
    
#     #Looping through the PBA structures to find the lowest energy above hull value:
#     e_above_hull_list = []
#     for k in range(len(same_comps)):
#         e_above_hull_list.append(phase_d.get_e_above_hull(corrected_entries[-k]))
#     e_above_hull = min(e_above_hull_list)
    
#     #Adding the values known so far to the main df:
#     pba_e_hull_df.loc[i, 'Composition'] = comp
#     pba_e_hull_df.loc[i, 'Formula'] = comp.formula
#     pba_e_hull_df.loc[i, 'e_above_hull'] = e_above_hull
    
#     #Adding the A, P, and R atoms to main df:
#     pba_e_hull_df.loc[i, 'A_atom'] = A_atom_nn
#     pba_e_hull_df.loc[i, 'P_atom'] = P_atom_nn
#     pba_e_hull_df.loc[i, 'R_atom'] = R_atom_nn
#     pba_e_hull_df.loc[i, 'n_A'] = struct.composition.as_dict()[A_atom_nn]



In [102]:
comp_dict.keys()

dict_keys([0, 899, 988, 1835])

In [105]:
len(same_comps_and_APR)

4

In [104]:
len(same_comps)

5