## Notebook for data curation

This Notebook is used to generate a dataset (in the extended .xyz format) of the CSD-2K CSD-1K and CSD-S546 data (CSD-3K+S546_shift_tensors.xyz) and the CSD-500 and CSD-S104 data (CSD-500+104-7_shift_tensors.xyz), containing the atom-wise isotropic chemical shifts and shift tensor extracted from the .magres files given in the original CSD-X data 
(archived in https://archive.materialscloud.org/record/2019.0023/v1).

The CSD-500+104-7_shift_tensors.xyz file misses 7 files from the originally reported CSD-500+104.xyz file. I have noticed that there are 7 structures in the CSD-500+104.xyz file (‘JEKPIZ’, ‘NECFIM’, ‘CUMZAM’, ‘ODEJAJ01’, ‘ITOFEE’, ‘WEPBAV’, ‘FOQREK) that have missing .magres files. These structures were excluded from the tensor containing .xyz file. I have done this by returning the intersection of two dicts containing {key(CSD-identifier) : value(e.g. atoms object)} pairs.

A second file helpers.py contains helper functions building CSD-identifier:property dicts, testing functions and a modified .magres parser from the ASE library. This is necessary because either the .magres parser from the ASE (and http://tfgg.me/magres-format/build/html/index.html) are broken, or the .magres files were not written according to specifications. 

This originates from large isotropic shifts and tensor values not being white-space separated. 
'm N-6.0351   57.0173   -6.0146   39.7698   98.8410   49.6738   13.0950   46.7860-115.2668'

The not so nice fix is to replace each "-" with " -" in the file and resubstituting the only other two lines containing "-" in the .magres file (some unit information and calculator version number)

The modified ASE parser takes a modified file object as input instead of a file path.

In [3]:
import os
import glob
from ase.io import read, write
import numpy as np

from helpers import *

In [4]:
#Build dictionary with {key: STRUCTURE-CSD-NAME value: atoms object}

extyz_dict = build_extxy_dict("CSD-500+S104.xyz")


#da
datasets = ["CSD-500","CSD-S104"] #test directories
contained_in = "magres"
extension = "*magres"

#build path
#./CSD-500/magres/*magres

extyz_dict_tens = {}

for dataset in datasets:
    DATASETPATH = os.path.join(os.getcwd(),dataset,contained_in,extension)
    files = glob.glob(DATASETPATH)
    print(DATASETPATH)
    for n, file in enumerate(files):
        structname = file.rstrip(".nmr.magres").split("/")[-1]
        extyz_dict_tens.update({structname : None})

#build intersection of two sets (.xyz files) necessary for train set due to the 7 missing files
final_dict = {x:extyz_dict[x] for x in extyz_dict 
                              if x in extyz_dict_tens}  

# else, just do:
# final_dict = {x:extyz_dict[x] for x in extyz_dict}

/Users/matthiaskellner/Desktop/EPFL_2021/COSMO_project/make_tensor_data/CSD-500/magres/*magres
/Users/matthiaskellner/Desktop/EPFL_2021/COSMO_project/make_tensor_data/CSD-S104/magres/*magres


In [5]:
#Build dictionary with {key: STRUCTURE-CSD-NAME value: STATUS}

all_ids = generate_status_dict("CSD-3k+S546.xyz","PASSING")
souspicious = generate_status_dict("./frames_status/frames_suspicious.xyz","SUSPICIOUS")
outliers = generate_status_dict("./frames_status/frames_blatant_outliers.xyz","FAIL")

all_ids.update(souspicious) 
all_ids.update(outliers)

In [6]:
#loop through datasets (magres directories and files that are contained in the latter)
#extract the CSD name from the file name
# read and generate atoms objects from .magres files
# remove atoms.info pairs that are garbage
# if set_status (for training) is True: write status to info dict
# flatten shift tensor and add it with another name
# change coordinates to coordinates from .extyz file (higher precision) 
# 


# Directories, where .magresfiles are located
datasets= ["CSD-500","CSD-S104"] #["CSD-2k","CSD-1k","CSD-S546"]
contained_in = "magres"
extension = "*magres"

structs = []

set_status=False

for dataset in datasets:
    #build combined filepaths from the working directory and 
    DATASETPATH = os.path.join(os.getcwd(),dataset,contained_in,extension)
    files = glob.glob(DATASETPATH)
    print(DATASETPATH)
    
    for n, file in enumerate(files):
        structname = file.rstrip(".nmr.magres").split("/")[-1]
        #print(structname)
        
        #try:
        with open(file) as f:
            fd = f.read()
            fd = fd.replace("-"," -")
            fd = fd.replace("units sus 10^ -6.cm^3.mol^ -1","units sus 10^-6.cm^3.mol^-1")
            fd = fd.replace("#$magres -abinitio -v1.0","#$magres-abinitio-v1.0")
            fd = fd.replace("QE -GIPAW 5.x","QE-GIPAW 5.x")
            atoms = read_magres_modified(fd)
            #print(fd)
       
        
        if set_status is True:
            atoms.info.update({"STATUS":all_ids[structname]})
        
        #-----flatten TENSOR-----
        atoms.arrays.update({"cs_tensor": atoms.arrays["ms"].reshape((-1,9))})
        atoms.info.update({"magres_units": {'cs_tensor': 'ppm', 'cs_iso': 'ppm'}})
            
        
        #----remove labels and incices
        atoms.arrays.pop("ms")
        atoms.arrays.pop("indices")
        atoms.arrays.pop("labels")
        
        #remove garbage from comments
        atoms.info.pop("magresblock_calculation")
        
        #check if structname is in final dict:
        #nescessary for -7 files. probably to complicated
        if structname in final_dict:
            atoms.info.update({"NAME":structname})
            atoms.info.update({"ENERGY": final_dict[structname].info["ENERGY"]})
            atoms.set_positions(final_dict[structname].get_positions())
            atoms.set_cell(final_dict[structname].get_cell())
            atoms.arrays.update({"cs_iso": final_dict[structname].arrays["CS"]})
            structs.append(atoms)

/Users/matthiaskellner/Desktop/EPFL_2021/COSMO_project/make_tensor_data/CSD-500/magres/*magres
/Users/matthiaskellner/Desktop/EPFL_2021/COSMO_project/make_tensor_data/CSD-S104/magres/*magres


In [7]:
#write("CSD-500+104-7_shift_tensors.xyz",structs,format="extxyz")

In [8]:
check_plausibility("./test_tensor/CSD-500+104-7_shift_tensors.xyz","CSD-500+S104.xyz") 
#check if PBC, cell, coordinate and shifts are transferred correctly
# diagonalizes shift tensor and takes average of eigenvalues. compares the average to the iso shift
# to ensure that tensor values are written correctly

In [10]:
#can be used to check if status was written correctly
test_status("./train_tensor/CSD-3k+S546_shift_tensors.xyz",all_ids) 

In [239]:
"""
struct_iso_good = read("teststructs_iso_good.xyz",format="extxyz")
struct_tensor_good = read("teststructs_tens_good.xyz",format="extxyz")

#-----testing the comparison helper functions
bad_structs_iso = [read(this,format="extxyz") for this in ["teststructs_iso_bad_no_PBC.xyz","teststructs_iso_bad_cell.xyz","teststructs_iso_bad_coordinates.xyz","teststructs_iso_bad_shift.xyz"]]
bad_structs_tens = [read(this,format="extxyz") for this in ["teststructs_tens_bad_no_PBC.xyz","teststructs_tens_bad_cell.xyz", "teststructs_tens_bad_shift.xyz","teststructs_tens_bad_coordinates.xyz"]]

for bad_struct in bad_structs_iso:
    print(compaire(bad_struct,struct_tensor_good))
for bad_struct in bad_structs_tens:
    print(compaire(struct_iso_good,bad_struct))
    
compaire(struct_iso_bad,struct_tensor_bad)
"""

[True, False, False, True]
False
[True, False, True, True]
False
[False, True, True, True]
False
[True, True, True, False]
False
[True, False, True, True]
False
[True, False, True, True]
False
[True, True, True, False]
False
[False, True, True, True]
False
