# Benchmark with other quantum chemistry output file parser projects

As far as I know there are several mature parsers offerring python API for quantum chemistry output files, which have been in development for several years. 

1. [pymatgen](https://pymatgen.org/pymatgen.io.qchem.html)
2. [cclib](https://cclib.github.io/)
3. [openeye](https://docs.eyesopen.com/toolkits/python/oechemtk/molreadwrite.html) 

*Attention*: The project [openeye](https://docs.eyesopen.com/toolkits/python/oechemtk/molreadwrite.html)  is supposed to be powerful from its documentation, but unfortunately I tried to run the sample code and it crashed the kernel, so I can't compare it here.

The main functions of the above projects are not exactly the same as those of MolOP. This benchmark only covers performance and functionality comparisons of our completed parsers. Our project currently only supports parsing of Gaussian and xTB related files, while the other projects have extensive support for various quantum chemistry software. We may add support for other software formats in the future, but for now, if you need parsing of other software outputs, you can use the mature projects above to accomplish your goals.

In fact, I didn't realise these projects existed until I decided to develop MolOP. If I had known about them earlier, perhaps the MolOP project would have been another tool that simply provided a way to go **from xyz coordinates to molecular graphs**. Of course, now you can also use just this functionality, I have wrapped a separate function for it and you can learn about its contribution at [**structure_recover_benchmark**](structure_recover_benchmark.md).

In [1]:
from codetiming import Timer
from glob import glob
from molop import AutoParser
from molop.config import molopconfig
from pymatgen.io import gaussian
from cclib.io import ccopen


@Timer(name="decorator")
def pymatgen_parser(file_wildcard: str):
    result = []
    for file in glob(file_wildcard):
        try:
            result.append(gaussian.GaussianOutput(file))
        except Exception as e:
            print(f"Error parsing file {file}: {e}")
    return result


@Timer(name="decorator")
def cclib_parser(file_wildcard: str):
    result = []
    for file in glob(file_wildcard):
        try:
            result.append(ccopen(file).parse())
        except Exception as e:
            print(f"Error parsing file {file}: {e}")
    return result


@Timer(name="decorator")
def molop_parser(file_wildcard: str, quiet: bool = False, **kwargs):
    if quiet:
        molopconfig.quiet()
    else:
        molopconfig.verbose()
    return AutoParser(file_wildcard, **kwargs)

## Benchmark files

The benchmark files are located in the `tests/test_files` directory. All of them are legal Gaussian output files.

In [2]:
print(AutoParser("../../tests/test_files/g16log/*.log").to_summary_df().to_markdown())

MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:02<00:00, 34.21it/s]
INFO - 0 files failed to parse, 85 successfully parsed


|    | parser            | file_path                                                                                                                 | file_name                                                                    | file_format   | version                                    |   frame_index |   charge |   multiplicity | SMILES                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Canonical SMILES                                                

## Time benchmark

In short, MolOP is about an order of magnitude faster compared with pymatgen and cclib。

In [3]:
for i in range(10):
    pymatgen_result = pymatgen_parser("../../tests/test_files/g16log/*.log")

Error parsing file ../../tests/test_files/g16log/anion_0167_opt_g16_nbo_sp.log: list index out of range


../../tests/test_files/g16log/cation_0407_opt_g16.log: Termination error or bad Gaussian output file !


Error parsing file ../../tests/test_files/g16log/radical_0554_opt_g16_nbo_sp.log: list index out of range
Elapsed time: 3.3883 seconds
Error parsing file ../../tests/test_files/g16log/anion_0167_opt_g16_nbo_sp.log: list index out of range
Error parsing file ../../tests/test_files/g16log/radical_0554_opt_g16_nbo_sp.log: list index out of range
Elapsed time: 3.2941 seconds
Error parsing file ../../tests/test_files/g16log/anion_0167_opt_g16_nbo_sp.log: list index out of range
Error parsing file ../../tests/test_files/g16log/radical_0554_opt_g16_nbo_sp.log: list index out of range
Elapsed time: 3.3668 seconds
Error parsing file ../../tests/test_files/g16log/anion_0167_opt_g16_nbo_sp.log: list index out of range
Error parsing file ../../tests/test_files/g16log/radical_0554_opt_g16_nbo_sp.log: list index out of range
Elapsed time: 3.2906 seconds
Error parsing file ../../tests/test_files/g16log/anion_0167_opt_g16_nbo_sp.log: list index out of range
Error parsing file ../../tests/test_files/g1

pymatgen took more than 3s and raised 2 errors

In [4]:
for i in range(10):
    cclib_result = cclib_parser("../../tests/test_files/g16log/*.log")

[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'


  r, _ = scipy.spatial.transform.Rotation.align_vectors(b_, a_)


Elapsed time: 6.1525 seconds


[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'
Elapsed time: 6.1883 seconds


[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'
Elapsed time: 6.1815 seconds


[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'
Elapsed time: 6.1660 seconds


[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'
Elapsed time: 6.1962 seconds


[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'
Elapsed time: 6.1503 seconds


[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'
Elapsed time: 6.1727 seconds


[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'
Elapsed time: 6.1416 seconds


[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'
Elapsed time: 6.1910 seconds


[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Encountered error when parsing.
[Gaussian ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log ERROR] Last line read:  Rotational constants (GHZ):      ************     1.38922     1.38922



Error parsing file ../../tests/test_files/g16log/dsgdb9nsd_000484-1+.log: could not convert string to float: '************'
Elapsed time: 6.1752 seconds


cclib took more than 6s and raised 1 errors

In [5]:
for i in range(10):
    molop_result = molop_parser("../../tests/test_files/g16log/*.log", n_jobs=1)

MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.81it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.7441 seconds


MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.80it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.7448 seconds


MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.63it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.8119 seconds


MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.44it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.8897 seconds


MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.66it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.7997 seconds


MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.98it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.6773 seconds


MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.41it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.8998 seconds


MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.61it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.8223 seconds


MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.59it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.8303 seconds


MolOP parsing with single thread: 100%|██████████| 85/85 [00:05<00:00, 14.55it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 5.8466 seconds


Sequential MolOP took more than 5s and raised no error. We can be faster by the automatic parallelization.

In [6]:
for i in range(10):
    molop_result = molop_parser("../../tests/test_files/g16log/*.log")

MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 89.89it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.9506 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 89.17it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.9602 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:01<00:00, 76.70it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 1.1125 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 90.49it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.9442 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 85.21it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 1.0019 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 90.45it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.9470 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:01<00:00, 76.75it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 1.1129 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 92.19it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.9282 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 94.23it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.9066 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:01<00:00, 75.26it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 1.1335 seconds


Parallel MolOP took less than 1s and raised no error. We can be faster if we want only the optimized structure.

In [7]:
for i in range(10):
    molop_result = molop_parser(
        "../../tests/test_files/g16log/*.log", only_last_frame=True
    )

MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 207.64it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4135 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 196.58it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4362 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 198.95it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4312 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 198.51it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4334 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 197.26it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4364 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 198.97it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4327 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 198.17it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4336 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 198.14it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4348 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 198.97it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4332 seconds


MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 198.72it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.4330 seconds


Parallel MolOP to extract only the last frame took close to 0.4s and raised no error. Actually, the other tools commonly provide only the last frame.

## Information benchmark

In [8]:
cclib_result[-1].getattributes().keys()

dict_keys(['atomcharges', 'atomcoords', 'atommasses', 'atomnos', 'charge', 'coreelectrons', 'dispersionenergies', 'enthalpy', 'entropy', 'freeenergy', 'geotargets', 'geovalues', 'grads', 'homos', 'metadata', 'moenergies', 'moments', 'mosyms', 'mult', 'natom', 'nbasis', 'nmo', 'optdone', 'optstatus', 'polarizabilities', 'pressure', 'rotconsts', 'scfenergies', 'scftargets', 'scfvalues', 'temperature', 'vibdisps', 'vibfconsts', 'vibfreqs', 'vibirs', 'vibrmasses', 'vibsyms', 'zpve'])

In [9]:
pymatgen_result[-1].as_dict().keys()

dict_keys(['has_gaussian_completed', 'nsites', 'unit_cell_formula', 'reduced_cell_formula', 'pretty_formula', 'is_pcm', 'errors', 'Mulliken_charges', 'elements', 'nelements', 'charge', 'spin_multiplicity', 'input', 'output', '@module', '@class'])

In [10]:
molop_result[-1][-1].model_dump().keys()

dict_keys(['atoms', 'coords', 'standard_coords', 'charge', 'multiplicity', 'bonds', 'formal_charges', 'formal_num_radicals', 'frame_id', 'status', 'geometry_optimization_status', 'qm_software', 'qm_software_version', 'keywords', 'method', 'basis', 'functional', 'solvent', 'solvent_model', 'solvent_epsilon', 'solvent_epsilon_infinite', 'temperature', 'electron_temperature', 'forces', 'rotation_constants', 'energies', 'thermal_energies', 'molecular_orbitals', 'vibrations', 'charge_spin_populations', 'polarizability', 'bond_orders', 'total_spin', 'single_point_properties', 'running_time', 'smiles', 'canonical_smiles', 'filename', 'is_error', 'is_normal', 'is_TS'])

pymatgen extracts only the basic information from the Gaussian output file.

It's great that cclib has a very full parsing of the Gaussian output file, covering almost everything in it. Unfortunately there are bugs in the handling of some fields.

MolOP's support for Gaussian output files exceeds that of pymatgen, but hardly exceeds the coverage of cclib. 

More importantly, the smallest unit of data extracted by MolOP is each legal frame in a file, and each frame is parsed to the same specification. Other projects will only process a file as a whole. The structure of the optimisation process is equally important for tasks such as building datasets of molecular geometries. Each frame in the geometry optimisation is a legitimate SCF convergent structure and provides a DFT-level molecular force field, which may be of great help for neural network potential function fitting.

In [11]:
molop_result = molop_parser("../../tests/test_files/g16log/*.log")
print(
    molop_result[
        "/home/tmj/proj/MolOP/tests/test_files/g16log/RE_BOX-Anion-Real_Cu-III-Phenol_Major-Amide-Anion_From-IP_C-O-190_TS_Opt.log"
    ]
    .to_summary_df()
    .to_markdown(floatfmt=".6f")
)

MolOP parsing with 16 jobs: 100%|██████████| 85/85 [00:00<00:00, 90.46it/s]
INFO - 0 files failed to parse, 85 successfully parsed


Elapsed time: 0.9444 seconds
|     | parser            | file_path                                                                                                                 | file_name                                                                    | file_format   | version                                    |   frame_index |   charge |   multiplicity | SMILES                                                                                                                                                                                                                                                                             | Canonical SMILES                                                                             | keywords                                                                                                                                                                                                                                                               