This repository containes example files and code to read, write and transform SURF files containing chemical reaction data.
First, clone this repository and then install the necessary requirements by running the following command (assuming you have pip installed):
pip install -r requirements.txtNow you should be ready to go.
SURF is a tabular file structure that is both human- and machine-readable. In a SURF spreadsheet, each row stores data of one reaction. The column headers structure the data and are split into constant (CC) and flexible (FC) categories. CCs never change and should be always present, independent of the number of reaction components. They capture the identifiers and provenance of the reaction as well as basic characteristics (reaction type, named reaction, reaction technology) and conditions (temperature, time, atmosphere, scale, concentration, stirring/shaking). Add-ons, such as the procedure or comments, belong to the CCs as well. The FCs describe the more variable part of a reaction, the different starting material(s), solvent(s), reagent(s) and product(s). Each reaction component is described by at an identifier such as the CAS number or molecule name, and a SMILES or InChI string storing the chemcial structure. While the SMILES/InChI string is available for every compound and can also serve as structural input for machine learning models, the CAS number, even though not always available, can be handy for chemists in the lab to order, itemize and find chemicals. For the starting material(s) and reagent(s), e.g. catalyst, ligand, additive, in addition to the two identifiers, a third column is added to cover the stochiometric amount (equivalents). For products, the respective yield and yield type is referenced. The flexibility of SURF allows capturing multiple starting materials and reagents, as these can be accommodated by adding three additional columns with (CAS, SMILES/InChI, and equivalents). If desired, further columns for additional identifiers like names or lot numbers can be added.
The file data/surf_template.txt provides an example of the SURF structure with five minisci-type reactions from literature.
To concatenate multiple SURF files into one larger file, put all SURF files to be combined into one folder and run the following:
python concat_surf.py <folder path> <output file>ls data/hte-data
hte-1.txt hte-2.txt hte-3.txt hte-4.txt ...
python concat_surf.py data/hte-data data/hte-data.txtThe idea of SURF is not to replace existing reaction file structures, but have a human- and machine-readable format that is interoperable. Below we describe examples of how to transform SURF into other existing formats and back.
The Open Reaction Database (ORD) is an open-access schema and infrastructure for structuring and sharing organic reaction data. Translating SURF files into the protocol buffers format used by the open reaction database, run the following:
python surf2ord.py <input SURF file> <output ORD file>To get all the options of the script, run:
python surf2ord.py --helpIf the SURF file does not contain any or only partial provenance information, the user can provide personal information with the --username, --email, --orcid and --organization options.
python surf2ord.py data/surf_template.txt data/surf_template.pbtxt --username "Alex Mueller" --email "alex.mueller@roche.com"To translate protocol buffers files used by the open reaction database back into the SURF format, run the following:
python ord2surf.py <input ORD file> <output SURF file>To get all the options of the script, run:
python ord2surf.py --helppython ord2surf.py data/ord_search_result.pb data/ord_search_result.txt --validateReaction SMILES are a frequently used representation for chemical reactions. However, they just represent the molecular structures involved in the reaction and lack detailed information on conditions, equivalents and analytics. We therefore only provide a way to translate SURF files into Reaction SMILES but not vice versa:
python surf2rxnsmiles.py <input SURF file> <output RXNSMILES file>To get all the options of the script, run:
python surf2rxnsmiles.py --helppython surf2rxnsmiles.py data/surf_template.txt data/surf_template.rxnsmiThe Unified Data Model (UDM) is an open, extendable and freely available data format for the exchange of experimental information about compound synthesis and testing, developed by the Pistoia Alliance. To translate SURF files into UDM XML files, run the following:
python surf2udm.py <input SURF file> <output UDM file>To get all the options of the script, run:
python surf2udm.py --helppython surf2udm.py data/surf_template.txt data/surf_template.xmlTo translate UDM XML files into SURF files, run the following:
python udm2surf.py <input UDM file> <output SURF file>To get all the options of the script, run:
python udm2surf.py --helppython udm2surf.py data/udm_file.xml data/surf_file.txtIf you are using SURF in your project, please cite the following reference:
@article{nippa_mueller_atz2023surf,
title={Simple User-Friendly Reaction Format},
author={Nippa, David F. and M{\"u}ller, Alex T. and Atz, Kenneth and Konrad, David B. and Grether, Uwe and Martin, Rainer E. and Schneider, Gisbert},
journal={Chemrxiv-2023-nfq7h},
year={2023},
doi={10.26434/chemrxiv-2023-nfq7h}
}
The release used for publication is v1.0.0, archived under https://doi.org/10.5281/zenodo.13969986