Skip to content

bethanyconnolly/awesome-chemistry-datasets

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

awesome-chemistry-datasets Awesome

Contributions are very welcome - please follow the guidelines and the Code of Conduct.

text datasets

  • BioRxiv XML - Bulk access to the full text of bioRxiv articles for the purposes of text and data mining (TDM) is available via a dedicated Amazon S3 resource.
  • ChemTables: 788 chemical patent tables with labels of their content type. Built for semantic classification of table type. Licensed under CC BY NC 3.0.
  • Europe PMC - Bulk download of full text and SI of > 5 million articles.
  • IUPAC Gold Book
  • LibreText: Open-access chemistry textbook.
  • MedRxiv XML - Text and data mining is possible via dedicated Amazon S3 resource.
  • NLM literature archive: NLM LitArch (NLM Literature Archive) is a digital archive for books, documents, and articles in the fields of life science, medicine, and healthcare at the National Institutes of Health. Also accessible via NCBI bookshelf.
  • OpenStax Free textbooks, including Chemistry 2e, which is released under CC-BY 4.0.
  • PubChemSTM: 281K chemical structure and text pairs
  • PubMed central: free full-text archive
  • PubMed: abstracts and outlinks
  • S2ORC: The Semantic Scholar Open Research Corpus. 81.1M English-language academic papers spanning many academic disciplines largest publicly-available collection of machine-readable academic text). Released under CC BY-NC 4.0.

structures

ml structure-property benchmark datasets

  • ACNet: a benchmark for Activity Cliff Prediction, 400K Matched Molecular Pairs (MMPs) against 190 targets, including over 20K MMP-cliffs and 380K non-AC MMPs from ChEMBL (version 28).
  • Aquasoldb: Curation of nine open source datasets on aqueous solubility. The authors also assigned reliability groups.
  • BindingDB: molecular recognition database, contains 2.6M data for 1.1M Compounds and 8.10K Targets (Feb 2023)
  • ChEBI-20: 33,010 molecule-description pairs (for molecule captioning task)
  • ESol: Water solubility data(log solubility in mols per litre) for common organic small molecules.
  • Flashpoint: Sun et al. collected a dataset of the flashpoints of 10575 molecules from academic papers, the Gelest chemical catalogue, the DIPPR database, Lange's Handbook of Chemistry, the Hazardous Chemicals Handbook, and the PubChem database.
  • FreeSolv: Experimental and Calculated Small Molecule Hydration Free Energies
  • Harvard OPV: "experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of geometries, each with quantum chemical results using a variety of density functionals and basis sets"
  • Hydrogen Storage Materials Database: data on hydrides for hydrogen storage (information such as chemical formula and hydrogen capacity)
  • ILThermo: thermodynamic and transport properties of pure ionic liquids and mixtures of them.
  • Leffingwell Odor Dataset: 3523 molecules associated with expert-labeled odor descriptors from the Leffingwell PMP 2001 database
  • Limiting activity coefficients: for different solvent/solute pairs, used to train a SMILES-based transformer.
  • Lipophilicty: Experimental results of octanol/water distribution coefficient(logD at pH 7.4).
  • MoleculeNet - Benchmark suite that contains multiple datasets listed here
  • oechem: On Feb 17 2023 OCHEM contained 3774118 records for 689 properties (with at least 50 records) collected from 20609 sources (user is granted a Creative Commons CC-BY (version 4.0) license to data submitted)
  • Papyrus: A large scale curated dataset aimed at bioactivity predictions. Contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with smaller datasets.
  • Photoswitch Dataset: Curated dataset of 405 photoswitch molecules.
  • QM Datasets: QM7, QM7b, QM8, QM9, MD Trajectories
  • SolProp: Database of 1 million solvent/solute COSMO-RS calculations and 10145 experimental solvation free energies (originally published as part of this paper).
  • SOMAS: Experimental and calculated solubilities for small molecules. Originally proposed for the design of redox-flow batteries.
  • Therapeutic Data Commons: ML tasks that cover small molecules and biologics, including antibodies, peptides, miRNAs, and gene editing therapies.
  • ThermoML Archive: experimental thermophysical and thermochemical property data (in ThermoML XML format)

Target identification data

  • Open Targets: is a large-scale resource that uses human genetics and genomics data for systematic drug target identification and prioritization.
  • Probes & Drugs Portal: is an interactive, open data resource for chemical biology. Overview of libraries of bioactive compounds (e.g., ChEMBL, Guide to PHARMACOLOGY), including commercial screening libraries.

Pharmacology & ADME & Metabolism

  • Guide to PHARMACOLOGY: is an expert-curated resource of ligand-activity-target relationships. It includes activity data even for data with unknown bioactivity value (under CC BY-SA 4.0).
  • Drug Indications Database (DID): is a dataset of structured drug-indication relations. It is intended to facilitate the building of practical, comprehensive, integrated drug ontologies.
  • The Metabolism and Transport Database : is a cheminformatics and bioinformatics resource that contains curated data related to human small molecule metabolism and transport.
  • The Human Metabolome Database (HMDB): is a freely available electronic database containing detailed information about small molecule metabolites found in the human body.
  • KEGG PATHWAY Database(KEGG): a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
  • MetXBioDB Metabolite Biotransformations: a comprehensive collection of biotransformation reactions and metabolite information from the BioTransformer database. It includes the transformation and metabolism of metabolites.
  • QSAR datasets - Meta-QSAR (phase I & II): Data (extracted from ChEMBL) used in Olier et al. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery.
  • EPA CompTox: is a widely used resource for chemistry, toxicity, and exposure information for hundreds of thousands of chemicals including, but not limited to, chemical properties, environmental fate, and transport, hazard, in vitro to in vivo extrapolation (IVIVE), exposure, bioactivity (each data has its license).

reactions

  • ustop: Reactions extracted by text-mining from United States patents published between 1976 and September 2016.

high-throughput screening data

  • Dreher-Doyle: yields and conditions for 3955 Pd-catalysed Buchwald–Hartwig C–N crosscouplings
  • Perera: yields and conditions for 5760 Pd-catalysed Suzuki-Miyaura C-C cross-couplings

eln data

related list

License

CC0

About

overview of datasets for ML in chemistry

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published