Skip to content
Extraction and re-use(ability) of chemical information from common scientific documents containing ChemDraw files
Ruby Shell
Branch: master
Clone or download
Latest commit 2b749c2 Oct 17, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin Commit ChemScanner Oct 17, 2019
lib Commit ChemScanner Oct 17, 2019
.gitignore Commit ChemScanner Oct 17, 2019
.rubocop.yml Commit ChemScanner Oct 17, 2019
.ruby-gemset Commit ChemScanner Oct 17, 2019
.travis.yml Commit ChemScanner Oct 17, 2019
Gemfile Commit ChemScanner Oct 17, 2019
LICENSE.txt Commit ChemScanner Oct 17, 2019
Rakefile Commit ChemScanner Oct 17, 2019


The ChemScanner library attempts to extract and interpret reactions/molecules information from ChemDraw-related files format: CDX, CDXML, embedded CDX within DOC and DOCX, Perkin Elmer ELN.


Add this line to your application's Gemfile:

gem 'chem_scanner'

And then execute:

$ bundle

Or install it yourself as:

$ gem install chem_scanner

UI for ChemScanner

You can try the ChemScanner at or The UI is more user-friendly which some additional features:

  • Export to Excel and CML.
  • Preview of the original scheme.
  • Import directly to Chemotion ELN
  • Add comment for each extracted scheme. These comments would also appear in the export and Chemotion ELN imported molecules/reactions.
  • ...


To scan/extract a single CDX file

require 'chem_scanner'

cdx ='/path/to/cdx/file')
# Get array of scanned Canonical SMILES
# Get array of scanned Reactions in SMILES

There are 5 classes correspond to 5 supported file formats: CDX, CDXML, DOC, DOCX, PerkinELN.



  • Access "scanned" molecules
# Molecules - array of scanned molecules
# Get array of scanned Canonical SMILES
# Get one  molecule
molecule = cdx.molecules.first
# Number of scanned molecules
  • Molecule class:
# Canonical SMILES
# Molfile
# Molecule label (bold text near molecule)
# Molecule text (molecule description)
# Molecule details (additional information from Perkin Elmer ELN)

We are using a ruby-binding version of RDKit as a dependency of ChemScanner.


Reaction consist of 3 groups of molecules: reactants, reagents and products. Each group is and array of molecules, which each element is an object of Molecule class. In addition, some abbreviations belong to the reaction are represented by SMILES. Those could be access via reagent_smiles

reaction = cdx.reactions.first
# Access extracted structure group
reactants = reaction.reactants
reagents = reaction.reagents
products = reaction.products
reagent_smiles = reaction.reagent_smiles

Further manipulation of each group would be similar to Molecule class.

  • Reaction properties

Reaction itself has description, yield, time, temperature and details properties. All these properties are extracted from the ChemDraw scheme, excep details field are additional information from PerkinELN.

  • Reaction step

Some multi-step reactions can also be recognized. If a reaction is a multi-step reaction, the "steps" could be accessed via:

# Get first scanned reaction
reaction = cdx.reactions.first
# Access first step
step = reaction.steps.first
step.number # Should be 1 
# List reagents SMILES

Each step has these following properties: description, time, temperature, and reagents

Supported File Formats

CDX, CDXML, PerkinELN usage and API are described above. Their outputs are simple molecules and reactions.

DOC and DOCX classes are little bit different. Since DOC and DOCX file can contain more than 1 embedded ChemDraw schemes, which each embedded scheme is 1 CDX scheme. ChemScanner attempts to extract all of them and put into one Hash map, called cdx_map.

require 'chem_scanner'

doc ='/path/to/doc/file')
doc.cdx_map.each do |key, cdx|

# Access all molecules in all CDXs
# Access all reactions in all CDXs

DOCX is a bit different, ChemScanner can extract the CDX together with its preview image within the documents.

require 'chem_scanner'

docx ='/path/to/docx/file')
docx.cdx_map.each do |key, cdx_info|
  # Get the CDX scheme
  cdx = cdx_info[:cdx]
  # Preview images, used for ChemScanner UI
  img_ext = cdx_info[:img_ext] # Could be '.png', '.emf'
  img_b64 = cdx_info[:img_b64] # Base64 encoded of image

# Access all molecules in all CDXs
# Access all reactions in all CDXs


After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to


Bug reports and pull requests are welcome on GitHub at This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.


The gem is available as open source under the terms of the GNU AGPLv3 License.

You can’t perform that action at this time.