Skip to content

Reserach environment for solving OCSR task with foundational models

License

Notifications You must be signed in to change notification settings

dmzio/ocsr_eval

Repository files navigation

MolMole OCSR Research Environment

For the current experimental write-up and results, see report/report.md.

Why Optical Chemical Structure Recognition (OCSR) matters

Converting chemical diagrams into machine-readable representations (SMILES, InChI, molecular graphs) is fundamental for indexing the chemical literature. Classical OCSR tools often struggle on complex layouts and scanned pages. Modern deep learning approaches such as DECIMER and MolScribe demonstrate strong performance with dedicated training. Given the rapid evolution of multimodal foundation models, it is natural to ask whether general-purpose VLM/LLM systems can perform OCSR without chemistry-specific training—and what their failure modes look like.

Overview

This repository provides a ready-to-use Docker Compose environment for evaluating OCSR with current vision-enabled foundation models.

The project is inspired by MolMole (LG AI Research), which proposes an end-to-end framework for extracting molecules and reactions from full-page patent images and introduces an evaluation benchmark. The MolMole paper evaluates 550 annotated pages; due to copyright restrictions, only a 300-page patent subset is publicly released as MolMole_Patent300 dataset on HuggingFace.

The code in this repo runs holistic extraction: the model sees the entire page and is asked to return all structures and reactions in one JSON response. Extractions are evaluated in three output formats:

  • graph: atoms/bonds JSON (closest to an explicit molecular graph).
  • smiles: SMILES strings.
  • selfies: SELFIES strings.

All developer workflows run inside Docker/Compose; host-level execution is reserved for CI.

Repository structure

.
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── .env.example
├── experiments_openrouter.yaml
├── report/
│   └── report.md
├── src/
│   └── molmole_research/
│       ├── downloader.py   # download dataset + build labels.json from MOL files
│       ├── extractor.py    # holistic OCSR extraction (graph / SMILES / SELFIES)
│       ├── converter.py    # optional conversion helpers
│       ├── evaluator.py    # compute metrics and write logs
│       └── runner.py       # orchestrate multi-model runs
└── tests/
    └── ...

Quickstart (recommended: Docker + OpenRouter)

This is the intended way to reproduce the pilot runs in results_openrouter_*. Keep secrets in .env (gitignored) and do not put API keys on the command line.

  1. Create .env:

    cp .env.example .env
    # edit .env and set OPENROUTER_KEY=...

    If you plan to run direct OpenAI API experiments (not via OpenRouter), you can also set OPENAI_API_KEY in .env.

  2. Build the image:

    make build
  3. Download MolMole_Patent300 and build labels.json:

    make download
  4. Run OpenRouter experiments (small debug runs; adjust --limit as needed):

    docker compose run --rm --user "$(id -u):$(id -g)" research \
      python -m molmole_research.runner run \
        --config experiments_openrouter.yaml \
        --format graph \
        --limit 5 \
        --results-dir results_openrouter_graph

    Repeat for other formats:

    • --format smiles --results-dir results_openrouter_smiles
    • --format selfies --results-dir results_openrouter_selfies
  5. Inspect outputs:

    • Raw model outputs: results_openrouter_*/<experiment>.jsonl
    • Metrics: results_openrouter_*/<experiment>_metrics.json
    • Logs: results_openrouter_*/<experiment>_metrics.log
    • Aggregated summary: results_openrouter_*/summary.json

Dataset download

The publicly released dataset lives on HuggingFace as doxa-friend/MolMole_Patent300 (license: CC-BY-NC-ND-4.0). The downloader uses huggingface_hub.snapshot_download to fetch the dataset snapshot and builds labels.json by converting the provided MOL files to canonical SMILES via RDKit.

To download the dataset into data/images, run:

make download

If the download fails due to authentication or license acceptance, the downloader prints instructions for manual setup.

Running experiments

All commands below are meant to run via Docker.

Single-model runs

If you are using the OpenAI API directly, ensure OPENAI_API_KEY is set (for example via .env).

  1. Run extraction (example: OpenAI, SMILES output):

    docker compose run --rm --user "$(id -u):$(id -g)" research \
      python -m molmole_research.extractor run \
        --model gpt-4o \
        --dataset-dir data/images \
        --out results \
        --format smiles \
        --limit 5
  2. Evaluate:

    docker compose run --rm --user "$(id -u):$(id -g)" research \
      python -m molmole_research.evaluator run \
        --pred results/gpt-4o.jsonl \
        --dataset-dir data/images \
        --out results
  3. Optional conversion step (mainly useful for debugging):

    docker compose run --rm --user "$(id -u):$(id -g)" research \
      python -m molmole_research.converter run \
        --pred results/gpt-4o.jsonl \
        --out results

Notes:

  • The extractor resumes by default (appends and skips already-processed pages). For a clean run, delete the output JSONL or use --no-resume.
  • Use --timeout to bound each request, and --limit for short debug runs.

Multi-model runs (runner)

To run a YAML-defined set of experiments (recommended for OpenRouter), use:

docker compose run --rm --user "$(id -u):$(id -g)" research \
  python -m molmole_research.runner run \
    --config experiments_openrouter.yaml \
    --format graph \
    --limit 5 \
    --results-dir results_openrouter_graph

The runner writes per-experiment JSONL outputs and per-experiment metrics, plus summary.json in the selected results directory.

OpenRouter notes

  • experiments_openrouter.yaml sets the OpenRouter API base and declares api_key_env: OPENROUTER_KEY.
  • The runner reads OPENROUTER_KEY from .env (or the environment) and passes it to the extractor via environment variables.
  • Start with --limit and a small model set; OpenRouter runs can be expensive.

Optional: interactive container shell

If you prefer an interactive session:

make shell

Inside the shell, you can run the same python -m molmole_research.<module> run ... commands.

Using Ollama / Open-WebUI

If an existing Open-WebUI + Ollama stack is available at http://host.docker.internal:11434/v1 with the model ministral-3:14b, you can run a sample extraction from inside the research container:

python -m molmole_research.extractor run \
  --model ministral-3:14b \
  --api-base http://host.docker.internal:11434/v1 \
  --api-key placeholder \
  --dataset-dir data/images \
  --out results \
  --format graph \
  --limit 5

Notes:

  • The Compose service includes extra_hosts: host.docker.internal so the container can reach the host’s Ollama port.
  • The --api-key value is ignored by Ollama but required by the OpenAI client; any non-empty string is fine.

Makefile commands

The provided Makefile defines several convenience targets:

Target Description
make build Build the Docker image (installs dependencies).
make shell Open a bash shell inside the research container.
make test Run the full test suite inside the container.
make lint Check code style with ruff inside the container.
make format Format the code base with ruff format inside the container.
make run Run the default runner inside the container.
make download Download MolMole_Patent300 and build labels.json.
make up / make down Start or stop the research container stack.

CI / local-only execution

Automated CI pipelines execute host-level commands (pytest, ruff) to keep runtimes fast. Outside CI, prefer the Docker workflow above and avoid creating local virtual environments. If you must reproduce the CI run locally, mirror its steps in a temporary venv and install requirements.txt, but treat that as an exception rather than the norm.

Notes on models and API compatibility

The extractor uses the OpenAI Python client and targets OpenAI-compatible APIs. For a provider that exposes an OpenAI-compatible API (OpenAI, OpenRouter, local Open-WebUI, etc.), set --api-base and provide credentials via environment variables or the runner configuration.

Relevant literature

The articles/ directory contains additional papers used to inform this environment (no actual pdfs in repo). A brief summary of each paper is available in articles/relevant_articles.md.

License

This project is released under the MIT License. Individual datasets and published papers retain their respective licenses; please consult the original sources for details.

About

Reserach environment for solving OCSR task with foundational models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published