A pipeline for parsing scientific and biomedical PDFs using GROBID and the MinerU cloud API.
Every PDF is first registered to receive a stable paper_id. All subsequent steps use the paper ID to locate inputs and write outputs.
GitHub: git@github.com:gaoshang-strong/PDF_parser.git
Local: /ShangGaoAIProjects/PDF_parser
data/
├── raw_pdfs/ Original PDFs (before registration)
├── registered_pdfs/ Registered PDFs + registry.json
├── grobid/ {paper_id}.tei.xml
└── mineru/ {paper_id}/ (content_list.json, .md, images/)
micromamba activate PDF_parser
cd /ShangGaoAIProjects/PDF_parser
python -m pip install -e ".[dev]"/home/sgao30/micromamba/bin/micromamba run -n PDF_parser python -m pytestAssign a stable paper_id to each PDF and move it into the registry:
pdf-parser register --pdf data/raw_pdfs/paper.pdf
# → pdf_16edbbde296287d6The PDF is moved to data/registered_pdfs/{paper_id}.pdf and recorded in
data/registered_pdfs/registry.json. Registration is idempotent — the
same file content always produces the same paper_id.
data/registered_pdfs/
├── registry.json
├── pdf_16edbbde296287d6.pdf
└── pdf_1ee686107e655447.pdf
registry.json format:
{
"pdf_16edbbde296287d6": {
"original_filename": "paper.pdf",
"paper_id": "pdf_16edbbde296287d6",
"registered_at": "2026-05-08T10:00:00+00:00",
"sha256": "16edbbde296287d6..."
}
}GROBID runs as a Docker service:
# Create container (first time only)
docker run -d --init --name grobid -p 127.0.0.1:8070:8070 grobid/grobid:0.9.0
# Start if already created
docker start grobid
# Verify
pdf-parser grobid checkProcess a registered PDF:
pdf-parser grobid process --paper-id pdf_16edbbde296287d6
# writes → data/grobid/pdf_16edbbde296287d6.tei.xmlStop GROBID when done:
docker stop grobidRequires an API token from mineru.net:
export MINERU_API_TOKEN="your-token-here"
pdf-parser mineru run --paper-id pdf_16edbbde296287d6
# writes → data/mineru/pdf_16edbbde296287d6/Output directory contains *_content_list.json, *.md, and images/.
for pdf in data/raw_pdfs/*.pdf; do
pdf-parser register --pdf "$pdf"
donedocker start grobid
for paper_id in $(jq -r 'keys[]' data/registered_pdfs/registry.json); do
out="data/grobid/${paper_id}.tei.xml"
[ -f "$out" ] && echo "skip $paper_id" && continue
echo "=== grobid $paper_id ==="
pdf-parser grobid process --paper-id "$paper_id"
doneexport MINERU_API_TOKEN="your-token-here"
for paper_id in $(jq -r 'keys[]' data/registered_pdfs/registry.json); do
out="data/mineru/${paper_id}"
[ -d "$out" ] && echo "skip $paper_id" && continue
echo "=== mineru $paper_id ==="
pdf-parser mineru run --paper-id "$paper_id"
doneexport MINERU_API_TOKEN="your-token-here"
docker start grobid
for paper_id in $(jq -r 'keys[]' data/registered_pdfs/registry.json); do
echo "=== $paper_id ==="
[ -f "data/grobid/${paper_id}.tei.xml" ] || \
pdf-parser grobid process --paper-id "$paper_id"
[ -d "data/mineru/${paper_id}" ] || \
pdf-parser mineru run --paper-id "$paper_id"
donepdf-parser register --pdf <path> [--papers-dir data/registered_pdfs]
pdf-parser grobid check [--url http://localhost:8070]
pdf-parser grobid process --paper-id <id> [--papers-dir ...] [--out-dir data/grobid] [--url ...]
pdf-parser grobid batch --input-dir <dir> --out-dir <dir> [--url ...]
pdf-parser mineru run --paper-id <id> [--papers-dir ...] [--out-dir data/mineru]
src/pdf_parser/
├── cli.py # pdf-parser entry point
├── registry.py # register_paper, get_registered_pdf
├── parsers/
│ └── mineru_api_adapter.py # MinerU cloud API workflow
└── grobid/
└── runtime.py # GROBID Docker service helpers
tests/
├── fixtures/ # TEI XML fixtures for GROBID tests
└── test_*.py # One test file per module; all mocked
The data/ directory is not committed to Git. Do not commit PDFs, TEI XML, MinerU outputs, or API tokens.
- No broad
try/exceptto hide errors — let them surface. - No placeholder functions or fake test results.
- Add or update tests when changing parser behaviour.
- Run
pytestbefore claiming a change is complete.