Skip to content

Stuck when using with Celery #1277

@lzl12051

Description

@lzl12051

Bug

I would like to use docling as a parser task in celery, and it works only for non-pdf files.

Steps to reproduce

UPDATE2: Tried the OMP_NUM_THREADS=1 and -P solo settings. No luck.
UPDATE: I am not an expert of celery, and the task itself reported took only 32 sec to process the file. But in real world it took 19 min as the logger printed.

[2025-04-01 16:28:46,576: WARNING/MainProcess] Start parsing: somePDF.pdf
[2025-04-01 16:28:46,576: WARNING/MainProcess] Converter initialized.
[2025-04-01 16:47:22,904: WARNING/MainProcess] Conversion secs: [32.63109178701416]

Apparently there's something wrong with my Celery config or what, but I am unable to get it right after digging for hours.
Appreciate if anyone knows what happened. Sorry if this issue is out of scope of this repo, feel free to close it.


Just a super simple task, it will fail to parse PDF files.

from celery_app import app
from docling.document_converter import DocumentConverter

def call_docling(source: str) -> str:
    print(f"Start parsing: {source}")
    converter = DocumentConverter(
    )
    print(f"Converter initialized.")
    result = converter.convert(source)
    doc_conversion_secs = result.timings["pipeline_total"].times
    print(f"Conversion secs: {doc_conversion_secs}")
    return result.document.export_to_markdown()

@app.task(bind=True)
def parse(self, source: str):
    task_id = self.request.id
    redis_key = f"parse:{task_id}"
    redis_client = get_redis_client()
    result = call_docling(source)
    redis_client.set(redis_key, result)
    redis_client.expire(redis_key, 300)
    return self.request.id

then call it by:

url = "somePDF.pdf"
content = parse.delay(url).get().decode("utf-8")
print(web)

for a web url it will return in a sec, but for a PDF file it takes more than 15 min. My machine has 4 x RTX 6000 Ada GPU, and if I use docling standalone in a python file it will parse PDF in 3~5 sec.
Under celery, when the task is called, I can see from nvitop it is loaded in GPU ( VRAM usage incresed), but GPU UTL is just stay near 0%.

Thanks for any help!

Docling version

Version: 2.28.4

Python version

3.10.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions