-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Bug
I would like to use docling as a parser task in celery, and it works only for non-pdf files.
Steps to reproduce
UPDATE2: Tried the OMP_NUM_THREADS=1 and -P solo settings. No luck.
UPDATE: I am not an expert of celery, and the task itself reported took only 32 sec to process the file. But in real world it took 19 min as the logger printed.
[2025-04-01 16:28:46,576: WARNING/MainProcess] Start parsing: somePDF.pdf
[2025-04-01 16:28:46,576: WARNING/MainProcess] Converter initialized.
[2025-04-01 16:47:22,904: WARNING/MainProcess] Conversion secs: [32.63109178701416]
Apparently there's something wrong with my Celery config or what, but I am unable to get it right after digging for hours.
Appreciate if anyone knows what happened. Sorry if this issue is out of scope of this repo, feel free to close it.
Just a super simple task, it will fail to parse PDF files.
from celery_app import app
from docling.document_converter import DocumentConverter
def call_docling(source: str) -> str:
print(f"Start parsing: {source}")
converter = DocumentConverter(
)
print(f"Converter initialized.")
result = converter.convert(source)
doc_conversion_secs = result.timings["pipeline_total"].times
print(f"Conversion secs: {doc_conversion_secs}")
return result.document.export_to_markdown()
@app.task(bind=True)
def parse(self, source: str):
task_id = self.request.id
redis_key = f"parse:{task_id}"
redis_client = get_redis_client()
result = call_docling(source)
redis_client.set(redis_key, result)
redis_client.expire(redis_key, 300)
return self.request.idthen call it by:
url = "somePDF.pdf"
content = parse.delay(url).get().decode("utf-8")
print(web)for a web url it will return in a sec, but for a PDF file it takes more than 15 min. My machine has 4 x RTX 6000 Ada GPU, and if I use docling standalone in a python file it will parse PDF in 3~5 sec.
Under celery, when the task is called, I can see from nvitop it is loaded in GPU ( VRAM usage incresed), but GPU UTL is just stay near 0%.
Thanks for any help!
Docling version
Version: 2.28.4
Python version
3.10.12