[PDF변환] 사이냅소프트 PDF 컨버터 SDK 도입 #187#197
Conversation
- attachment_processor / convert_processor / intelligent_processor 의
PDF 변환 로직을 LibreOffice 단일 → PDF SDK(default) + LibreOffice fallback 으로 전환
- convert_to_pdf(file_path, use_pdf_sdk=True) 시그니처: kwargs 로 엔진 선택
- intelligent_processor: kwargs 'auto_convert_to_pdf' (default True) 추가.
비-PDF 입력에 대해 자동 PDF 변환 여부 제어 (False=변경 전 동작)
- convert_processor 의 자체 convert_to_pdf 제거 → attachment_processor 에서 import 재사용
- PDF SDK 경로: PDF_SDK_HOME 환경변수 → fallback <repo_root>/pdf_sdk
- fontconfig conf 런타임 동적 패치 (현재 SDK 경로 기준 <dir>/<cachedir> 치환)
[빌드/배포]
- Dockerfile: 'pdf_sdk_artifacts' stage 신설 (HF private dataset
HeechanKim-Genon/pdf_sdk 다운로드 + chmod + fontconfig 보정)
- runtime stage: PDF_SDK_HOME=/app/pdf_sdk ENV + COPY 추가
- HWP SDK 와 동일한 HF_TOKEN 으로 두 SDK 모두 다운로드
[테스트]
- tests/smoke/test_pdf_convert_smoke.py 추가
attachment_processor × {ppt, pptx} × {use_pdf_sdk=True, False} parametrize
→ vectors 정상 반환 + schema 검증 (LibreOffice fallback 회귀 방지)
[문서]
- genon/README.md: HF 다운로드 가이드 (도커 빌드 + 로컬 실행),
SDK 바이너리 단독 실행 가이드 (Linux 전용 명시)
- gitbook_doc/intro.md, attachment_processor.md, convert_processor.md,
intelligent_processor.md: PDF 변환 엔진 분기 / use_pdf_sdk·auto_convert_to_pdf
config / SDK 경로 결정 로직 반영
- .gitignore: pdf_sdk/ 추가
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#175 PR(692a863)에서 추가된 settings.py 의 모듈 레벨 minio_config = MinioConfig() 가 PROD/DEV 환경 무관하게 import 시점에 MINIO_ENDPOINT/MINIO_ACCESS_KEY/MINIO_SECRET_KEY 세 환경변수의 존재를 강제 검증하는 문제 수정. main.py:49 의 로직상 MinIO 다운로드는 settings.PREPROCESSOR_ID 가 있을 때만 호출되므로, PREPROCESSOR_ID 가 없는 환경(MinIO 를 쓰지 않는 환경)에서도 워커 boot 가 실패하는 회귀가 발생함. 수정: - common/settings.py: 모듈 레벨 minio_config 인스턴스 제거 (MinioConfig 클래스 정의는 유지) - util/minio_resource.py: download_resource_files 함수 안에서 MinioConfig() 인스턴스화로 이동 → MinIO 사용 시점에만 환경변수 검증, 사용 안 하는 환경은 영향 없음 (이 수정은 187 브랜치의 PDF SDK 도입과 무관한 별개 fix 이지만, 187 브랜치 빌드/배포 시 관측되어 같은 브랜치에 합쳐 처리) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
이전 커밋(79edf45)에서 코드 중복 제거 목적으로 convert_processor 와 intelligent_processor 가 attachment_processor 의 convert_to_pdf 를 import 하도록 통합했으나, Genos 웹 UI 는 facade 코드를 단일 파일 (preprocessor.py)로 다루기 때문에 사용자가 GitHub 의 원본을 그대로 복붙하면 attachment_processor 모듈을 찾을 수 없어 import 가 깨짐. 각 facade 가 독립적으로 동작하도록 convert_to_pdf / _convert_to_pdf_sdk / _convert_to_pdf_libreoffice / _patch_fontconfig 를 자체 정의로 복원. 세 파일 모두 동일한 구현(코드 중복) 을 보유하지만, 운영 환경의 단일 파일 제약을 만족하는 게 우선. 수정: - convert_processor.py: import 한 줄(convert_to_pdf) 제거 + 4개 함수 자체 정의 + shutil/subprocess/tempfile/unicodedata import 복원 - intelligent_processor.py: 동일 - gitbook_doc/convert_processor.md, intelligent_processor.md: "재사용/import" 표기를 "자체 정의" 로 정정 + 단일 파일 제약 사유 명시 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR integrates a PDF SDK as an alternative to LibreOffice for file-to-PDF conversion. It adds dual-backend routing with environment configuration, updates all preprocessor pipelines to support engine selection via a ChangesPDF SDK Integration & Dual-Backend Conversion
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new PDF SDK as the primary engine for document conversion, complementing the existing LibreOffice fallback across the attachment, convert, and intelligent processors. Key changes include Dockerfile updates for SDK acquisition, comprehensive documentation updates, and a refactor of the conversion logic to support engine selection via the use_pdf_sdk parameter. Additionally, the PR delays MinioConfig instantiation to prevent worker startup failures in environments where MinIO is not utilized. Review feedback identifies a potential issue in the font configuration patching logic, suggesting the removal of the count=1 constraint in re.sub calls to ensure all relevant XML tags are correctly updated across all processor implementations.
| dst = os.path.join(tmp_dir, "fonts.conf") | ||
| with open(src, "r", encoding="utf-8") as f: | ||
| conf = f.read() | ||
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1) |
There was a problem hiding this comment.
Using count=1 in re.sub might not patch all <dir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1) | |
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf) |
| with open(src, "r", encoding="utf-8") as f: | ||
| conf = f.read() | ||
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1) | ||
| conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1) |
There was a problem hiding this comment.
Using count=1 in re.sub might not patch all <cachedir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.
| conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1) | |
| conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf) |
| dst = os.path.join(tmp_dir, "fonts.conf") | ||
| with open(src, "r", encoding="utf-8") as f: | ||
| conf = f.read() | ||
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1) |
There was a problem hiding this comment.
Using count=1 in re.sub might not patch all <dir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1) | |
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf) |
| with open(src, "r", encoding="utf-8") as f: | ||
| conf = f.read() | ||
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1) | ||
| conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1) |
There was a problem hiding this comment.
Using count=1 in re.sub might not patch all <cachedir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.
| conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1) | |
| conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf) |
| dst = os.path.join(tmp_dir, "fonts.conf") | ||
| with open(src, "r", encoding="utf-8") as f: | ||
| conf = f.read() | ||
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1) |
There was a problem hiding this comment.
Using count=1 in re.sub might not patch all <dir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1) | |
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf) |
| with open(src, "r", encoding="utf-8") as f: | ||
| conf = f.read() | ||
| conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1) | ||
| conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1) |
There was a problem hiding this comment.
Using count=1 in re.sub might not patch all <cachedir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.
| conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1) | |
| conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf) |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
genon/preprocessor/facade/attachment_processor.py (2)
1314-1325:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the converted PDF path instead of discarding it.
For
.doc,.ppt/.pptx, and image inputs,convert_to_pdf()is called and then ignored; the loader still parses the original source file. That meansuse_pdf_sdkdoes not change the extraction path, and broken converters can stay hidden behind a successfulUnstructured*Loaderparse.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@genon/preprocessor/facade/attachment_processor.py` around lines 1314 - 1325, The code calls convert_to_pdf(file_path, use_pdf_sdk=use_pdf_sdk) but ignores its returned PDF path, continuing to pass the original file_path into UnstructuredWordDocumentLoader, UnstructuredPowerPointLoader, and UnstructuredImageLoader; change those branches to capture the returned path (e.g., pdf_path = convert_to_pdf(...)) and pass pdf_path to the respective loaders (UnstructuredWordDocumentLoader, UnstructuredPowerPointLoader, UnstructuredImageLoader) so the loaders operate on the converted PDF rather than the original source when use_pdf_sdk is used.
194-217:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe non-ASCII LibreOffice fallback checks the wrong PDF name.
If the ASCII temp copy succeeds, LibreOffice writes
<ascii_name>.pdfintoout_dir, not the original<in_path.stem>.pdf. The current existence check therefore misses successful conversions for non-ASCII filenames and returnsNoneanyway.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@genon/preprocessor/facade/attachment_processor.py` around lines 194 - 217, The LibreOffice fallback uses the original in_path.stem when checking for the produced PDF, which fails for non-ASCII temp copies; update the check in convert_to_pdf (attachment_processor.py) to derive pdf_path from the actual candidate filename (use cand.stem or the ascii_copy name) — e.g., build expected_pdf = Path(out_dir) / f"{cand.stem}.pdf" for each candidate, test its existence and return its string, and ensure tmp_dir is cleaned up on success or failure.genon/preprocessor/facade/convert_processor.py (1)
228-251:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe LibreOffice temp-copy fallback misses successful conversions for non-ASCII filenames.
sofficewill emit a PDF named aftercand.stem, but the success check only looks forin_path.with_suffix(".pdf"). When the ASCII temp copy is the one that succeeds, this path still returnsNone.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@genon/preprocessor/facade/convert_processor.py` around lines 228 - 251, The LibreOffice fallback currently checks only in_path.with_suffix(".pdf"), missing cases where the ASCII temp copy (ascii_copy) produced the PDF; update the loop that iterates candidates in convert_processor.py to derive the expected output PDF per candidate (e.g., pdf_path = out_dir / f"{cand.stem}.pdf") and check that path after subprocess.run, returning its string when found; also ensure tmp_dir is removed when the temp copy produced the PDF (and still cleaned up on all exits). Use the existing symbols cand, pdf_path, tmp_dir, convert_arg, out_dir, and in_path to implement this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@genon/preprocessor/facade/attachment_processor.py`:
- Around line 141-153: The subprocess calls that run the PDF conversion (the
subprocess.run call creating proc with cmd, checking proc.returncode and
pdf_path) must be bounded by a timeout and handle subprocess.TimeoutExpired
similarly to other failures: add a timeout argument to subprocess.run (for both
the pdfConverter branch shown around the cmd/proc/pdf_path logic and the soffice
branch referenced around lines 205-217), wrap the call in try/except
subprocess.TimeoutExpired, and in the except block log the timeout via
_log.warning (include proc-like context and the command) and return/raise the
same failure response path used for non-zero returncodes so the request isn't
blocked indefinitely. Ensure you reference the same variables (cmd, env, proc,
pdf_path) and reuse the existing failure handling where returncode != 0 is
currently handled.
In `@genon/preprocessor/facade/convert_processor.py`:
- Around line 1313-1318: The PPT and DOCX/PPTX branches in get_loader_langchain
(and the similar branch at lines ~1945-1946) call convert_to_pdf(file_path,
use_pdf_sdk=use_pdf_sdk) for side effects but continue using the original file,
so the use_pdf_sdk choice is never applied and conversion failures are ignored;
change these branches to capture the returned PDF path from convert_to_pdf and
pass that path into the LangChain loader (e.g., replace
UnstructuredPowerPointLoader(file_path) with
UnstructuredPowerPointLoader(pdf_path)) and do likewise for the DOCX/PPTX loader
paths, propagate the use_pdf_sdk argument into those calls, and if
convert_to_pdf returns None or raises, fail fast by raising an exception or
returning an error instead of proceeding with the original file.
- Around line 175-187: Add a bounded timeout to the external subprocess calls:
pass a sensible timeout (e.g. timeout=30) to subprocess.run when building and
running cmd in convert_processor.py, wrap the call in a try/except that catches
subprocess.TimeoutExpired, logs a clear warning including the command and that
it timed out (similar to the existing _log.warning usage for proc.stderr), and
treat the timeout as a conversion failure (return None or the current failure
path) instead of letting the request hang; apply the same change to the other
subprocess.run block around lines 240-251 so both conversions are protected by
timeouts and timeout handling.
In `@genon/preprocessor/facade/intelligent_processor.py`:
- Around line 63-75: The converter subprocess calls currently use subprocess.run
without timeouts, so add a timeout to both runs (the PDF SDK call shown around
the cmd variable and the other soffice call around lines 128-139) by passing a
sensible timeout value (e.g., convert_timeout) and wrap each call in try/except
subprocess.TimeoutExpired; on timeout log a warning (include the context like
"[convert_to_pdf:sdk]" or the soffice context) and treat it as a conversion
failure (return the same failure value/path handling used for non-zero return
codes). Ensure you reference and update the subprocess.run invocations in the
convert_to_pdf-related code paths and the soffice conversion path to include
timeout and TimeoutExpired handling.
- Around line 116-143: The check for successful LibreOffice conversion uses a
fixed pdf_path based on in_path, so when converting the ASCII temp copy
(ascii_copy) the produced file (ascii_name.pdf) isn't recognized; update the
loop in convert_to_pdf (function in intelligent_processor.py) to compute the
expected PDF path for each candidate (e.g., derive pdf_path from the current
cand's name/stem and out_dir) and check that path.exists() after subprocess.run;
keep the same tmp_dir cleanup behavior (remove tmp_dir on success or after the
loop).
In `@genon/preprocessor/tests/smoke/test_pdf_convert_smoke.py`:
- Around line 34-43: The current test_pdf_convert_smoke is insufficient because
basic_processor() still proceeds to parse the original PPT/PPTX instead of
verifying the PDF output; update the test to explicitly exercise convert_to_pdf
(or call a processor that consumes the returned PDF path) and assert on the
converted artifact itself: invoke the convert_to_pdf path used by
basic_processor() (or call convert_to_pdf directly), verify the returned PDF
path exists and is a non-empty file (and/or has a .pdf extension), and then pass
that PDF path into the downstream processor to ensure the PDF backend actually
produced a usable PDF rather than only validating PowerPoint parsing.
---
Outside diff comments:
In `@genon/preprocessor/facade/attachment_processor.py`:
- Around line 1314-1325: The code calls convert_to_pdf(file_path,
use_pdf_sdk=use_pdf_sdk) but ignores its returned PDF path, continuing to pass
the original file_path into UnstructuredWordDocumentLoader,
UnstructuredPowerPointLoader, and UnstructuredImageLoader; change those branches
to capture the returned path (e.g., pdf_path = convert_to_pdf(...)) and pass
pdf_path to the respective loaders (UnstructuredWordDocumentLoader,
UnstructuredPowerPointLoader, UnstructuredImageLoader) so the loaders operate on
the converted PDF rather than the original source when use_pdf_sdk is used.
- Around line 194-217: The LibreOffice fallback uses the original in_path.stem
when checking for the produced PDF, which fails for non-ASCII temp copies;
update the check in convert_to_pdf (attachment_processor.py) to derive pdf_path
from the actual candidate filename (use cand.stem or the ascii_copy name) —
e.g., build expected_pdf = Path(out_dir) / f"{cand.stem}.pdf" for each
candidate, test its existence and return its string, and ensure tmp_dir is
cleaned up on success or failure.
In `@genon/preprocessor/facade/convert_processor.py`:
- Around line 228-251: The LibreOffice fallback currently checks only
in_path.with_suffix(".pdf"), missing cases where the ASCII temp copy
(ascii_copy) produced the PDF; update the loop that iterates candidates in
convert_processor.py to derive the expected output PDF per candidate (e.g.,
pdf_path = out_dir / f"{cand.stem}.pdf") and check that path after
subprocess.run, returning its string when found; also ensure tmp_dir is removed
when the temp copy produced the PDF (and still cleaned up on all exits). Use the
existing symbols cand, pdf_path, tmp_dir, convert_arg, out_dir, and in_path to
implement this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 931c1aec-6945-42a9-a970-af948770595d
📒 Files selected for processing (13)
.gitignoregenon/README.mdgenon/preprocessor/docker/Dockerfilegenon/preprocessor/facade/attachment_processor.pygenon/preprocessor/facade/convert_processor.pygenon/preprocessor/facade/gitbook_doc/attachment_processor.mdgenon/preprocessor/facade/gitbook_doc/convert_processor.mdgenon/preprocessor/facade/gitbook_doc/intelligent_processor.mdgenon/preprocessor/facade/gitbook_doc/intro.mdgenon/preprocessor/facade/intelligent_processor.pygenon/preprocessor/src/common/settings.pygenon/preprocessor/src/util/minio_resource.pygenon/preprocessor/tests/smoke/test_pdf_convert_smoke.py
| cmd = [ | ||
| binary, | ||
| "-i", str(in_path), | ||
| "-o", str(out_dir), | ||
| "-t", tmp, | ||
| "-f", fonts_dir, | ||
| "-e", "-1", | ||
| "-p", "1", | ||
| ] | ||
| proc = subprocess.run(cmd, env=env, capture_output=True, text=True) | ||
| if proc.returncode == 0 and pdf_path.exists(): | ||
| return str(pdf_path) | ||
| _log.warning(f"[convert_to_pdf:sdk] stderr: {proc.stderr.strip()}") |
There was a problem hiding this comment.
Bound the SDK and LibreOffice subprocesses with timeouts.
These conversions execute inline during request handling. A hung pdfConverter or soffice process will block the request indefinitely. Please add a timeout and handle subprocess.TimeoutExpired the same way you handle other conversion failures.
Also applies to: 205-217
🧰 Tools
🪛 Ruff (0.15.12)
[error] 150-150: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/facade/attachment_processor.py` around lines 141 - 153,
The subprocess calls that run the PDF conversion (the subprocess.run call
creating proc with cmd, checking proc.returncode and pdf_path) must be bounded
by a timeout and handle subprocess.TimeoutExpired similarly to other failures:
add a timeout argument to subprocess.run (for both the pdfConverter branch shown
around the cmd/proc/pdf_path logic and the soffice branch referenced around
lines 205-217), wrap the call in try/except subprocess.TimeoutExpired, and in
the except block log the timeout via _log.warning (include proc-like context and
the command) and return/raise the same failure response path used for non-zero
returncodes so the request isn't blocked indefinitely. Ensure you reference the
same variables (cmd, env, proc, pdf_path) and reuse the existing failure
handling where returncode != 0 is currently handled.
| cmd = [ | ||
| binary, | ||
| "-i", str(in_path), | ||
| "-o", str(out_dir), | ||
| "-t", tmp, | ||
| "-f", fonts_dir, | ||
| "-e", "-1", | ||
| "-p", "1", | ||
| ] | ||
| proc = subprocess.run(cmd, env=env, capture_output=True, text=True) | ||
| if proc.returncode == 0 and pdf_path.exists(): | ||
| return str(pdf_path) | ||
| _log.warning(f"[convert_to_pdf:sdk] stderr: {proc.stderr.strip()}") |
There was a problem hiding this comment.
Add timeouts to the conversion subprocesses.
These external binaries run on the request path. If either converter hangs, the request hangs with it. A bounded timeout here would turn that into a normal conversion failure instead of a stuck worker.
Also applies to: 240-251
🧰 Tools
🪛 Ruff (0.15.12)
[error] 184-184: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/facade/convert_processor.py` around lines 175 - 187, Add a
bounded timeout to the external subprocess calls: pass a sensible timeout (e.g.
timeout=30) to subprocess.run when building and running cmd in
convert_processor.py, wrap the call in a try/except that catches
subprocess.TimeoutExpired, logs a clear warning including the command and that
it timed out (similar to the existing _log.warning usage for proc.stderr), and
treat the timeout as a conversion failure (return None or the current failure
path) instead of letting the request hang; apply the same change to the other
subprocess.run block around lines 240-251 so both conversions are protected by
timeouts and timeout handling.
| def get_loader_langchain(self, file_path: str, use_pdf_sdk: bool = True): | ||
| """PPT 파일용 langchain 로더""" | ||
| ext = os.path.splitext(file_path)[-1].lower() | ||
| if ext == '.ppt': | ||
| convert_to_pdf(file_path) | ||
| convert_to_pdf(file_path, use_pdf_sdk=use_pdf_sdk) | ||
| return UnstructuredPowerPointLoader(file_path) |
There was a problem hiding this comment.
The new use_pdf_sdk flag never feeds the parser in these paths.
In both the PPT LangChain branch and the DOCX/PPTX branch, convert_to_pdf() is called only for side effects and the returned PDF path is ignored. Parsing continues from the original file, so SDK vs. LibreOffice cannot change extraction results, and conversion failures degrade silently instead of failing fast.
Also applies to: 1945-1946
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/facade/convert_processor.py` around lines 1313 - 1318, The
PPT and DOCX/PPTX branches in get_loader_langchain (and the similar branch at
lines ~1945-1946) call convert_to_pdf(file_path, use_pdf_sdk=use_pdf_sdk) for
side effects but continue using the original file, so the use_pdf_sdk choice is
never applied and conversion failures are ignored; change these branches to
capture the returned PDF path from convert_to_pdf and pass that path into the
LangChain loader (e.g., replace UnstructuredPowerPointLoader(file_path) with
UnstructuredPowerPointLoader(pdf_path)) and do likewise for the DOCX/PPTX loader
paths, propagate the use_pdf_sdk argument into those calls, and if
convert_to_pdf returns None or raises, fail fast by raising an exception or
returning an error instead of proceeding with the original file.
| cmd = [ | ||
| binary, | ||
| "-i", str(in_path), | ||
| "-o", str(out_dir), | ||
| "-t", tmp, | ||
| "-f", fonts_dir, | ||
| "-e", "-1", | ||
| "-p", "1", | ||
| ] | ||
| proc = subprocess.run(cmd, env=env, capture_output=True, text=True) | ||
| if proc.returncode == 0 and pdf_path.exists(): | ||
| return str(pdf_path) | ||
| _log.warning(f"[convert_to_pdf:sdk] stderr: {proc.stderr.strip()}") |
There was a problem hiding this comment.
Add timeouts to both converter subprocesses.
Both binaries run directly on the request path. If pdfConverter or soffice hangs on a bad document, this coroutine blocks indefinitely and can pin a worker. Bound both calls with a timeout and treat TimeoutExpired as a conversion failure.
Also applies to: 128-139
🧰 Tools
🪛 Ruff (0.15.12)
[error] 72-72: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/facade/intelligent_processor.py` around lines 63 - 75, The
converter subprocess calls currently use subprocess.run without timeouts, so add
a timeout to both runs (the PDF SDK call shown around the cmd variable and the
other soffice call around lines 128-139) by passing a sensible timeout value
(e.g., convert_timeout) and wrap each call in try/except
subprocess.TimeoutExpired; on timeout log a warning (include the context like
"[convert_to_pdf:sdk]" or the soffice context) and treat it as a conversion
failure (return the same failure value/path handling used for non-zero return
codes). Ensure you reference and update the subprocess.run invocations in the
convert_to_pdf-related code paths and the soffice conversion path to include
timeout and TimeoutExpired handling.
| try: | ||
| in_path.name.encode('ascii') | ||
| candidates = [in_path] | ||
| tmp_dir = None | ||
| except UnicodeEncodeError: | ||
| tmp_dir = Path(tempfile.mkdtemp()) | ||
| ascii_name = unicodedata.normalize('NFKD', in_path.stem).encode('ascii', 'ignore').decode('ascii') or "file" | ||
| ascii_copy = tmp_dir / f"{ascii_name}{in_path.suffix}" | ||
| shutil.copy2(in_path, ascii_copy) | ||
| candidates = [ascii_copy, in_path] | ||
|
|
||
| for cand in candidates: | ||
| cmd = [ | ||
| "soffice", "--headless", | ||
| "--convert-to", convert_arg, | ||
| "--outdir", str(out_dir), | ||
| str(cand) | ||
| ] | ||
| proc = subprocess.run(cmd, env=env, capture_output=True, text=True) | ||
| if proc.returncode == 0 and pdf_path.exists(): | ||
| if tmp_dir: | ||
| shutil.rmtree(tmp_dir, ignore_errors=True) | ||
| return str(pdf_path) | ||
| _log.warning(f"[convert_to_pdf:libreoffice] stderr: {proc.stderr.strip()}") | ||
|
|
||
| if tmp_dir: | ||
| shutil.rmtree(tmp_dir, ignore_errors=True) | ||
| return None |
There was a problem hiding this comment.
The LibreOffice non-ASCII fallback never recognizes its own output.
When cand is the ASCII temp copy, LibreOffice writes <ascii_name>.pdf into out_dir, but success is checked only against in_path.with_suffix(".pdf"). For Korean filenames, a successful temp-copy conversion is therefore reported as a failure.
Suggested fix
for cand in candidates:
+ cand_pdf_path = out_dir / f"{cand.stem}.pdf"
cmd = [
"soffice", "--headless",
"--convert-to", convert_arg,
"--outdir", str(out_dir),
str(cand)
]
proc = subprocess.run(cmd, env=env, capture_output=True, text=True)
- if proc.returncode == 0 and pdf_path.exists():
+ if proc.returncode == 0 and cand_pdf_path.exists():
+ if cand_pdf_path != pdf_path:
+ cand_pdf_path.replace(pdf_path)
if tmp_dir:
shutil.rmtree(tmp_dir, ignore_errors=True)
return str(pdf_path)🧰 Tools
🪛 Ruff (0.15.12)
[error] 134-134: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/facade/intelligent_processor.py` around lines 116 - 143,
The check for successful LibreOffice conversion uses a fixed pdf_path based on
in_path, so when converting the ASCII temp copy (ascii_copy) the produced file
(ascii_name.pdf) isn't recognized; update the loop in convert_to_pdf (function
in intelligent_processor.py) to compute the expected PDF path for each candidate
(e.g., derive pdf_path from the current cand's name/stem and out_dir) and check
that path.exists() after subprocess.run; keep the same tmp_dir cleanup behavior
(remove tmp_dir on success or after the loop).
| async def test_pdf_convert_smoke(basic_processor, sample: Path, use_pdf_sdk: bool): | ||
| """입력 ext × {PDF SDK, LibreOffice} 모두 vectors 반환 확인.""" | ||
| dp = basic_processor() | ||
| vectors = await dp(None, str(sample), use_pdf_sdk=use_pdf_sdk) | ||
|
|
||
| assert isinstance(vectors, list) and len(vectors) >= 1 | ||
| v = vectors[0] | ||
| if hasattr(v, "model_dump"): | ||
| v = v.model_dump() | ||
| assert "text" in v and isinstance(v["text"], str) |
There was a problem hiding this comment.
This smoke test can pass even when both PDF backends are broken.
basic_processor() currently goes through a path that still parses the original .ppt/.pptx after calling convert_to_pdf(), so these assertions only prove that PowerPoint loading still works. They do not prove that either backend actually produced a PDF. Please assert on the converted artifact or exercise a processor that consumes the returned PDF path directly.
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 35-35: Docstring contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/tests/smoke/test_pdf_convert_smoke.py` around lines 34 -
43, The current test_pdf_convert_smoke is insufficient because basic_processor()
still proceeds to parse the original PPT/PPTX instead of verifying the PDF
output; update the test to explicitly exercise convert_to_pdf (or call a
processor that consumes the returned PDF path) and assert on the converted
artifact itself: invoke the convert_to_pdf path used by basic_processor() (or
call convert_to_pdf directly), verify the returned PDF path exists and is a
non-empty file (and/or has a .pdf extension), and then pass that PDF path into
the downstream processor to ensure the PDF backend actually produced a usable
PDF rather than only validating PowerPoint parsing.
Summary
kwargs.get('use_pdf_sdk', ...)패턴 (HwpProcessor 의use_hwp_sdk와 동일 컨셉) 으로 사용자가 엔진 선택 가능, 기본값TrueHeechanKim-Genon/pdf_sdk) 에서 SDK 자동 다운로드. HWP SDK 와 동일 토큰(HF_TOKEN) 사용intelligent_processor는 비-PDF 입력에 대해 PDF SDK 로 자동 변환 후 docling pipeline 진입 (auto_convert_to_pdf, defaultTrue)변경 내용 (커밋 분리)
1.
[PDF변환] 사이냅소프트 PDF 컨버터 SDK 도입 #187(메인)attachment_processor.py/convert_processor.py/intelligent_processor.py의convert_to_pdf시그니처 확장:(file_path, use_pdf_sdk=True)→ 내부 분기로 SDK / LibreOffice 호출intelligent_processor.py진입부에_is_pdf매직 헤더 검사 +auto_convert_to_pdf분기 추가PDF_SDK_HOME환경변수 → fallback<repo_root>/pdf_sdk<dir>/<cachedir>치환)Dockerfile:pdf_sdk_artifactsstage 신설 (HF dataset 다운로드 + chmod + conf 보정), runtime stage 에PDF_SDK_HOME=/app/pdf_sdkENV 추가tests/smoke/test_pdf_convert_smoke.py추가 —attachment_processor× {ppt, pptx} × {use_pdf_sdk=True, False} parametrize, vectors 정상 반환 + schema 검증genon/README.md+gitbook_doc/*.md문서 갱신.gitignore에pdf_sdk/추가2.
[FIX] MinioConfig 모듈 임포트 시 환경변수 강제 검증 제거(별도 fix)692a8637) 에서settings.py모듈 레벨에minio_config = MinioConfig()가 추가되면서, MinIO 를 사용하지 않는 환경에서도 import 시점에MINIO_*환경변수 누락만으로 워커 boot 가 실패하는 문제 발생main.py:49의if settings.PREPROCESSOR_ID:분기에서만 MinIO 가 실제로 호출되므로, 검증도 그 흐름 안에서 일어나야 함common/settings.py: 모듈 레벨 인스턴스 제거 (클래스 정의는 유지)util/minio_resource.py:download_resource_files함수 안에서MinioConfig()인스턴스화로 이동3.
[FIX] facade 간 import 제거 — Genos 단일 파일 환경 호환성 회복convert_processor/intelligent_processor가attachment_processor.convert_to_pdf를 import 하게 통합했었으나, Genos 웹 UI 가 facade 코드를 단일 파일(preprocessor.py)로 다루기 때문에 GitHub 원본 그대로 복붙 시 import 가 깨지는 회귀 발생convert_to_pdf+ 헬퍼 4종을 자체 정의로 복원 (코드 중복 부활하지만 단일 파일 제약이 우선)알려진 운영 환경 요구사항 (인프라 셋업 필요)
본 PR 의 코드 자체는 빌드/임포트 통과하나, PREPROCESSOR_ID 가 truthy 인 deployment 에서 정상 부팅하려면 다음 운영 셋업이 필요함 (#175 의 의존성, 본 PR 무관):
MINIO_ENDPOINT/MINIO_ACCESS_KEY/MINIO_SECRET_KEY환경변수 주입 — 기존 cluster 의minioSecret 활용 가능preprocessorbucket 생성후속 이슈로 분리 권장
download_resource_files를 fail-safe 하게 (bucket / 리소스 부재 시 warning + 진행 — 현재는 hard-fail)Test plan
mncregistry:30500/mnc/doc-parser-preprocessor:187-branch-test)use_pdf_sdk=False) 회귀 검증🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Chores