fix(classifiers): guard against content=None in DocumentLanguageClassifier (fixes #11418) by devteamaegis · Pull Request #11425 · deepset-ai/haystack

devteamaegis · 2026-05-28T08:19:10Z

Summary

Fixes #11418 — DocumentLanguageClassifier crashes with TypeError when a Document has content=None.

Before:

def _detect_language(self, document: Document) -> str | None:
    language = None
    try:
        language = langdetect.detect(document.content)  # TypeError if content is None
    except langdetect.LangDetectException:
        ...

langdetect.detect(None) raises TypeError, which is not caught by the LangDetectException handler. The exception propagates to the caller, crashing the pipeline. This affects any blob-only Document (e.g. images, PDFs loaded without text extraction) since Document.content is explicitly allowed to be None.

After: an explicit None guard is added before calling langdetect.detect(). Documents with content=None log a warning and return None (which causes run() to route them to "unmatched"), consistent with existing behaviour for text that langdetect fails to detect.

Changes

haystack/components/classifiers/document_language_classifier.py

Added if document.content is None: guard at the top of _detect_language
Logs a warning including the document ID (same pattern as the LangDetectException branch)
Returns None so the caller routes the document to "unmatched"

test/components/classifiers/test_document_language_classifier.py
Three new tests:

Test	Assertion
`test_content_none_does_not_raise`	`run([Document(content=None)])` must not raise; document gets `language="unmatched"`
`test_content_none_emits_warning`	A warning containing the document ID is logged
`test_mixed_none_and_text_content`	Batch with a `None`-content doc and a text doc both classified correctly

Test plan

All 10 tests in test_document_language_classifier.py pass: uv run --with langdetect --with pytest python -m pytest test/components/classifiers/test_document_language_classifier.py -v → 10 passed

…ifier

…anguageClassifier

vercel · 2026-05-28T08:19:16Z

@devteamaegis is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

CLAassistant · 2026-05-28T08:19:18Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

sjrl · 2026-05-28T08:21:13Z

Closing as duplicate of #11419

devteamaegis added 2 commits May 28, 2026 04:18

fix(classifiers): guard against None content in DocumentLanguageClass…

d218205

…ifier

test(classifiers): add regression tests for content=None in DocumentL…

345d9b2

…anguageClassifier

devteamaegis requested a review from a team as a code owner May 28, 2026 08:19

devteamaegis requested review from sjrl and removed request for a team May 28, 2026 08:19

github-actions Bot added topic:tests type:documentation Improvements on the docs labels May 28, 2026

sjrl closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(classifiers): guard against content=None in DocumentLanguageClassifier (fixes #11418)#11425

fix(classifiers): guard against content=None in DocumentLanguageClassifier (fixes #11418)#11425
devteamaegis wants to merge 2 commits into
deepset-ai:mainfrom
devteamaegis:fix/document-language-classifier-none-content

devteamaegis commented May 28, 2026

Uh oh!

vercel Bot commented May 28, 2026

Uh oh!

CLAassistant commented May 28, 2026

Uh oh!

sjrl commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

devteamaegis commented May 28, 2026

Summary

Changes

Test plan

Uh oh!

vercel Bot commented May 28, 2026

Uh oh!

CLAassistant commented May 28, 2026

Uh oh!

sjrl commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants