Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Only one of languages and ocr_languages should be specified. languages is preferred. ocr_language... #2293

Closed
sentry-io bot opened this issue Dec 19, 2023 · 3 comments
Assignees

Comments

@sentry-io
Copy link

sentry-io bot commented Dec 19, 2023

The API is hitting this error from Unstructured. We had decided that this shouldn't throw an error, and just take the value from languages if both are provided.

Some example inputs that cause this:

ocr_languages: ["\"es\""], languages": [""]
ocr_languages: ["eng"], languages": [""]
ocr_languages: ["deu"], languages": [""]
ocr_languages: ["deu"], languages": ["deu"]
ocr_languages: ["eng+deu"], languages": ["eng+deu"]
ocr_languages: ["english"], languages": ["english"]
ocr_languages: ["[eng]"], languages": [""]
ocr_languages: ["[eng]"], languages": ["[eng]"]

So it seems we need to:

  • Account for ocr_languages being sent when languages is empty. We can log a warning that this is deprecated.
  • Prefer languages if both are set
  • Remove the exception
ValueError: Only one of languages and ocr_languages should be specified. languages is preferred. ocr_languages is marked for deprecation.
(23 additional frame(s) were not displayed)
...
  File "prepline_general/api/general.py", line 869, in pipeline_1
    list(response_generator(is_multipart=False))[0]
  File "prepline_general/api/general.py", line 782, in response_generator
    response = pipeline_api(
  File "prepline_general/api/general.py", line 470, in pipeline_api
    raise e
  File "prepline_general/api/general.py", line 437, in pipeline_api
    elements = partition(**partition_kwargs)
  File "/home/notebook-user/unstructured/unstructured/partition/auto.py", line 229, in partition
    raise ValueError(
@Coniferish Coniferish self-assigned this Dec 21, 2023
@Coniferish
Copy link
Collaborator

@awalker4, I want to double-check that you want handle the incorrect ways of passing languages. That is, ["[eng]"] should be handled instead of raising a ValueError.

@awalker4
Copy link
Contributor

Yeah, I suppose that should get turned into a 400 error

github-merge-queue bot pushed a commit that referenced this issue Jan 11, 2024
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293

This PR adds a dictionary for helping map fully spelled out languages to
tesseract language codes

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this issue Jan 16, 2024
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293

Refactor `_convert_language_code_to_pytesseract_lang_code` and extract
`_get_iso639_language_object` to its own function


```
from unstructured.partition.lang import _convert_language_code_to_pytesseract_lang_code as convert
convert("English") # this will raise an error on both main and this branch
convert("en") # this will return "eng" on both branches
```
github-merge-queue bot pushed a commit that referenced this issue Jan 19, 2024
This PR is one in a series of PRs for refactoring and fixing the
languages parameter so it can address incorrect input by users. #2293

This PR adds _clean_ocr_languages_arg. There are no calls to this
function yet, but it will be called in later PRs related to this series.
github-merge-queue bot pushed a commit that referenced this issue Jan 29, 2024
This PR is the last in a series of PRs for refactoring and fixing the
language parameters (`languages` and `ocr_languages` so we can address
incorrect input by users. See #2293

It is recommended to go though this PR commit-by-commit and note the
commit message. The most significant commit is "update
check_languages..."
@Coniferish
Copy link
Collaborator

Closing as final PR for this issue is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants