Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore: don't pass empty language code to tesseract CLI #1996

Merged
merged 20 commits into from
Nov 7, 2023

Conversation

yuming-long
Copy link
Contributor

@yuming-long yuming-long commented Nov 2, 2023

Summary:

Close: #1920

  • stop passing in empty string from languages to tesseract, which will result in passing empty string to language config -l for the tesseract CLI
  • also stop passing in duplicate language code from languages to tesseract OCR
  • if we failed to convert any iso languages from the languages parameter, proceed OCR with eng as default

Test

  • First confirm the tesseract error Estimating resolution as X before this:
    • on the unstructured-api repo with main branch, run make run-web-app
    • curl to test error from empty string, or just any wrong input like -F 'languages="eng,de"':
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
 -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
| jq -C . | less -R
  • after this change:
    • in your unstructured API env, cd to unstructured repo and install it locally with pip install -e .
    • check out to this branch
    • run make run-web-app again in api repo
    • the curl command return output and see warning in log

@yuming-long yuming-long temporarily deployed to ci November 2, 2023 20:34 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 20:34 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 20:34 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 20:34 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 20:34 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 20:34 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 20:34 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 20:34 — with GitHub Actions Inactive
@Coniferish
Copy link
Collaborator

@yuming-long, do you use pyenv for env management? I'm having trouble installing the local changes to unstructured in my api environment. Will keep working on it, but figured I'd ask.

@yuming-long
Copy link
Contributor Author

yuming-long commented Nov 2, 2023

Hi @Coniferish, yes I use pyenv.
I am on api repo, and then pyenv activate unstructured-api, cd ../unstructured, and pip install -e . works for me. Let me know if you need help on the error :)

@Coniferish
Copy link
Collaborator

Sounds good. It's probably because I set 'local' envs for the different repos and need to clear out the config or something

@yuming-long yuming-long temporarily deployed to ci November 2, 2023 23:41 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 23:41 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 23:43 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 23:43 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 23:43 — with GitHub Actions Inactive
@yuming-long yuming-long temporarily deployed to ci November 2, 2023 23:43 — with GitHub Actions Inactive
@yuming-long yuming-long added this pull request to the merge queue Nov 7, 2023
@badGarnet badGarnet removed this pull request from the merge queue due to a manual request Nov 7, 2023
@badGarnet badGarnet merged commit ad14321 into main Nov 7, 2023
46 checks passed
@badGarnet badGarnet deleted the yuming/dont_pass_empty_str_to_tesseract branch November 7, 2023 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TesseractError: Estimating resolution as x
5 participants