Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OCR'ing more languages #942

Closed
wallace11 opened this issue Jul 18, 2021 · 4 comments · Fixed by #1559
Closed

Add support for OCR'ing more languages #942

wallace11 opened this issue Jul 18, 2021 · 4 comments · Fixed by #1559
Labels
documentation Improvements or additions to documentation

Comments

@wallace11
Copy link
Contributor

Hi there,
I see that even though tesseract supports over 100 languages already, only a handful are available in docspell.
I was wondering if it was possible to add more languages to the OCR.
It looks like most of the job needs to be done on configuring the language objects in https://github.com/eikek/docspell/blob/master/modules/common/src/main/scala/docspell/common/Language.scala and then the UI needs to be adjusted accordingly.

If it's a more complicated task to add a new language, then would it be possible to add a contribution guide and the community will take care of it via PRs?

Cheers.

@eikek
Copy link
Owner

eikek commented Jul 19, 2021

Hi @wallace11 ,
yes it's not that hard to add more languages for tesseract. If you have just one language that is not available, you could simply overwrite the tesseract command as a quick (and dirty :-)) workaround.

For adding more lanugages, I could indeed write a little guide. The only difficulty is, that I try to recognize dates by using date format patterns of the specific locale for the language. If you can give me the lanugage(s) that you miss and the date patterns (like in #679) I could add it with only little work.

@mtonnie
Copy link

mtonnie commented Mar 1, 2022

What about the idea to install additional languages on demand?
Add an environment variable eg. something like OCR_LANGUGES and then use this variable to check and install the defined languages. from entrypoint script.

This might reduce the basic image size dramatically (probably 700-800MB less)

@eikek
Copy link
Owner

eikek commented Mar 1, 2022

That would be really nice - I think I don't get how it would work. Do you mean that when the container starts it installs additional packages and then starts the app?

It would be really nice to reduce the image size! OTOH I myself have plenty space 😃 … and so wouldn't spent much time on it. The downsides are that users could select languages in the UI that are not supported. The languages are currently hard-coded. But a subset of those could be put into the config file so that the ui hides non-supported ones. Then it would still be possible to mess it up, but I think that's ok. But then you need to configure more stuff in the docker-compose.yml.

@eikek eikek added this to the Docspell 0.36.0 milestone May 21, 2022
eikek added a commit that referenced this issue May 21, 2022
@eikek eikek linked a pull request May 21, 2022 that will close this issue
@eikek eikek added the documentation Improvements or additions to documentation label May 21, 2022
eikek added a commit that referenced this issue May 21, 2022
@mergify mergify bot closed this as completed in #1559 May 21, 2022
@eikek
Copy link
Owner

eikek commented May 21, 2022

Added a little guide to https://docspell.org/docs/dev/add-language/ (will be published with 0.36.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants