Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for more document languages #488

Closed
eikek opened this issue Nov 30, 2020 · 5 comments · Fixed by #581
Closed

Support for more document languages #488

eikek opened this issue Nov 30, 2020 · 5 comments · Fixed by #581
Labels
joex affects the joex component

Comments

@eikek
Copy link
Owner

eikek commented Nov 30, 2020

Languages are currently english, german and french. This is what is supported by stanford-nlp. But other languages good be added easily without the nlp support. For these a fallback could be provided, then adding more languages is not hard. Maybe get rid of NLP alltogether.

See: #461

@mrtnggnn
Copy link

mrtnggnn commented Jan 1, 2021

Docspell 0.17.1 installed with the docker-compose method seems not to have the french language installed for tesseract.
French is the default language for my collective.

2020-12-31T17:03:10: ocrmypdf stderr: ERROR - The installed version of tesseract does not have language data for the following requested languages: fra
2020-12-31T17:03:10: PDF conversion failed: Command result=3. No output file found.. Go without PDF file

@eikek
Copy link
Owner Author

eikek commented Jan 1, 2021

Thanks @mrtnggnn – this is missing indeed! I'll create a new issue from your comment to fix this bug in the docker file.

@eikek
Copy link
Owner Author

eikek commented Jan 17, 2021

Next release will include the following languages for document processing: Spanish, Italian, Portuguese, Czech, Dutch, Danish, Finnish, Norwegian, Swedish, Russian, Romanian

If you'd like others to be included, please let me know.

@maciekb
Copy link

maciekb commented Feb 4, 2022

Please consider to add Polish language processing support.

@eikek
Copy link
Owner Author

eikek commented Feb 5, 2022

Please consider to add Polish language processing support.

@maciekb Yes, of course! Let's create a new issue #1345 for this, otherwise it gets lost too easily. Can I ping you for help, because unfortunately I don't know any Polish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
joex affects the joex component
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants