Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine available languages and provide a choice for them #8

Open
zuphilip opened this issue Nov 23, 2019 · 4 comments
Open

Determine available languages and provide a choice for them #8

zuphilip opened this issue Nov 23, 2019 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@zuphilip
Copy link
Member

Currently, we use a fixed language as deu or eng for OCR with Tesseract. But in a lot of cases it is even better to choose script/Latin, or for old texts script/Fraktur. Also other languages or scripts should be available to choose from.

There are several things to consider here:

  1. How can we find out the available languages for the currently installed tesseract? - It is possible to run commands like tesseract --list-langs from the extension, but we cannot access the output or pipe the output somewhere from Zotero. Should we just ship a one-liner script (shell script for linux/mac and bat file for windows) which is then calling the command above and pipe it to a file, which we then can analyze? Other ideas?
  2. It is possible to have some general options and defining a standard model there. In the setting pane you can then also change this model depending on the languages you have installed (see 1.).
  3. It is possible to analyze the language field of each Zotero entry to choose a different option. This would then allow for example to use deu model for German texts and eng model for English texts. However, this might not always be that simple. For example for older German texts one should maybe use script/Fraktur model instead and even the script/Latin model is quite often better for texts including names also in foreign languages etc.
  4. Maybe it is better to ask before each call which language to choose etc. Then you can manually select all the entries which can be recognized by the same language. Moreover, one could possible have some more Tesseract options to toggle on/off etc. What do you think?

CC @stweil @luerhard

@stweil
Copy link
Member

stweil commented Nov 23, 2019

Keep it simple. I think it would be sufficient to have a user option (similar to the tesseract path option) for the language / script which is preset to eng (the default language which is always installed). The user would be responsible for installing and selecting the right models, otherwise Tesseract would simply fail with an error.

Latin (or script/Latin, depending on your installation) is a good choice for all texts based on Latin script. Some users might need Cyrillic, Greek, Arabic or other scripts. The user option would also allow setting Latin+Greek+Arabic, for example, so I see no need to ask each time.

@luerhard
Copy link
Contributor

Regarding 1.
For Unix-systems it would probably be enough to just run the command
tesseract --list-langs > /path/to/file.txt
to print all the available languages to a file.

If this works fine, one could implement a Dropdownmenu to just select the language. I think that would be enough.

@zuphilip
Copy link
Member Author

A simple solution in a free textbox in the new preferences as @stweil suggested is now implemented.

I am aware of the command in tesseract to show all available languages, but I don't see a possibility to call this from Zotero and save its output somewhere. But yeah we could create a file with something like this.

Let us wait a little bit more and in practice how good the simple solution is already working.

@zuphilip zuphilip added enhancement New feature or request help wanted Extra attention is needed labels Jan 23, 2021
@zettelberg
Copy link

Have had a related problem: not being accustomed to type "deu" but always "de" in similar cases (...which I should have verified by trying "tesseract list-lang" of course...) took me quite a long time to get the solution - also because the system doesn't throw any error messages in that case (sadly!). A dropdown-box (or simply: more examples!) would have helped a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants