Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Japanese and numbers (digits) #973

Closed
wallace11 opened this issue Jul 31, 2021 · 6 comments
Closed

Issue with Japanese and numbers (digits) #973

wallace11 opened this issue Jul 31, 2021 · 6 comments
Labels
bug Something isn't working or in unexpected ways docker All things regarding docker setup

Comments

@wallace11
Copy link
Contributor

Hi there,

Following up #948 and #962, I tested a couple of Japanese documents and the whole process went flawlessly.

To my surprise, the only problem I had is with numbers.
For some reason, roman numbers are converted to circled numbers.
Besides being incorrect, this messes up the date recognition because 2016年10月25日 is being recognized as ⑳①⑥ 年 ①0 月 ②⑤ 日 (weird, right?)

I tried to find a solution and came across this issue in the tessdata repo, which explains the issue in more details and has a potential solution (not sure that they were talking about, exactly).
tesseract-ocr/tessdata#119

I wanted to find a "good" paper for sharing here for people to test on, but all the "good" documents I've got contain personal information, etc. so I just used the back of a movie ticket that I had lying around. You'll notice all the numbers at the bottom become circled after OCR.

z-20210731-145217.pdf

Thanks!

@eikek
Copy link
Owner

eikek commented Aug 2, 2021

Thanks for reporting. I'll look into it later. Are you using the docker images?

@eikek eikek added bug Something isn't working or in unexpected ways joex affects the joex component labels Aug 2, 2021
@wallace11
Copy link
Contributor Author

@eikek Sorry for the late response. Yes, I'm using Docker.

@eikek
Copy link
Owner

eikek commented Aug 6, 2021

No worries! I'm off for doing something serious for a week or two anyways:-). Thanks for the research and the test paper (this is really helpful!), it seems we need to download a different training set for Japanese. Hope this fixes the problem. Otherwise, I'm quite lost.

@eikek eikek added this to the Docspell 0.26.0 milestone Aug 6, 2021
eikek added a commit that referenced this issue Aug 13, 2021
They seem to work better as suggested here:
tesseract-ocr/tessdata#119

Refs: #973
@eikek eikek added docker All things regarding docker setup and removed joex affects the joex component labels Aug 13, 2021
@eikek
Copy link
Owner

eikek commented Aug 13, 2021

Hi @wallace11 I changed the joex docker image by adding the other training data. It is shortly available via the nightly tag. If you could test it out a little, would be great. I could confirm that your test document now doesn't show the circled numbers. But I cannot tell whether it is better/worse than before in other areas.

@wallace11
Copy link
Contributor Author

@eikek
It's working!
It can even detect ACTUAL circled numbers and differentiate them from the normal numbers, or numbers that are written like this for example 1).
Dates detection also works!

Only issue is that it tends to insert unnecessary spaces (in Japanese you rarely need to use spaces). I'm guessing it's the best we can achieve right now which is 95% there in terms of accuracy and usability - for me it's perfect.

Thank you!

@eikek
Copy link
Owner

eikek commented Aug 15, 2021

Hi @wallace11 thanks you, great to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working or in unexpected ways docker All things regarding docker setup
Projects
None yet
Development

No branches or pull requests

2 participants