Issue with Japanese and numbers (digits) #973

wallace11 · 2021-07-31T12:04:42Z

Hi there,

Following up #948 and #962, I tested a couple of Japanese documents and the whole process went flawlessly.

To my surprise, the only problem I had is with numbers.
For some reason, roman numbers are converted to circled numbers.
Besides being incorrect, this messes up the date recognition because 2016年10月25日 is being recognized as ⑳①⑥ 年 ①0 月 ②⑤ 日 (weird, right?)

I tried to find a solution and came across this issue in the tessdata repo, which explains the issue in more details and has a potential solution (not sure that they were talking about, exactly).
tesseract-ocr/tessdata#119

I wanted to find a "good" paper for sharing here for people to test on, but all the "good" documents I've got contain personal information, etc. so I just used the back of a movie ticket that I had lying around. You'll notice all the numbers at the bottom become circled after OCR.

z-20210731-145217.pdf

Thanks!

eikek · 2021-08-02T07:37:01Z

Thanks for reporting. I'll look into it later. Are you using the docker images?

wallace11 · 2021-08-04T20:33:11Z

@eikek Sorry for the late response. Yes, I'm using Docker.

eikek · 2021-08-06T20:50:06Z

No worries! I'm off for doing something serious for a week or two anyways:-). Thanks for the research and the test paper (this is really helpful!), it seems we need to download a different training set for Japanese. Hope this fixes the problem. Otherwise, I'm quite lost.

They seem to work better as suggested here: tesseract-ocr/tessdata#119 Refs: #973

eikek · 2021-08-13T15:04:00Z

Hi @wallace11 I changed the joex docker image by adding the other training data. It is shortly available via the nightly tag. If you could test it out a little, would be great. I could confirm that your test document now doesn't show the circled numbers. But I cannot tell whether it is better/worse than before in other areas.

wallace11 · 2021-08-14T21:35:34Z

@eikek
It's working!
It can even detect ACTUAL circled numbers and differentiate them from the normal numbers, or numbers that are written like this for example 1).
Dates detection also works!

Only issue is that it tends to insert unnecessary spaces (in Japanese you rarely need to use spaces). I'm guessing it's the best we can achieve right now which is 95% there in terms of accuracy and usability - for me it's perfect.

Thank you!

eikek · 2021-08-15T00:04:27Z

Hi @wallace11 thanks you, great to hear!

eikek added bug Something isn't working or in unexpected ways joex affects the joex component labels Aug 2, 2021

eikek added this to the Docspell 0.26.0 milestone Aug 6, 2021

eikek added a commit that referenced this issue Aug 13, 2021

Use different japanese train files for tesseract

326cf1c

They seem to work better as suggested here: tesseract-ocr/tessdata#119 Refs: #973

eikek mentioned this issue Aug 13, 2021

Use different japanese train files for tesseract #1005

Merged

eikek added docker All things regarding docker setup and removed joex affects the joex component labels Aug 13, 2021

wallace11 closed this as completed Aug 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Japanese and numbers (digits) #973

Issue with Japanese and numbers (digits) #973

wallace11 commented Jul 31, 2021

eikek commented Aug 2, 2021

wallace11 commented Aug 4, 2021

eikek commented Aug 6, 2021

eikek commented Aug 13, 2021

wallace11 commented Aug 14, 2021

eikek commented Aug 15, 2021

Issue with Japanese and numbers (digits) #973

Issue with Japanese and numbers (digits) #973

Comments

wallace11 commented Jul 31, 2021

eikek commented Aug 2, 2021

wallace11 commented Aug 4, 2021

eikek commented Aug 6, 2021

eikek commented Aug 13, 2021

wallace11 commented Aug 14, 2021

eikek commented Aug 15, 2021