Error: Permission denied: 'tesseract' #2

leandrodamascena · 2019-08-26T13:57:21Z

Hi man .. First of all, thank you for your doing this project ... it's very intersting for me .. Now I need to OCR more than 6TB of PDF files and using an EC2 instance with only ocrmypdf (python project) is too slow and not performatic ..

I made all the configurations that you explain in README.md, but I'm facing a tesseract error.. I enabled X-Ray debug and extract exact error. below I'm send some prints of my configuration.. Thank you in advanced and let me know if you need more informations.

Error in lambda console

Configuration of enviroment variables

My ROLE

X-RAY debug

hdwatts · 2019-08-26T14:36:37Z

Hi, thanks for submitting the issue.

Can you share an example of your event config? Also did you download the zip from the releases page or clone the source and compress it yourself?

leandrodamascena · 2019-08-26T14:38:05Z

Hi.. now working fine.. the problem was files in bin paste are not executable by default... I'm downloaded this on linux ami instance and change permission to be executable.. thank you so much..

The problem is cause my PDF file is in Portuguese (Brazil) language and the OCR was not so good.. I'll read the code and try to implement multilanguage support.

hdwatts · 2019-08-26T15:11:31Z

Thanks, I'll look into that! I want to note that I just added a release that includes multi-threading. This leads to about a 4x increase in speed (since it runs OCR on 4 pages at a time).

Let me know if downloading the Portuguese .traineddata file from here, placing it into the tessdata folder, and recompressing works.

You will probably have to delete the eng.traineddata file to remain within AWS function size limits.

leandrodamascena · 2019-08-26T16:37:16Z

I downloaded a new version and some new problems are ocurring...

1 - Limit size excedded - The limit size of unzipped file exceds lambda limit (262144000 bytes).. I removed file "tessdata/osd.traineddata" and worked fine.. The time of execution is better than previous version.. I'm cocerned about this removed file, I really don't know about consequences using daily..

2 - When I add por.traineddata and remove english file doesn't work.. the system always expect english language to use by default.

hdwatts · 2019-08-26T17:25:56Z

Apologies, I included some unnecessary files in the .zip, leading to it being too big. This is fixed if you re-download the release.

In _validations.py on line 61 OCRmyPDF appears to default to eng. I believe if you modify the ocrmypdf call on line 30 in apply-ocr-to-s3-object.py to be something like:
ocrmypdf.ocr(inputname, outputname, pages=pages, force_ocr=True, lambda_safe=True, language=['por']) then it should skip the validation for eng and look for por.

As for osd.traineddata, that is Orientation and script detection data, so definitely something that is useful depending on your input. I would keep it in. More information here: https://ai.google/research/pubs/pub35506

leandrodamascena · 2019-08-26T18:44:13Z

Still not working... Things that I tried..

1 - Opened _validation.py and put directly "por" language in default language and deleted "eng.traineedata".. I had the same error about language.

2 - Tried to keep "eng.trainedata" inside the folder and removed "dist-info" directories inside python directory and lambda size was exceeded..

3 - Are you sure that you deleted some files from repo and commited? I didn't see this commit.

Thank you man.

hdwatts · 2019-08-26T19:57:50Z

So I didn't make a commit to remove anything from the repo. I only updated the lambda-OCRmyPDF.zip file in the releases section.

I have done a bunch of tinkering and found that the issues stem from the way tesseract is being called for some utility functions. Even just to print parameters it requires eng.traineddata by default!!!

I have created a hardcoded por language zip file for you. It can be found here: https://github.com/chronograph-pe/lambda-OCRmyPDF/releases/download/v1.1-alpha-por/lambda-por-OCRmyPDF.zip

Note that the event must have a language='por' param. As shown below:

{
  "pages": "1",
  "awsRegion": "us-east-1",
  "language": "por",
  "s3": {
    "bucket": {
      "name": [BUCKETNAME]
    },
    "object": {
      "key": "input.pdf"
    }
  }
}

Note: I have no idea if this works on Portuguese files, please let me know. I have tested on a basic input.pdf file and the lambda function completes without issue, but do not know if the OCR actually works.

If it does work please let me know and I will work on an official multilanguage support release.

leandrodamascena · 2019-08-27T13:11:44Z

Now is working nice using language as a configuration... But I'm still facing a problem with portuguese.. PDF OCR is not recognizing words in portuguese..

do you have an ephemeral container or another envinroment to test stand alone this code? I could configure here and test with diferents scenarios...

thank you.. its a really nice project!

hdwatts · 2019-08-27T13:56:51Z

That is probably due to the por.traineddata coming from the TessData Fast repository instead of the normal TessData to save on space so it fits in Lambda.

Is it recognizing any Portuguese words or none at all?

And while I do not have an ephemeral container, the library itself has a docker container here: https://ocrmypdf.readthedocs.io/en/latest/docker.html

leandrodamascena · 2019-08-27T14:03:50Z

I'm already downloaded TessData not "compressed", the full size...

No no isn't recognizing any words..

Do you want the original PDF that I'm trying? I can send to your email...

hdwatts · 2019-08-27T14:35:10Z

Sure - send it to the email linked in my github profile.

harnit-bakshi · 2020-02-24T14:32:51Z

Any update on this did it work for POR?

krzischp · 2022-02-17T02:26:59Z

Hi, do you have any update, please? Your link is broken:
https://github.com/chronograph-pe/lambda-OCRmyPDF/releases/download/v1.1-alpha-por/lambda-por-OCRmyPDF.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Permission denied: 'tesseract' #2

Error: Permission denied: 'tesseract' #2

leandrodamascena commented Aug 26, 2019

hdwatts commented Aug 26, 2019

leandrodamascena commented Aug 26, 2019

hdwatts commented Aug 26, 2019

leandrodamascena commented Aug 26, 2019

hdwatts commented Aug 26, 2019 •

edited

Loading

leandrodamascena commented Aug 26, 2019

hdwatts commented Aug 26, 2019 •

edited

Loading

leandrodamascena commented Aug 27, 2019

hdwatts commented Aug 27, 2019

leandrodamascena commented Aug 27, 2019

hdwatts commented Aug 27, 2019

harnit-bakshi commented Feb 24, 2020

krzischp commented Feb 17, 2022

Error: Permission denied: 'tesseract' #2

Error: Permission denied: 'tesseract' #2

Comments

leandrodamascena commented Aug 26, 2019

hdwatts commented Aug 26, 2019

leandrodamascena commented Aug 26, 2019

hdwatts commented Aug 26, 2019

leandrodamascena commented Aug 26, 2019

hdwatts commented Aug 26, 2019 • edited Loading

leandrodamascena commented Aug 26, 2019

hdwatts commented Aug 26, 2019 • edited Loading

leandrodamascena commented Aug 27, 2019

hdwatts commented Aug 27, 2019

leandrodamascena commented Aug 27, 2019

hdwatts commented Aug 27, 2019

harnit-bakshi commented Feb 24, 2020

krzischp commented Feb 17, 2022

hdwatts commented Aug 26, 2019 •

edited

Loading

hdwatts commented Aug 26, 2019 •

edited

Loading