Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Permission denied: 'tesseract' #2

Open
leandrodamascena opened this issue Aug 26, 2019 · 13 comments
Open

Error: Permission denied: 'tesseract' #2

leandrodamascena opened this issue Aug 26, 2019 · 13 comments

Comments

@leandrodamascena
Copy link

Hi man .. First of all, thank you for your doing this project ... it's very intersting for me .. Now I need to OCR more than 6TB of PDF files and using an EC2 instance with only ocrmypdf (python project) is too slow and not performatic ..

I made all the configurations that you explain in README.md, but I'm facing a tesseract error.. I enabled X-Ray debug and extract exact error. below I'm send some prints of my configuration.. Thank you in advanced and let me know if you need more informations.

Error in lambda console
image

Configuration of enviroment variables
image

My ROLE
image

X-RAY debug
image

@hdwatts
Copy link
Collaborator

hdwatts commented Aug 26, 2019

Hi, thanks for submitting the issue.

Can you share an example of your event config? Also did you download the zip from the releases page or clone the source and compress it yourself?

@leandrodamascena
Copy link
Author

Hi.. now working fine.. the problem was files in bin paste are not executable by default... I'm downloaded this on linux ami instance and change permission to be executable.. thank you so much..

The problem is cause my PDF file is in Portuguese (Brazil) language and the OCR was not so good.. I'll read the code and try to implement multilanguage support.

@hdwatts
Copy link
Collaborator

hdwatts commented Aug 26, 2019

Thanks, I'll look into that! I want to note that I just added a release that includes multi-threading. This leads to about a 4x increase in speed (since it runs OCR on 4 pages at a time).

Let me know if downloading the Portuguese .traineddata file from here, placing it into the tessdata folder, and recompressing works.

You will probably have to delete the eng.traineddata file to remain within AWS function size limits.

@leandrodamascena
Copy link
Author

I downloaded a new version and some new problems are ocurring...

1 - Limit size excedded - The limit size of unzipped file exceds lambda limit (262144000 bytes).. I removed file "tessdata/osd.traineddata" and worked fine.. The time of execution is better than previous version.. I'm cocerned about this removed file, I really don't know about consequences using daily..
image

2 - When I add por.traineddata and remove english file doesn't work.. the system always expect english language to use by default.
image

image

@hdwatts
Copy link
Collaborator

hdwatts commented Aug 26, 2019

Apologies, I included some unnecessary files in the .zip, leading to it being too big. This is fixed if you re-download the release.

In _validations.py on line 61 OCRmyPDF appears to default to eng. I believe if you modify the ocrmypdf call on line 30 in apply-ocr-to-s3-object.py to be something like:
ocrmypdf.ocr(inputname, outputname, pages=pages, force_ocr=True, lambda_safe=True, language=['por']) then it should skip the validation for eng and look for por.

As for osd.traineddata, that is Orientation and script detection data, so definitely something that is useful depending on your input. I would keep it in. More information here: https://ai.google/research/pubs/pub35506

@leandrodamascena
Copy link
Author

Still not working... Things that I tried..

1 - Opened _validation.py and put directly "por" language in default language and deleted "eng.traineedata".. I had the same error about language.
image

2 - Tried to keep "eng.trainedata" inside the folder and removed "dist-info" directories inside python directory and lambda size was exceeded..

3 - Are you sure that you deleted some files from repo and commited? I didn't see this commit.

Thank you man.

@hdwatts
Copy link
Collaborator

hdwatts commented Aug 26, 2019

So I didn't make a commit to remove anything from the repo. I only updated the lambda-OCRmyPDF.zip file in the releases section.

I have done a bunch of tinkering and found that the issues stem from the way tesseract is being called for some utility functions. Even just to print parameters it requires eng.traineddata by default!!!

I have created a hardcoded por language zip file for you. It can be found here: https://github.com/chronograph-pe/lambda-OCRmyPDF/releases/download/v1.1-alpha-por/lambda-por-OCRmyPDF.zip

Note that the event must have a language='por' param. As shown below:

{
  "pages": "1",
  "awsRegion": "us-east-1",
  "language": "por",
  "s3": {
    "bucket": {
      "name": [BUCKETNAME]
    },
    "object": {
      "key": "input.pdf"
    }
  }
}

Note: I have no idea if this works on Portuguese files, please let me know. I have tested on a basic input.pdf file and the lambda function completes without issue, but do not know if the OCR actually works.

If it does work please let me know and I will work on an official multilanguage support release.

@leandrodamascena
Copy link
Author

Now is working nice using language as a configuration... But I'm still facing a problem with portuguese.. PDF OCR is not recognizing words in portuguese..

do you have an ephemeral container or another envinroment to test stand alone this code? I could configure here and test with diferents scenarios...

thank you.. its a really nice project!

@hdwatts
Copy link
Collaborator

hdwatts commented Aug 27, 2019

That is probably due to the por.traineddata coming from the TessData Fast repository instead of the normal TessData to save on space so it fits in Lambda.

Is it recognizing any Portuguese words or none at all?

And while I do not have an ephemeral container, the library itself has a docker container here: https://ocrmypdf.readthedocs.io/en/latest/docker.html

@leandrodamascena
Copy link
Author

I'm already downloaded TessData not "compressed", the full size...

No no isn't recognizing any words..

Do you want the original PDF that I'm trying? I can send to your email...

@hdwatts
Copy link
Collaborator

hdwatts commented Aug 27, 2019

Sure - send it to the email linked in my github profile.

@harnit-bakshi
Copy link

Any update on this did it work for POR?

@krzischp
Copy link

Hi, do you have any update, please? Your link is broken:
https://github.com/chronograph-pe/lambda-OCRmyPDF/releases/download/v1.1-alpha-por/lambda-por-OCRmyPDF.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants