Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract OCR only takes pdf files as input #46

Closed
patdunlavey opened this issue Mar 8, 2022 · 11 comments
Closed

tesseract OCR only takes pdf files as input #46

patdunlavey opened this issue Mar 8, 2022 · 11 comments

Comments

@patdunlavey
Copy link
Contributor

In OcrPostProcessor, where it builds the command to run tesseract, the command always emerges in the form:

{{ ghostscript command that takes the file and tries to generate a png from it }} && {{ tesseract command that uses the png as input }}

I.e. it only works with pdf files as input! Any raster file (tiff, jpeg, etc) results in no OCR being generated.

It should first check the input file to see if tesseract can run on it directly, and if not, then test if ghostscript can convert it into a file that tesseract can run on.

@DiegoPino I'll take this on, it seems pretty easy -- that is unless I have misdiagnosed this problem!

@DiegoPino
Copy link
Member

@patdunlavey yes, that particular change is very easy. But there is another blocker. Defining the sequence ID and in the context of a Dynamic IIIF Manifest, what that means...

Let me explain (means I need your ideas/help)

  • With PDF, the original use case that we built this the sequence order of pages is not ambiguous. We know exactly which page number is each. But with Images, depending on how many Images there are Inside a single ADO and IF, the Object that holds them is Part of a TOP one (multiple pages bound to a book or a creative work series) knowing which TIFF is page 1 requires extra options in the processor

What Options?

  • Allow to setup a JSON key as the source for the sequence (we use sequence_id as default - an implicit default driven by Solr and the Views.. so requires documenting that if you change that you need to change the Views that drive sequence order)
  • Allow to setup also the internal (per image) sequence order in case of multi Image Objects
  • Allow an option (maybe not even desired) to use a IIIF manifest (exposed endpoint) as the source for the sequence in case your logic is strange/uses TOC etc. This might not be needed IF we allow the actual search/interaction logic of the Viewers to deal with that when hitting the Search.. (does that make sense?)

Errors/bad design/improvements

  • Right now the ocr tag used in the Strawberry Flavor Sold DOCs unique ID (remember that mix of Node UUID, File UUID, parent, plugin id etc) is based on the unique plugin ID given. By default the current PDF one is ocr. If we create a new Plugin instance to deal with TIFFs, etc, we want to allow to another option to set that KEY. So both, the TIFFs and the PDF pages share the ocr key. OR, make it fixed in this type of processor (might remove flexibility in the future but would also give us "immediate" usability
  • We need to check HOW IABookeader is interacting/passing (we build this so we can) via JS the right endpoint query to Archipelago so multi TIFF objects OR compound/creative work series are searchable. This is very dependent on the originating IIIF manifest!

@DiegoPino
Copy link
Member

DiegoPino commented Mar 8, 2022

@patdunlavey this here:

// Sets the default page to 1 if not passed.
needs to come from a setting in case we are not dealing with a PDF (e.g sequence_id JSON key (the value) and should be exposed in the config form maybe even exposed ONLY if the source is as:image

@patdunlavey
Copy link
Contributor Author

@DiegoPino I tried to spitball some code in this PR.

@patdunlavey
Copy link
Contributor Author

patdunlavey commented Mar 8, 2022

Update:

I made a couple further corrections in the PR. In my testing, it appears to successfully generate and index OCR for single and multiple image file objects. Not all perfect in some quick testing:

  1. Some of my test images are failing to index. Sample error (dumped to the console when I use drush queue:run): "msg":"Exception writing document id gg2me1-default_solr_index-strawberryfield_flavor_datasource/33:1:en:10123392-bafa-45aa-bd50-f9d9636ef6ed:ocr_single to the index; possible analysis error.".
  2. Global search only seems to find content from the first file.
  3. Not seeing search in the Book Reader for OCR'd content.

I checked that the queue entries include a sequence value that corresponds to the sequence number in the metadata, so that part seems to be working.

@DiegoPino
Copy link
Member

DiegoPino commented Mar 9, 2022 via email

@patdunlavey
Copy link
Contributor Author

patdunlavey commented Mar 10, 2022

I figured out that the reason some documents were refusing to index was because, as a result of having branched from main rather than 0.3.0, I did not have a fix for this. I merged 0.3.0 into my issue branch, and changed the pull request to target 0.3.0. So that resolved that issue.

Two other issues remain: the fact that global search only discovers content in one of the OCR'd and indexed files on a multi-file object; and the bookreader does not show multi-file books as searchable.

I suspect that the first problem - that text in only one of the OCR'd files is found in global search - is related to something that I see in the solr indexed data. All of the strawberryfield_flavor_datasource records are showing "1" as the sequence_id. This is despite the fact that I'm pretty sure the files are getting proper sequence numbers going into tesseract. If you have any ideas about this, let me know @DiegoPino .

The second problem - that the bookreader doesn't show the multi-file book content as searchable - seems like it could be related to the first. Another possibility is that I am using a separate strawberry runner for non-paged files and maybe that's confusing things.

@patdunlavey
Copy link
Contributor Author

@DiegoPino I pushed up some more work on this that I think gets the sequence ID working pretty well. The biggest part I'm not sure about is if I may be screwing up other processors that do not use sequence id as their input_argument.

I was wrong in my complaint that global search doesn't find content in some OCR'd files. In fact global search doesn't find any OCR content! I now understand that and why it is so (it seems like maybe adding a relation in the view to strawberry flavor datasources could let us have a search that finds nodes whose associated strawberryflavor datasource entities contain the search string?).

The absence of search within the bookreader remains a problem, but I'm thinking that may be a separate issue? Do you have any enlightenment to provide on that @DiegoPino ? Might it depend on the second item listed on this issue "Add for each Page (no collapsed data) an extra location of HOCR URL"

@DiegoPino
Copy link
Member

DiegoPino commented Mar 10, 2022 via email

@patdunlavey
Copy link
Contributor Author

Hi Diego,

I am done for now, pending your feedback. I won't be able to do much today (my son is home from his day program), but will try to address anything you put back to me as quickly as I can. If you have thoughts for how to isolate the necessary code changes to the OCR processor, I'm definitely all ears, but since the sequence ID is provided to the OCR processor, and utilized outside of it, I don't see how that's possible.

Derek pointed out that search_pages view to me, which I had forgotten about. But I'm not sure what you mean about a "deal" that implies that global fulltext search is only interested in object metadata. As a user, when I search for a word, I think my expectation is that I'm doing a content and metadata search. But that's not the subject of this issue, so let's not discuss it further here. I'm sorry I brought it into this discussion.

I'm glad to hear that bookreader will not be hard to solve. Will you provide specific direction for that?

Thanks!!

@DiegoPino
Copy link
Member

No worries at all about response speed. It will take me a lot of testing/debugging and code comparison to have a proper review. Will of course help/code to make the bookreader work

re: search. You can (in your own institution) mix and match results in a single View. The issue I see is that by default (and we can maybe work on that?) Strawberry Runners have NO View modes. They do not even exist. So you have to depend on fields to display, which means your global search View would need to be tuned. That is all. What a user expects/not expects is very domain driven and tbh most users will expect what they are used to, which does not always mean you can not provide a different alternative/persective. Not a critic, just a small statement about expectations

hugs and good luck today

@DiegoPino
Copy link
Member

Resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants