-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tesseract OCR only takes pdf files as input #46
Comments
@patdunlavey yes, that particular change is very easy. But there is another blocker. Defining the sequence ID and in the context of a Dynamic IIIF Manifest, what that means... Let me explain (means I need your ideas/help)
What Options?
Errors/bad design/improvements
|
@patdunlavey this here: strawberry_runners/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php Line 457 in dbbdcb6
sequence_id JSON key (the value) and should be exposed in the config form maybe even exposed ONLY if the source is as:image
|
@DiegoPino I tried to spitball some code in this PR. |
Update: I made a couple further corrections in the PR. In my testing, it appears to successfully generate and index OCR for single and multiple image file objects. Not all perfect in some quick testing:
I checked that the queue entries include a sequence value that corresponds to the sequence number in the metadata, so that part seems to be working. |
Hey! This is wonderful! 🥰Will do a thorough review (a caring one) first
hour in the morning. Thx so much!!!
On Tue, Mar 8, 2022 at 7:31 PM Pat Dunlavey ***@***.***> wrote:
Update:
I made a couple further corrections in the PR. In my testing, it appears
to successfully generate and index OCR for single and multiple image file
objects. Not all perfect in some quick testing:
1. Some of my test images are failing to index. Sample error (dumped
to the console when I use drush queue:run): "msg":"Exception writing
document id
gg2me1-default_solr_index-strawberryfield_flavor_datasource/33:1:en:10123392-bafa-45aa-bd50-f9d9636ef6ed:ocr_single
to the index; possible analysis error.".
2. Global search only seems to find content from the first file.
I checked that the queue entries include a sequence value that corresponds
to the sequence number in the metadata, so that part seems to be working.
—
Reply to this email directly, view it on GitHub
<#46 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABU7ZZ6H44TGXDDCCJS7G43U67IKLANCNFSM5QGQXY2A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Diego Pino Navarro
Digital Repositories Developer
Metropolitan New York Library Council (METRO)
|
I figured out that the reason some documents were refusing to index was because, as a result of having branched from main rather than 0.3.0, I did not have a fix for this. I merged 0.3.0 into my issue branch, and changed the pull request to target 0.3.0. So that resolved that issue. Two other issues remain: the fact that global search only discovers content in one of the OCR'd and indexed files on a multi-file object; and the bookreader does not show multi-file books as searchable. I suspect that the first problem - that text in only one of the OCR'd files is found in global search - is related to something that I see in the solr indexed data. All of the strawberryfield_flavor_datasource records are showing "1" as the sequence_id. This is despite the fact that I'm pretty sure the files are getting proper sequence numbers going into tesseract. If you have any ideas about this, let me know @DiegoPino . The second problem - that the bookreader doesn't show the multi-file book content as searchable - seems like it could be related to the first. Another possibility is that I am using a separate strawberry runner for non-paged files and maybe that's confusing things. |
@DiegoPino I pushed up some more work on this that I think gets the sequence ID working pretty well. The biggest part I'm not sure about is if I may be screwing up other processors that do not use sequence id as their input_argument. I was wrong in my complaint that global search doesn't find content in some OCR'd files. In fact global search doesn't find any OCR content! I now understand that and why it is so (it seems like maybe adding a relation in the view to strawberry flavor datasources could let us have a search that finds nodes whose associated strawberryflavor datasource entities contain the search string?). The absence of search within the bookreader remains a problem, but I'm thinking that may be a separate issue? Do you have any enlightenment to provide on that @DiegoPino ? Might it depend on the second item listed on this issue "Add for each Page (no collapsed data) an extra location of HOCR URL" |
Hi Pat,
I need to check that logic (how the id is passed around), probably the only thing or your pull that is breaking the idea that a processor should be self sufficient (the deal) and it might break other processors. Will check all once you tell me you are done (I was about today but then saw some code coming from you)
RC3 has a standard view for that...https://studio.archipelago.nyc/search_pages <https://studio.archipelago.nyc/search_pages> does that one not work? I mean you could also display all in a single view but that would also break a “deal” (relevance of content search v/s metadata search)
Give me a little while, bit stumped with other code but will give you a few solutions. Book reader problem is really not big, its mostly a naming convention of each page (so should be as easy as documenting/adapting the twig templates for IIIF) so the Search knows where/how to find the values but might require some JS to be more robust (just maybe)
More tomorrow, thanks
|
Hi Diego, I am done for now, pending your feedback. I won't be able to do much today (my son is home from his day program), but will try to address anything you put back to me as quickly as I can. If you have thoughts for how to isolate the necessary code changes to the OCR processor, I'm definitely all ears, but since the sequence ID is provided to the OCR processor, and utilized outside of it, I don't see how that's possible. Derek pointed out that search_pages view to me, which I had forgotten about. But I'm not sure what you mean about a "deal" that implies that global fulltext search is only interested in object metadata. As a user, when I search for a word, I think my expectation is that I'm doing a content and metadata search. But that's not the subject of this issue, so let's not discuss it further here. I'm sorry I brought it into this discussion. I'm glad to hear that bookreader will not be hard to solve. Will you provide specific direction for that? Thanks!! |
No worries at all about response speed. It will take me a lot of testing/debugging and code comparison to have a proper review. Will of course help/code to make the bookreader work re: search. You can (in your own institution) mix and match results in a single View. The issue I see is that by default (and we can maybe work on that?) Strawberry Runners have NO View modes. They do not even exist. So you have to depend on fields to display, which means your global search View would need to be tuned. That is all. What a user expects/not expects is very domain driven and tbh most users will expect what they are used to, which does not always mean you can not provide a different alternative/persective. Not a critic, just a small statement about expectations hugs and good luck today |
Resolved |
In OcrPostProcessor, where it builds the command to run tesseract, the command always emerges in the form:
I.e. it only works with pdf files as input! Any raster file (tiff, jpeg, etc) results in no OCR being generated.
It should first check the input file to see if tesseract can run on it directly, and if not, then test if ghostscript can convert it into a file that tesseract can run on.
@DiegoPino I'll take this on, it seems pretty easy -- that is unless I have misdiagnosed this problem!
The text was updated successfully, but these errors were encountered: