Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE-234: Make Flavor search aware of CWS/Children based OCR #235

Merged
merged 7 commits into from
Nov 7, 2022

Conversation

DiegoPino
Copy link
Member

Still WIP don't even test

Does not ensure yet right PAGE ID for IAB
More working/yet un committed work happening
@DiegoPino DiegoPino self-assigned this Oct 21, 2022
@DiegoPino DiegoPino added this to the 1.1.0 milestone Oct 21, 2022
@DiegoPino DiegoPino added enhancement New feature or request Typed Data and Search Strawberry Flavor Post Processing data extracted that goes into Solr labels Oct 21, 2022
The OCR returns were overwriting previous results!
@patdunlavey @giancarlobi this should now Match more things on a Book search.
// Still incomplete and needs more checking
(we need to be sure the new Solr Fields are always present)

One note:
If an OCR contains "Queen;" "Queen" won't match. I wonder if we need either add spacing on the OCR itself to allow this to match or we can alter the HOCR processor/fieldtype to tokenize against also ",", "-" and ";" ? (Solr config)
This allows any change on a parent (means the attached ADO) to trigger a refresh. This is needed to allow changes in titles/sequence_ids or any other arbitrary metadata to permeate into a SBF document
basic CWS/Parent/Child search using the parent sequence_id (means Children ADO based ordering).
@DiegoPino
Copy link
Member Author

@patdunlavey this is the code that solves an 85% of the use cases. Will add a guide with screenshots tomorrow but the code as it is should be ready. The other part goes into format_strawberryfield since metadata displays are defined there and I can not introduce a bidirectional dependency between modules.

@DiegoPino DiegoPino changed the title ISSUE-234 ISSUE-234: Make Flavor search aware of CWS/Children based OCR Oct 31, 2022
@DiegoPino
Copy link
Member Author

@patdunlavey,@alliomeria and @karomabiles (if you want to see this working) instructions for testing OCR for compounds:

  • Get this branch (docker exec -ti esmero-php bash -c "composer require 'strawberryfield/strawberryfield:dev-issue-234 as dev-1.0.0'"
  • Clear caches.
  • Add the following fields to your Strawberry Runners Data Sources at /admin/config/search/search-api/index/default_solr_index/fields
    image
  • Important is that both "Machine names" and "property paths are those", if in doubt let me know
  • Make sure that your Strawberry Runner HOCR (the pager and the OCR) are processing image/tiff, image/jpeg, application/pdf and targeting also "painting" (all this is for the example)
  • Generate a Top New ADO of type collection (a Creative Work Series one) with basic metadata.
  • Add 3 Digital Object ADOs each one with e.g a single JPEG using "Painting as type" and make them Children of the First (CWS one)
    1. page1
    2. page2
    3. page3

And make sure each one also has "sequence_id" set, to 1, 2 and 3 respectively. If your webforms don't have that element/key (we need it) please add it or edit the JSON RAW. Save.

Make sure the Queue is processed (All Background ones that will generate OCR).

  • Now make sure you have a special View Mode for testing.

Mine is named:

image

and has these settings:

image

Basically you want to have the IABookreader but using the IIIF V3 CWS as template as source. Now Apply that view Mode to the Top Object by editing and forcing that Display Mode
image

You should not need to reindex at this stage (if you followed this steps for this demo object)

Search for "Queen", "Pumpkin" and "King". Each should be highlighted correctly on its own page. Now search for "OCR" multiple pages.

This covers the basic use case where all children have a sequence_id and all are shown in the Manifest.
Still working on the complex (a setting) use case where the structure shown is different, maybe only odd pages, etc.

Please let me know if you have issues/questions/needs

@patdunlavey
Copy link
Collaborator

@DiegoPino starting to look at this now (sorry for the delay!!!)

We already cast to (int) later on. I can of course be EVEN more thorough but is_int makes webform generated NODE IDs be skipped. BAD!
@patdunlavey
Copy link
Collaborator

@DiegoPino I was able to reproduce your steps, and your result! The only problem I noticed is that I don't get the pins in the result bar. I suspect that's due to me not being fully caught up to changes in the IIIF Presentation API 3 Creative Works Series Manifest.

I tested what happens when I add a second image file to one of the child objects. It seems to OCR correctly, but it is saved in the key_value table with the sequence number of "1", rather than that found at "as:image".*.sequence. As a result, when I display in the bookviewer, I get the additional page, but highlighting is off. In this case, I added your sample image file to a page in this object, and though it searches successfully for it (the word "queen" in this example), it highlights on the wrong page:
image

Not sure if this is a simple problem to solve (and whether it's in the 15% you referred to!).

@patdunlavey
Copy link
Collaborator

Looking here, it seems like the sequence number should be correct. Not sure why it isn't!

@DiegoPino
Copy link
Member Author

@patdunlavey adding a new page and having key_value = 1 is OK. I wonder if you added the "sequence_id" JSON KEY key to your new page/ADO?

@DiegoPino
Copy link
Member Author

The actual page matching here depends on having a sequence_id at at Child ADO level. Without it, the Manifest is going to show pages in any order and won't match the response (and re-lative new ordering of results from the search) order that happens here now. The re-paging of the results happens here:

if (isset($allfields_translated_to_solr['parent_sequence_id']) &&

So if your ADO (the one that produced the HOCR) has no sequence_id it will return 1 and thus will offset all. Your new page should have sequence_id = 4 (in the JSON) now

@DiegoPino
Copy link
Member Author

Also, the lack of pins in the result bar is strange. Are you using this on top of a custom code piece? e.g have you started modifying any other part of Archipelago already? Weird because on a fresh 1.0.0 I do see the pins .... maybe we need to have a call!

@DiegoPino
Copy link
Member Author

@patdunlavey will merge and we open a new Pull/ISSUE for troubleshooting? There is more work to be done on SBFlavors for sure and I can add any corrections to a new pull.

@DiegoPino DiegoPino merged commit 911f8cb into 1.1.0 Nov 7, 2022
@patdunlavey
Copy link
Collaborator

Sorry, I meant to get the results of my investigation in earlier! I'll make a new ticket for the multi-file sequencing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Strawberry Flavor Post Processing data extracted that goes into Solr Typed Data and Search
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants