Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexer efficiency improvements and fixes for side-effects of #3006 #3077

Merged

Conversation

andrew-morrison
Copy link
Contributor

@andrew-morrison andrew-morrison commented Nov 13, 2023

Description

I have observed the following side-effects of #3006:

  • Archival objects match keywords in their parent resource when searching in the PUI. That is because the PUI indexer retrieves ancestors in order to do inheritance of selected fields, but all the fields in those ancestors are now being included in the new “fullrecord_published” index.
  • Indexing is slower, because each record is being scanned for fields to include in the new “fullrecord_published” and “notes_published” indexes. It only adds hundredths of a second per record, but for very large institutions that could add hours to a full re-index.

This pull request contains a possible approach to fixing these issues:

  • Change the extract_string_values method to extract both published and unpublished strings at the same time.
  • Off-load the merging of published and unpublished in the "fullrecord" and "notes" fields to Solr, using copyField (which requires the fields be changed to multi-valued.)
  • Do not call the build_fullrecord field twice for the PUI indexer (only once, the hook defined in the PUIIndexer class, after the ancestor records of archival objects have been deleted.)
  • Delete code in build_fullrecord to add finding_aid_subtitle, finding_aid_author, and agents names, which do not appear to be necessary anymore (those fields are already included "fullrecord" index.)

Related JIRA Ticket or GitHub Issue

https://archivesspace.atlassian.net/browse/ANW-261

How Has This Been Tested?

Set stored attribute on the modified fields in solr/schema.xml and re-indexed with test data containing mixture of published and unpublished notes and records. All the unpublished text was restricted to "fullrecord" and "notes", and not in "fullrecord_published" and "notes_unpublished". Also "fullrecord_published" does not get text from ancestors. Finally, the time required for a full re-index has been reduced back to comparable time as before the changes in #3006.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have read the CONTRIBUTING document.
  • I have authority to submit this code.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Copy link
Collaborator

@donaldjosephsmith donaldjosephsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a good look at this and I think I got my head around it. The detailed description was much appreciated. Good thinking on taking advantage of the copyField and multivalued field stuff in Solr.

@donaldjosephsmith donaldjosephsmith dismissed their stale review November 14, 2023 23:30

frontend test failures are real in this case

@donaldjosephsmith donaldjosephsmith merged commit 5a172d9 into archivesspace:master Nov 20, 2023
12 checks passed
@cdibella cdibella added this to the 3.5.0 milestone Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants