Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: substring searches via Elasticsearch often fail to bring up files and SIPs #144

Closed
sevein opened this issue Sep 6, 2018 · 3 comments
Assignees
Milestone

Comments

@sevein
Copy link
Contributor

sevein commented Sep 6, 2018

Misty originally raised this issue and suggested a solution here: artefactual/archivematica#364. I closed that one because it's not being actively developed by the original author. We may want to re-open it or submit a new PR instead.

Misty explained the issue better than I could do it:

The default tokenization used by Elasticsearch doesn't deal well with the ways that path strings are constructed; substring searches often fail to bring up files and SIPs which would be expected by users. For example, trying to search for "events" will fail to bring up files in a SIP named "Events_2005-d1c6c1a2-f613-497f-a527-64eb6861f03c".

CC @jraddaoui - you may want to look into this as part of the work that you're doing for artefactual/archivematica#1171.

@ross-spencer ross-spencer changed the title Problem: substring searches often fail to bring up files and SIPs Problem: substring searches via Elasticsearch often fail to bring up files and SIPs Nov 8, 2018
@sallain
Copy link
Member

sallain commented Dec 11, 2018

Added this as a child of #273, which is the epic for all search-related issues.

@sevein sevein added the Status: in progress Issue that is currently being worked on. label Dec 17, 2018
@sromkey sromkey added this to the 1.9.0 milestone Jan 21, 2019
@jraddaoui
Copy link

This has been addressed in the ES upgrade by adding a new analyzer/tokenizer for filepath fields. Those paths should now be split in the following chars: - _ . / \.

@jraddaoui jraddaoui added Status: review The issue's code has been merged and is ready for testing/review. and removed Status: in progress Issue that is currently being worked on. labels Jan 24, 2019
@sallain sallain assigned sallain and unassigned jraddaoui Feb 4, 2019
@sallain
Copy link
Member

sallain commented Feb 5, 2019

This works!!!!!! 🎉 🌮 🍾

If I have a filename like View_from_lookout_over_Queenstown_towards_the_Remarkables_in_spring.jpg, I can search for any of the words in the filename to find my file. Searching for a partial word like Queens doesn't display my result, but adding a wildcard Queens* and using the Phrase parameter (as opposed to Keyword) does work - this is the expected behaviour.

There was a change in the search behaviour related to the availability of the raw JSON data; that has been described in #274 (comment). I'll verify this issue and keep #274 open to deal with that issue.

@sallain sallain added Status: verified and removed Status: review The issue's code has been merged and is ready for testing/review. labels Feb 5, 2019
@sromkey sromkey closed this as completed Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants