-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
search: configure query parser for handling special characters #3054
Conversation
cernopendata/config.py
Outdated
@@ -281,7 +281,7 @@ | |||
default_order='asc', | |||
order=2, | |||
), | |||
'title': dict(fields=['title.keyword'], | |||
'title': dict(fields=['title.exact'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorting is well fixed 👍 Special characters appear in the front, which is OK for now, but we can also see later whether we could classify them into the alphabet, e.g. /Btau
amongst B
s rather than in front of the search results.
def _default_parser(_query_string=None): | ||
"""Default parser that uses the Q() from elasticsearch_dsl.""" | ||
if '"' not in _query_string: | ||
_query_string = _query_string.replace("/", "//") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the internal error for regexp-like queries, but it also leads to some unexpected findings, such as:
-
searching for
10.7483/OPENDATA.9S5F.BY3B
will find/glossary/Barn
as one of the matching hits. Do you know why was this matched? -
searching for
"10.7483/OPENDATA.9S5F.BY3B"
will find hundreds of results which are not related, likerecord/349
.
An alternative approach could be to quote expressions... We can see IRL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a table showing an overview of the above findings:
query | opendata.cern.ch (production) | opendata-dev.cern.ch (master) | PR 3054 |
---|---|---|---|
10.7483/OPENDATA.9S5F.BY3B | error | 357 results | 359 results |
"10.7483/OPENDATA.9S5F.BY3B" | 1 result | error | 5444 results |
10.7483 | 2123 results | 357 results | 359 results |
Basically the PROD system has the best result, the only thing we want to correct is the "error" when people don't use quotes. The PR fixes the error, but finds too much hits for quoted query, and not enough hits for partial query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am currently testing this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for CMS dataset title search; while the PR fixes the error situations, it also finds way too many "unrelated" results:
query | opendata.cern.ch (production) | opendata-dev.cern.ch (master) | PR 3054 |
---|---|---|---|
/BTau/Run2010B-Apr21ReReco-v1/AOD | error | error | 2 results |
"/BTau/Run2010B-Apr21ReReco-v1/AOD" | 2 results | error | 2 results |
BTau | 5 results | 5 results | 5 results |
Edit: Updated the latest results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @tiborsimko
# -*- coding: utf-8 -*- | ||
# | ||
# This file is part of CERN Open Data Portal. | ||
# Copyright (C) 2017 CERN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/2017/2021/g ?
closes #3042
closes #2930