Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search: configure query parser for handling special characters #3054

Merged
merged 1 commit into from
Feb 19, 2021
Merged

search: configure query parser for handling special characters #3054

merged 1 commit into from
Feb 19, 2021

Conversation

ParthS007
Copy link
Member

@ParthS007 ParthS007 commented Feb 9, 2021

closes #3042
closes #2930

@@ -281,7 +281,7 @@
default_order='asc',
order=2,
),
'title': dict(fields=['title.keyword'],
'title': dict(fields=['title.exact'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting is well fixed 👍 Special characters appear in the front, which is OK for now, but we can also see later whether we could classify them into the alphabet, e.g. /Btau amongst Bs rather than in front of the search results.

def _default_parser(_query_string=None):
"""Default parser that uses the Q() from elasticsearch_dsl."""
if '"' not in _query_string:
_query_string = _query_string.replace("/", "//")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the internal error for regexp-like queries, but it also leads to some unexpected findings, such as:

  • searching for 10.7483/OPENDATA.9S5F.BY3B will find /glossary/Barn as one of the matching hits. Do you know why was this matched?

  • searching for "10.7483/OPENDATA.9S5F.BY3B" will find hundreds of results which are not related, like record/349.

An alternative approach could be to quote expressions... We can see IRL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a table showing an overview of the above findings:

query opendata.cern.ch (production) opendata-dev.cern.ch (master) PR 3054
10.7483/OPENDATA.9S5F.BY3B error 357 results 359 results
"10.7483/OPENDATA.9S5F.BY3B" 1 result error 5444 results
10.7483 2123 results 357 results 359 results

Basically the PROD system has the best result, the only thing we want to correct is the "error" when people don't use quotes. The PR fixes the error, but finds too much hits for quoted query, and not enough hits for partial query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am currently testing this one.

Copy link
Member

@tiborsimko tiborsimko Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for CMS dataset title search; while the PR fixes the error situations, it also finds way too many "unrelated" results:

query opendata.cern.ch (production) opendata-dev.cern.ch (master) PR 3054
/BTau/Run2010B-Apr21ReReco-v1/AOD error error 2 results
"/BTau/Run2010B-Apr21ReReco-v1/AOD" 2 results error 2 results
BTau 5 results 5 results 5 results

Edit: Updated the latest results

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# -*- coding: utf-8 -*-
#
# This file is part of CERN Open Data Portal.
# Copyright (C) 2017 CERN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/2017/2021/g ?

@tiborsimko tiborsimko merged commit 8c589cf into cernopendata:master Feb 19, 2021
@ParthS007 ParthS007 deleted the 3042 branch February 19, 2021 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

search: configure query parser internally for handling special characters search: exact dataset title
2 participants