search: configure query parser for handling special characters #3054

ParthS007 · 2021-02-09T15:33:03Z

closes #3042
closes #2930

tiborsimko · 2021-02-12T13:28:39Z

cernopendata/config.py

@@ -281,7 +281,7 @@
            default_order='asc',
            order=2,
        ),
-        'title': dict(fields=['title.keyword'],
+        'title': dict(fields=['title.exact'],


Sorting is well fixed 👍 Special characters appear in the front, which is OK for now, but we can also see later whether we could classify them into the alphabet, e.g. /Btau amongst Bs rather than in front of the search results.

tiborsimko · 2021-02-12T13:32:34Z

cernopendata/modules/records/search/query.py

+    def _default_parser(_query_string=None):
+        """Default parser that uses the Q() from elasticsearch_dsl."""
+        if '"' not in _query_string:
+            _query_string = _query_string.replace("/", "//")


This fixes the internal error for regexp-like queries, but it also leads to some unexpected findings, such as:

searching for 10.7483/OPENDATA.9S5F.BY3B will find /glossary/Barn as one of the matching hits. Do you know why was this matched?

searching for "10.7483/OPENDATA.9S5F.BY3B" will find hundreds of results which are not related, like record/349.

An alternative approach could be to quote expressions... We can see IRL

Here is a table showing an overview of the above findings:

query opendata.cern.ch (production) opendata-dev.cern.ch (master) PR 3054

10.7483/OPENDATA.9S5F.BY3B error 357 results 359 results

"10.7483/OPENDATA.9S5F.BY3B" 1 result error 5444 results

10.7483 2123 results 357 results 359 results

Basically the PROD system has the best result, the only thing we want to correct is the "error" when people don't use quotes. The PR fixes the error, but finds too much hits for quoted query, and not enough hits for partial query.

Yes, I am currently testing this one.

Ditto for CMS dataset title search; while the PR fixes the error situations, it also finds way too many "unrelated" results:

query opendata.cern.ch (production) opendata-dev.cern.ch (master) PR 3054

/BTau/Run2010B-Apr21ReReco-v1/AOD error error 2 results

"/BTau/Run2010B-Apr21ReReco-v1/AOD" 2 results error 2 results

BTau 5 results 5 results 5 results

Edit: Updated the latest results

Ping @tiborsimko

tiborsimko · 2021-02-18T18:29:04Z

tests/test_cernopendata_query_parser.py

+# -*- coding: utf-8 -*-
+#
+# This file is part of CERN Open Data Portal.
+# Copyright (C) 2017 CERN.


s/2017/2021/g ?

ParthS007 mentioned this pull request Feb 12, 2021

search: add mappings for doi search #3058

Merged

tiborsimko reviewed Feb 12, 2021

View reviewed changes

tiborsimko reviewed Feb 18, 2021

View reviewed changes

search: configure query parser for handling special characters

8c589cf

ParthS007 requested a review from tiborsimko February 19, 2021 07:20

tiborsimko approved these changes Feb 19, 2021

View reviewed changes

tiborsimko merged commit 8c589cf into cernopendata:master Feb 19, 2021

ParthS007 deleted the 3042 branch February 19, 2021 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

search: configure query parser for handling special characters #3054

search: configure query parser for handling special characters #3054

ParthS007 commented Feb 9, 2021 •

edited

Loading

tiborsimko Feb 12, 2021

tiborsimko Feb 12, 2021

tiborsimko Feb 16, 2021

ParthS007 Feb 16, 2021

tiborsimko Feb 16, 2021 •

edited by ParthS007

Loading

ParthS007 Feb 18, 2021

tiborsimko Feb 18, 2021

query	opendata.cern.ch (production)	opendata-dev.cern.ch (master)	PR 3054
10.7483/OPENDATA.9S5F.BY3B	error	357 results	359 results
"10.7483/OPENDATA.9S5F.BY3B"	1 result	error	5444 results
10.7483	2123 results	357 results	359 results

query	opendata.cern.ch (production)	opendata-dev.cern.ch (master)	PR 3054
/BTau/Run2010B-Apr21ReReco-v1/AOD	error	error	2 results
"/BTau/Run2010B-Apr21ReReco-v1/AOD"	2 results	error	2 results
BTau	5 results	5 results	5 results

search: configure query parser for handling special characters #3054

search: configure query parser for handling special characters #3054

Conversation

ParthS007 commented Feb 9, 2021 • edited Loading

tiborsimko Feb 12, 2021

Choose a reason for hiding this comment

tiborsimko Feb 12, 2021

Choose a reason for hiding this comment

tiborsimko Feb 16, 2021

Choose a reason for hiding this comment

ParthS007 Feb 16, 2021

Choose a reason for hiding this comment

tiborsimko Feb 16, 2021 • edited by ParthS007 Loading

Choose a reason for hiding this comment

ParthS007 Feb 18, 2021

Choose a reason for hiding this comment

tiborsimko Feb 18, 2021

Choose a reason for hiding this comment

ParthS007 commented Feb 9, 2021 •

edited

Loading

tiborsimko Feb 16, 2021 •

edited by ParthS007

Loading