Prefering original resources in presentation of record duplicates #316

teckart · 2020-12-01T11:38:35Z

In cases where the VLO importer identifies record duplicates (currently based on name and language), the record presented on the search page might not be the one from the resource owner, but another record provided by an external catalogue. Ways to reduce this behavour have to be evaluated and implemented.

Example: "Arabic Speech Corpus" OTA vs. ELRA

Helpful links:

https://lucene.apache.org/solr/guide/6_6/collapse-and-expand-results.html#CollapseandExpandResults-CollapsingQueryParser

teckart · 2020-12-03T17:06:12Z

The Solr collapsing mechanism provides min/max/sort parameters to select a group's head. We could create an (optional) index field to indicate the preference of a specific resource based on its origin and use it in the query, but it is still unclear what information we would use for that. We could for example maintain a list of endpoints that are mostly "aggregators" (of external resources) for downvoting, but this would mean additional configuration & maintenance and would be a bit random in some cases (like LINDAT's "LRT inventory"). This might also be the case when prefering a dataProvider over others.

twagoo · 2020-12-04T07:54:08Z

Something to keep in mind: we already have boosts in place for things like availability, presence of description, position in hierarchy (see solrconfig.xml) that now help determine the group's head. By default the selection takes into account relevance with respect to the query as well.

We will have to carefully decide whether we want to add logic 'on top' of this, or have a completely separate policy for the selection of the head. I don't have a clear preference right now but we have to make sure that we don't inadvertently discard a useful ranking mechanism.

twagoo added this to the VLO 4.10 milestone Dec 1, 2020

teckart self-assigned this Dec 3, 2020

twagoo modified the milestones: VLO 4.10, 4.11 Apr 22, 2021

twagoo modified the milestones: VLO 4.11, VLO 4.12 Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefering original resources in presentation of record duplicates #316

Prefering original resources in presentation of record duplicates #316

teckart commented Dec 1, 2020

teckart commented Dec 3, 2020

twagoo commented Dec 4, 2020

Prefering original resources in presentation of record duplicates #316

Prefering original resources in presentation of record duplicates #316

Comments

teckart commented Dec 1, 2020

teckart commented Dec 3, 2020

twagoo commented Dec 4, 2020