Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict Wikidata reconciliation queries by item type #101

Closed
dlindem opened this issue May 31, 2021 · 5 comments
Closed

Restrict Wikidata reconciliation queries by item type #101

dlindem opened this issue May 31, 2021 · 5 comments
Labels
bug Something isn't working PIDs issues involving persistent identifiers such as dois, isbns and qids wikidata
Milestone

Comments

@dlindem
Copy link

dlindem commented May 31, 2021

I had a book chapter item in Zotero, with ISBN of the containing volume: Wooldridge, Russon (2004): "Lexicography", in Schreibman et al. (eds.): A Companion to Digital Humanities, Oxford: Blackwell, ISBN 978-1-4051-0321-3.

The fetched Q-ID is Q96725112, which refers to a journal (not to an article!) with the same title.

I think reconciliation only based on titles is very error-prone. A restriction to items of the same item type would be one strategy (allow only book chapters in this case).

@diegodlh
Copy link
Owner

diegodlh commented May 31, 2021

Hi, David. Thank you very much for your report. The bug you found is indeed critical.

First, let me provide a brief explanation of how QIDs are fetched, using Wikidata reconciliation service:
Reconciliation queries are built using the following parameters:

  • query: the item's title
  • type: used to "restrict the search to entities which bear those types" or one of its subclasses. Right now this is always set to Q386724 ("work").
  • properties: "a map from property identifiers to" property values. Right now this includes P356 (DOI) → cleaned item's DOI, and P212|P957 (ISBN-13 & ISBN-10) → cleaned item's ISBN

The reconciliation service returns a series of candidate QIDs, each with a matching score.
As explained here, if a DOI or ISBN were provided, "candidates are first fetched by looking for items with the supplied identifiers". If no candidates are found (or if DOI or ISBN were not provided) the query field (i.e., title) is searched for.

If there is an exact match (whether there is an exact match or not is decided by the reconciliation service, but the documentation is not explicit about this; I would have to re-check their source code to make sure) the QID is saved to the item. If there are partial matches, the user may be asked to choose among them (only non-batch operations).

Restricting to items of the same item type sounds like a great suggestion. I will make this the subject of this issue.
Implementing this should be relatively trivial and would imply setting the query's type parameter to a more specific Wikidata class (instead of Q386724), according to the Zotero's item type.
We would just need to map the different Zotero item types to their corresponding Wikidata class. I guess we could use the map used by Zotero's Wikidata import and export translators.
I would anyway continue returning matching items of different type as partial matches (i.e., suggestions), to make sure the user won't create duplicates if the item type was misconfigured in their collection. Edit: To make this point clearer, I would like to highlight that there seem to be too opposing tensions in play here: being too stringent about item matching, which may result in users creating duplicates in Wikidata, and being too flexible about it, which may result in Cita assigning wrong QIDs to items (what happened in your bug report). Being strict for exact matches, and leaving room for flexible matching as partial matches (i.e., suggestions) may be a good middle point.

In addition, including author names and publication date in the reconciliation query may help identify these mismatches as well. I had tried to do this, but it didn't work as expected. Apparently it was a bug. Now that it seems to have been solved, I'll try again (#103).

Finally, there is yet another case related to the bug you reported; that is, book chapters in Zotero with an ISBN for the book they belong to. Right now this ISBN is included in the query's properties parameter above. This could result in the reconciliation service returning the book itself (rather than the chapter). I've just posted a separate issue about this: #102. Anyway, this would be solved if restricting queries to items of the same type.

@diegodlh diegodlh added bug Something isn't working PIDs issues involving persistent identifiers such as dois, isbns and qids wikidata labels May 31, 2021
@diegodlh diegodlh changed the title Item mismatch (same title, but different item) Restrict Wikidata reconciliation queries by item type May 31, 2021
@dlindem
Copy link
Author

dlindem commented Jun 1, 2021

Regarding the mapping of item types: I think the mapping can be built straightforwardly (following the mapping in Zotkat), with one exeption: A conference paper may appear as conference paper, or as journal article, or as book chapter (if the proceedings are published as serial volume, or as book, which often happens). So, in that case, it would be good to allow all three item types for a possible wikidata match.
Regarding book chapters: what about ISBN plus starting page? And, it is a pity that Zotero does not allow the DOI field for book chapters (and books), while publishers now often provide DOI for these two item types, even for older books (example).

@diegodlh
Copy link
Owner

diegodlh commented Jun 3, 2021

Hi, @dlindem! Sorry I couldn't reply to your message before.

I'll take your suggestion about conference papers into account when I have time to fix this. Do you know if conference papers treated as conference paper, journal article or book chapter should be treated as separate items in Wikidata? I wonder if this may be related to what was discussed here regarding whether a preprint and the final article should have different items in Wikidata (I think they should). Analogously, if Zotero had a separate "preprint" item type (it doesn't, but I'm using it in analogy to the conference paper case), if a user had a preprint item in their library and the Wikidata resoultion returned the corresponding final article, I think I wouldn't treat that as an exact match, but as a partial match (i.e., suggestion), as commented in my reply above.

Regarding book chapters: what about ISBN plus starting page?

I'm afraid I think that wouldn't work because of the way how the reconciliation service works. A book chapter in Wikidata doesn't seem to have a reference to the ISBN of the book it belongs to. Rather, it has a "part of" statement pointing to a "version, edition, translation" (hopefully) which in turn has an ISBN.
A possibility would be (1) querying the reconciliation service for title only in these cases, then (2) getting the targets of all "part of" claims for the candidate QIDs returned, and (3) matching the ISBN to these target ISBNs. But I'm not sure it'd be worth the effort.

it is a pity that Zotero does not allow the DOI field for book chapters (and books)

That's a good idea. I can have Cita store these as extra, just like it does with QID and OCC: #109

@diegodlh
Copy link
Owner

diegodlh commented Jun 18, 2021

Non-strict (partial) type matching is not yet supported by openrefine-wikibase (wetneb/openrefine-wikibase/issues/4). As a workaround, I will send two requests to this API, the first one with a specific item type, and the second one (if the first one returns no exact matches) with a more general item type. Any matches (exact or not) returned from this second request will be treated as partial matches.

In addition, addressing #52 would also help users with deciding whether a partial match refers to their item or not.

@diegodlh
Copy link
Owner

Should be published in v0.2.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PIDs issues involving persistent identifiers such as dois, isbns and qids wikidata
Projects
None yet
Development

No branches or pull requests

2 participants