Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output of the check command #25

Open
jneubert opened this issue Jan 16, 2017 · 9 comments
Open

Output of the check command #25

jneubert opened this issue Jan 16, 2017 · 9 comments

Comments

@jneubert
Copy link
Collaborator

jneubert commented Jan 16, 2017

With the following input

source, target, annotation
124825109, psn9, "Dennis Snower (GND PRESENT AND LINKED)"
170947386, phi58, "Christian von Hirschhausen (GND PRESENT BUT NOT LINKED)"
114008787, pwy2, "Charles Wyplosz (GND NOT PRESENT)"
124825109, pxx999, "Dennis SNOWER (GND PRESENT BUT LINKED TO DIFFERENT TARGET)"
123292182, psn9, "Leo H. Klaassen (GND PRESENT AND NOT LINKED BUT LINK TARGET EXISTS ALREADY)"

the current version (0.0.7) gives the output

$ wdmapper check P227 P2428 -i  tests/gnd_repec_test.csv
#FORMAT: BEACON
#NAME: RePEc Short-ID
#DESCRIPTION: Mapping from GND IDs to RePEc Short-IDs
#PREFIX: http://d-nb.info/gnd/
#TARGET: https://authors.repec.org/pro/

~ 124825109|Q1189225|psn9
+ 170947386|Christian von Hirschhausen (GND PRESENT BUT NOT LINKED)|phi58
- 170947386|Q1082503
+ 114008787|Charles Wyplosz (GND NOT PRESENT)|pwy2
+ 124825109|Dennis SNOWER (GND PRESENT BUT LINKED TO DIFFERENT TARGET)|pxx999
- 124825109|Q1189225|psn9
+ 123292182|Leo H. Klaassen (GND PRESENT AND NOT LINKED BUT LINK TARGET EXISTS ALREADY)|psn9
- 124825109|Q1189225|psn9
- 123292182|Q18817490
@jneubert
Copy link
Collaborator Author

I'm not sure how to make sense of the actual output. What I would expect is that the output of check allows to distinguish the different cases, as indicated in parentheses in the input file. But frankly, I don't understand the relation between input and output, even considering https://wdmapper.readthedocs.io/en/latest/commands.html#check.
Is the beacon-like format of the output required for some postprocessing you have in mind? It is quite constricting, in that it codes the outcome in the first character in the line, and it offers no fields for additional information such as the ID of a found Wikidata item.
Perhaps that is the reason, than some input lines produce more than one output line. In my eyes, this makes the parsing of the output much more difficult (for humans as well as for machines), because you have to check for follow-up lines.

@jneubert
Copy link
Collaborator Author

jneubert commented Jan 18, 2017

Line by line

124825109, psn9, "Dennis Snower (GND PRESENT AND LINKED)"
~ 124825109|Q1189225|psn9

The output should

  • identify the input line. For machines, the source ID is fine, but for humans it would be nice to include the annotation
  • give the ID of the found Wikidata item
  • state that the link already exists

In my eyes, there is no need of

  • repeating the target ID
  • saying anything about the label of the Wikidata item (difference between = and ~). That could not work in general, because WD has multiple labels (in different languages), while we cannot specify a language for comparison in the input file. Additionally, sometimes simple rules (like 'first-name last-name' vs. 'last-name, first-name' prevent any exact matching.

@jneubert
Copy link
Collaborator Author

jneubert commented Jan 18, 2017

170947386, phi58, "Christian von Hirschhausen (GND PRESENT BUT NOT LINKED)"
+ 170947386|Christian von Hirschhausen (GND PRESENT BUT NOT LINKED)|phi58
- 170947386|Q1082503

Like above, the output should, preferably in one line

  • identify the input line
  • give the ID of the found Wikidata item
  • state that the link does not exist (and could be added)

@jneubert
Copy link
Collaborator Author

jneubert commented Jan 18, 2017

114008787, pwy2, "Charles Wyplosz (GND NOT PRESENT)"
+ 114008787|Charles Wyplosz (GND NOT PRESENT)|pwy2

Here I feel betrayed! By the + code, the output seems to indicate that a link coud be added - but not Wikidata item was identied by the source ID, which makes that impossible. (I'd consider adding Wikidata items out of scope.)

The output should

  • identify the input line
  • state that no Wikidata item has been found

@jneubert
Copy link
Collaborator Author

jneubert commented Jan 18, 2017

124825109, pxx999, "Dennis SNOWER (GND PRESENT BUT LINKED TO DIFFERENT TARGET)"
+ 124825109|Dennis SNOWER (GND PRESENT BUT LINKED TO DIFFERENT TARGET)|pxx999
- 124825109|Q1189225|psn9

The output should

  • identify the input line
  • give the ID of the found Wikidata item
  • state (very clearly) that there exists a conflict between the existing target ID and the one in the input file
  • give the existing target ID
  • repeat the target ID from the input file

Perhaps a more general question is touched here: I don't think that already existing links in Wikidata should be overridden in any automatic fashion - conflicts should be stated by the tool and investigated and solved manually.

@jneubert
Copy link
Collaborator Author

123292182, psn9, "Leo H. Klaassen (GND PRESENT AND NOT LINKED BUT LINK TARGET EXISTS ALREADY)"
+ 123292182|Leo H. Klaassen (GND PRESENT AND NOT LINKED BUT LINK TARGET EXISTS ALREADY)|psn9
- 124825109|Q1189225|psn9
- 123292182|Q18817490

To me, the output for this case highly confusing, and I could only guess what it was intended to mean.

The output should

  • identify the input line
  • give the ID of the found Wikidata item
  • state (very clearly) that the target ID is already linked to another Wikidata item (and therefore cannot be added)
  • repeat the target ID from the input file
  • give the ID of the Wikidata item where it is already used

@jneubert
Copy link
Collaborator Author

According to https://wdmapper.readthedocs.io/en/latest/commands.html#check, lines starting with - seem to be continuation lines, but could also indicate that a link should removed (somehow opposite to the + code). That ambiguity of - should be avoided in any case.

@jneubert
Copy link
Collaborator Author

jneubert commented Jan 18, 2017

Just came accross a nice example re. the variety of Wikidata labels for a person: https://www.wikidata.org/wiki/Q564905?uselang=en

@jneubert
Copy link
Collaborator Author

Output of v 0.0.9 (line "Charles Wyplosz (GND NOT PRESENT)" changed to non-existent GND):

# wdmapper check P227 P2428 -i /opt/repec-ras/var/ras/example2/map/gnd_ras_mapping.test.csv
#FORMAT: BEACON
#NAME: RePEc Short-ID
#DESCRIPTION: Mapping from GND IDs to RePEc Short-IDs
#PREFIX: http://d-nb.info/gnd/
#TARGET: https://authors.repec.org/pro/
#SOURCESET: http://www.wikidata.org/entity/Q36578
#TARGETSET: http://www.wikidata.org/entity/Q206316

~ 124825109|Q1189225|psn9
+ 170947386|Christian von Hirschhausen (GND PRESENT BUT NOT LINKED)|phi58
- 170947386|Q1082503
+ 999999999|Dummy (GND NOT PRESENT)|pxx99
+ 120068524|Joseph Stiglitz (GND PRESENT BUT LINKED DIFFERENTLY)|pxx999
- 120068524|Q18430|pst33
+ 123292182|Leo H. Klaassen (GND PRESENT BUT LINK TARGET DUPLICATE)|psn9
- 123292182|Q18817490
- 124825109|Q1189225|psn9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant