New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

citationIri concerns #452

Open
VladimirAlexiev opened this Issue Apr 21, 2016 · 4 comments

Comments

Projects
None yet
2 participants
@VladimirAlexiev
Member

VladimirAlexiev commented Apr 21, 2016

I have a few concerns about citationIri. It's trying to make a URL for the citation from its properties:

  1. @jimkont please confirm that even though it's a for loop, it'll execute no more than once
  2. What if neither of the cases match? We still need a URL, so we must make a local node (see next)
  3. Since the citation may have local props (eg "pages"), it's not quite correct to use a global URL, unless it reflects all these local props. In such case we need to make a local node, which refers to the global URL (eg using dct:isPartOf)
  4. For ISBN and ISSN, how are we sure that they're available on GBooks?
  5. More cases should be added, eg if there's "arxhiv" id, then make a http://arxiv.org URL
  6. @nfreire: TEL has some 109M bibliographic records (adding 60M more), maybe we can use their URLs? How are they identified? BTW they use RDA, so that should be considered for dbpedia/mappings-tracker#79
@jimkont

This comment has been minimized.

Show comment
Hide comment
@jimkont

jimkont Apr 22, 2016

Member

@jimkont please confirm that even though it's a for loop, it'll execute no more than once

yes, for loops do not execute more than onece

What if neither of the cases match? We still need a URL, so we must make a local node (see next)
Since the citation may have local props (eg "pages"), it's not quite correct to use a global URL, unless it reflects all these local props. In such case we need to make a local node, which refers to the global URL (eg using dct:isPartOf)

This needs some investigation

For ISBN and ISSN, how are we sure that they're available on GBooks?

I added this mostly to provide a stable ID

More cases should be added, eg if there's "arxhiv" id, then make a http://arxiv.org URL

sounds good :)

@nfreire: TEL has some 109M bibliographic records (adding 60M more), maybe we can use their URLs? How are they identified? BTW they use RDA, so that should be considered for dbpedia/mappings-tracker#79

not sure if that can be done directly at extraction time or with a post-processing step but we are open to all suggestions

Member

jimkont commented Apr 22, 2016

@jimkont please confirm that even though it's a for loop, it'll execute no more than once

yes, for loops do not execute more than onece

What if neither of the cases match? We still need a URL, so we must make a local node (see next)
Since the citation may have local props (eg "pages"), it's not quite correct to use a global URL, unless it reflects all these local props. In such case we need to make a local node, which refers to the global URL (eg using dct:isPartOf)

This needs some investigation

For ISBN and ISSN, how are we sure that they're available on GBooks?

I added this mostly to provide a stable ID

More cases should be added, eg if there's "arxhiv" id, then make a http://arxiv.org URL

sounds good :)

@nfreire: TEL has some 109M bibliographic records (adding 60M more), maybe we can use their URLs? How are they identified? BTW they use RDA, so that should be considered for dbpedia/mappings-tracker#79

not sure if that can be done directly at extraction time or with a post-processing step but we are open to all suggestions

@jimkont

This comment has been minimized.

Show comment
Hide comment
@jimkont

jimkont Jun 13, 2016

Member

I improved the citation IRI issue with most of the IDs I could find in the citation template documentations
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/CitationExtractor.scala#L271

The problem now is what to do with citations that have no ID or URL, these are for now skipped but I could create a UUID for those, what do you think?

Member

jimkont commented Jun 13, 2016

I improved the citation IRI issue with most of the IDs I could find in the citation template documentations
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/CitationExtractor.scala#L271

The problem now is what to do with citations that have no ID or URL, these are for now skipped but I could create a UUID for those, what do you think?

@VladimirAlexiev

This comment has been minimized.

Show comment
Hide comment
@VladimirAlexiev

VladimirAlexiev Jun 13, 2016

Member

Both cases doi and jstor check the field "doi". That's not wrong, as there are many DOI resolvers, see https://www.wikidata.org/wiki/Property:P356#P1630. But the second case is ineffectual, no?

Consider item 3 above. If a book or journal is cited in 1000 wikipedia articles, each will use the same ISBN or ISSN, and you'll generate the same citationIri. But if each cites a different chapter or article, it will have different title, pages, authors etc etc. You'll emit all these statements against the same citationIri, thus jumble them together.

Therefore all citations need their own URL, except those for which we can guarantee they cite individual items (arxiv, pmc, pubmed; DOI can reference either a book or an article so is not individual). Then we link this "own" node to the book or article, eg using dct:isPartOf

"Own URL" could mean:

  • UUID
  • IntermediateNode URL, eg dbpedia.org/resource/<entity>__cite1 etc
  • URL made from a hash of the key citation values (eg ISSN, authors, pages...). This is the best solution since it may facilitate reusing citations between articles.
Member

VladimirAlexiev commented Jun 13, 2016

Both cases doi and jstor check the field "doi". That's not wrong, as there are many DOI resolvers, see https://www.wikidata.org/wiki/Property:P356#P1630. But the second case is ineffectual, no?

Consider item 3 above. If a book or journal is cited in 1000 wikipedia articles, each will use the same ISBN or ISSN, and you'll generate the same citationIri. But if each cites a different chapter or article, it will have different title, pages, authors etc etc. You'll emit all these statements against the same citationIri, thus jumble them together.

Therefore all citations need their own URL, except those for which we can guarantee they cite individual items (arxiv, pmc, pubmed; DOI can reference either a book or an article so is not individual). Then we link this "own" node to the book or article, eg using dct:isPartOf

"Own URL" could mean:

  • UUID
  • IntermediateNode URL, eg dbpedia.org/resource/<entity>__cite1 etc
  • URL made from a hash of the key citation values (eg ISSN, authors, pages...). This is the best solution since it may facilitate reusing citations between articles.

jimkont added a commit that referenced this issue Jun 14, 2016

@jimkont

This comment has been minimized.

Show comment
Hide comment
@jimkont

jimkont Jun 14, 2016

Member

I kept the existing naming convention for now but included a hash-based IRI for citations that have no ID.
What you say makes sense but will be better handled when this is moved to the mappings wiki otherwise it requires a lot of hardcoding

Member

jimkont commented Jun 14, 2016

I kept the existing naming convention for now but included a hash-based IRI for citations that have no ID.
What you say makes sense but will be better handled when this is moved to the mappings wiki otherwise it requires a lot of hardcoding

jimkont added a commit that referenced this issue Jun 14, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment