Specify the type of sources and targets #16

Closed
nichtich opened this Issue Jun 22, 2012 · 10 comments

2 participants

@nichtich
Gemeinsamer Bibliotheksverbund member

There may be a need to identify the "kind of" resources identified by source URIs and/or target URIs (there is no "kind of" identifier as all identifiers are URIs). This information may just be put in the #DESCRIPTION meta field. The concept of a "kind of" thing is rather fuzzy anyway. A formal solution was to introduce something like #SOURCETYPE and/or #TARGETTYPE, for instance to state that all entities linked to/from are people (foaf:Person): For instance

#PREFIX: http://d-nb.info/gnd/
#TARGET: http://example.org/{ID}
#SOURCETYPE: http://xmlns.com/foaf/0.1/Person
#TARGETTYPE: http://purl.org/ontology/bibo/Document

115541543

is mapped to the RDF graph:

<http://d-nb.info/gnd/115541543> a foaf:Person ;
  rdfs:seeAlso <http://example.org/115541543> .
<http://example.org/115541543> a bibo:Document .
@gymel

No objection about identifying the type of ressources addressed by adding #...TYPE URIs.

Typically the canoncial identifiers used are not URIs. There undoubtedly exist "things" like ISBNs outside the semantic web and identifiers by this standard do not contain the string "urn:isbn:", although this is an officially registered namespace for ISBNs. And there are things like IMDb IDs where you have persistent URLs but any kind of URI you'll make up will be inofficial at the moment.

Thus #SOURCETYPE may be something identifying "books" (or some restriction like "ebooks", or "books about animals"), #PREFIX might be "urn:isbn:" (for what it's worth. As you clarified in issue #15 it is a string, not an URI) but there still lacks a statement about the kind of identifiers we use for the mapping, namely an ISBN's. This might be stated as (I'm not sure about giving a reduced namespace prefix or a specification document is better suited in absence of a "standards vocabulary")

#SCHEME: urn:isbn
#SCHEME: urn:iso:std:iso:2108:2005
#SCHEME: http://www.isbn-international.org/
#SCHEME: http://www.rfc-editor.org/rfc/rfc3187.txt
#SCHEME: urn:ietf:rfc:3187

(and this time "urn:isbn" is an URI).

@nichtich
Gemeinsamer Bibliotheksverbund member

There is no need to state the "kind of an identifier" because all identifiers MUST be URIs. This is not a bug but intended on purpose. It's not a problem because Beacon is about links, not about identifiers. In lack of official URI namespaces just use an inofficial namespace. Nobody is interested in plain identifiers anyway but in what these identifiers identify and what is linked by these identifiers.

@nichtich nichtich closed this Jul 2, 2012
@gymel

I strongly disagree. If all would be about URIs then VoID Linksets were everything you need and especially Beacon files would just be an attempt to backport meaningful RDF into just another silly serialization as text files.

Above I had hoped to outline clearly enough that at least some "classical identifiers" are not born as URIs. And still do not have official unique URIs. And even for those who have almost no existing software uses these URIs but almost always reduces them to just plain old "numbers". And these (non-URI) identifiers are more than an enumeration but form a system governed by assignment rules, syntax specifications and so on.

Therefore there is a huge gap between specifying a #PREFIX /string/ which turns every individual identifier into a (private or official) URI and additionally stating (by a #SCHEME /URI/) that the unprefixed numbers used in the data section of a Beacon File are not arbitrary but rather taken from an established identifier system according to its semantics.

Of course, the spec must be careful when talking about "identifiers" in the URI sense and and "classical identifiers from systems" not yet completely transformed into the semantic web framework.

@nichtich
Gemeinsamer Bibliotheksverbund member

Please send me a concrete pull request to modify the actual specification.

I don't get the use case of "established identifier systems" without URI prefix. At least for mapping to HTML you get an URL as target, so conforms to URI syntax. If you don't have an URI prefix for the source, you need to somehow communicate which kind of identifier you are using, anyway. Let's say there is an establed identifier system called "gnarz". How shall a user of your link dump know that you use this system? The burden of somehow configurating that "this Beacon uses gnarz-ids" is same to configurating that all URIs starting with http://example.org/gnarz/ (or whatever prefix was used in the link dump) are gnarz-ids.

@gymel

I'll give some examples instead, culminating in directions about handling the famous gnarz-IDs.

I. There is an official HTTP-URI for the identifiers used:
#PREFIX: http://d-nb.info/gnd/
This prefix (we remember that it is a string) does "work" when used as an URL but you'll be redirected to a page where you'll have problems to learn something about "GND" (and it is the intention of the beacon file to state: "Hey! I'm formulated by means of GND identifiers which are the identifiers used by the GND. The following URI is the identifier of the GND or at least gives you opportunity to learn more about the GND and what it's identifiers look like")

Unfortunately there is no official URI or persistent URL for "GND" (either seen as a dataset or as an "effort" consisting of objectives and rules) and I have many choices to provide a web-operational URI "about" GND. I'd probably opt for
#SCHEME: http://thedatahub.org/dataset/dnb-gemeinsame-normdatei
as a compromise between giving a stable link with sufficient information and pointing out that "The GND" is more than a published RDF dataset.

This case is typical for "modern" URIs which align web-friendly according to registered domain names at the price that there is no IANA-delegation chain for their identifiers. VIAF-Identifiers fall in this category.

II. Official non-HTTP-URI
#PREFIX: urn:isbn:
this already has very strong semantics but unfortunately does not resolve to anything. To learn about the international ISBN system I already had to admit further above that there is no canonical web URI for all purposes and it is quite arbitrary to decide between
#SCHEME: urn:isbn
#SCHEME: urn:iso:std:iso:2108:2005
#SCHEME: http://www.isbn-international.org/
#SCHEME: http://www.rfc-editor.org/rfc/rfc3187.txt
#SCHEME: urn:ietf:rfc:3187
but I would opt for
#SCHEME: http://www.isbn-international.org/
since this is the address the standards document provides where to look for updated information

Further examples are BNF-identifiers (a subspace of info:ark) and LCNAF (info:lccn), info:oclcnum and so on with the additional obstacle that these info-URIs are quite outdated and one certainly would prefer
#PREFIX: http://catalogue.bnf.fr/ark:/12148/
or even
#PREFIX: http://data.bnf.fr/

III. The World according to the Gnarz community
Not much is known about this community, but facts are they have been industrious for decades and heaped up lots of intersting content but have yet to arrive in the kind of web which emerged after 2000 AC. But since for a long time they have been publishing the only internationally accepted bibliography in their field they have influenced international normalization by a great deal. In an heroic effort they have recently published their vast resources in the World Wide Web, fulfilling international demand. If you happen to find one of their identifiers, say 216640-9, you can construct a query for the resource as follows:
http://dispatch.opac.d-nb.de/DB=1.1/CMD?ACT=SRCHA&TRM=216640-9&IKT=8506
Thus we have a /pattern/ for individual (range) ressources but this cannot necessarily be a #PREFIX (in the "normal" meaning of being a prefix string). Therefore we would have to invent a prefix on our own. But on the other hand we now have complete freedom to choose this prefix to be also meaningful als URI, e.g. http://www.gnarz.org/ or http://zdb-opac.de/ or http://www.zeitschriftendatenbank.de/ or http://thedatahub.org/dataset/zdb

In this situation we still have to utilize #PREFIX in the extended meaning of providing an URL pattern (one probably would not call the queries above URIs since there are so many possible searches in the database to yield the same result) although it might me cleaner to use the made-up-Prefix as #PREFIX and give the hint how to actually reach the (source) data on the web only in a description field or so: Then (in the case where we drop working URLs on the record level) we are in the comfortable situation of being able to say everything with the #PREFIX and indeed do not need an additional #SCHEME statement.

@nichtich
Gemeinsamer Bibliotheksverbund member

Unfortunately there is no official URI or persistent URL for "GND

There are at least two of them: http://lobid.org/organisation/DE-588 and http://de.dbpedia.org/resource/Gemeinsame_Normdatei. Crafting yet another identifier with http://thedatahub.org/dataset/dnb-gemeinsame-normdatei does not solve anything. In the current specification one can say:

#NAME: Gemeinsame Normdatei (GND)
#INSTITUTION: http://lobid.org/organisation/DE-588

#SCHEME: ...

The "#SCHEME" meta field is meaningless to me unless you provide an exact definition, where exactely to put it the specification and what to change there instead.

#PREFIX: urn:isbn:
this already has very strong semantics but unfortunately does not resolve to anything.

It's not the purpose of an URI to resolve to anything, but to identify something. If one wants to know about the nature of an URI, he or she just has to traverse the URI hierarchy. In this case you clearly end up at RFC 3187. This is how URI is defined. Again, it does not help to suggest alternative forms of ISBN as URI that nobody uses anyway, Just stick to the most used form for interoperability. I think ISBN is no good example to illustrate your point.

Further examples are BNF-identifiers (a subspace of info:ark) and LCNAF (info:lccn), info:oclcnum and so on with
the additional obstacle that these info-URIs are quite outdated

But there at least exists an URI form, so why not using it instead of providing meaningless sequences of characters and hoping that applications will guess its meaning? There is owl:sameAs to map from one URI form to the other but there is no way to map from an unknown identifier to to a known type, unless you invent your own mechanism. Don't reinvent the wheel if it has been solved with URI.

III. The World according to the Gnarz community

If there is no URI form of gnarz identifiers, one has to create it. For instance:

#PREFIX: http://purl.org/net/gnarz/
#TARGET: http://dispatch.opac.d-nb.de/DB=1.1/CMD?ACT=SRCHA&TRM={ID}&IKT=8506

This is not less usable than for instance

#SCHEME: http://purl.org/net/gnarz/

or

#BYTHEWAYTHISIDENTIFIERSARE: gnarz

In short: If there is an URI schema of identifiers, then use ist. If there are multiple URI schmas, use the most popular form. If there is no URI schema, define one and propagate it so others can benefit from your data. Without known URI schema, identifiers are not usable without human intervention, anyway.

@gymel

ad I.
o.k., now we have already three existing URIs for "GND", all from honest but inofficial efforts to list datasets. And bringing the ISIL "DE-588" into the game there should be another URI for DE-588 within the ISIL dataset. But I see that the GND example is flawed or at least special since by virtue of ISILs it treats itself as an organisation.

ad II.
Well no, RFC 3187 is titled "Using International Standard Book Numbers as Uniform Resource Names" thus this RFC ties the meaning of the ISBN standard to the urn:isbn namespace: It defines urn:isbn URIs but not ISBNs by themselves.

ad III.
Wrong usage of #TARGET: This is not the Beacon File mapping gnarz identifiers to gnarz resource URLs but a Beacon file exploring / linking to an "external" dataset (my gnarzisms) by means of gnarz identifiers. Thus our target is
#TARGET: http://mygoals.example.org/gnarz-resolver/{ID}
and - as always in our context - {ID} is a placeholder for gnarz identifiers, not gnarz URIs.
This mapping has meaning independent of our knowledge of gnarz resource URLs, they don't even have to exist and if they exist they might not be accessible by gnarz identifiers (cf. discussion on IFIS-ID and the impossibility to use them).

Maybe we cannot do better than VoID: Section 4.2 http://www.w3.org/TR/void/#pattern generalizes turtle prefixes (void:uriSpace) to Regex patterns (void:uriRegexPattern) /in case/ the URIs have something in common. The datasets as a whole are identified by URIs preferably "provided by the original data provider". If there is none, one should "mint" one in one's own namespace /and/ include a link to /the/ homepage of the dataset (the VoID primer argues that this at least could help with "discovery"). In our setting #PREFIX takes care of the former, but at the moment there is no meta element taking care of the latter (i.e. identifying the "source" dataset - and accidentally we also somehow lost the identifying URI for the target dataset, leaving us with the #NAME string).

But I still think the Beacon situation is not identical to the VoID situation: Albeit you can construct a "source dataset" the primary concern is to assign (or create) target URIs from source /identifiers/. #PREFIX strings are a means to transform this into something which can be expressed by RDF but it often introduces an element of volatility not present when sticking to the plain old identifiers. Furthermore the implicit identification of the #PREFIX string with a namespace URI which is to be taken as "canonical" URI representing the source dataset which in turn claims to explain the identifiers used is neither clean nor does it work (remember the days where http://d-nb.info/gnd/ could not tell us wether we were talking about GKD or SWD?)

Admittedly Beacon has other serializations than RDF/XML but we define #PREFIX to be a string (or RDF literal) and therefore we must not ever use its content as an URI (and maybe we should rethink wether PREFIX should be allowed to be an URI template). Now

Question 1: Can we always identify the two datasets involved by "vendor supplied" URIs or at least their respective homepages?

Question 2: If yes, is this valuable enough to justify meta fields for this or these (I understand #HOMEPAGE links to an external description of the /linkset/ which should include - targetted at human readers? - some basic information on the datasets involved)

Question 3: In cases of a very artificially constructed "source dataset" (think of gnarz as the numbers in a phone directory from 1920 not yet digitized) can/should we somehow relax the "source dataset" URI to something more strongly relating to the identifiers used?

Question 4: Should we stress the more formal aspects of the identifiers as textual data e.g. by optionally providing an .xsd description of their form? The URL of this description could serve as the URI of question 3?

Actually, I was quite fond of the old notion
#FORMAT: PND-BEACON
#PREFIX: http://d-nb.info/gnd
where the "PND" was the content of your new proposal #BYTHEWAYTHISIDENTIFIERSARE: Of course it's not an URI but at least some bit of text telling us what the domain of the mapping is about...

@nichtich
Gemeinsamer Bibliotheksverbund member

Please provide specific changes and separate different issues instead of a general discussion. Issue #16 is solved by introduction of #SOURCETYPE and #TARGETTYPE. I opened issue #20 for how to deal with non-URI identifiers and issue #21 for what about the source dataset (question 2), you may create additional issues for the other questions.

@nichtich
Gemeinsamer Bibliotheksverbund member

The proposed #SCHEME metafield is now introduces as #SOURCE (see issue #21). However, I'll remove SOURCETYPE and TARGETYPE to simplify the specification, it's a nice to have but just too much.

@nichtich nichtich reopened this Aug 21, 2012
@nichtich
Gemeinsamer Bibliotheksverbund member

This does not fully solve this request, but the introduction of #SOURCESET and #TARGETSET at least covers some use cases as one can identify the dataset that links point from and to. Everything else would make BEACON even more complex, so I'll close and reject additional requests.

@nichtich nichtich closed this Nov 30, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment