Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping to existing RDF ontologies (e.g. Schema.org) #94

Open
joepio opened this issue Feb 13, 2022 · 15 comments
Open

Mapping to existing RDF ontologies (e.g. Schema.org) #94

joepio opened this issue Feb 13, 2022 · 15 comments

Comments

@joepio
Copy link
Member

joepio commented Feb 13, 2022

Existing RDF ontologies have some problems that Atomic Data solves:

  • No schema validation: They lack consistent machine readable descriptions to validate data structures. Some have SHACL or SHEX, but it's highly inconsistent, and these descriptions can often not be found from the predicate URL.
  • Inconsistent serialization: Some are TTL, some N3, some RDF/XML, some JSON-LD...
  • Inconsistent protocol: Some adhere to accept headers, some don't.

Read more about atomic & rdf.

But there are many ontologies in existence, and these describe various domains quite accurately. It would be great if we could still get the benefits of atomic data, without losing the information stored in these existing ontologies.

Some thoughts / challenges:

  • Most ontologies lack schema constraints, whereas Atomic Data requires typed properties. This means we'll probably have to make assumptions about what would be the best datatype to use. This is sometimes not a trivial matter, though. It will most certainly involve insight in domain and technical knowledge.
  • Many ontologies are more like vocabularies / taxonomies, where they simply describe a thing and not it's relationships or superclass / ontological structure. These are simple to describe in Atomic Data. However, Atomic Data does not have the mathematical ontology / OWL-like concepts such as subClassOf or distinctFrom.
  • Use one of the releases .ttl files, probably.

Implementation

Add original-url property to Property class

This original-url would be the URL of the RDF predicate.
When serializing to RDF, we could opt in to use this URL.
Inversely, when importing RDF, we could search for Properties having that predicate as original URL, and conform to the atomic data constraints (namely, they must resolve to JSON-AD properties)

However, this would come with a challenge. If a server has multiple Properties with the same original-url value, the server can't decide which one should be used. Malicious agents might even inject resources in the Server to mess up mappings.

If we have an explicit mapping resource, we can prevent this.

Mapping resource

A resource that contains a bunch of mappings. This can be referred to while importing RDF.

see lenses #102

Credits to @hoijui for sharing many ideas on this topic

@hoijui
Copy link
Contributor

hoijui commented Feb 14, 2022

So as it would optimally work, in my head, is like this:

somewhere under an AD sub-domain, there are AD proxies for .. the 100 most common RDF/OWL ontologies out there.
These would have to be auto-generated as much as possible, and the rest should be done in a semi-automated way, for example:

https://github.com/schemaorg/schemaorg/blob/main/data/schema.ttl

would be fed into a script (rdf2ad), together with an other file, which contains a list of propertyName -> dataType mappings. if any propertyType is missing in that mapping, rdf2ad will print an error message and exit 1. Then that mapping has to be added manually. Doing it this way, we need to do relatively little manual work, and yet can still deal with changes/different versions in the RDF ontologies pretty well.
... after converting the RDF ontology to AD, it will be hosted under a URL resembling the original URL, for example:

https://github.com/schemaorg/schemaorg/blob/main/data/schema.ttl

-- converts to -->

# using the source-file URL:
https://rdf-mirror.atomicdata.dev/ontologies/github.com/schemaorg/schemaorg/blob/main/data/schema.ttl.ad

# or the original schema IRI (makes more sense, I think -> easier conversion)
https://rdf-mirror.atomicdata.dev/ontologies/schema.org.ad

@hoijui
Copy link
Contributor

hoijui commented Nov 12, 2022

To start brainstorming ideas for how to practically go about getting there,
I will outline the roadmap I have in my head right now:

  1. Write a script (BASH/Python/Rust?) that creates a table of the most commonly use RDF/OWL(2) ontologies,
    including 1 line for released version of each of these ontologies,
    each line containing at least: IRI, version, raw-data-download-URL

  2. Write a script that syncs the raw-data-download-URL's to the local file-system.

  3. Write a script that collects statistical data over all these ontologies, e.g.:

    • which ontology links to which other
    • ... and how many times
    • both the above in a format suitable to generate a visual graph
    • ...
  4. Start writing a tool (Rust) that converts an RDF/OWL(2) ontology into an AtomicData one.
    At first, it will only contain classes, properties and their connection.

  5. Test the tool in that state on all the ontologies.

  6. Write a tool/script to convert a "user" data-set (i.e. the OKH-LOSH data) to AtomicData,
    in a very much simplified form.

  7. ... and back. -> PoC done!

  8. Improve the tools from steps 4, 6 and 7.

@joepio
Copy link
Member Author

joepio commented Nov 28, 2022

Some ideas on how to tackle nr. 4 (assuming you're using schema.org as ttl source, and prefer writing stuff in Rust). So consider these following as substeps.

  1. Parse the owl data in a graph using an RDF library.
  2. Iterate over all Properties, and create Atomic Data Properties from them. You can create JSON-AD strings, but perhaps better is to use the atomic_lib::Resource struct with the .set_propval + save methods. See example.rs or browse through the tests for inspiration.
  3. Then iterate over the Classes. Do something similar as above.
  4. Now, export your data. You'll get a JSON-AD file with the generated Atomic Data classes and properties.
  5. Convert the URLS to something like atomicdata.dev/ontologies/schema/something/ID
  6. Open a PR for atomic-data-rust that contains a JSON-AD file with these ontologies.
  7. ???
  8. Profit!

@hoijui
Copy link
Contributor

hoijui commented Dec 17, 2022

I did some initial research for lists of ontologies,
and there seem to be some good options! :-)

  1. https://archivo.dbpedia.org/list
    This is basically perfect.
    The only thing left to do,
    is choose which ones we want,
    or defining a filter that does that for us.
    I am pretty sure we would not want an ontology that is not available there,
    As it scans and crawls the web every 8h,
    and thus catches most of what is out there.
  2. https://lov.linkeddata.es/dataset/lov/sparql
    Similarly useful like 1. but older,
    with less vocabularies/ontologies,
    and no filtering options, less other meta-data,
    and no info about how often it actualizes.
    Allows to fetch the data as a dump,
    a single *.tar.gz file.
  3. https://github.com/zazuko/rdf-vocabularies/tree/master/ontologies
    A set of data-files of the most common ontologies, created for a Node.js library, and regularly updated.
    This would be even easier to get and use.
    It is explicitly made so it would be easy for other projects
    to use that data as well.
  4. https://github.com/ruby-rdf/rdf-vocab/blob/develop/lib/rdf/vocab.rb
    Similar to 2., just for a Ruby library instead,
    and a bit harder to use/extract the data from.

@jonassmedegaard
Copy link

Perl-based RDF libraries commonly use one of these methods for stable references to ontologies:
a) a dump of http://prefix.cc/popular/all at a fixed point in time
b) a manually curated subset of a)

Since it sounds like you want to restrict by certain qualities (e.g. "OWL-based", or "reasonably popular"), I suggest that you do b) - and then if it turns out that prefix.cc does not cover some ontologies you fancy, then there is nothing stopping you from changing the rules of your curating to include non-prefix.cc ontologies (but also you might then consider simply registering your pet ontologies at prefix.cc and just bump your fixed fime to a moment after your registration)

@jonassmedegaard
Copy link

jonassmedegaard commented Dec 17, 2022

All of prefix.cc is currently ~3000 ontologies.

The perl module RDF::NS::Curated provides a curated set of ~65 ontologies (as I recall it is simply "the most popular at prefix.cc at the time" but if curious you/I can simply ask Kjetil).

@hoijui
Copy link
Contributor

hoijui commented Dec 18, 2022

ohhh perfect, thank you Jonas! :-)
Sounds like I'd try that ... maybe those same ~65 then.

@hoijui
Copy link
Contributor

hoijui commented Oct 20, 2023

I applied for funding for doing this on my own about a year ago with NLnet and got refused somewhen in Q1 this year.
Since then, no new attempts from my side.

I still very much would like to have this mapping capability.
Right now, I am starting a new project with Lynn from VF,
creating an ontology for OSH.
I would love to do it in AD instead of RDF,
but can;t because of this missing.

@hoijui
Copy link
Contributor

hoijui commented Oct 20, 2023

@joepio Do you have an idea for how to write an RDF ontology so that it would be easily and future-fully mappable to AD,
once this mapping is implemented?

Thinking of: things to avoid using in RDF, and maybe extra properties to favor using or to be used on all classes/properties in the ontology . I guess the main thing would be the validation/data-type part.. right?
Asking of course, so we could take it into consideration, now that we start writing our ontology. (We already did start, but it is still very small and completely mold-able.)
Also good to know, to get to a point where we have a few such AD-mapping-ready RDF ontologies at some point, ready to test a mapping implementation, nice development starts on it.
In the best case, such extra properties for AD would even be usable/make sense even disregarding AD,
but that is less important.

@joepio
Copy link
Member Author

joepio commented Oct 20, 2023

Good question @hoijui!

I think most RDF ontologies / shacl shapes should be mappable to Atomic Data.

Some things to keep in mind:

  • don't rely on language strings wherever possible, as translations will work very differently in AD
  • every form of n-to-many list will be awkward in RDF, so I'm not sure what to suggest here.

@hoijui
Copy link
Contributor

hoijui commented Oct 20, 2023

what about the data validation... would AD data validation map to SHACL, or an RDF property specially made for this (e.g. admapping:datatype)?

(something went wrong with the link in your comment)

@hoijui
Copy link
Contributor

hoijui commented Oct 20, 2023

This gives some hints, I guess:
https://docs.atomicdata.dev/interoperability/rdf.html?highlight=language%20tags#convert-atomic-data-to-rdf

So language tags working differently... is that really an issue when they are used, when there is software (the code doing the mapping) in-between?
I am not talking about making RDF valid AD, just of it having the necessary data to map it to AD.

@joepio
Copy link
Member Author

joepio commented Oct 20, 2023

Fixed the link!

Yeah gotcha.

I think we can probably map pretty much everything at some point, like, we can always fall back to the 'string' datatype.

@hoijui
Copy link
Contributor

hoijui commented Oct 20, 2023

ok.. I guess.. I'll not do anything special for now then.
thanks!

@joepio
Copy link
Member Author

joepio commented Oct 20, 2023

gotcha :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants