Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenHistoricalMap backend #1220

Open
1ec5 opened this issue Dec 28, 2023 · 18 comments
Open

Add OpenHistoricalMap backend #1220

1ec5 opened this issue Dec 28, 2023 · 18 comments

Comments

@1ec5
Copy link

1ec5 commented Dec 28, 2023

OpenHistoricalMap has its own Overpass API instance but no SPARQL endpoint. QLever would be a valuable tool for OHM, because most historical mapping is centered around features that are notable enough for Wikidata items or are depicted in images on Wikimedia Commons.

OHM data is modeled similarly to OSM data. OHM tags are a superset of OSM tags. OHM triples should use separate prefixes like ohmkey: and ohmnode: to avoid polluting OSM queries with OHM results. QLever’s support for XSL dates and SPARQL date functions should be adequate for now; EDTF date parsing would be nice but I have no idea how that would work anyways.

OHM has its own vector map infrastructure. Petrimaps is impressive, but it would need to be integrated with OHM vector maps somehow.

Previously: Sophox/sophox#22.

@patrickbr
Copy link
Member

patrickbr commented Jan 11, 2024

I just looked into this, and I think this is a great idea. The OHM dataset is fairly small when compared to the OSM dataset (the latest planet file is just around 800 MB), so there is no technical reason not to add this data. I think it would even make sense to include the OHM dataset into our regularly updated site https://osm2rdf.cs.uni-freiburg.de/.

I will build a TTL file from the latest OHM planet file and publish it under https://osm2rdf.cs.uni-freiburg.de/ttl/ohm.planet.osm.ttl.bz2. @hannahbast, if you have time, you can then set up a QLever instance from this data.

@hannahbast
Copy link
Member

@patrickbr I agree and I also tried to build a TTL file myself using osm2rdf, but failed. I don't remember the error message. But maybe you are more lucky.

@rwelty1889
Copy link

i appreciate this work. it goes directly to some experiments i'm doing that will hopefully be presented (at least in part) at SOTM US 2024 later this year.

@patrickbr
Copy link
Member

patrickbr commented Jan 13, 2024

This takes a bit longer than expected, for three reasons:

  1. The OHM planet dataset contains at least one invalid node location, and we did not properly skip invalid coordinates at several locations until now (which was never a problem, as the OSM dataset so far never contained invalid coordinates).
  2. The OHM dataset has a completely different area "structure" than the OSM dataset. Namely, we have many large areas that are nearly equivalent (borders evolving over time). Roughly speaking, if we don't have one border of Germany / The Holy Roman Empire, but N of them going back to Charlemagne, all geometrical contains/intersect checks against "Germany" take N times as long as with the OSM dataset.
  3. We had an experimental branch of osm2rdf (with less memory usage) running on our conversion machine, and this branch had a subtle bug which is now fixed.

The weekly update starts today on 17:56 as scheduled, and will now also convert the OHM dataset and publish it on https://osm2rdf.cs.uni-freiburg.de. I expect it to be finished on Wednesday / Thursday.

@hannahbast
Copy link
Member

@patrickbr Awesome + looking forward to it!

@1ec5
Copy link
Author

1ec5 commented Jan 13, 2024

Thank you for your efforts in producing this file!

The OHM planet dataset contains at least one invalid node location, and we did not properly skip invalid coordinates at several locations until now (which was never a problem, as the OSM dataset so far never contained invalid coordinates).

Interesting, I don’t think there should be any difference between OSM and OHM in terms of the kinds of coordinates that the API would allow into the database. Maybe it indicates a problem in planet generation? Do you recall anything about these nodes? I downloaded the latest planet (planet-240113_0000.osm.pbf) and ran the following commands but didn’t see anything outside the global bounding box. I wonder if that’s because Osmium just short-circuits when extracting the global bounding box.

osmium tags-filter planet-240113_0000.osm.pbf -o nodes.osm.pbf 'n/'
osmium extract -b -180,-90,180,90 nodes.osm.pbf -o nodes-world.osm.pbf
osmium diff nodes.osm.pbf nodes-world.osm.pbf

The OHM dataset has a completely different area "structure" than the OSM dataset. Namely, we have many large areas that are nearly equivalent (borders evolving over time). Roughly speaking, if we don't have one border of Germany / The Holy Roman Empire, but N of them going back to Charlemagne, all geometrical contains/intersect checks against "Germany" take N times as long as with the OSM dataset.

Yes, since you’re precomputing area containment, this will be the dataset’s main source of complexity. The most severely “coincident” case I know of is the city boundary of San José, California with over 1,100 iterations, though maybe QLever will tell me of a larger one. 🙂

By the way, there’s an impending import of about 17,700 more boundary relations within the United States. We anticipate that many more boundaries will be uploaded over time at the local level, and also in countries that currently have poor coverage, but most of them won’t be as pathological as the ones you’ve encountered.

@patrickbr
Copy link
Member

patrickbr commented Jan 17, 2024

Sorry for the delay, but the problem with the historic iterations of political boundaries was more difficult than expected.

With ad-freiburg/osm2rdf#67, I was able to build the dataset in a few hours, even on my local laptop.

The weekly update is not finished yet, but I already published an OHM dataset here:

https://osm2rdf.cs.uni-freiburg.de/ttl/ohm.planet.osm.ttl.bz2

Feel free to experiment. This dataset will be build now every week.

@patrickbr
Copy link
Member

patrickbr commented Jan 17, 2024

@lehmann-4178656ch, one thing we already noticed is that the prefixes are all hardcoded to OSM stuff in osm2rdf (which so far has been a totally valid design choice).

For example, consider the following query for historical streets: https://qlever.cs.uni-freiburg.de/ohm-planet/z2HZ2L

Because of the hardcoded prefixes, if you click on an element, it goes to the OSM element with the same ID. How much work do you think it would be to make these prefixes configurable?

@nyurik
Copy link

nyurik commented Jan 17, 2024

Just FYI - sophox is back up and up to date now (I'm the admin on that). I am not too tied to the Blazegraph (as it is not being actively maintained), so wouldn't mind joining forces and possibly migrating to another indexing solution.

@hannahbast
Copy link
Member

@nyurik Feel free to use https://github.com/ad-freiburg/qlever to provide the SPARQL endpoint. It's much faster than Blazegraph, has fancy autocompletion, a fancy map view, etc. Let us know if you want help with the setup.

I guess one separate question is whether the Sophox RDF has any advantages over the RDF generated here: https://osm2rdf.cs.uni-freiburg.de . But let's start another issue or discussion for that, this here is the issue for OpenHistoricalMap and it's not a good idea to have two separate discussions in one issue.

@hannahbast
Copy link
Member

hannahbast commented Jan 18, 2024

@patrickbr @lehmann-4178656ch I changed the prefixes for osmmode, osmway, osmrel at the beginning of ohm.planet.osm.ttl.bz2 as follows:

@prefix osmway: <https://www.openhistoricalmap.org/way/> .
@prefix osmnode: <https://www.openhistoricalmap.org/node/> .
@prefix osmrel: <https://www.openhistoricalmap.org/relation/> .

Then I rebuilt the index and now the links point to where they should. For example: https://qlever.cs.uni-freiburg.de/ohm-planet/ZgSBCB

However, the start and end dates are currently not very useful because they don't have a proper type (should be xsd:date or xsd:gYear). Can you fix this?

@1ec5
Copy link
Author

1ec5 commented Jan 18, 2024

However, the start and end dates are currently not very useful because they don't have a proper type (should be xsd:date or xsd:gYear). Can you fix this?

Also xsd:gYearMonth in case the start_date or end_date tag is formatted as YYYY-MM, which is allowed. Note that many OHM dates are BCE, which ISO 8601 represents as a negative integer offset by +1 (because there’s no year zero).

How would QLever handle malformed dates? Would there be an alternative way to access those values, or is that out of scope? I suppose the same consideration applies to keys such as wikidata and wikipedia, but unfortunately malformed dates are currently more common in OHM, mostly added by OSM mappers who are familiar with that project’s homegrown date approximation format. (We want people to record approximations in *_date:edtf tags instead.)

A common workaround in OverpassQL is to rely on the fact that ISO 8601 dates sort lexicographically, so even a standard string type suffices. But presumably a proper date type would facilitate joins with other datasets.

@hannahbast
Copy link
Member

@1ec5 If the dates are such that the lexicographic order corresponds to the chronological order, than we can just specify the date as a string and everything is fine. For example, the countries of the world at the beginning of the 20th century (according to OHM): https://qlever.cs.uni-freiburg.de/ohm-planet/xpkeaM

@1ec5
Copy link
Author

1ec5 commented Jan 18, 2024

If the dates are such that the lexicographic order corresponds to the chronological order, than we can just specify the date as a string and everything is fine.

As long as the start_date or end_date value is well-formed, then the lexicographic order would correspond to the chronological order for any date between 10,000 BCE (-9999-12-31) and 9999 CE (9999-12-31), inclusive. Apparently we do have four features that predate 10,000 BCE, though QLever doesn’t pick them up for some reason. All of them are natural features with very approximate dates. There are also 34 features with dates past 9999 CE, but they all appear to be typos or placeholder values 🙅‍♂️ rather than amazing predictions of the future.

For example, the countries of the world at the beginning of the 20th century (according to OHM): https://qlever.cs.uni-freiburg.de/ohm-planet/xpkeaM

By the way, this query includes still-extant countries. (However, I had to hard-code today’s date, because neither NOW() nor BOUND() is implemented.)

@1ec5
Copy link
Author

1ec5 commented Jan 18, 2024

How would QLever handle malformed dates? Would there be an alternative way to access those values, or is that out of scope?

Maybe start_date and end_date could remain strings, but additional schema:startDate and schema:endDate triples could be added?

@hannahbast
Copy link
Member

@1ec5 Yes, indeed, our approach is to leave the additional data intact and add triples if the data is curated somehow. For example, Wikidata IDs are stored as simple strings in the OSM data and you get that string via the osmkey:wikidata predicate. But for many SPARQL queries, it is convenient to have the full IRI, and for that purpose, we have the related osm:wikidata predicate. For example: https://qlever.cs.uni-freiburg.de/osm-planet/skweDr .

@lehmann-4178656ch Can you add date triples with a proper datatype to osm2rdf, at least for the predicates osmkey:start_date and osmkey:end_date ?

@patrickbr
Copy link
Member

patrickbr commented Jan 29, 2024

With ad-freiburg/osm2rdf#70 merged, osm2rdf produces osm:start_date and osm:end_date date triples for full dates, years and year/month combinations (types xsd:date, xsd:gYear, and xsd:gYearMonth.

These should be present in the next build started today, which should be finished on Friday.

@lehmann-4178656ch
Copy link
Member

ad-freiburg/osm2rdf#70 put data into both osm: and ohm: depending on the selected base dataset even though the original values were always stored with osmkey:.

With ad-freiburg/osm2rdf#79 we moved the typed values to the more generic osm2rdfkey: prefix indicating that we did something to these values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants