Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop friendly architecture / directly load OSM data #120

Open
geoHeil opened this issue Aug 24, 2018 · 5 comments
Open

Hadoop friendly architecture / directly load OSM data #120

geoHeil opened this issue Aug 24, 2018 · 5 comments

Comments

@geoHeil
Copy link

geoHeil commented Aug 24, 2018

How hard do you think would it be to build an add-on that instead of loading from postGIS would load the data from parquet files directly stored inside hadoop?

https://github.com/adrianulbona/osm-parquetizer

This data format is published daily at: http://osm-data.skobbler.net already in converted format.

@oldrev
Copy link

oldrev commented Aug 26, 2018

Hi, the answer is not that hard, you could do it by implements your own RoadReader interface.

Copy and modify the PostGISReader class is a good start.

@jongiddy
Copy link
Contributor

Barefoot does some transformation of the PBF files to create an efficient dataset for its use. It would be great to be able to do that conversion in a Hadoop cluster, but I don't think it is trivial.

However, once you have the correctly formatted files in the Hadoop cluster, it should be fairly easy to create a new Parquet-aware RoadReader.

I do the initial processing on a local VM, using the map/osm/import.sh script to import PBF data into PostgreSQL, then https://github.com/jongiddy/barefoot-map-db-file to export from PostgreSQL to a single .bfmap file, which I then upload to HDFS. My Spark jobs use https://github.com/jongiddy/barefoot-hdfs-reader to read the map data from HDFS.

@geoHeil
Copy link
Author

geoHeil commented Sep 2, 2018 via email

@geoHeil
Copy link
Author

geoHeil commented Sep 12, 2018

@jongiddy do I understand correctly, that the map is always loaded completely into memory (and especially for the whole world) requires fairly large amount of RAM for the executors?

Also when looking into the hadoop native file format. The driver would need to collect the whole parquet file and then broadcast it?

@smattheis
Copy link

smattheis commented Sep 23, 2018

@jongiddy and @oldrev already pointed out the relevant aspects. (Thanks!) I have only one note to add: The pre-processing step is mostly a transformation of OSM roads into a routable format which means splitting roads into edges of a graph. In OSM, roads are often long and cross intersections, as e.g. at the intersection of https://www.openstreetmap.org/way/33954504 and https://www.openstreetmap.org/way/31662854, such that a road must be split into multiple edges to represent the intersection and to allow turns. This pre-processing is done by the import scripts @jongiddy mentioned. A direct import into HDFS would need to implement that pre-processing step. Further, with the road readers you can define a subregion to be loaded into RAM or to save it in a HDFS file. However, routing and map matching across subregions is something that is not supported in the moment. This means, it won't help if you want to have a large map and just want organize in tiles. It only helps if you need, for some use case, ONLY a subregion of the map data you have imported into a map server initially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants