-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
st_read and requiring layer name #39
Comments
Nice idea, would improve usability! What would be a good approach to try to derive layer name from a single string: remove the file name extension and possibly path from the file name? |
I'm not sure, my use cases have almost always been geojson and shapefiles so I don't have direct experience with the more exotic vector types and their usage of layer names. GDAL peculiarities mean that both of the following work without issue
so I'm not sure if just looking at the path will be helpful. The example I have the most issue with has always been with reading geojson where the user is expected to know that the layer is called |
I agree with the changes suggested by @rundel. That will increase usability lots. It was a godsend when I found out about I think it's easiest to just point it at the file automatically deciding the driver/layer based on the extension. This is what rio does and I think sf should too: https://cran.r-project.org/web/packages/rio/vignettes/rio.html |
So these might be some heuristics then: if
|
Some other common extensions that I think would be good for sf to recognise:
I think raster already does a good job of handling the multiple raster formats. |
Examples of using st_read on .geojson .osm, .gml, .kml and .kmz would be helpful! |
Please be careful to separate out st_read for dummies and the real st_read, which shouldn't make assumptions. A typical problem is with the shapefile driver which falls over in a dsn=directory with a given layer=, but where the directory contains a *.dbf without being part of a shapefile. Robert's use of shapefile as a general term for vector data is very unhelpful. The border between sensible simplification and going too far is very real. Something not (yet) in the frame is encoding, which is often very unpleasant (even CP1252 is a problem), and far from everyone is UTF-8 yet. So either do a wrapper for st_read which handles frequent cases, and explain that more often than we like, real data is messy and needs the underlying complications in the bare interface, or just provide the bare interface and enough guidance. Fixing GeoJSON for the current odd OGR naming may change as soon as the driver gets updated - there are reasons for exposing users to the actual complications, because it makes derailments their responsibility. |
Example of a 5 Mb file I created - cycling potential on major roads in London:
I take Roger's point about precision and robustness vs ease of use and like the idea of having a separate wrapper function for magically guessing, most of the time correctly, how to read in a file. For that I would propose something with a name like Would that satisfy all parties?
Would work for me and think it's a function I can even help writing! |
I will not provide two functions. We can give layer a default name where this makes sense, and an error message when it is missing otherwise (e.g. where dsn is a directory, or an extension is missing). Like rgdal, the function responds with printing which layer is being read from which data source, on success, so that is there. Even |
Sounds reasonable and can understand desire to minimise number of functions. So, to clarify, that would mean |
no, because that doesn't point to a file with an extension that tells us what the layer name should be. I just pushed a commit that accepts
though -- please test (now supports gpkg, geojson, shp). To be discussed: what will it return as default? |
|
Could you clarify what you mean by "that doesn't point to a file with an extension that tells us what the layer name should be" - not understanding 100%. |
I meant: literally the command you typed. It has no file extension since the "/" comes after the ".". If you mean, either
or
then yes, but please don't imply meaning in github issues -- I can only read what you write. Which layer in your .osm?
|
(the last two are empty) |
Many thanks. |
I changed strategy somewhat. In case no layer is specified,
|
Very cool - works a treat, as tested on the osm data: # devtools::install_github("edzer/sfr") # latest version
library(sf)
# user: now I want to read in an osm file...
osm_data = st_read(dsn = "/tmp/doc_txt.osm")
# user: great - now I'll plot it...
# plot(osm_data) # fail
# user: what's in it?
osm_data
# user: ah ok, now I'll read-in the points
osm_points = st_read(dsn = "/tmp/doc_txt.osm", layer = "points")
plot(osm_points) I think there should be a message telling the user that no data has been read in when this happens though, e.g. |
Yes, I think there should be a message, either:
|
And the message could also say something like
|
OK, I now have:
|
Looks like you've cracked it - can't think of any way to make that better. Your work will make loading geographic data into R more accessible for thousands of people when this takes off! |
Although from a grammar perspective I don't think the semi colon works, hence ecea519 (I know, I'm painting bike sheds again but someone's got to do it!) |
I'd say job done and this issue can be closed. This issue has led to an improvement in usability and, although you still need to provide a layer name, it's now clear where to find those names. Solved from your perspective @rundel? |
I'm late to this, but concerned that you can now get a data set from st_read, or an error, or a list with layer names, which will present strange errors downstream when you try to do stuff as if it were a data set. To me it should be an error, or a valid data set (or NULL, maybe if the data source can execute queries). Hadley calls this "type-stability": https://blog.rstudio.org/2016/01/06/purrr-0-2-0/ (and I'm a massive offender in my own packages, but it's important I think. ) Is there a function to get the layer names without using st_read, like ogrListLayers? If you consider multiple NetCDF variables in a single file to be analogous to vector layers: I've always liked raster's behaviour to read the first variable in a NetCDF file as the one to read if the "varname" is not specified, with a message about that decision including a list of the available variable names. That way you get a data set without specifying a variable, but also an indication that a choice for the "varname" argument was made for you. The behaviour would be to have an implicit default value of 1L for "layer" to act as an index into the list of available layers, and to trigger a message with the details in that case. Then an explicit integer for layer could stand in place of a name - sometimes you want to read everything and push it around somewhere else, without knowing what it is. I'd meant to suggest this behaviour, since it's such a huge obstacle for newcomers who start by knowing nothing about multiple layers, and then (from what I've seen), turn from readOGR to the more obvious tools in maptools and raster to read their vector data. (All that said, the current behaviour is a huge user-friendly improvement over readOGR, so I'm very glad to see it.) |
Interesting points @mdsumner. I like the option of providing a NULL output if there are many layers and print the layer names. Then a separate function, e.g. Certainly agree with the general principle of type stability. |
I like type stability. How about this (in case no layer is specified):
|
This looks at least reasonable, assuming one places more importance on user-friendliness than I would (user-friendliness as the opposite of the very beneficial steep learning curve). Should readOGR move to this structure too? |
Wow, lots of discussion since I last had a chance to look at this. I am in favor of @edzer's proposed cases. My only small quibble is including |
@edzer @rsbivand - what's are the future plans for |
They are separate. rgdal will go to maintenance mode for vector, but just as maptools hasn't been deprecated, I guess rgdal vector will not be either. I'd just change a missing layer= argument from: |
OK, I currently see:
and
so that should settle this issue. |
Very comprehensive solution and demonstration of the new functionality, that works for me and it seems @rundel who usefully opened this issue is happy. Job done I'd say! |
Yes, very good indeed. Thanks all (and especially @edzer for the work!) |
Looks great to me, thanks for all the hard work @edzer |
Perhaps this is a local issue or my own ignorance (likely), but as I start to develop on
What would be the correct way to convert
Thanks so much for |
Yes, this is how it works. I'd have opted for
but that really doesn't matter. The extra list nesting however suggests that coordinates may have more rings, in which case you need to do more - you'd need to know whether they are holes or other outer rings. @mdsumner might know this area better? |
@timelyportfolio I've implemented |
@timelyportfolio it will show holes correctly by virtue of the evenodd rule drawn <-
list(structure(
list(
type = "Feature",
properties = structure(
list(`_leaflet_id` = 102L, feature_type = "rectangle"),
.Names = c("_leaflet_id",
"feature_type")
),
geometry = structure(
list(type = "Polygon",
coordinates = list(
list(
list(10.0854730652645, 48.8285498294904),
list(10.0854730652645, 49.393084372626),
list(11.123681073077,
49.393084372626),
list(11.123681073077, 48.8285498294904),
list(10.0854730652645, 48.8285498294904)
),
list(
list(10.3, 48.9), list(10.3, 49.1), list(10.7, 49.1), list(10.7, 48.9),
list(10.3, 48.9)))),
.Names = c("type",
"coordinates")
)
),
.Names = c("type", "properties", "geometry")
))
Note that leaflet the R package will get support for true leaflet-MultiPolygon, but it doesn't have it yet. I'm confused about why addGeoJSON takes an R list rather than a string though, we really need some clarity in this area, and some semantics for transferring between raw formats (like a string, or binary format), their analogous representation within R, and the form needed to drive another tool. Sometimes there are too many steps in the chain. And clearly we need better tools for GeoJSON and TopoJSON, rather than hacking these internal forms, we are half-ish way through that with some tools here - I'm not sure how what we are doing meshes with geojsonio, yet. TopoJSON requires partial indexing between shared tables (or structures), so it's a good bridge for what the broader central scheme I've been working towards can provide. |
Sorry, I now realize that my previous comment probably doesn't make sense in this context. |
Thanks for all the input. I will promote to a new issue for geojson as R |
This is a small pet peeve of mine with regard to rgdal - is there a compelling reason to continue to force the user to provide the layer name when reading in data?
There isn't a technical reason within gdal to require this and in my experience most geometry files I've worked with only ever have one layer. It seems to me a more sensible approach is to make the layer name an optional argument and in the case of a single layer to just automatically return that object.
If more than one layer are present then the current error could be thrown, or alternatively throw a warning and return a list containing all layers.
The text was updated successfully, but these errors were encountered: