Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider read_st and write_st? #140

Closed
hadley opened this issue Jan 2, 2017 · 11 comments
Closed

Consider read_st and write_st? #140

hadley opened this issue Jan 2, 2017 · 11 comments

Comments

@hadley
Copy link
Contributor

hadley commented Jan 2, 2017

Instead of st_read and st_write.

The (small) advantages are greater consistency with the tidyverse (e.g. read_csv()) and somewhat more consistency with base R (e.g. read.csv()).

I think there's something nice about typing packagename::read_ and easily seeing what formats can be read in.

@edzer
Copy link
Member

edzer commented Jan 2, 2017

I think it was you who suggested to have all functions start with a common prefix, so that you could see all functions of the package by typing st_[tab]. Hard to have both...

@hadley
Copy link
Contributor Author

hadley commented Jan 2, 2017

Yeah, I think this is one exception that is useful.

(BTW why is the common prefix st_ and not sf_? I found that slightly confusing)

@edzer
Copy link
Member

edzer commented Jan 2, 2017

This comes from some standard document; all PostGIS commands follow the same pattern.

@hadley
Copy link
Contributor Author

hadley commented Jan 2, 2017

Ah, got it

@etiennebr
Copy link
Member

Why not read_sf <- function(...) st_read(...) ? That would make both ways exist.

@mdsumner
Copy link
Member

mdsumner commented Jan 3, 2017

Finally looked up what the st is, spatial temporal is not what I expected. http://stackoverflow.com/questions/7234679/what-is-st-in-postgis

@etiennebr
Copy link
Member

etiennebr commented Jan 19, 2017

While I understand the stringsAsFactors = TRUE I find it generally confusing. readr package uses stringsAsFactors = FALSE, which I think would generally be a good default for sf given the diversity of data stored in spatial features. Would you consider changing the default?

In fact, my question originates from trying to follow the readr interface, so I was thinking read_sf() could have a stringAsFactors = FALSE default, but it would probably be very confusing to have st_read() with a different default than read_sf(). I know users can change that default, but for reproducibility, I think it's right to assume a --vanilla environment.

@mdsumner
Copy link
Member

mdsumner commented Jan 19, 2017

@etiennebr I agree with this, I think auto-factoring is an unhappy legacy we should actively stamp out. The argument for staying compatible with R's defaults is a reasonable one though, and I know that @edzer made the choice for it purposefully. I do think it's worth breaking the tradition though. Shapefiles, MapInfo, GeoPackage etc. don't have factors afaik, and there's no common-standard across the huge variety of possible drivers and data sources, so I think the auto-factoring is incorrect.

(Manifold GIS does have a "Lookup" type, which is a clear analog to factor, but I don't know how common that is across vendors. "factor_alikes_as_factors = TRUE"?

It might be a partial table read from a database, for example, so the set of available values are not all seen in the first pass, which means subsequent read-bind workflows already don't make sense from the strict view-point of factors as a pre-set of allowable values.

@edzer
Copy link
Member

edzer commented Jan 20, 2017

By that, we'd try to mirror in R what a relational database does. But relational databases were not designed to model data, and that's why we're using R. There is a fundamental difference between a character variable and a factor: character data identifies records, factors group them. In R we already identify records by their row number, so there is no need: the character rownames of mtcars are convenient. These rownames are not a grouping variable, they are identifiers. I notice that tidyverse is trying to replace most default functions in base R, but in this case I think they made the wrong decision, although I consider this on the edge of bikeshedding. Users who hate me for this can simply set

options(stringsAsFactors = FALSE)

in their .Renviron or on the to of their reproducible scripts.

@hadley
Copy link
Contributor Author

hadley commented Jan 20, 2017

I don't want to beat a dead horse, but my reasoning is a little different. To me, the distinction between factors and character vectors is that factors have a fixed and known set of possible values. It is not possible to know the set of possible values by inspecting the data (i.e. you might have a gender column that only contains males) so by default it's safer to load as character and rely on the user to supply the full set of possible values. (The order of the levels often also has some meaning.)

Historically, I think the default simply reflects that character vectors were poorly supported in the early days of R and factors were easier to work with. Unfortunately I don't think it's viable to change the global option for many people because much existing code depends on the default being TRUE (e.g. it also affects every call to data.frame() in every loaded package).

@mdsumner
Copy link
Member

mdsumner commented Jan 20, 2017

"To model data" is not why I'm using R.

The simple features standard has no metadata system or scaling semantics applied to the fields so I don't understand why these would be added automatically for every user of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants