Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dplyr compatibility? #42

Closed
tiernanmartin opened this issue Oct 31, 2016 · 38 comments
Closed

Dplyr compatibility? #42

tiernanmartin opened this issue Oct 31, 2016 · 38 comments

Comments

@tiernanmartin
Copy link

tiernanmartin commented Oct 31, 2016

I see that working with sf objects with dplyr is listed on the ISC Proposal, but it looks like you haven't gotten around to documenting that yet.

Without going into a vignette-length explanation, could you shed some light on why dplyr functions appear to strip sf objects of their sf class and suggest a way to effectively combine the power of these two tools?

Example:

library(dplyr)
library(sf)

nc <- st_read(system.file("shapes/", package="maptools"), "sids")
## Reading layer sids from data source 
## /Library/Frameworks/R.framework/Versions/3.3/Resources/library/maptools/shapes/ using driver "ESRI Shapefile"
## features:       100
## fields:         14
## converted into: MULTIPOLYGON

class(nc)
## [1] "sf"         "data.frame"

class(nc[nc$AREA > .1,])
## [1] "sf"         "data.frame"

class(dplyr::filter(nc,AREA > .1))
## [1] "data.frame"

Many thanks.

@edzer
Copy link
Member

edzer commented Oct 31, 2016

Good catch! I guess (but am not sure) that this is out of my hands; right now you could coerce back by

nc %>% filter(AREA > .1) %>% st_as_sf() %>% plot()

but this is maybe, ehm, not very tidy? @hadley what do you think?

@hadley
Copy link
Contributor

hadley commented Oct 31, 2016

I think it's just a matter of adding the right dplyr methods to restore the sf class, e,g.:

filter_.sf <- function(.data, ..., .dots) {
  st_as_sf(NextMethod())
}

@edzer
Copy link
Member

edzer commented Oct 31, 2016

Got it, thanks! I see the following functions ending in a _ in dplyr:

 [1] "arrange_"        "count_"          "data_frame_"     "distinct_"      
 [5] "do_"             "filter_"         "funs_"           "group_by_"      
 [9] "group_indices_"  "lst_"            "mutate_"         "mutate_each_"   
[13] "rename_"         "rename_vars_"    "select_"         "select_vars_"   
[17] "slice_"          "summarise_"      "summarise_each_" "summarize_"     
[21] "summarize_each_" "translate_sql_"  "transmute_"     

@mdsumner which ones can we meaningfully support? Are there commands that need additional geometrical operations, like merging/unioning geometries?

@mdsumner
Copy link
Member

mdsumner commented Nov 1, 2016

I short, I don't know the answer, I haven't explored it. At a guess, group_by() %>% summarize() is the main one for geometry, I imagine that it should work like the rgeos functions in byid=FALSE mode, but applied to each grouping. If there's no fun(geom) then perhaps error, or just drop the sf classing.

In https://github.com/mdsumner/spdplyr I only do the simple ones, arrange, distinct, filter, mutate, rename, select, slice, - the group_by/summarize just bangs the geometries together without removing internal boundaries or intersections - but this was done in isolation, and without me knowing how to extend dplyr, really. Joins are another thing I haven't explored much what should/could happen there.

Sadly, despite the common "we need dplyr for Spatial" cries in the community I haven't seen any useable details about what is desired. (Personally, this is not of high interest without exposing the underlying X, Y, [Z], [M] attributes, and the underlying grouping structures in the geom in a consistent way, but to do that you need to go beyond path-based lines and polys, and I've put all of that into work elsewhere. I had hoped for there to be a common framework between ggplot/ggvis/rgl and sf, but I think that all belongs outside of sf. There's no such common framework outside of R, for example, so there's no analogous target to adhere to).

@mdsumner
Copy link
Member

mdsumner commented Nov 1, 2016

I'm really going to try to spend some time on this, here is just what I tried recently, without going into detail.

library(sf)
example(st_read)

## simple (non aggregation) stuff already works
library(sf)
example(st_read)

library(dplyr)
nc %>% slice(10)

nc %>% filter(PERIMETER > 2.5)

## geometry dropped as expected
nc %>% group_by(SID79) %>% summarize(sum(PERIMETER))

## grouped mutate (without aggregation) keeps geometry but drops
## sf defs
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA)) 

## trick fun to apply to the geometry column
## works for grouped union, though drops the sf defs
fun <- function(x) st_union_cascaded(st_sfc(do.call(c, st_geometry(x))))
nc %>% group_by(SID79) %>% summarize(geometry = fun(geometry)) 

I know this requires a lot of thought, but I think there's value in

  • not dropping the sf defs for a mutate into group_by (will require effort on how the grouped_df is handled)
  • a default aggregation behaviour, where any group_by keeps the sf defs and geometry column is unioned, even if it isn't operated on in summarize()

I haven't though deeply about the other dplyr behaviours.

@edzer
Copy link
Member

edzer commented Nov 1, 2016

UPDATE
Please update to c4dde04 which has tested dplyr verb methods for sf objects; an overview:

  +"arrange_"       -"count_"         -"data_frame_"    +"distinct_"
  -"do_"            +"filter_"        -"funs_"          +"group_by_"
  -"group_indices_" -"lst_"           +"mutate_"        -"mutate_each_"
  +"rename_"        -"rename_vars_"   +"select_"        -"select_vars_"
  +"slice_"         +"summarise_"     -"summarise_each_"+"summarize_"
  *"summarize_each_"-"translate_sql_" +"transmute_"

from tidyr:  +gather, +spread

   +: added sf methods
   -: not relevant (?)
   *: will be deprecated?

where summarisze_ automatically merges geometries over groups (when called with a grouped_df).

@tiernanmartin
Copy link
Author

May I also suggesttidyr::gather_() and tidyr::spread_() as worthy candidates for inclusion in the list of tidyverse functions that play well with sf.

@edzer
Copy link
Member

edzer commented Nov 1, 2016

Thanks; I did, but still untested.

@edzer
Copy link
Member

edzer commented Nov 2, 2016

I updated the table 3 comments up; all relevant dplyr verbs + gather & spread are now implemented and lightly tested. Below is my test script. Happy testing - positive feedback also welcome!

library(sf)
library(dplyr)
nc = st_read(system.file("shape/nc.shp", package="sf"), crs = 4267)
nc %>% filter(AREA > .1) %>% plot()

# plot 10 smallest counties in grey:
nc %>% plot()
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')

# select: check both when geometry is part of the selection, and when not:
nc %>% select(SID74, SID79) %>% names()
nc %>% select(SID74, SID79, geometry) %>% names()
nc %>% select(SID74, SID79) %>% class()
nc %>% select(SID74, SID79, geometry) %>% class()

# arrange: ten smallest counties
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')

# group_by:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc %>% group_by(area_cl) %>% class()

# mutate:
nc2 <- nc %>% mutate(area10 = AREA/10)

# transmute:
nc %>% transmute(AREA = AREA/10, geometry = geometry) %>% class()
nc %>% transmute(AREA = AREA/10) %>% class()

# rename:
nc2 <- nc %>% rename(area = AREA)

# distinct:
nc[c(1:100,1:10),] %>% distinct() %>% nrow()

# summarize:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.g <- nc %>% group_by(area_cl)
nc.g %>% summarise(mean(AREA))
nc.g %>% summarize(mean(AREA)) %>% plot(col = 3:6/7)

library(tidyr)

# time-wide to long table, using tidyr::gather
# stack the two SID columns for the July 1, 1974 - June 30, 1978 and July 1, 1979 - June 30, 1984 periods
# (see https://cran.r-project.org/web/packages/spdep/vignettes/sids.pdf)
nc %>% select(SID74, SID79, geometry) %>% gather(VAR, SID, -geometry) %>% summary()

# spread:
nc$row = 1:100
nc.g <- nc %>% select(SID74, SID79, row) %>% gather(VAR, SID, -row)
nc.g %>% tail()
nc.g %>% spread(VAR, SID) %>% head()
nc %>% select(SID74, SID79, geometry, row) %>% gather(VAR, SID, -geometry, -row) %>% spread(VAR, SID) %>% head()

@edzer
Copy link
Member

edzer commented Nov 2, 2016

Here's an example computing population (well, birth) densities for aggregated areas:

library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
## Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.3/sf/gpkg/nc.gpkg' using driver `GPKG'
## features:       100
## fields:         14
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
nc.ea <- st_transform(nc, 7314) # Lambert equal area
nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, dens = BIR74/area) # births/km^2
summary(nc.ea$dens)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01834 0.08762 0.14850 0.23860 0.27700 1.37700
nc.ea$area_cl <- cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.grp <- nc.ea %>% group_by(area_cl)
out <- nc.grp %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)

did anyone get lost?

out %>% summarise(sum(A * new_dens))
##   sum(A * new_dens)
## 1            329962
nc.ea %>% summarise(sum(area * dens))
##   sum(area * dens)
## 1           329962

No. You might have discovered by now that I'm brand new to dplyr - happy to take comments how to tidy this up!

@tiernanmartin
Copy link
Author

Thank you for implementing this in such short order @edzer!

When I opened this issue I had no expectation of seeing progress in the near term and had already started reverting my current project back to sp conventions. But now, thanks to your responsiveness and the feedback from others on this thread, I'm thrilled to start testing the sf x dplyr interface.

Going forward I'll be happy to share my feedback, tests, and any unusual situations I run into. As @mdsumner mentioned, there's plenty of thinking and hard work ahead.

Thanks again and I look forward to following the development of sf.

@tiernanmartin
Copy link
Author

A few observations about summarise()'s interaction with sf objects:

  • the CRS gets dropped
  • the aggregated groups aren't what I expected
  • some geometries may be dropped

You can find tests demonstrating these observations here: https://tiernanmartin.github.io/YCC-Baseline-Conditions/3-communication/other/sp-dplyr-test.nb.html

edzer added a commit that referenced this issue Nov 3, 2016
Signed-off-by: Edzer Pebesma <edzer.pebesma@uni-muenster.de>
@edzer
Copy link
Member

edzer commented Nov 3, 2016

Great observations, beautiful tests! Now, they look even much better:

  • CRS now get passed through by summarise (for the other commands it did work)
  • aggregate groups and dropped geometries have now been fixed.

The issue you discoverd was that the indices attribute in objects returned by group_by is 0-based, where I assumed it was 1-based. @hadley if you are ever going to change this, remember telling sf about it!

@kendonB
Copy link
Contributor

kendonB commented Nov 3, 2016

I'm not sure if this deserves it's own issue, but a full set of *_join operations would be a good addition here.

I haven't dug into sfr yet, so I don't know if you're allowing sf objects to contain multiple geometries (i.e. two MULTIPOLYGONs or a MULTIPOLYGON and a MULTIPOINT). If you can, (and if you can't, I think you should), then *join operations would result in sf objects with multiple geometries.

A proposed combo:

mp1 # sf with MULTIPOLYGON
mp2 # sf with MULTIPOLYGON
mp3 <- left_join(mp1, mp2, border = FALSE)

would result in an sf with two columns geometry1 and geometry2 and would behave just like left_join in dplyr except a match is defined as two geometries with positive spatial overlap; border = TRUE would include matches with common borders. left_join(x, y) includes all observations in x, regardless of whether they match or not. Multiple matches return as multiple rows.

There are maybe more match conditions that you would want to include. Perhaps allowing any function that takes two sfg objects and returns a logical would make the most sense. Does st_intersects, for example, allow sfg objects as inputs? If so, then this could be fully general for any pair of sf types. Usage would be something like:

mp3 <- left_join(mp1, mp2, match_fun = st_touches)

It probably also makes sense to do this so you don't have to reinvent the join wheel. I don't believe dplyr allows for arbitrary conditions for join matches yet (tidyverse/dplyr#557), but there might be a solution that isn't too difficult to implement.

The user could also generate new geometry columns using any function that takes two sfgs and returns an sfg. i.e.:

mp3 <- mp3 %>%
  mutate(geometry3 = purrr::map2(geometry1, geometry2, st_intersection))

@tiernanmartin
Copy link
Author

@edzer Is it better to keep adding dplyr-related requests to this issue or to open fresh issues for each request?

@edzer
Copy link
Member

edzer commented Nov 4, 2016

Yes, @kendonB could you pls make this into a separate issue, for instance called "Dplyr-style join operations with spatial predicates" ?

@mdsumner
Copy link
Member

mdsumner commented Nov 12, 2016

@edzer this is really nice, there's some code reduction if you put the cut and the group by actions in the pipeline together:

nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, 
                          dens = BIR74/area, 
                          area_cl = cut(AREA, c(0, .1, .12, .15, .25)))
out <- nc.ea %>% group_by(area_cl) %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)
## compare
out %>% summarise(sum(A * new_dens))
nc.ea %>% summarise(sum(area * dens))

It reminds me that applying the st_ functions inside verbs is a motivator, and seems fine:

library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
nc.ea <- nc %>% mutate(area_nonsense = st_area(geom), 
              geom = st_transform(geom, 7314), 
              area = st_area(geom) / 1e6, 
              dens = BIR74 / area,
              AREA_cl = cut(AREA, c(0, .1, .12, .15, .25))) 

out <- nc.ea %>%   group_by(AREA_cl) %>% 
  summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)

out %>% summarise(sum(A * new_dens))
##sum(A * new_dens)
## 1            329962

But also I see that 7314 is Transverse Mercator, not Lambert EA - it's not so important re the example, but worth correcting in case this gets used elsewhere.

@mdsumner
Copy link
Member

And a check for the grouping and area calculations sp/rgeos style:

library(rgdal)
spnc <- spTransform(as(nc, "Spatial"), "+proj=tmerc +lat_0=40.35 +lon_0=-86.15000000000001 +k=1.000031 +x_0=240000 +y_0=36000 +ellps=GRS80 +units=us-ft +no_defs")
sp.out <- rgeos::gUnionCascaded(spnc, as.character(cut(spnc$AREA, c(0, .1, .12, .15, .25))))
rgeos::gArea(sp.out, byid = TRUE)/1e6
#    (0,0.1]  (0.1,0.12] (0.12,0.15] (0.15,0.25] 
#  290620.6    182825.3    322369.5    585439.0 
out$A
##[1] 290620.6 182825.3 322369.5 585439.0

@etiennebr
Copy link
Member

To follow up in relation to #121, I like the way tibble handles simple features, because they don't drop a class on select() (data.frames would do the same, though tibbles are very well behaved). Because they have no expectations of a spatial column when selecting they can just drop the geometry, which makes it coherent and explicit. It's also easy to add another spatial column.

I ran an experiment (it fails after summarise, I don't know exactly why, maybe related to #49 ?).

@mstrimas
Copy link

mstrimas commented Mar 7, 2017

I'm noticing some potentially odd behaviour with spread(). When the geometries of features with the same key are different spread() results in a single row with only the geometry of the first instance of that key. Here's an example:

nc <- st_read(system.file("shape/nc.shp", package="sf"))
# make a long format data set
nc_gathered <- nc %>% 
  select(county = NAME, BIR74, BIR79, -geometry) %>% 
  slice(1:3) %>% 
  gather(year, births, BIR74, BIR79)
# works as expected
nc_gathered %>% 
  spread(year, births)
# change second Alleghany geometry
nc_gathered$geometry[[4]] <- nc_gathered$geometry[[2]]
nc_gathered %>% 
  spread(year, births)
# now there's only one Alleghany entry
# in contrast, if this were a normal data frame there would be 
# two Alleghany entries, one for each of the different geometries

@edzer
Copy link
Member

edzer commented Mar 7, 2017

I agree that while duplicating, and then de-duplicating geometries, there is the implicit assumption that no one messes up things in between. Why would you do this? How would you want this to work differently?

@mstrimas
Copy link

mstrimas commented Mar 8, 2017

I don't know what the best approach is, or if this even matters, largely because I'm unclear of how spread() might be used for spatial data in the wild. Just seemed to me that the behaviour of spread_.sf() is inconsistent with spread() for data frames. Without having done much testing, wouldn't this method work:

spread_.sf <- function (data, key_col, value_col, fill = NA, convert = FALSE, 
    drop = TRUE, sep = NULL) 
{
  st_as_sf(NextMethod())
}

@edzer
Copy link
Member

edzer commented Mar 8, 2017

Gives me

> nc_gathered %>% 
+   spread(year, births)
Error in .subset2(x, i, exact = exact) : 
  attempt to select less than one element in get1index

on your example. Did you do any testing?

@mstrimas
Copy link

mstrimas commented Mar 9, 2017

Oops, sorry, forgot a crucial line! This is actually tested and works for me.

spread_.sf <- function(data, key_col, value_col, fill = NA, 
                       convert = FALSE, drop = TRUE, sep = NULL) {
  data <- as.data.frame(data)
  st_as_sf(NextMethod())
}

Anyhow, not sure if this approach is necessarily better, just thought I'd bring it up... Thanks!

edzer added a commit that referenced this issue Mar 9, 2017
@edzer
Copy link
Member

edzer commented Mar 9, 2017

Thanks; I replace the original spread_.sf with yours.

@mstrimas
Copy link

Above @mdsumner gives an example of a grouped mutate. I get an error when running this example:

library(sf)
library(tidyverse)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA))
#> Warning in is.na(st_agr(x)): is.na() applied to non-(list or vector) of
#> type 'NULL'
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index

Appears to be because mutate() drops the st_column attribute, but keeps the sf class def, when run on grouped data frames (though not on normal data frames for some reason). Then st_as_sf.sf() is called instead of st_as_sf.data.frame() which doesn't add the st_column back. I have no idea if this is the best approach, but the only way I was able to fix this is by removing the sf class with:

mutate_.sf <- function(.data, ..., .dots) {
  class(.data) <- setdiff(class(.data), "sf")
  st_as_sf(NextMethod())
}

Grouped filter() does not suffer from this issue.

edzer added a commit that referenced this issue Mar 10, 2017
@edzer
Copy link
Member

edzer commented Mar 10, 2017

Clumsy as it looks, I agree that might be the best thing to do. Thanks!

@eivindhammers
Copy link

Any reason why class(.data) <- setdiff(class(.data), "sf") was only added to group_by() in f51b9a3?

I'm still getting Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index.

@edzer
Copy link
Member

edzer commented Mar 24, 2017

We might have to do it everywhere. Does it solve your problem? With which function is that?

@eivindhammers
Copy link

It does solve it. mutate()

@yutannihilation
Copy link
Contributor

I know sf is on the mid-way toward achieving compatibility with dplyr, but I'm a bit afraid the compatibility will be degraded with the next release of dplyr. For example, sf object loses its class (sf) with dplyr 0.6.0rc. In case you've not noticed yet, I report the issue here:

0.5.0(current):

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "sf"         "data.frame"

packageVersion("dplyr")
#> [1] '0.5.0'

RC for 0.6.0:

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "data.frame"

packageVersion("dplyr")
#> [1] '0.5.0.9002'

One more thing I want to ask is, while dplyr has a plan to deprecate SE functions (e.g. select_, filter_, ...), do you plan to introduce tidyeval?

@yutannihilation
Copy link
Contributor

Ah, my comment above may be duplicated with #304.

@edzer
Copy link
Member

edzer commented Apr 14, 2017

There is no next release of dplyr yet, and there are no signs that @hadley has done any reverse dependency checks so far. The rstudio blog mentions that the verb_ methods will be deprecated but remain around for backward compatibility. This is, as you found out, not yet the case, so I consider your worries a little premature. sf is not mid-way something, dplyr is.

@yutannihilation
Copy link
Contributor

Oh, I see. thanks for your quick response:) I will consider filing this issue to dplyr's repo.

@hadley
Copy link
Contributor

hadley commented Apr 14, 2017

We have been careful not to make any backward incompatible changes - we should have a system of default methods that ensures existing backends still work. If that isn't true, I'd really appreciate a minimal reprex filed in dplyr that illustrates the problem.

@edzer revdep emails will go out (hopefully) later today.

@edzer
Copy link
Member

edzer commented Apr 14, 2017

Scripts that use dplyr verbs such as mutate (see 5 comments above for a reprex) no longer work with packages that dispatch on verb_ (mutate_), which is the only way dplyr 0.5 provides. To be compatible, users now need to change mutate into mutate_ in their scripts, did you have in mind to recommend them doing that?

I don't mind modifying sf, but consider it a backward incompatible change.

@hadley
Copy link
Contributor

hadley commented Apr 14, 2017

That is not the intent - can you please file a bug on dplyr?

@edzer
Copy link
Member

edzer commented Apr 14, 2017

OK: tidyverse/dplyr#2664

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants