Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend data.table instead of data.frame? #7

Closed
barryrowlingson opened this issue Feb 11, 2016 · 22 comments
Closed

Extend data.table instead of data.frame? #7

barryrowlingson opened this issue Feb 11, 2016 · 22 comments

Comments

@barryrowlingson
Copy link
Contributor

I've occasionally tried to extend data.frame classes and always given up. I've never found a satisfactory way to store non-trivial things in a data frame column.

See my poorly documented spong package for example: https://github.com/barryrowlingson/spong

BUT... data.table provides a more flexible data grid structure that is very happy to store structured data in columns. Example:

sq=cbind(c(0,1,1,0,0),c(0,0,1,1,0))
polys = list(sq, sq+1, sq+2)
class(polys)="sf"
attr(polys, "CRS")="+init=EPSG:4326"

print.sf=function(x,...){print("Geometry...")}
require(data.table)
d = data.table(polys, name=c("Alpha","Bravo","Charlie"), x=runif(3), y=runif(3))

d now prints like this:

> d
                  polys    name         x         y
1: 0,1,1,0,0,0,0,1,1,0,   Alpha 0.3505690 0.1249515
2: 1,2,2,1,1,1,1,2,2,1,   Bravo 0.5857398 0.5670319
3: 2,3,3,2,2,2,2,3,3,2, Charlie 0.1299579 0.1650072

and the polys column still has its class:

> d$polys
[1] "Geometry..."
`
data.tables are directly compatible with `dplyr` and `ggplot2` (unlike `sp` classes)

d %>% filter(name=="Alpha")
polys name x y
1: 0,1,1,0,0,0,0,1,1,0, Alpha 0.350569 0.1249515
ggplot(d, aes(x=x,y=y)) + geom_point()

Usefully, data table row selection preserves attributes:

> a = d %>% filter(name=="Alpha")
> attr(a$polys,"CRS")
[1] "+init=EPSG:4326"

which is something even R's default subsetting doesn't do - it drops as much as it can including the class of the object:

> class(polys)
[1] "sf"
> class(polys[1])
[1] "list"

and hence you waste your life writing [.sf methods that do little more than restore the attributes that R took away in the first place (see for example getAnywhere("[.POSIXct"))

The only basic thing I can't figure out at the moment is how to identify a geometry column within a data.table. We could add an attribute to a data table, but that gets lost on selection - perhaps the data.table authors might like to help with this:

> attr(d,"geom")="polys"
> attr(d[1,],"geom")
NULL

or we define a superclass of data.table and write some methods for that.

SpatialPolygonsDataTable anyone?

@edzer
Copy link
Member

edzer commented Feb 11, 2016

Great ideas! However, if I construct d by

d = data.frame(name=c("Alpha","Bravo","Charlie"), x=runif(3), y=runif(3))
d$polys= polys

everything else works identical as above. I would prefer a solution where it doesn't matter whether d is a data.frame or a data.table, but if that isn't possible I'm still not convinced. data.table can do clever things with indexes, but that would require writing custom spatial index code I'm afraid.

I'm not convinced by your ggplot example, it shows points rather than a polygon; I guess you still have to fortify this.

As of the name of the geometry column, why not always call it geometry and give

class(d) = c("sf", "data.frame")

?

@barryrowlingson
Copy link
Contributor Author

Possibly... but...

dplyr fails:

> d = data.frame(name=c("Alpha","Bravo","Charlie"), x=runif(3), y=runif(3))
> d$polys= polys
> a = d %>% filter(name=="Alpha")
Error: column 'polys' has unsupported type : sf
> # versus:
> dt = data.table(polys, name=c("Alpha","Bravo","Charlie"), x=runif(3), y=runif(3))
> a = dt %>% filter(name=="Alpha")
> 

and attributes (including class) disappear on subsetting:

> attr(d[1,]$polys,"CRS")
NULL
> # versus
> attr(dt[1,]$polys,"CRS")
[1] "+init=EPSG:4326"

Maybe my version of R is a bit old or something if you are getting different behaviour...

The ggplot example was just for contrast with sp classes where you need to convert to data frame (or extract @data) to do a scatterplot of data values - I'm not convinced by ggplot or ggmap for maps yet!

@edzer
Copy link
Member

edzer commented Feb 11, 2016

None of these issues here: maybe update?; with data.frame I even get

> attr(d[1,],"geom")
[1] "polys"

I have

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.3      data.table_1.9.6

loaded via a namespace (and not attached):
[1] lazyeval_0.1.10 magrittr_1.5    R6_2.1.1        assertthat_0.1 
[5] parallel_3.2.3  tools_3.2.3     DBI_0.3.1       Rcpp_0.12.3    
[9] chron_2.3-47   

@barryrowlingson
Copy link
Contributor Author

I'm on 3.2.0 - hard to believe a minor point release would change low-level fundaments like that but I'll do something for an hour while 3.2.3 compiles and let you know...

@barryrowlingson
Copy link
Contributor Author

Just found my 3.2.3. Here's a session with R --vanilla, shows data frame attribute dropping, sessionInfo follows:

 sq=cbind(c(0,1,1,0,0),c(0,0,1,1,0))
 polys = list(sq, sq+1, sq+2)
 class(polys)="sf"
 attr(polys, "CRS")="+init=EPSG:4326"

 print.sf=function(x,...){print("Geometry...")}
 d = data.frame(name=c("Alpha","Bravo","Charlie"), x=runif(3), y=runif(3))
 d$polys= polys
 str(d)
# 'data.frame': 3 obs. of  4 variables:
# $ name : Factor w/ 3 levels "Alpha","Bravo",..: 1 2 3
# $ x    : num  0.383 0.395 0.836
# $ y    : num  0.749 0.726 0.522
# $ polys:List of 3
# ..$ : num [1:5, 1:2] 0 1 1 0 0 0 0 1 1 0
# ..$ : num [1:5, 1:2] 1 2 2 1 1 1 1 2 2 1
# ..$ : num [1:5, 1:2] 2 3 3 2 2 2 2 3 3 2
#  ..- attr(*, "class")= chr "sf"
#  ..- attr(*, "CRS")= chr "+init=EPSG:4326"
 str(d[1,])
# 'data.frame': 1 obs. of  4 variables:
# $ name : Factor w/ 3 levels "Alpha","Bravo",..: 1
# $ x    : num 0.383
# $ y    : num 0.749
# $ polys:List of 1
#  ..$ : num [1:5, 1:2] 0 1 1 0 0 0 0 1 1 0

Note loss of attributes on polys element when selecting first row of data frame.

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04 LTS

locale:
 [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8    
 [5] LC_MONETARY=en_GB.utf8    LC_MESSAGES=en_GB.utf8   
 [7] LC_PAPER=en_GB.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

Same behaviour if dplyr is loaded, or R started without --vanilla. Nothing else in my workspace (except all the hair I'm pulling out...)

@edzer
Copy link
Member

edzer commented Feb 11, 2016

I can see that. It's a different point: your first point had attributes on d, this post on a column in d. dplyr::filter seems to keep those, [.data.frame doesn't.

@barryrowlingson
Copy link
Contributor Author

It was the attributes on the column that I was really initially concerned about, since that's where the metadata for the geometry (CRS, at least) probably ought to be.

Sadly subsetting a list like polys (not the dataframe but the list) drops its attributes anyway, so maybe there's no alternative but to define subset methods for them, and they might pass through into data frame methods. The data.table subset code must have a policy of retaining attributes, whereas data.frame uses R's default which drops them.

Which is more useful to us here?

@rsbivand
Copy link
Member

Who else should be alerted to this discussion? I feel that the gg* and dplyr infrastructure is important at least to track and to try to see how tmap and mapview mesh on the visualization side.

@rsbivand
Copy link
Member

I have feeling that the underlying representation could benefit by thinking out-of-workspace - maybe the geom colunm could be external pointers to objects in an OGR or GEOM abstraction of SF? How does data.table do external pointers (to indices??)? What spatial index data should be in-memory to speed access to stuff outside?

@edzer
Copy link
Member

edzer commented Feb 11, 2016

@barryrowlingson : adding

"[.sf" = function(x, i, j, ..., drop=FALSE) { 
    a = attributes(x)
    class(x) = NULL
    ret = x[i]
    attributes(ret) = a
    ret 
}

would be enough.

@kendonB
Copy link
Contributor

kendonB commented Aug 1, 2016

@hadley do tibbles solve all the problems here?

@kendonB
Copy link
Contributor

kendonB commented Aug 3, 2016

See here for a discussion of list columns using tibbles http://r4ds.had.co.nz/many-models.html#list-columns-1

@edzer
Copy link
Member

edzer commented Aug 3, 2016

@kendonB which problem do you exactly refer to that hasn't been solved in the current version of the sf package, or the one described here?

@hadley
Copy link
Contributor

hadley commented Aug 16, 2016

The advantage of using (or extending) tibbles would be to avoid yet another type of data frame. In the linked examples you still have the problem of:

roads = data.frame(widths = c(5, 4.5))
roads$geom = sfc
sf(roads)

Instead of:

roads = data.frame(widths = c(5, 4.5), geom = sfc)
sf(roads)

@edzer
Copy link
Member

edzer commented Aug 17, 2016

This has been solved:

> g = ST_sfc(list(ST_Point(1:2)))
> ST_sf(a=3, g)
  a          g
1 3 POINT(1 2)
> class(ST_sf(a=3,g))
[1] "sf"         "data.frame"

@hadley
Copy link
Contributor

hadley commented Aug 17, 2016

Is there a reason you don't want to build on top of tibbles?

@edzer
Copy link
Member

edzer commented Aug 17, 2016

I try to minimize dependencies; I like building on top of code that doesn't change; I don't think that the improvement of tibbles over data.frames is very substantial, and I believe that R users are helped by understanding base R before they try to understand CRAN packages. I cannot instruct tibbles how to print simple features nicely, which I can with data.frames:

> library(sf)
> g = ST_sfc(list(ST_Point(1:2)))
> d = ST_sf(a=3, g)
> tbl_df(d)
Source: local data frame [1 x 2]

      a               g
  <dbl>       <S3: sfc>
1     3 <S3: POINT/sfi>
> d
  a          g
1 3 POINT(1 2)

But I'd be happy to help make simple features work for tbl_df, which should be trivial since they subclass data.frame - just not (yet?) as the default output of ST_sf().

@hadley
Copy link
Contributor

hadley commented Aug 17, 2016

Would you mind filing an issue on tibble? You should be able to control how tibbles print your objects. (You might be able to already but the docs might need improvement)

@edzer
Copy link
Member

edzer commented Aug 18, 2016

Sure: see tidyverse/tibble#157

@edzer
Copy link
Member

edzer commented Oct 23, 2016

See also #25 where simple features are used in data.table objects

@mbacou
Copy link

mbacou commented Nov 5, 2016

This seems very relevant to me as well from a end-user's perspective. For now my code is littered with costly (and prone to error):

dt <- data.table(sp@data)
dt[, rn := row.names(sp)]
# [...]
sp <- SpatialPolygonsDataFrame(sp, as.data.frame(dt), match.ID="rn")

I'd love for data.table's by-reference operations and indexing to work more seamlessly with spatial features. Also using data.table's setkey() and merge() on geometry columns.

Expanding on this idea a bit, I could envision:

dt1 = data.table(sf1=polys1, b1=letters[c(1,2,3,3)], c1=runif(4))
dt2 = data.table(sf2=polys2, b2=letters[c(3,3,2,6)], c2=runif(4))

# with...
sapply(dt1, class)
#   sf1           b1         c1
# "sfc"  "character"  "numeric"

sapply(dt2, class)
#   sf2           b2         c2
# "sfc"  "character"  "numeric"

# create a spatial index on sf columns
setkey(dt1, sf1)
setkey(dt2, sf2)

# Return attributes of dt2 at locations of dt1 (e.g. where the geometries intersect or overlap)
dt2[dt1]

# Set attributes of dt2 to attributes of dt1 at locations of dt1
dt2[dt1, c2 := c1]

# And the usual st_* correlates
dt2[st_touches(dt1)]
dt2[st_covers(dt1)]
dt2[st_contains(dt1)]
dt2[st_within_distance(dt1, 10)]
# etc...

# And union operations on geometry columns within a data.table
dt1[, .(st_union(sf1), sum(c2)), by=b2]

@edzer
Copy link
Member

edzer commented Feb 20, 2017

I think this issue has now been settled -- we now extend data.frame as well as tibble, depending where you start with. Feel free to reopen if needed.

@edzer edzer closed this as completed Feb 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants