Conversion between multipart and singlepart #199

wgrundlingh · 2017-01-31T20:34:14Z

Singlepart shapefiles are known to perform better with complex operations (like intersection) compared to multipart shapefiles. Could conversions of this be implemented in sf?

Perhaps the following interface:

st_to_singlepart(x, ...)
st_to_multipart(x, fieldA, fieldB, ...)

edzer · 2017-01-31T20:46:42Z

Could you please provide a minimal reproducible example that supports your issue? I am completely unfamiliar with singlepart or multipart shapefiles - sf reads shapefiles through GDAL, that's it.

wgrundlingh · 2017-01-31T20:57:57Z

@edzer A minimal example that supports the difference in performance, or that highlights what is meant by singlepart and/or multipart? Or perhaps both?

edzer · 2017-01-31T21:08:15Z

both.

wgrundlingh · 2017-02-01T22:16:08Z

Here is a minimal example, with the shape files mentioned below downloadable from this link. Original shape file links are also given below:

library('sf') # To manage simple feature
library('tictoc') # For timing

source <- 'C:/TEMP/sf_test/' # Source folder for shape files

pointfile <- st_read(
  paste(source, 'lda_000b16a_e_pointfile.shp', sep=''))
polygonfile.s <- st_read(
  paste(source, 'lfed000b16a_e_singlepart.shp', sep=''))
polygonfile.m <- st_read(
  paste(source, 'lfed000b16a_e_multipart.shp', sep=''))

# Spatial join: Point file with singlepart polygon file
tic('Singlepart intersection')
intersect_output <- st_intersects(pointfile, polygonfile.s)
toc()
# Spatial join: Point file with multipart polygon file
tic('Multipart intersection')
intersect_output <- st_intersects(pointfile, polygonfile.m)
toc()

There are three shape files used:

lfed000b16a_e_singlepart.shp
lfed000b16a_e_multipart.shp
lda_000b16a_e_pointfile.shp

(1) and (2) represent polygon files from the cartographic version of Statistics Canada's 2016 Census Federal Electoral Districts (2013 Representation order). I downloaded the shape files (which are multipart by default) and created a singlepart version using QGIS's native Multipart to singlepart tool.

(3) represents a point file from the cartographic version of Statistics Canada's 2016 Census Dissemination Areas. I downloaded the shape file (which is multipart by default) and extracted their centroids using QGIS.

In the above case, the intersection between the two layers occurred 40 times faster when working with singlepart polygons (40 seconds compared to ~30 minutes):

> # Spatial join: Point file with singlepart polygon file
> tic('Singlepart intersection')
> intersect_output <- st_intersects(pointfile, polygonfile.s)
although coordinates are longitude/latitude, it is assumed that they are planar
> toc()
Singlepart intersection: 38.04 sec elapsed
> # Spatial join: Point file with multipart polygon file
> tic('Multipart intersection')
> intersect_output <- st_intersects(pointfile, polygonfile.m)
although coordinates are longitude/latitude, it is assumed that they are planar
> toc()
Multipart intersection: 1806.77 sec elapsed

Reference: ARCMap help on Multipart to Singlepart

edzer · 2017-02-02T08:50:22Z

Thanks, now I understand what you meant by multipart and singlepart shapefiles. I got used to calling these MULTIPOLYGON and POLYGON geometries.

You can convert from MULTIPOLYGON to POLYGON by

x = st_cast(polygonfile.m, "POLYGON")

Interestingly, the difference in timing is much more modest if you do this for 1000, 2000, 3000 points; for 7000 it seems to explode; I have no clue why. I recently added the option for prepared geometries, and this brings down the computing time quite dramatically for this case:

> tic('Singlepart intersection, prepared')
> intersect_output <- st_intersects(pointfile, polygonfile.s, prepared = TRUE)
although coordinates are longitude/latitude, it is assumed that they are planar
> toc()
Singlepart intersection, prepared: 2.242 sec elapsed
> tic('Multipart intersection, prepared')
> intersect_output <- st_intersects(pointfile, polygonfile.m, prepared = TRUE)
although coordinates are longitude/latitude, it is assumed that they are planar
> toc()
Multipart intersection, prepared: 5.084 sec elapsed

edzer · 2017-02-02T15:58:56Z

Your shapefiles have btw a wrong projection file: they suggest long/lat WGS84, but the coordinate values suggest otherwise.

More dramatic improvements with prepared = TRUE but reversed arguments:

> st_crs(polygonfile.m) = NA
> st_crs(polygonfile.s) = NA
> st_crs(pointfile) = NA
> 
> # Spatial join: Point file with singlepart polygon file
> tic('Singlepart intersection')
> out.s = intersect_output <- st_intersects(pointfile, polygonfile.s, prepared = TRUE)
> toc()
Singlepart intersection: 2.194 sec elapsed
> # Spatial join: Point file with multipart polygon file
> tic('Multipart intersection')
> out.m = intersect_output <- st_intersects(pointfile, polygonfile.m, prepared = TRUE)
> toc()
Multipart intersection: 5.047 sec elapsed
> 
> tic('Singlepart intersection - x')
> x = intersect_output <- st_intersects(polygonfile.s, pointfile, prepared = TRUE)
> toc()
Singlepart intersection - x: 0.843 sec elapsed
> all.equal(out.s, sp:::.invert(x, nrow(polygonfile.s), nrow(pointfile)))
[1] TRUE
> # Spatial join: Point file with multipart polygon file
> tic('Multipart intersection - x')
> x = intersect_output <- st_intersects(polygonfile.m, pointfile, prepared = TRUE)
> toc()
Multipart intersection - x: 0.426 sec elapsed
> all.equal(out.m, sp:::.invert(x, nrow(polygonfile.m), nrow(pointfile)))
[1] TRUE

@rsbivand do you see any disadvantage if I make prepared=TRUE the default? Many users will otherwise never find out, I fear. Optimizing the argument order would be a second issue.

wgrundlingh · 2017-02-02T16:19:13Z

@edzer Thanks for looking into this.

I'd imagine argument order is important since the intersection of X with Y might be different from the intersection of Y with X in general. The first argument is usually kept intact with the second argument joined. The result therefore has the same number of rows (or features) as X. Swapping them around will lead to a different result.

edzer · 2017-02-02T16:28:38Z

Yes, but as you see in my example above, this different result is easily converted in the result wanted (sp:::.invert is a 6 liner), and this would be done without the user knowing.

Still, we'd needs some heuristics as of when to do this; number of features is clearly not the right thing to look at.

edzer · 2017-02-05T16:58:12Z

Wrote a new issue about which geometry to prepare; closing this one.

edzer closed this as completed Feb 2, 2017

edzer reopened this Feb 2, 2017

edzer mentioned this issue Feb 5, 2017

Which geometry should be prepared? #205

Closed

edzer closed this as completed Feb 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion between multipart and singlepart #199

Conversion between multipart and singlepart #199

wgrundlingh commented Jan 31, 2017

edzer commented Jan 31, 2017

wgrundlingh commented Jan 31, 2017

edzer commented Jan 31, 2017

wgrundlingh commented Feb 1, 2017

edzer commented Feb 2, 2017 •

edited

Loading

edzer commented Feb 2, 2017

wgrundlingh commented Feb 2, 2017

edzer commented Feb 2, 2017

edzer commented Feb 5, 2017

Conversion between multipart and singlepart #199

Conversion between multipart and singlepart #199

Comments

wgrundlingh commented Jan 31, 2017

edzer commented Jan 31, 2017

wgrundlingh commented Jan 31, 2017

edzer commented Jan 31, 2017

wgrundlingh commented Feb 1, 2017

edzer commented Feb 2, 2017 • edited Loading

edzer commented Feb 2, 2017

wgrundlingh commented Feb 2, 2017

edzer commented Feb 2, 2017

edzer commented Feb 5, 2017

edzer commented Feb 2, 2017 •

edited

Loading