Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion between multipart and singlepart #199

Closed
wgrundlingh opened this issue Jan 31, 2017 · 9 comments
Closed

Conversion between multipart and singlepart #199

wgrundlingh opened this issue Jan 31, 2017 · 9 comments

Comments

@wgrundlingh
Copy link

Singlepart shapefiles are known to perform better with complex operations (like intersection) compared to multipart shapefiles. Could conversions of this be implemented in sf?

Perhaps the following interface:

  • st_to_singlepart(x, ...)
  • st_to_multipart(x, fieldA, fieldB, ...)
@edzer
Copy link
Member

edzer commented Jan 31, 2017

Could you please provide a minimal reproducible example that supports your issue? I am completely unfamiliar with singlepart or multipart shapefiles - sf reads shapefiles through GDAL, that's it.

@wgrundlingh
Copy link
Author

@edzer A minimal example that supports the difference in performance, or that highlights what is meant by singlepart and/or multipart? Or perhaps both?

@edzer
Copy link
Member

edzer commented Jan 31, 2017

both.

@wgrundlingh
Copy link
Author

Here is a minimal example, with the shape files mentioned below downloadable from this link. Original shape file links are also given below:

library('sf') # To manage simple feature
library('tictoc') # For timing

source <- 'C:/TEMP/sf_test/' # Source folder for shape files

pointfile <- st_read(
  paste(source, 'lda_000b16a_e_pointfile.shp', sep=''))
polygonfile.s <- st_read(
  paste(source, 'lfed000b16a_e_singlepart.shp', sep=''))
polygonfile.m <- st_read(
  paste(source, 'lfed000b16a_e_multipart.shp', sep=''))

# Spatial join: Point file with singlepart polygon file
tic('Singlepart intersection')
intersect_output <- st_intersects(pointfile, polygonfile.s)
toc()
# Spatial join: Point file with multipart polygon file
tic('Multipart intersection')
intersect_output <- st_intersects(pointfile, polygonfile.m)
toc()

There are three shape files used:

  1. lfed000b16a_e_singlepart.shp
  2. lfed000b16a_e_multipart.shp
  3. lda_000b16a_e_pointfile.shp

(1) and (2) represent polygon files from the cartographic version of Statistics Canada's 2016 Census Federal Electoral Districts (2013 Representation order). I downloaded the shape files (which are multipart by default) and created a singlepart version using QGIS's native Multipart to singlepart tool.

(3) represents a point file from the cartographic version of Statistics Canada's 2016 Census Dissemination Areas. I downloaded the shape file (which is multipart by default) and extracted their centroids using QGIS.

In the above case, the intersection between the two layers occurred 40 times faster when working with singlepart polygons (40 seconds compared to ~30 minutes):

> # Spatial join: Point file with singlepart polygon file
> tic('Singlepart intersection')
> intersect_output <- st_intersects(pointfile, polygonfile.s)
although coordinates are longitude/latitude, it is assumed that they are planar
> toc()
Singlepart intersection: 38.04 sec elapsed
> # Spatial join: Point file with multipart polygon file
> tic('Multipart intersection')
> intersect_output <- st_intersects(pointfile, polygonfile.m)
although coordinates are longitude/latitude, it is assumed that they are planar
> toc()
Multipart intersection: 1806.77 sec elapsed

Reference: ARCMap help on Multipart to Singlepart

@edzer
Copy link
Member

edzer commented Feb 2, 2017

Thanks, now I understand what you meant by multipart and singlepart shapefiles. I got used to calling these MULTIPOLYGON and POLYGON geometries.

You can convert from MULTIPOLYGON to POLYGON by

x = st_cast(polygonfile.m, "POLYGON")

Interestingly, the difference in timing is much more modest if you do this for 1000, 2000, 3000 points; for 7000 it seems to explode; I have no clue why. I recently added the option for prepared geometries, and this brings down the computing time quite dramatically for this case:

> tic('Singlepart intersection, prepared')
> intersect_output <- st_intersects(pointfile, polygonfile.s, prepared = TRUE)
although coordinates are longitude/latitude, it is assumed that they are planar
> toc()
Singlepart intersection, prepared: 2.242 sec elapsed
> tic('Multipart intersection, prepared')
> intersect_output <- st_intersects(pointfile, polygonfile.m, prepared = TRUE)
although coordinates are longitude/latitude, it is assumed that they are planar
> toc()
Multipart intersection, prepared: 5.084 sec elapsed

@edzer edzer closed this as completed Feb 2, 2017
@edzer edzer reopened this Feb 2, 2017
@edzer
Copy link
Member

edzer commented Feb 2, 2017

Your shapefiles have btw a wrong projection file: they suggest long/lat WGS84, but the coordinate values suggest otherwise.

More dramatic improvements with prepared = TRUE but reversed arguments:

> st_crs(polygonfile.m) = NA
> st_crs(polygonfile.s) = NA
> st_crs(pointfile) = NA
> 
> # Spatial join: Point file with singlepart polygon file
> tic('Singlepart intersection')
> out.s = intersect_output <- st_intersects(pointfile, polygonfile.s, prepared = TRUE)
> toc()
Singlepart intersection: 2.194 sec elapsed
> # Spatial join: Point file with multipart polygon file
> tic('Multipart intersection')
> out.m = intersect_output <- st_intersects(pointfile, polygonfile.m, prepared = TRUE)
> toc()
Multipart intersection: 5.047 sec elapsed
> 
> tic('Singlepart intersection - x')
> x = intersect_output <- st_intersects(polygonfile.s, pointfile, prepared = TRUE)
> toc()
Singlepart intersection - x: 0.843 sec elapsed
> all.equal(out.s, sp:::.invert(x, nrow(polygonfile.s), nrow(pointfile)))
[1] TRUE
> # Spatial join: Point file with multipart polygon file
> tic('Multipart intersection - x')
> x = intersect_output <- st_intersects(polygonfile.m, pointfile, prepared = TRUE)
> toc()
Multipart intersection - x: 0.426 sec elapsed
> all.equal(out.m, sp:::.invert(x, nrow(polygonfile.m), nrow(pointfile)))
[1] TRUE

@rsbivand do you see any disadvantage if I make prepared=TRUE the default? Many users will otherwise never find out, I fear. Optimizing the argument order would be a second issue.

@wgrundlingh
Copy link
Author

@edzer Thanks for looking into this.

I'd imagine argument order is important since the intersection of X with Y might be different from the intersection of Y with X in general. The first argument is usually kept intact with the second argument joined. The result therefore has the same number of rows (or features) as X. Swapping them around will lead to a different result.

@edzer
Copy link
Member

edzer commented Feb 2, 2017

Yes, but as you see in my example above, this different result is easily converted in the result wanted (sp:::.invert is a 6 liner), and this would be done without the user knowing.

Still, we'd needs some heuristics as of when to do this; number of features is clearly not the right thing to look at.

@edzer
Copy link
Member

edzer commented Feb 5, 2017

Wrote a new issue about which geometry to prepare; closing this one.

@edzer edzer closed this as completed Feb 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants