Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for parallel tile downloads and control of cache #217

Merged
merged 6 commits into from Jun 12, 2023

Conversation

JacobJeppesen
Copy link
Contributor

@JacobJeppesen JacobJeppesen commented May 15, 2023

This pull request is a fix for #215

I have added support for parallel tile downloads in the bounds2raster and bounds2img functions. It gives some quite significant speed improvements, with minor changes to the code. I wasn't sure were to put max_num_parallel_tile_downloads = 32, so for now it is on line 227 in tile.py (i.e., inside the bounds2img function code). Let me know if there is anything you would like to have changed.

Thanks for an awesome package btw.! 😃

…hile also caching the downloaded tiles. The solution was to use parallel processes instead of threads.
@JacobJeppesen
Copy link
Contributor Author

Just made a small change as downloads occasionally failed due a memory error when using threads for the parallel downloads. It was caused by a clash with the memory caching from joblib to cache the _fetch_tile() function, and was fixed by using processes instead of threads for the parallel download.

The processes take ~0.5s to spawn, so this added a minor delay before downloading starts, but it is still significantly faster compared to the for-loop implementation.

@ljwolf
Copy link
Member

ljwolf commented May 23, 2023

I wonder whether any useful default max_num_parallel_downloads will break some services' TOS?

For example, OSM asks you to restrict yourself to two download connections at a time.. I've raised TOS-violating defaults before w.r.t. caching in #202, but I'm not sure what the maintainers (@darribas, @martinfleis) think? I'm fine with letting the user be the "violator," but we should probably document that the defaults set in the package may violate some TOS for some providers.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljwolf's comments are valid here.

I am fine with allowing parallel downloads as there are use case (paid tiles...) where this is perfectly valid use case that is in no conflict with TOS. I would just probably default to 1, to ensure that a user does not violate anything without knowing it.

@@ -74,6 +75,7 @@ def bounds2raster(
ll=False,
wait=0,
max_retries=2,
num_parallel_tile_downloads=16,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
num_parallel_tile_downloads=16,
n_connections=16,

Something shorter like this would be preferable.

… it to default value of 1. Added different n_connections values when testing the bounds2img() function.
@JacobJeppesen
Copy link
Contributor Author

Good point. I've just made a new commit where the default value is changed to 1. The num_parallel_tile_downloads and max_num_parallel_downloads variables were also renamed to n_connections and max_connections, respectively, and I cleaned up the code a bit.

I also added tests of raise ValueError("n_connections must be between 1 and {max_connections}") in bounds2img() to address the CodeCov errors in the checks.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question on the max connections value, apart from that it is ready to go. Thanks!

tiles = list(mt.tiles(w, s, e, n, [zoom]))
tile_urls = [provider.build_url(x=tile.x, y=tile.y, z=tile.z) for tile in tiles]
# download tiles
max_connections = 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this specific upper limit? I am happy leaving that a responsibility of a user.

…added a parameter to disable caching, which is useful in resource constrained environments when using parallel connections for download.
@JacobJeppesen
Copy link
Contributor Author

Yeah, agree that hardcoding max_connections wasn't ideal. I couldn't figure out a good place to put it, but I think there should be a default limit set, so it is clear to the user to see that setting n_connections too high can put excess stress on the tile server. I've just made a new commit where I've added it as a parameter. That is one solution, but it could also just be a warning instead of an error. Or it could it be omitted entirely, and the documentation for n_connections could just state that the user should be careful with it.

In the same commit, I also added a parameter to disable the tile caching, as that makes using parallel connections in resource constrained environments much faster. I.e., it took a lot of time to spawn the processes in small serverless functions, where disabling the cache so the parallel download can be done with threads avoids this. I wasn't sure whether this should be added as a parameter though (or if it should be added at all).

So, in general I wasn't sure whether these changes were in line with the thoughts behind Contextily 😅 Let me know what you think, and then we might need another round of changes before its ready 🙂

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or it could it be omitted entirely, and the documentation for n_connections could just state that the user should be careful with it.

That would be my preference.

Disabling caching is good, it resolves #202 :)

contextily/tile.py Outdated Show resolved Hide resolved
@JacobJeppesen
Copy link
Contributor Author

Sounds good :)

I've just made a new commit based on the comments. The max_connections parameter is gone, and I've updated the docstrings with more details for n_connections and disable_cache.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good to me!

@ljwolf do you think this will be a good solution for caching?

@martinfleis martinfleis changed the title Added support for parallel tile downloads Added support for parallel tile downloads and control of cache Jun 9, 2023
@ljwolf
Copy link
Member

ljwolf commented Jun 9, 2023

I think so, this is great!

I would make the option use_cache=True, rather than disable_cache=False, since a double negative (disable and default to False) can be confusing, but that's the only thing I'd change.

@martinfleis
Copy link
Member

@JacobJeppesen Can you change the keyword per @ljwolf's suggestion? It makes sense.

…bounds2img() function parameters to avoid using double negative
@JacobJeppesen
Copy link
Contributor Author

Yeah, I agree that the double negative should be avoided. I've changed it to use_cache=True in the newest commit.

Let me know if there are any more changes needed :)

@martinfleis martinfleis merged commit 4ef4a67 into geopandas:main Jun 12, 2023
10 checks passed
@martinfleis
Copy link
Member

Thanks @JacobJeppesen!

@JacobJeppesen
Copy link
Contributor Author

Thanks to you too! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants