ENH: Method to sample points randomly from within geometries (#2860)

geopandas · May 1, 2023 · e8ddf25 · e8ddf25
1 parent 35f7004
commit e8ddf25
Show file tree

Hide file tree

Showing 15 changed files with 573 additions and 7 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -20,8 +20,10 @@ New features and improvements:
 - Added a ``to_wgs84`` keyword to ``to_json`` allowing automatic re-projecting to follow
   the 2016 GeoJSON specification (#416).
 - ``to_json`` output now includes a ``"crs"`` field if the CRS is not the default WGS84 (#1774).
-- Improve error messages when accessing the `geometry` attribute of GeoDataFrame without an active geometry column 
+- Improve error messages when accessing the `geometry` attribute of GeoDataFrame without an active geometry column
   related to the default name `"geometry"` being provided in the constructor (#2577)
+- Added ``sample_points`` method to sample random points from Polygon or LineString
+  geometries (#2860).
 
 Deprecations and compatibility notes:
 

diff --git a/ci/envs/310-latest-conda-forge.yaml b/ci/envs/310-latest-conda-forge.yaml
@@ -23,6 +23,7 @@ dependencies:
   - xyzservices
   - scipy
   - geopy
+  - pointpats
   # installed in tests.yaml, because not available on windows
   # - postgis
   - SQLalchemy<2

diff --git a/ci/envs/311-latest-conda-forge.yaml b/ci/envs/311-latest-conda-forge.yaml
@@ -24,6 +24,7 @@ dependencies:
   - xyzservices
   - scipy
   - geopy
+  - pointpats
   # installed in tests.yaml, because not available on windows
   # - postgis
   - SQLalchemy>=2

diff --git a/ci/envs/38-latest-conda-forge.yaml b/ci/envs/38-latest-conda-forge.yaml
@@ -28,3 +28,4 @@ dependencies:
   - libspatialite
   - pyarrow
   - pyogrio
+  - pointpats
diff --git a/ci/envs/39-latest-conda-forge.yaml b/ci/envs/39-latest-conda-forge.yaml
@@ -26,6 +26,7 @@ dependencies:
   - xyzservices
   - scipy
   - geopy
+  - pointpats
   # installed in tests.yaml, because not available on windows
   # - postgis
   - SQLalchemy<2

diff --git a/doc/environment.yml b/doc/environment.yml
@@ -40,6 +40,7 @@ dependencies:
   - pygeos
   - xyzservices
   - packaging
+  - pointpats
   - pip
   - pip:
       - sphinx-toggleprompt

diff --git a/doc/source/docs/reference/geoseries.rst b/doc/source/docs/reference/geoseries.rst
@@ -91,6 +91,7 @@ Constructive methods and attributes
    GeoSeries.make_valid
    GeoSeries.minimum_bounding_circle
    GeoSeries.normalize
+   GeoSeries.sample_points
    GeoSeries.simplify
 
 Affine transformations

diff --git a/doc/source/docs/user_guide.rst b/doc/source/docs/user_guide.rst
@@ -21,3 +21,4 @@ Advanced topics can be found in the :doc:`Advanced Guide <advanced_guide>` and f
   Aggregation with dissolve <user_guide/aggregation_with_dissolve>
   Merging data <user_guide/mergingdata>
   Geocoding <user_guide/geocoding>
+  Sampling points <user_guide/sampling>
diff --git a/doc/source/docs/user_guide/sampling.ipynb b/doc/source/docs/user_guide/sampling.ipynb
@@ -0,0 +1,268 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f02b5a40-29b6-4d46-abb5-f84d5ee4da56",
+   "metadata": {},
+   "source": [
+    "# Sampling Points"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f2fa64d9-7781-4357-8381-6ff64eff7379",
+   "metadata": {},
+   "source": [
+    "Learn how to sample random points using GeoPandas. \n",
+    "\n",
+    "The example below shows you how to sample random locations from shapes in GeoPandas GeoDataFrames."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae0cc935-8940-4cfb-9a62-d3174fc77687",
+   "metadata": {},
+   "source": [
+    "## Import Packages\n",
+    "\n",
+    "To begin with, we need to import packages we'll use: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1b3b0e6e-221c-4b8e-baf8-92bf07e806ea",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import geopandas\n",
+    "import geodatasets"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "9e5ee97a-6647-4686-a3a0-f3dfb7228cd1",
+   "metadata": {},
+   "source": [
+    "For this example, we will use the New York Borough example data (`nybb`) provided by geodatasets. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d013124f-4af5-4380-bf45-aa5fd4887c63",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nybb = geopandas.read_file(geodatasets.get_path(\"nybb\"))\n",
+    "# simplify geometry to save space when rendering many interactive maps\n",
+    "nybb.geometry = nybb.simplify(200) "
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d4703589-f6b5-46a3-9540-ec9a52716747",
+   "metadata": {},
+   "source": [
+    "To see what this looks like, visualize the data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec65cc25-2baa-431f-adea-9275c231ac47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nybb.explore()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "99883485-2e67-4da8-a3d8-d4724ef8b2a1",
+   "metadata": {},
+   "source": [
+    "## Sampling random points\n",
+    "\n",
+    "To sample points from within a GeoDataFrame, use the `sample_points()` method.\n",
+    "To specify the sample sizes, provide an explicit number of points to sample. For example, we can sample 200 points randomly from each feature: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "76d8658d-7ee8-4ef0-84b4-b8883c921687",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n200_sampled_points = nybb.sample_points(200)\n",
+    "m = nybb.explore()\n",
+    "n200_sampled_points.explore(m=m, color='red')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14b8628f-3d5e-4fbb-b26e-28cc042ff755",
+   "metadata": {},
+   "source": [
+    "This functionality also works for line geometries. For example, let's look only at the boundary of Manhattan Island:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c47472fc-7ca5-4f69-b86f-93c25a9f2b03",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "manhattan_parts = nybb.iloc[[3]].explode(ignore_index=True)\n",
+    "manhattan_island = manhattan_parts.iloc[[30]]\n",
+    "manhattan_island.boundary.explore()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "54280acd-e449-4528-ac7c-c1b2dbbcdc2f",
+   "metadata": {},
+   "source": [
+    "Sampling randomly from along this boundary can use the same `sample_points()` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "358822ef-a0c6-40cd-9e35-9a688c56f361",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "manhattan_border_points = manhattan_island.boundary.sample_points(200)\n",
+    "m = manhattan_island.explore()\n",
+    "manhattan_border_points.explore(m=m, color='red')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fa0f5a05-e7cc-44cf-b6ac-40afd9753f23",
+   "metadata": {},
+   "source": [
+    "Keep in mind that sampled points are returned as a single multi-part geometry, and that the distances over the line segments are calculated *along* the line. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b0ccad29-d29f-4da1-9151-692bfd20d533",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "manhattan_border_points"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "125f19d9-b3f0-4ab9-b44a-41be1a8cc388",
+   "metadata": {},
+   "source": [
+    "If you want to separate out the individual sampled points, use the `.explode()` method on the dataframe:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "265f2194-a94f-4a3d-9ae8-f7da55893e90",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "manhattan_border_points.explode(ignore_index=True).head()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "21a1a9d5",
+   "metadata": {},
+   "source": [
+    "## Variable number of points\n",
+    "\n",
+    "You can also sample different number of points from different geometries if you pass an array specifying the size of the sample per geometry."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b76e4cd8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "variable_size = nybb.sample_points([10, 50, 100, 200, 500])\n",
+    "m = nybb.explore()\n",
+    "variable_size.explore(m=m, color='red')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7f041b05-7c12-4a2c-a360-35bc5c79f1f4",
+   "metadata": {},
+   "source": [
+    "## Sampling from more complicated point pattern processes\n",
+    "\n",
+    "Finally, the `sample_points()` method can use different sampling processes than those described above, so long as they are implemented in the `pointpats` package for spatial point pattern analysis. For example, a \"cluster-poisson\" process is a spatially-random cluster process where the \"seeds\" of clusters are chosen randomly, and then points around these clusters are distributed according again randomly. \n",
+    "\n",
+    "To see what this looks like, consider the following, where ten points will be distributed around five seeds within each of the boroughs in New York City:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7868221c-ad0a-44e6-9f2f-41c4d6ac0fcf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sample_t = nybb.sample_points(method='cluster_poisson', size=50, n_seeds=5, cluster_radius=7500)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "abd12d6e-533d-4808-86d7-9df4c097c077",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m = nybb.explore()\n",
+    "sample_t.explore(m=m, color='red')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "geopandas_dev",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "b1fe2ae8565152c84d3dbd08488d3746f754c9bdf2de9b61cf939da5306d3793"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/doc/source/getting_started/install.rst b/doc/source/getting_started/install.rst
@@ -150,6 +150,7 @@ Further, optional dependencies are:
 - `psycopg2`_ (optional; for PostGIS connection)
 - `GeoAlchemy2`_ (optional; for writing to PostGIS)
 - `geopy`_ (optional; for geocoding)
+- `pointpats`_ (optional; for advanced point sampling)
 
 
 For plotting, these additional packages may be used:
@@ -267,4 +268,6 @@ More specifically, whether the speedups are used or not is determined by:
 
 .. _PyGEOS: https://github.com/pygeos/pygeos/
 
-.. _packaging: https://packaging.pypa.io/en/latest/
+.. _packaging: https://packaging.pypa.io/en/latest/
+
+.. _pointpats: https://pointpats.readthedocs.io/en/latest/
diff --git a/environment-dev.yml b/environment-dev.yml
@@ -40,3 +40,6 @@ dependencies:
     - mapclassify
     # spatial access methods
     - rtree>=0.9
+    # point sampling
+    - pointpats
+