# Polygon Matching Walkthrough

In this notebook we show how to utilse our PolgonMatcher class to find areas where existing conservation area boundaries do not match polygons found via other resources such as Open Street Map or OS Zoomstack. 

This is done by 'snapping' our existing boundary to nearby polygon features such as woods, residential areas and similar. We produce a new boundary polygon and measure where this disagrees with the original. 

In [None]:
import urllib

import geopandas as gpd
import numpy as np
import osmnx as ox
import polars as pl
import requests
from brdr.enums import OpenbaarDomeinStrategy
from geopandas import GeoDataFrame
from polars import DataFrame
from shapely.wkt import loads

from data_quality_utils.polygon.plotting import (
    get_plotting_polygons,
    plot_area_with_sliders,
)
from data_quality_utils.polygon.polygon_matcher import PolygonMatcher

In [None]:
datasette_base_url = "https://datasette.planning.data.gov.uk/conservation-area.csv"

query = """
select * 
from entity
"""
encoded_query = urllib.parse.urlencode({"sql": query})

r = requests.get(f"{datasette_base_url}?{encoded_query}", auth=("user", "pass"))

filename = "datasette_data.csv"
with open(filename, "wb") as f_out:
    f_out.write(r.content)

data = pl.read_csv(filename)

## Polygon Matcher class

We initialise our class here for our workflow, with parameters that specify how sensitive and far reaching we wish our matcher to be. Our `polygon_snap_distance` represents the distance (m) we want the new boundary to consider snapping to polygons within, so a larger value results in more extreme boundary changes. The `overlap_sensitivity` represents how sensitive our snapping algorithm is to the new base features as a percentage, lower is more sensitive. The `snapping_strategy` is a parameter for our snapping library that determines how the original boundary changes with the base features. From the brdr documentation:

`EXCLUDE`: Completely exclude everything that is not on the reference layer

`AS_IS`: All parts that are not covered by the reference layer are added to the resulting geometry AS IS

`SNAP_INNER_SIDE`: Everything that falls within the relevant distance over the plot boundary is snapped to the plot. The outer boundary is not used.

`SNAP_ALL_SIDE`: Everything that falls within the relevant distance over the plot boundary is snapped to the plot. The inner and outer boundary is used where possible.

`SNAP_PREFER_VERTICES`: The part on the OD is 'snapped' to the closest reference-polygons. Vertices of the reference-polygons are preferred above edges if they are within the relevant distance.

`SNAP_NO_PREFERENCE`: The part on the OD is 'snapped' to the closest reference-polygons. The full edge of the reference-polygons is used. No preference of reference-vertices.

`SNAP_ONLY_VERTICES`: The part on the OD is 'snapped' to the vertices of reference-polygons.

We also need to specify co-ordinate reference systems. For the majority of our work, including plotting, we work with `base_crs = "EPSG:4326"`, but for our snapping library and working with distances we use the mercator projection `mercator_crs = "EPSG:3857"`. 

When working with base features that are represented as lines, notably roads, to get our algorithm to work they need to be converted into polygons. This is done by adding circles continously along the line, then combining them into one polygon at the end. The radius of these circles is determined by `line_buffer`, such that the width of the polygon is 2 x `line_buffer`. Finally, we need to specify the distance (m) at which we search for polygons before we even consider snapping to them. This is determined by `polygon_detection_buffer`.

In [None]:
polygon_snap_distance = 20
overlap_sensitivity = 1
snapping_strategy = OpenbaarDomeinStrategy.SNAP_NO_PREFERENCE
base_crs = "EPSG:4326"
mercator_crs = "EPSG:3857"
line_buffer = 5
polygon_detection_buffer = 1

polygon_matcher = PolygonMatcher(
    base_crs=base_crs,
    polygon_snap_distance=polygon_snap_distance,
    overlap_sensitivity=overlap_sensitivity,
    snapping_strategy=snapping_strategy,
    mercator_crs=mercator_crs,
    polygon_detection_buffer=polygon_detection_buffer,
    line_buffer=line_buffer,
)

## Basic Usage

Below shows the function calls needed to obtain a new boundary. The new boundary is stored in `aligned_df` and the areas where our new boundary disagrees with the old is stored in `diff_df`.

In [None]:
def geodata_from_string(
    data_table: DataFrame, data_index: int, base_crs: str = "EPSG:4326"
) -> GeoDataFrame:
    """Obtain geodata from datasette query table.

    :param data_table: Polars dataframe from datasette query.
    :param data_index: Index of conservation area we want.
    :param base_crs: CRS to use, defaults to "EPSG:4326"
    :return: GeoData
    """
    original_wkt = data["geometry"][data_index]
    original_geom = loads(original_wkt)
    original_df = GeoDataFrame([1], geometry=[original_geom], crs=base_crs)

    return original_df

In [None]:
data_index = 0
original_df = geodata_from_string(data, data_index, base_crs)

input_tags = {"landuse": ["residential"]}

In [None]:
base_features_df = polygon_matcher.download_osm_polygons(original_df, input_tags)

In [None]:
aligned_df, diff_df = polygon_matcher.match_polygon_to_features(
    original_df, base_features_df
)

## Case Studies

To demonstrate practical usage, lets look at a few from the top of the dataset. Here we expand the number of features to consider and display calculations for worrying areas. After this, we can plot our results and inspect the areas highlighted as potentially incorrect.

First we look at Sleapshyde.

### Sleapshyde

In [None]:
data_index = 4
original_df = geodata_from_string(data, data_index, base_crs)

input_tags = {
    "landuse": ["residential", "farmyard", "cemetrey", "allotments"],
    "natural": ["wood", "grassland", "meadow"],
}

In [None]:
base_features_df = polygon_matcher.download_osm_polygons(original_df, input_tags)

In [None]:
aligned_df, diff_df = polygon_matcher.match_polygon_to_features(
    original_df, base_features_df
)

We can then plot our results. To avoid overwhelming with false positives, rather than plotting everywhere the new border differs from the old, we only highlight where it differs and intersects with our areas of interest, because these are likely to be where the interesting change has occurred.

In [None]:
results_tuple = get_plotting_polygons(
    original_df, base_features_df, aligned_df, diff_df, base_crs
)

original_border, base_features, new_border, difference_area = results_tuple

In [None]:
plot_area_with_sliders(
    original_border,
    base_features,
    new_border,
    difference_area,
    data["name"][data_index],
)

The model highlights a few key areas, notably the top of the farm where the original conservation area cuts through the farmyard top corner. There is an ambiguous area in the top left that is also highlighted. We also see a range of smaller alignments throughout, potentially of interest is the wooded area near the bottom, where a non-trivial chunk of woodland is cut off.

The package also contains utilities for calculating metrics to highlight potentially incorrect boundaries.

In [None]:
area_threshold = 100
large_areas_list = polygon_matcher.calculate_area_of_large_discrepancies(
    base_features_df, diff_df, area_threshold
)
print(f"Areas of red areas over {area_threshold}m^2: {large_areas_list}")

In [None]:
area_sum = polygon_matcher.calculate_total_area_of_discrepancies(
    base_features_df, diff_df, area_threshold
)
print(f"Total area of red areas : {area_sum}m^2")

In [None]:
red_area_ratio = polygon_matcher.large_discrepancy_proportion(
    base_features_df, aligned_df, diff_df
)
print(f"Ratio of red areas in total area as percentage: {red_area_ratio}%")

### Potters Crouch

In [None]:
data_index = 2
original_df = geodata_from_string(data, data_index, base_crs)

base_features_df = polygon_matcher.download_osm_polygons(original_df, input_tags)
aligned_df, diff_df = polygon_matcher.match_polygon_to_features(
    original_df, base_features_df
)
results_tuple = get_plotting_polygons(
    original_df, base_features_df, aligned_df, diff_df, base_crs
)

original_border, base_features, new_border, difference_area = results_tuple

plot_area_with_sliders(
    original_border,
    base_features,
    new_border,
    difference_area,
    data["name"][data_index],
)

This example has a clear mismatch, where a corner of the farm is cut off. There are very few false positives in this example partially due to the areas having to intersect with base features. To really emphasise the importance of this, we can up the `polygon_snap_distance` and see what happens.

In [None]:
extra_polygon_snap_distance = 500

polygon_matcher = PolygonMatcher(
    base_crs=base_crs,
    polygon_snap_distance=extra_polygon_snap_distance,
    overlap_sensitivity=overlap_sensitivity,
    snapping_strategy=snapping_strategy,
    mercator_crs=mercator_crs,
    polygon_detection_buffer=polygon_detection_buffer,
    line_buffer=line_buffer,
)

data_index = 2
original_df = geodata_from_string(data, data_index, base_crs)

base_features_df = polygon_matcher.download_osm_polygons(original_df, input_tags)
aligned_df, diff_df = polygon_matcher.match_polygon_to_features(
    original_df, base_features_df
)
results_tuple = get_plotting_polygons(
    original_df, base_features_df, aligned_df, diff_df, base_crs
)

original_border, base_features, new_border, difference_area = results_tuple

plot_area_with_sliders(
    original_border,
    base_features,
    new_border,
    difference_area,
    data["name"][data_index],
)

A large portion of the original boundary is ignored here, so listing all the differences would wash out the interesting ones. The cut off corner is still outlined here correctly.

### Napsbury

In [None]:
polygon_snap_distance = 200

polygon_matcher = PolygonMatcher(
    base_crs=base_crs,
    polygon_snap_distance=polygon_snap_distance,
    overlap_sensitivity=overlap_sensitivity,
    snapping_strategy=snapping_strategy,
    mercator_crs=mercator_crs,
    polygon_detection_buffer=polygon_detection_buffer,
    line_buffer=line_buffer,
)

data_index = 0
original_df = geodata_from_string(data, data_index, base_crs)

base_features_df = polygon_matcher.download_osm_polygons(original_df, input_tags)
aligned_df, diff_df = polygon_matcher.match_polygon_to_features(
    original_df, base_features_df
)
results_tuple = get_plotting_polygons(
    original_df, base_features_df, aligned_df, diff_df, base_crs
)

original_border, base_features, new_border, difference_area = results_tuple

plot_area_with_sliders(
    original_border,
    base_features,
    new_border,
    difference_area,
    data["name"][data_index],
)

For Napsbury, we see some rather weird behaviour where the conservation area goes through the middle of houses. This serves to show that we are not suggesting the new boundary <i>should be</i> the boundary, rather, if is not obvious what the border is from the base features, we will get some weird behaviour, that we can flag. Aside from the houses, there are other smaller issues in the bottom right and left that are highlighted.

Another class of features to note are the roads - there is an odd boundary kink on top of the roundabout. We can use our `line_buffer` argument to inspect the roads.

In [None]:
polygon_snap_distance = 20

polygon_matcher = PolygonMatcher(
    base_crs=base_crs,
    polygon_snap_distance=polygon_snap_distance,
    overlap_sensitivity=overlap_sensitivity,
    snapping_strategy=snapping_strategy,
    mercator_crs=mercator_crs,
    polygon_detection_buffer=polygon_detection_buffer,
    line_buffer=line_buffer,
)

data_index = 0
original_df = geodata_from_string(data, data_index, base_crs)

input_tags = {
    "highway": ["unclassified", "primary", "secondary"],
}

base_features_df = polygon_matcher.download_osm_polygons(original_df, input_tags)
aligned_df, diff_df = polygon_matcher.match_polygon_to_features(
    original_df, base_features_df
)
results_tuple = get_plotting_polygons(
    original_df, base_features_df, aligned_df, diff_df, base_crs
)

original_border, base_features, new_border, difference_area = results_tuple

plot_area_with_sliders(
    original_border,
    base_features,
    new_border,
    difference_area,
    data["name"][data_index],
)

We see that the weird geometry is highlighted as an anomaly. Similarly, further up the road there is another area that is highlighted - this is clocking how there is little reason as to why the conservation area boundary crosses there rather than anywhere else.

## Random cases

To get a broader view we also picked some random indices and highlight interesting observations here.

In [None]:
np.random.seed(42)
test_indices = np.random.randint(0, 1000, 20)

In [None]:
polygon_snap_distance = 50

polygon_matcher = PolygonMatcher(
    base_crs=base_crs,
    polygon_snap_distance=polygon_snap_distance,
    # overlap_sensitivity=overlap_sensitivity,
    overlap_sensitivity=0.0000001,
    snapping_strategy=snapping_strategy,
    mercator_crs=mercator_crs,
    polygon_detection_buffer=polygon_detection_buffer,
    line_buffer=line_buffer,
)

In [None]:
data_index = int(test_indices[0])
original_df = geodata_from_string(data, data_index, base_crs)

input_tags = {
    "landuse": ["residential", "farmyard", "cemetrey", "allotments", "farmland"],
    "natural": ["wood", "grassland", "meadow"],
}
base_features_df = polygon_matcher.download_osm_polygons(original_df, input_tags)
aligned_df, diff_df = polygon_matcher.match_polygon_to_features(
    original_df, base_features_df
)
results_tuple = get_plotting_polygons(
    original_df, base_features_df, aligned_df, diff_df, base_crs
)

original_border, base_features, new_border, difference_area = results_tuple

plot_area_with_sliders(
    original_border,
    base_features,
    new_border,
    difference_area,
    data["name"][data_index],
)

Although there are potentially some false positives near the top, the model does well to follow the natural border, or not highlight things when inappropriate. There are also some clear anomalies near trees along roads in the bottom section.

In [None]:
data_index = int(test_indices[1])
original_df = geodata_from_string(data, data_index, base_crs)

base_features_df = polygon_matcher.download_osm_polygons(original_df, input_tags)
aligned_df, diff_df = polygon_matcher.match_polygon_to_features(
    original_df, base_features_df
)
results_tuple = get_plotting_polygons(
    original_df, base_features_df, aligned_df, diff_df, base_crs
)

original_border, base_features, new_border, difference_area = results_tuple

plot_area_with_sliders(
    original_border,
    base_features,
    new_border,
    difference_area,
    data["name"][data_index],
)

This example shows where this method can fall down - the quality of the polygons. The lighter grey indicates the base map - there is no polygon present. The only polygon present is the small wood, where issues are highlighted. The field at the bottom seems to have the conservation area line going straight through it, not following the treeline or any other features. However, neither this nor OS Zoomstack have polygons for this.

OS Zoomstack does offer some good polygons for buildings, but these are yet to be implemented here.