# Duplicate Entities

This notebook is intended to test methods for detecting duplicate entities in planning data. It has been developed by focusing on conservation areas but should be easily extended to other entity types as it uses standard properties of the entities such as name and geometry.

The goal is to flag specific issues and then prioritise entities based on the number of issues  that are present.

In [None]:
from itertools import combinations

import plotly.graph_objects as go
import polars as pl
import requests
from shapely import MultiPolygon, overlaps
from shapely.wkt import loads

from data_quality_utils.polygon.plotting import plot_multipolygon
from data_quality_utils.polygon.utils import overlap_ratio, shortest_distance

In [None]:
def plot_duplicates(duplicate_entities: list[MultiPolygon]):
    fig = go.Figure()
    for i, polygon in enumerate(duplicate_entities):
        fig = plot_multipolygon(
            polygon=polygon,
            fig=fig,
            name=f"{i}",
            line_color="black",
            fill_color=(255, 0, 0),
            fill_alpha=0.3,
        )

    fig.update_layout(
        geo_scope="europe",
        map=dict(
            style="open-street-map",
            center=dict(lon=polygon.centroid.x, lat=polygon.centroid.y),
            zoom=13,
        ),
        showlegend=True,
        margin={"r": 100, "t": 50, "l": 100, "b": 50},
        height=800,
        width=1000,
    )
    return fig

## Data

From our search of `datasette`, there appears to be no publicly available endpoint for obtaining all entities in the planning data. Instead, we use the `conservation-area` dataset to get all conservation areas.

In [None]:
conservation_entities_url = "https://datasette.planning.data.gov.uk/conservation-area/entity.csv?_stream=on&_size=max"
r = requests.get(conservation_entities_url, auth=("user", "pass"))
conservation_df = pl.read_csv(r.content)
conservation_df = conservation_df.with_columns(
    json=conservation_df["json"].str.json_decode(infer_schema_length=10000)
).unnest("json")

conservation_df = conservation_df.filter(pl.col("name") != "Polygon")

## Duplicate Checks

We do this in two parts. First we identify all potential duplicates and then we rank them by defining 'issues' and counting the number of issue each set of potential duplicates has. Issues in this context are facts about the entities that would be unusual if they are not duplicates - eg. a large amount of overlap between their geometries. 

### Identifying Candidates
The simplest check is whether two or more entities in the same region share the same name. We group the data by these columns and keep all entities that have two or more rows per group.

In [None]:
count_df = (
    conservation_df.group_by(["name", "organisation_entity"])
    .len()
    .rename({"len": "count"})
)
candidate_df = count_df.filter(count_df["count"] > 1).sort("count", descending=True)

In [None]:
len(candidate_df)

In [None]:
count_df = conservation_df.group_by(["name"]).len().rename({"len": "count"})
candidate_df = count_df.filter(count_df["count"] > 1).sort("count", descending=True)

In [None]:
candidate_df

### Counting Issues

If we demand candidate duplicates share a name, we can then identify issues that may indicate that two entities are duplicates. 

We'll look for missing information, for fewer unique values like organisation name than one would expect for the number of entities with the same name, and missing or overlapping geometries. The general method here is that by counting the number of issues that are present for each candidate set of duplicates, we can prioritise them for investigation.

In [None]:
def min_distance(pairs):
    distances = []
    for a, b in pairs:
        if a and b:
            distances.append(shortest_distance(a, b))
    if distances:
        return min(distances)
    else:
        return None


def max_overlap(pairs):
    areas = []
    for a, b in pairs:
        if a and b:
            areas.append(overlap_ratio(a, b))
    if areas:
        return max(areas)
    else:
        return None


def count_real_orgs(organisations):
    real_orgs = organisations.filter(~organisations.is_in([16, 600001]))
    return real_orgs.n_unique()

First generate some useful metrics that we expect are related to whether two or more entities with the same name are duplicates.

In [None]:
records = []
for name in candidate_df["name"]:

    filtered_entities_df = conservation_df.filter(pl.col("name") == name)

    possible_duplicates: list[MultiPolygon] = [
        loads(shape_str) for shape_str in filtered_entities_df["geometry"].to_list()
    ]
    candidate_pairs = list(combinations(possible_duplicates, 2))

    metrics = {
        "Name": name,
        "Number of candidates": len(possible_duplicates),
        "Number of organisations": count_real_orgs(
            filtered_entities_df["organisation_entity"]
        ),
        "Number of documentation-urls": filtered_entities_df["documentation-url"]
        .drop_nans()
        .n_unique(),
        "Number of document-urls": filtered_entities_df["document-url"]
        .drop_nans()
        .n_unique(),
        "Missing Geometries": filtered_entities_df["geometry"].is_null().sum(),
        "Max Overlap": max_overlap(candidate_pairs),
        "Number of Overlaps": sum([overlaps(a, b) for a, b in candidate_pairs]),
        "Smallest Distance": min_distance(candidate_pairs),
    }
    records.append(metrics)

Then go through and count the number of metrics that have values that indicate a problem.

In [None]:
candidate_groups_df = pl.DataFrame(records)
candidate_groups_df = candidate_groups_df.with_columns(
    issue_count=(
        # fewer unique documents than candidates
        (
            candidate_groups_df["Number of document-urls"]
            < candidate_groups_df["Number of candidates"]
        )
        # fewer unique organisations than candidates
        + (
            candidate_groups_df["Number of candidates"]
            > candidate_groups_df["Number of organisations"]
        )
        # fewer unique documentation pages than candidates
        + (
            candidate_groups_df["Number of documentation-urls"]
            < candidate_groups_df["Number of candidates"]
        )
        # candidates have at least some overlap in geography
        + (candidate_groups_df["Max Overlap"].fill_null(0.0) > 0)
        # Candidates have more than 1 overlapping polygon
        + (candidate_groups_df["Number of Overlaps"] > 1)
        # The smallest distance between candidats is less than a km
        + (candidate_groups_df["Smallest Distance"].fill_null(100) < 1)
    )
)

In [None]:
candidate_groups_df.filter(pl.col("issue_count") > 3).sort(
    by="issue_count", descending=True
)

# Results

## Overview

- We find 986 groups of entities with the same name
- 685 of these have at least one missing geometry, 58 have no geometries at all.
- Assuming at lease one issue would be present in any duplicate group, there are 506 groups that truly are duplicates.
- High issue counts do seem to point to clear duplicates
- Overlap can be used to find obvious duplicates
- Missing/incomplete data is a major problem that should be resolved before entity de-duplication.


## Case Studies

### Fragmented Entities
Wymondham Conservation Area is typical of the high issue count candidates. There are 13 entities that share a name, documentation-url and document-url. However, there are 13 unique geometries in a set of 13 "duplicates". These 13 areas form one contiguous area which is documented just once on the South Norfolk District Council website. Wymondham is specifically listed as a single conservation area on their website so it seems this is a single entity where the polygons that define it have been logged separately.

Several other top listed candidates (Diss, Wingham, Petersfield) have the same issue though Petersfield has the additional issue of one entity that covers the whole area plus 8 entities that record each of its sections.

In [None]:
name = "Wymondham Conservation Area"
possible_duplicates: list[MultiPolygon] = [
    loads(shape_str)
    for shape_str in conservation_df.filter(pl.col("name") == name)[
        "geometry"
    ].to_list()
    if shape_str
]

fig = plot_duplicates(possible_duplicates)
fig.show()

### Complete Overlap

Sorting instead by the amount of overlap between polygons, we get entities that are clear duplicates. For conservation areas like Stone, we get a small number of duplicate entities with essentially identical geometries. For more complex cases like Farringdon, we get one overarching polygon with several other entities that completely duplicate sections of it, leading to high overlaps.

In [None]:
name = "Stone"
possible_duplicates: list[MultiPolygon] = [
    loads(shape_str)
    for shape_str in conservation_df.filter(pl.col("name") == name)[
        "geometry"
    ].to_list()
    if shape_str
]

fig = plot_duplicates(possible_duplicates)
fig.show()

### Missing Data

Another common failure mode is when two entities have the same name but should be under different organiations. For example Poulton which appears to be the name of a place in the Cotswolds as well as Cheshire but both have been assigned to MHCLG (entity 600001) instead of their local council. 

These cases are also often missing geometries but do have documentation-urls. If some of the previous validation steps were run documentation-url and document-url might be fixed. A simple rule to validate the organisation given the documentation-url might fix the assignment to MHCLG. Geometry might be obtained if the organisation leads to the correct endpoint or if AI tools are used to extract polygons from documents.

In [None]:
name = "Poulton"
conservation_df.filter(pl.col("name") == name)

In [None]:
name = "Tiverton"
conservation_df.filter(pl.col("name") == name)

In [None]:
name = "Moor Park"
conservation_df.filter(pl.col("name") == name)