# Geo dupes in DuckDB
**Author**:  Greg Slater <br>
**Date**:  24th September 2024 <br>
**Dataset Scope**: `dataset` <br>
**Report Type**: Ad-hoc analysis <br>
**Purpose**: Initial test of possible approach to identifying entities with geometry facts which are very different from each other. Method outline:

* use duckdb to read sqlite db and create a new, spatially indexed table with geometry facts in it
* identify entities with multiple geometry facts
* group and summarise by entity, calculating for each entity:
    * the mean area of geometry facts 
    * the area of the bounding box of all combined geometry facts (using ST_Extent_Agg function)
* check entities with a big discrepancy, idea being that if one geometry fact is very distant from the others, the combined bounding box of all facts will be far larger than the average area of the individual geometries.


In [2]:
import pandas as pd
import geopandas as gpd
import numpy as np
import os
import duckdb as ddb
from datetime import datetime

pd.set_option("display.max_rows", 100)

td = datetime.today().strftime('%Y-%m-%d')
data_dir = "../../data/endpoint_checker/entity_resolution/"


In [3]:
# if sqlite DB isn't downloaded yet, get it from here: https://datasette.planning.data.gov.uk/conservation-area.db
ca_sqlite_path = os.path.join(data_dir, "conservation-area.sqlite3")

# Connect to DuckDB
con = ddb.connect()

# Load the SQLite extension
con.execute("INSTALL sqlite;")
con.execute("LOAD sqlite;")
con.execute("INSTALL spatial;")
con.execute("LOAD spatial;")

# Attach the SQLite database
con.execute(f"ATTACH DATABASE '{ca_sqlite_path}' AS sqlite_db;")

# Create a new table in DuckDB, load in entity table from sqlite and create spatial index on geom field
# Note - remove LIMIT statement to run on full entity table, restricted for now for easier testing
con.execute("""
    DROP TABLE IF EXISTS fact_spatial;
            
    CREATE TABLE fact_spatial (
    entity INTEGER,
    fact TEXT,
    entry_date TEXT,
    geom GEOMETRY);

    INSERT INTO fact_spatial (entity, fact, entry_date, geom)
    SELECT 
        entity, fact, entry_date, 
        -- ST_Transform(ST_GeomFromText(value, ignore_invalid => TRUE), 'EPSG:4326', 'EPSG:27700', always_xy := true)
        ST_GeomFromText(value, ignore_invalid => TRUE)
    FROM sqlite_db.fact
    WHERE field = 'geometry'
    LIMIT 10000;
            
    CREATE INDEX idx ON fact_spatial USING RTREE (geom);
""")


<duckdb.duckdb.DuckDBPyConnection at 0x120461930>

In [4]:
con.sql(f"""
    -- identify entities with multiple facts
    with entity_multi_facts as (
        select entity, count(*) as n_facts
        from fact_spatial
        group by 1 
        having count(*) > 1
        order by count(*) desc
    ),
    
    -- get facts for multi fact entities
    test as (
        select *
        from fact_spatial fs
        inner join entity_multi_facts emf on fs.entity = emf.entity
        order by fs.entity
        --limit 1000
    )
        
    -- calculate average poly area and area of ST_Extent_Agg for polys
    , calc as (
        select entity, AVG(ST_Area(geom)) / 1000000 as mean_poly_area_km2, ST_Area(ST_Extent_Agg(geom)) / 1000000 as envelope_area_km2
        from test
        group by entity
    )
        
    select 
        *,
        envelope_area_km2 / mean_poly_area_km2 as envelope_avg_delta
    from calc
    order by envelope_area_km2 / mean_poly_area_km2 desc
""")
          
# con.close()

┌──────────┬────────────────────────┬────────────────────────┬────────────────────┐
│  entity  │   mean_poly_area_km2   │   envelope_area_km2    │ envelope_avg_delta │
│  int32   │         double         │         double         │       double       │
├──────────┼────────────────────────┼────────────────────────┼────────────────────┤
│ 44005152 │ 1.0861711000000817e-11 │ 3.5089390949971176e-06 │  323055.8330079721 │
│ 44012481 │ 1.8857191166664477e-11 │  2.043697388626953e-06 │ 108377.61417192277 │
│ 44001586 │  8.207400174996816e-11 │ 1.4145518785880995e-08 │ 172.35078690294864 │
│ 44005076 │   7.92943292499932e-11 │  1.103977642742393e-08 │  139.2252955771724 │
│ 44001175 │  2.887642500001199e-11 │ 2.5127988046733664e-09 │  87.01904078057872 │
│ 44000050 │   4.09581997499899e-11 │  2.810219475577469e-09 │  68.61188950518172 │
│ 44001471 │   1.22752972500123e-11 │  6.070460358387208e-10 │  49.45265466692568 │
│ 44001128 │  3.656501624997056e-11 │ 1.5105813299669534e-09 │ 41.3122017952

From some spot checks, entities with geometries which are very long thin polygons have a delta around the 100 level (e.g. 44001586, 44005076). So setting the threshold lower than that would pick up false positives as these are a large delta because of the shape (small area but very large extent) rather than because they have very distant geometry facts.

Where the delta is much larger than this, i.e. > 100,000 is where there are reliably entities with distant facts (e.g. 44005152, 44012481). This threshold seems pretty robust to identify issues here in the conservation-area dataset, but may not generalise so well to datasets which have larger or more unusually shaped geometries, like the flood-risk-zones. But worth testing, as this runs very quick in DuckDB. Could test further with some different measures to ST_Extent_Agg that do something different - see the [aggregate spatial functions](https://duckdb.org/docs/stable/extensions/spatial/functions#aggregate-functions) in the DuckDB docs.

In [11]:
# save an example entity's facts as geojson to examine in QGIS 
con.sql(f"""
    COPY(
        select 
            entity, fact, 
            ST_Area(geom) / 1000000 as mean_poly_area_km2, 
            geom
        from fact_spatial
        where entity = 44012481
    )
        to 'test_distant_geom_facts_44012481.geojson'
        WITH (FORMAT gdal, DRIVER 'GeoJSON');
""")


In [12]:
# print out example result entity
con.sql(f"""
    select 
        entity, fact, 
        ST_Area(geom) / 1000000 as mean_poly_area_km2, 
        geom
    from fact_spatial
    where entity = 44005152

""")


┌──────────┬──────────────────────────────────────────────────────────────────┬────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────