## Examining duplication Between Organisations

In our data we often recieve multiple data sources per dataset. unfortunately this leads to duplication of geometries and other data points in the datasets. this notebook looks to investigate identifying these duplications between organisations.

In [None]:
# from download_data import download_dataset
# from data import get_entity_dataset, nrow
# from plot import plot_map, plot_issues_map
import spatialite
import pandas as pd
import geopandas as gpd
import os
import itertools
import shapely.wkt

import matplotlib.pyplot as plt
import time
import urllib

import numpy as np

pd.set_option("display.max_rows", 100)


In [None]:
# if running on Colab, uncomment and run this line below too:
# !pip install mapclassify

### Functions

In [350]:
def nrow(df):
    return print(f"No. of records in df: {len(df):,}")


def plot_issues_map(gdf:gpd.GeoDataFrame, entity_link, chloro_var, palette):

    if type(gdf) != gpd.GeoDataFrame:
        logging.error('input is not a GeodataFrame')

    entity_list = entity_link.split('-')
    
    base = gdf[gdf["entity"].isin(entity_list)].explore(
        column = chloro_var,  # make choropleth based on "BoroName" column
        cmap = palette,
        tooltip = False,
        popup = ["organisation_name", "entity", "name", "entry_date", "reference"],
        tiles = "CartoDB positron",  # use "CartoDB positron" tiles
        highlight = False,
        style_kwds = {
        "fillOpacity" : "0.1"
        }
    )
    
    return base

def get_all_organisations():
    params = urllib.parse.urlencode({
        "sql": f"""
        select organisation, name, entity as organisation_entity, statistical_geography
        from organisation
        """,
        "_size": "max"
        })
    url = f"https://datasette.planning.data.gov.uk/digital-land.csv?{params}"
    df = pd.read_csv(url)
    return df


def get_old_entity(collection_name):
    params = urllib.parse.urlencode({
        "sql": f"""
        select *
        from old_entity
        """,
        "_size": "max"
        })
    url = f"https://datasette.planning.data.gov.uk/{collection_name}.csv?{params}"
    df = pd.read_csv(url)
    return df

### Data import

In [None]:
# get LAD to LPA lookup from github
lookup_lad_lpa = pd.read_csv("https://github.com/digital-land/organisation-collection/raw/main/data/local-authority.csv",
                             usecols = ["entity", "local-authority-district", "local-planning-authority"])

lookup_lad_lpa.columns = ["organisation_entity", "LADCD", "LPACD"]

nrow(lookup_lad_lpa)
lookup_lad_lpa.head()

**Note on LAD to LPA mapping**   
Currently this [lookup file from github](https://github.com/digital-land/organisation-collection/raw/main/data/local-authority.csv) just records a 1:1 link between LADs and LPAs, but according to the ONS this relationship is actually 1:many. 
See [2020 lookup file](https://geoportal.statistics.gov.uk/datasets/ons::local-planning-authority-to-local-authority-district-april-2020-in-the-united-kingdom-lookup-1/about) and the example of Ryedale [`E07000167`], which is mapped to the following two LPAs:

* Ryedale LPA [`E60000061`]
* North York Moors National Park LPA [`E60000322`]

We need to agree some validation rules around this, i.e. can we expect Ryedale to submit data that might sit within either of these LPA areas, or for any London Boroughs to submit within the "London Legacy Development Corporation LPA" area?
But for simplicity's sake at the moment to get things up and running (as per Owen's advice), will test with existing 1:1 mapping and aim to develop logic once there is more clarity about multiple area handling.

The git lookup file also seems to be missing some areas, e.g. "Peak District National Park Authority" entity 405.

In [None]:
lookup_org[lookup_org["LADCD"] == "E07000167"]

In [None]:
# get org data from datasette
lookup_org = get_all_organisations()

# lookup_org["organisation_entity"] = lookup_org["organisation_entity"].astype(str)
lookup_org.columns = ["organisation", "organisation_name", "organisation_entity", "statistical_geography"]

# split out org type and join on LPA codes from LAD to LPA lookup
lookup_org["organisation_type"] = lookup_org["organisation"].apply(lambda x: x.split(":")[0])
lookup_org = lookup_org.merge(lookup_lad_lpa, how = "left", on = "organisation_entity")

nrow(lookup_org)
lookup_org.head()

In [None]:
# check what types of org are missing the LPA code
nrow(lookup_org[lookup_org["LPACD"].isnull()])
lookup_org[lookup_org["LPACD"].isnull()].groupby("organisation_type").size()

In [None]:
# LPA boundary data from planning.data.gov

LPA_boundary_df = pd.read_csv("https://files.planning.data.gov.uk/dataset/local-planning-authority.csv", 
                                  usecols = ["reference", "name", "geometry"])

LPA_boundary_df.columns = ["geometry", "name", "LPACD"]


# load geometry and create GDF
LPA_boundary_df['geometry'] = LPA_boundary_df['geometry'].apply(shapely.wkt.loads)
LPA_boundary_gdf = gpd.GeoDataFrame(LPA_boundary_df, geometry='geometry')

# Transform to ESPG:27700 for more interpretable area units
LPA_boundary_gdf.set_crs(epsg=4326, inplace=True)
LPA_boundary_gdf.to_crs(epsg=27700, inplace=True)

nrow(LPA_boundary_gdf)
LPA_boundary_gdf.head()


In [291]:
# load conservation area entity dataset from planning.data.gov into geopandas and transform CRS to EPSG:27700

entity_df = pd.read_csv("https://files.planning.data.gov.uk/dataset/conservation-area.csv",
                            usecols = ["entity", "name", "organisation-entity", "reference", "entry-date", "geometry"])
            
# entity_df.head()
entity_df.columns = [x.replace("-", "_") for x in entity_df.columns]



# set entity to string, needed later to sort and remove duplicate self intersections
entity_df["entity"] = entity_df["entity"].astype(str)
# entity_df["organisation_entity"] = entity_df["organisation_entity"].astype(str)

# join organisation name and LPA codes from lookup
entity_df = entity_df.merge(
    lookup_org[["organisation_name", "organisation_type", "organisation_entity", "LPACD"]], 
    how = "left",
    on = "organisation_entity")

# load geometry and create GDF
entity_df['geometry'] = entity_df['geometry'].apply(shapely.wkt.loads)
entity_gdf = gpd.GeoDataFrame(entity_df, geometry='geometry')

# Transform to ESPG:27700 for more interpretable area units
entity_gdf.set_crs(epsg=4326, inplace=True)
entity_gdf.to_crs(epsg=27700, inplace=True)

# calculate area
entity_gdf["area"] = entity_gdf["geometry"].area

nrow(entity_gdf)
entity_gdf.head()

No. of records in df: 8,826


Unnamed: 0,entity,entry_date,geometry,name,organisation_entity,reference,organisation_name,organisation_type,LPACD,area
0,44000001,2022-04-12,"MULTIPOLYGON (((516981.159 204270.242, 516973....",Napsbury,16,5080,Historic England,government-organisation,,495087.300218
1,44000002,2022-04-12,"MULTIPOLYGON (((512390.333 209659.962, 512382....",Shafford Mill,16,5071,Historic England,government-organisation,,136187.979619
2,44000003,2022-04-12,"MULTIPOLYGON (((511610.510 205098.079, 511611....",Potters Crouch,16,5074,Historic England,government-organisation,,34603.675292
3,44000004,2022-04-12,"MULTIPOLYGON (((512515.275 200300.431, 512520....",Old Brickett Wood,16,5075,Historic England,government-organisation,,55128.469061
4,44000005,2022-04-12,"MULTIPOLYGON (((520248.830 206717.191, 520410....",Sleapshyde,16,5078,Historic England,government-organisation,,44167.433073


In [None]:
# check of the organisations that we don't have an LPA code for
entity_df[entity_df["LPACD"].isnull()].groupby(["organisation_type", "organisation_name"]).size()

In [371]:
old_entity_df = get_old_entity("conservation-area")
old_entity_df["entity"] = old_entity_df["entity"].astype('str')
old_entity_df["old_entity"] = old_entity_df["old_entity"].astype('str')

nrow(old_entity_df)
old_entity_df.head()

No. of records in df: 529


Unnamed: 0,end_date,entity,entry_date,notes,old_entity,start_date,status
0,,44009617,,,44008389,,301
1,,44009617,,,44008390,,301
2,,44009621,,,44008391,,301
3,,44009621,,,44008392,,301
4,,44009621,,,44008393,,301


In [361]:
pd.concat([old_entity_df["entity"], old_entity_df["old_entity"]], ignore_index=True).drop_duplicates()

0       44009617
2       44009621
11      44009646
12      44009653
13      44009642
          ...   
1053    44006298
1054    44002984
1055    44002982
1056    44004833
1057    44002803
Length: 856, dtype: int64

# Checking expected bounds of data

In [None]:
# List LPA codes from entity df and check they're all in the LPA gdf
lpa_list = entity_df["LPACD"][entity_df["LPACD"].notnull()].drop_duplicates().to_list()

# check every one of our entity LPAs is in the LPA gdf
print(len(lpa_list))
nrow(LPA_boundary_gdf[LPA_boundary_gdf["LPACD"].isin(lpa_list)])

In [None]:
geogs_out_entities = []

# loop through LPA codes and for each check whether any conservation areas with that code don't intersect at all with the LPA boundary
for lpa_code in lpa_list:

    cons_areas = entity_gdf[entity_gdf["LPACD"] == lpa_code]
    cons_areas_intersect = cons_areas.geometry.intersects(LPA_boundary_gdf[LPA_boundary_gdf["LPACD"] == lpa_code].iloc[0].geometry)

    # add areas which don't intersect to the list
    geogs_out_entities.extend(cons_areas.loc[~cons_areas_intersect]["entity"].to_list())


entity_outside_LPA_df = entity_df[entity_df["entity"].isin(geogs_out_entities)]

# list of LPAs with entities outside them
LPAs_with_bads = entity_outside_LPA_df["LPACD"].drop_duplicates().to_list()

print(f"No. of bad entities found: {len(entity_outside_LPA_df):,}")
entity_outside_LPA_df.groupby("organisation_name").size()


No. of bad entities found: 15


organisation_name
Babergh District Council                  12
East Suffolk Council                       1
London Borough of Ealing                   1
London Borough of Hammersmith & Fulham     1
dtype: int64

In [453]:
entity_outside_LPA_df

Unnamed: 0,entity,geometry,name,organisation_entity,reference,organisation_name,organisation_type,LPACD
5828,44005968,"MULTIPOLYGON (((0.829991 52.230274, 0.83007 52...",Beyton,33,1,Babergh District Council,local-authority-eng,E60000183
5829,44005969,"MULTIPOLYGON (((0.891414 52.224025, 0.891411 5...",Woolpit,33,27,Babergh District Council,local-authority-eng,E60000183
5830,44005970,"MULTIPOLYGON (((0.995931 52.185989, 0.995939 5...",Stowmarket,33,19,Babergh District Council,local-authority-eng,E60000183
5834,44005974,"MULTIPOLYGON (((1.121016 52.25949, 1.121033 52...",Wetheringsett,33,25,Babergh District Council,local-authority-eng,E60000183
5837,44005977,"MULTIPOLYGON (((1.27374 52.343237, 1.273816 52...",Wingfield,33,29,Babergh District Council,local-authority-eng,E60000183
5844,44005984,"MULTIPOLYGON (((1.275201 52.31659, 1.275205 52...",Stradbroke,33,20,Babergh District Council,local-authority-eng,E60000183
5845,44005985,"MULTIPOLYGON (((1.364114 52.300877, 1.364176 5...",Laxfield,33,11,Babergh District Council,local-authority-eng,E60000183
5847,44005987,"MULTIPOLYGON (((1.108521 52.147819, 1.108344 5...",Coddenham,33,3,Babergh District Council,local-authority-eng,E60000183
5848,44005988,"MULTIPOLYGON (((0.896575 52.192579, 0.896612 5...",Rattlesden,33,17,Babergh District Council,local-authority-eng,E60000183
5853,44005993,"MULTIPOLYGON (((1.054631 52.152315, 1.054704 5...",Needham Market,33,15,Babergh District Council,local-authority-eng,E60000183


In [452]:
# map for LPA with entities outside it

LPA_code = LADs_with_bads[0]
bad_ents = entity_outside_LPA_df["entity"][entity_outside_LPA_df["LPACD"] == LPA_code]


map_entities = entity_gdf[entity_gdf["entity"].isin(bad_ents)].explore(
        # column = chloro_var,  # make choropleth based on "BoroName" column
        # cmap = palette,
    color = "red",
        # tooltip = False,
        # popup = ["organisation_name", "entity", "name", "reference"],
        tiles = "CartoDB positron",  # use "CartoDB positron" tiles
        highlight = False,
        style_kwds = {
        "fillOpacity" : "0.1"
        }
)

LPA_boundary_gdf[LPA_boundary_gdf["LPACD"] == LPA_code].explore(
    m = map_entities,
    color = "blue",
        style_kwds = {
        "fillOpacity" : "0"
        }
)

In [None]:
# map for LPA with entities outside it

LPA_code = LADs_with_bads[2]
bad_ents = entity_outside_LPA_df["entity"][entity_outside_LPA_df["LPACD"] == LPA_code]


map_entities = entity_gdf[entity_gdf["entity"].isin(bad_ents)].explore(
        # column = chloro_var,  # make choropleth based on "BoroName" column
        # cmap = palette,
    color = "red",
        # tooltip = False,
        # popup = ["organisation_name", "entity", "name", "reference"],
        tiles = "CartoDB positron",  # use "CartoDB positron" tiles
        highlight = False,
        style_kwds = {
        "fillOpacity" : "0.1"
        }
)

LPA_boundary_gdf[LPA_boundary_gdf["LPACD"] == LPA_code].explore(
    m = map_entities,
    color = "blue",
        style_kwds = {
        "fillOpacity" : "0"
        }
)

# Identifying geographical duplicates  
## Report

Aim of this is to quickly categorise the overlaps based on whether they fall into the following groups:

Entity overlaps with another: 

1. within the same organisation
    
2. from a different organisation   

    a. LPA entity overlaps with entity from another LPA
        
    b. LPA entity overlaps with entity from Historic England

Within some of these categories (1. and 2.b) there are distinctions made on the type of overlap that's occuring, with different actions recommended for different types



In [442]:
MATCH_LOWER_THRESH = 0.9  # defines the lower limit of the shared overlap between two entities to be called a match
EDGE_UPPER_THRESH = 0.1   # defines the upper limit of the shared overlap between two entities to be called an edge intersection


# full join of all geometries
entity_join_all = gpd.overlay(
    entity_gdf, 
    entity_gdf,
    how = "intersection", keep_geom_type=False 
)


# remove self-intersections and duplicates of the same intersections
entity_join_all = entity_join_all[entity_join_all["entity_1"] != entity_join_all["entity_2"]]

entity_join_all["entity_join"] = entity_join_all.apply(lambda x: '-'.join(sorted(x[["entity_1", "entity_2"]])), axis=1)
entity_join_all.drop_duplicates(subset="entity_join", inplace = True) #Drop them by name

# nrow(entity_join_all)

# flag the types of intersections between organisations
# is org the same
entity_join_all["int_org_match"] = np.where(entity_join_all["organisation_entity_1"] == entity_join_all["organisation_entity_2"], True, False)

# the types of org-org matches
entity_join_all["int_org_types"] = np.select(
    [
        (entity_join_all["organisation_entity_1"] == 16) & (entity_join_all["organisation_entity_2"] == 16),
        (entity_join_all["organisation_entity_1"] != 16) & (entity_join_all["organisation_entity_2"] != 16),
        ((entity_join_all["organisation_entity_1"] != 16) & (entity_join_all["organisation_entity_2"] == 16)) |
        ((entity_join_all["organisation_entity_1"] == 16) & (entity_join_all["organisation_entity_2"] != 16))
    ],
    ["HE - HE", "LPA - LPA", "HE - other"],
    default = "-"
)

# does the entity entry date match?
entity_join_all["date_match"] = np.where(entity_join_all["entry_date_1"] == entity_join_all["entry_date_2"], True, False)

# has one of the intersected entities already been re-mapped?
entity_join_all["entity_old"] = np.where(entity_join_all["entity_1"].isin(old_entity_df["old_entity"]) |
                                         entity_join_all["entity_2"].isin(old_entity_df["old_entity"]), True, False)


# calculate overlap %'s

entity_join_all["area_intersection"] = entity_join_all["geometry"].area

entity_join_all["p_pct_intersect"] = entity_join_all["area_intersection"] / entity_join_all["area_1"]
entity_join_all["pct_intersection"] = entity_join_all["area_intersection"] / (entity_join_all["area_1"] + entity_join_all["area_2"] - entity_join_all["area_intersection"])
entity_join_all["s_pct_intersect"] = entity_join_all["area_intersection"] / entity_join_all["area_2"]

# intersection area as % of smallest primary or secondary area
entity_join_all["pct_min_intersection"] = entity_join_all["area_intersection"] / entity_join_all[["area_1", "area_2"]].min(axis = 1)


entity_join_all["issue_type"] = np.select(
    [
        (entity_join_all["p_pct_intersect"] >= MATCH_LOWER_THRESH) & (entity_join_all["s_pct_intersect"] >= MATCH_LOWER_THRESH),
        (entity_join_all["p_pct_intersect"] <= EDGE_UPPER_THRESH) & (entity_join_all["s_pct_intersect"] <= EDGE_UPPER_THRESH),
        ((entity_join_all["p_pct_intersect"] >= MATCH_LOWER_THRESH) | (entity_join_all["s_pct_intersect"] >= MATCH_LOWER_THRESH)),
        
    ],
    [
        "> 90% combined match", "edge intersection", "> 90% single match"
    ],
    default = "-"
)

nrow(entity_join_all)
entity_join_all.head()

No. of records in df: 2,546


Unnamed: 0,entity_1,entry_date_1,name_1,organisation_entity_1,reference_1,organisation_name_1,organisation_type_1,LPACD_1,area_1,entity_2,...,int_org_match,int_org_types,date_match,entity_old,area_intersection,p_pct_intersect,pct_intersection,s_pct_intersect,pct_min_intersection,issue_type
7,44000009,2022-04-12,Childwickbury,16,5063,Historic England,government-organisation,,1885513.0,44000007,...,True,HE - HE,True,False,2.170036,1e-06,4.225103e-07,6.675918e-07,1e-06,edge intersection
18,44000732,2022-04-12,Bodenham Road,16,5104,Historic England,government-organisation,,96781.47,44000016,...,True,HE - HE,True,False,2.162021,2.2e-05,5.452397e-06,7.212809e-06,2.2e-05,edge intersection
22,44000770,2022-04-12,Leominster Town,16,2499,Historic England,government-organisation,,255623.2,44000017,...,True,HE - HE,True,False,2.033437,8e-06,4.665495e-06,1.128278e-05,1.1e-05,edge intersection
50,44000043,2022-04-12,Butterworth Hall,16,7716,Historic England,government-organisation,,29687.92,44000042,...,True,HE - HE,True,False,0.0,0.0,0.0,0.0,0.0,edge intersection
60,44003132,2022-04-12,Worcester and Birmingham Canal,16,449,Historic England,government-organisation,,253931.3,44000050,...,True,HE - HE,True,False,4.129991,1.6e-05,7.315925e-06,1.329709e-05,1.6e-05,edge intersection


In [None]:
# check the flagging or intersections between different org types is correct
entity_join_all.groupby(["int_org_match", "int_org_types", "organisation_entity_1", "organisation_entity_2"]).size()

int_org_match  int_org_types  organisation_entity_1  organisation_entity_2
False          HE - other     3                      16                        1
                              16                     73                        3
                                                     80                        5
                                                     90                        1
                                                     100                       1
                                                                              ..
True           LPA - LPA      309                    309                       2
                              319                    319                       3
                              329                    329                      28
                              352                    352                       1
                              376                    376                       4
Length: 156, dtype: int64

In [518]:
# count of issues by type breakdown
entity_join_all.groupby(['entity_old', 'int_org_match', 'int_org_types', 'date_match', 'issue_type']).size()

# write to csv to add in further descriptions
# entity_join_all.groupby(['entity_old', 'int_org_match', 'int_org_types', 'date_match', 'issue_type']).size().to_csv("temp_issue_mapping_table.csv")

entity_old  int_org_match  int_org_types  date_match  issue_type          
False       False          HE - other     False       -                        18
                                                      > 90% combined match    209
                                                      > 90% single match      229
                                                      edge intersection       390
                           LPA - LPA      False       edge intersection        30
                                          True        -                         1
                                                      edge intersection        15
            True           HE - HE        False       -                         2
                                                      > 90% combined match      1
                                                      > 90% single match        4
                                                      edge intersection       129
                       

In [488]:
# read back in issue flag mapping table with extra fields
# issue_mapping_df = pd.read_csv("temp_issue_mapping_table.csv")

# check join works
nrow(entity_join_all)
nrow(entity_join_all.merge(
        issue_mapping_df, 
        how = "inner", 
        on = ['entity_old', 'int_org_match', 'int_org_types', 'date_match', 'issue_type']))

entity_join_all = entity_join_all.merge(
        issue_mapping_df, 
        how = "inner", 
        on = ['entity_old', 'int_org_match', 'int_org_types', 'date_match', 'issue_type'])

No. of records in df: 2,546
No. of records in df: 2,546


In [492]:
# write full report table to csv
# entity_join_all[[
#     'entity_join', 'entry_date_1', 'name_1', 'organisation_entity_1',
#     'reference_1', 'organisation_name_1', 'organisation_type_1',
#     'entry_date_2', 'name_2', 'organisation_entity_2',
#     'reference_2', 'organisation_name_2', 'organisation_type_2',
#     'p_pct_intersect', 'pct_intersection', 's_pct_intersect', 'pct_min_intersection',
#     'int_org_match', 'int_org_types', 'date_match', 'entity_old',
#     'issue_type', 'issue_description', 'action']].to_csv("issues_report_test.csv", index=False)

In [519]:
plot_issues_map(entity_gdf, "44007798-44007822", "name", "Accent")

## #1 - Intersection within organisation

In [392]:
# Overlay all non-Heritage England entities (conservation area HE publish contains overlaps so not trying to flag here)
LPA_LPA_join = gpd.overlay(
    # entity_gdf, entity_gdf,
    entity_gdf[entity_gdf["organisation_entity"] != 16],
    entity_gdf[entity_gdf["organisation_entity"] != 16],
    how = "intersection", keep_geom_type=False 
)

# remove entity self-intersections and intersections across organisations
LPA_LPA_join = LPA_LPA_join[(LPA_LPA_join["organisation_entity_1"] == LPA_LPA_join["organisation_entity_2"]) &
             (LPA_LPA_join["entity_1"] != LPA_LPA_join["entity_2"])]

nrow(LPA_LPA_join)
# each intersection will be in there twice because we're joining the same dataset 
# (e.g. polygon1-polygon2 and polygon2-polygon1), so remove these
LPA_LPA_join["entity_join"] = LPA_LPA_join.apply(lambda x: '-'.join(sorted(x[["entity_1", "entity_2"]])), axis=1)
LPA_LPA_join.drop_duplicates(subset="entity_join", inplace = True) #Drop them by name

# calculate overlap %'s

LPA_LPA_join["area_intersection"] = LPA_LPA_join["geometry"].area

LPA_LPA_join["p_pct_intersect"] = LPA_LPA_join["area_intersection"] / LPA_LPA_join["area_1"]
LPA_LPA_join["pct_intersection"] = LPA_LPA_join["area_intersection"] / (LPA_LPA_join["area_1"] + LPA_LPA_join["area_2"] - LPA_LPA_join["area_intersection"])
LPA_LPA_join["s_pct_intersect"] = LPA_LPA_join["area_intersection"] / LPA_LPA_join["area_2"]

# intersection area as % of smallest primary or secondary area
LPA_LPA_join["pct_min_intersection"] = LPA_LPA_join["area_intersection"] / LPA_LPA_join[["area_1", "area_2"]].min(axis = 1)

LPA_LPA_join["issue_type"] = np.select(
    [
        (LPA_LPA_join["p_pct_intersect"] >= 0.9) & (LPA_LPA_join["s_pct_intersect"] >= 0.9),
        (LPA_LPA_join["p_pct_intersect"] <= 0.1) & (LPA_LPA_join["s_pct_intersect"] <= 0.1),
        ((LPA_LPA_join["p_pct_intersect"] >= 0.9) | (LPA_LPA_join["s_pct_intersect"] >= 0.9)),
        
    ],
    [
        "> 90% combined match", "edge intersection", "> 90% single match"
    ],
    default = "-"
)

LPA_LPA_join["date_match"] = np.where(LPA_LPA_join["entry_date_1"] == LPA_LPA_join["entry_date_2"], True, False)

LPA_LPA_join = LPA_LPA_join[['entity_1', 'entry_date_1', 'name_1', 'organisation_entity_1',
       'reference_1', 'organisation_name_1', 'entity_2', 'entry_date_2', 'name_2', 'organisation_entity_2',
       'reference_2', 'organisation_name_2', 
       'geometry', 'entity_join', 'area_intersection',
       'p_pct_intersect', 'pct_intersection', 's_pct_intersect',
       'pct_min_intersection', 'date_match', 'issue_type']]

nrow(LPA_LPA_join)
LPA_LPA_join.head()

No. of records in df: 1,136
No. of records in df: 568


Unnamed: 0,entity_1,entry_date_1,name_1,organisation_entity_1,reference_1,organisation_name_1,entity_2,entry_date_2,name_2,organisation_entity_2,...,organisation_name_2,geometry,entity_join,area_intersection,p_pct_intersect,pct_intersection,s_pct_intersect,pct_min_intersection,date_match,issue_type
19,44004074,2020-09-04,Landford Road Cons Area,376,COA00000876,London Borough of Wandsworth,44001043,2020-09-04,Charlwood Road/Lifford Street Cons Area,376,...,London Borough of Wandsworth,GEOMETRYCOLLECTION (POLYGON ((523627.357 17554...,44001043-44004074,3.904702,2.3e-05,1.434359e-05,3.8e-05,3.8e-05,True,edge intersection
22,44004075,2020-09-04,Putney Lower Common Cons Area,376,COA00000836,London Borough of Wandsworth,44004074,2020-09-04,Landford Road Cons Area,376,...,London Borough of Wandsworth,GEOMETRYCOLLECTION (POLYGON ((522799.363 17559...,44004074-44004075,12.182724,0.000275,5.673875e-05,7.1e-05,0.000275,True,edge intersection
26,44004073,2020-09-04,Westmead Cons Area,376,COA00000866,London Borough of Wandsworth,44001054,2020-09-04,Roehampton Village Cons Area,376,...,London Borough of Wandsworth,GEOMETRYCOLLECTION (POLYGON ((522443.085 17387...,44001054-44004073,10.79905,4e-05,3.163079e-05,0.000149,0.000149,True,edge intersection
34,44008940,2020-09-04,Cheyne,182,COA00000543,Royal Borough of Kensington and Chelsea,44008941,2020-09-04,Thames,182,...,Royal Borough of Kensington and Chelsea,GEOMETRYCOLLECTION (POLYGON ((527132.055 17758...,44008940-44008941,0.490527,2e-06,9.088055e-07,2e-06,2e-06,True,edge intersection
36,44008942,2020-09-04,Royal Hospital,182,COA00000546,Royal Borough of Kensington and Chelsea,44008941,2020-09-04,Thames,182,...,Royal Borough of Kensington and Chelsea,GEOMETRYCOLLECTION (POLYGON ((527584.415 17772...,44008941-44008942,79.867954,0.000143,9.243221e-05,0.00026,0.00026,True,edge intersection


In [339]:
# how many entities with a greater than 10% intersection?
nrow(LPA_LPA_join[(LPA_LPA_join["pct_min_intersection"] > 0.1)])

# LPA_LPA_join[(LPA_LPA_join["pct_min_intersection"] > 0.1)].sort_values("pct_min_intersection", ascending=False)

No. of records in df: 49


In [345]:
LPA_LPA_join.groupby(["date_match", "issue_type", "organisation_name_1"]).size()

date_match  issue_type            organisation_name_1                    
False       -                     London Borough of Lambeth                   3
            > 90% combined match  Epsom and Ewell Borough Council             2
                                  London Borough of Hammersmith & Fulham      2
                                  London Borough of Lambeth                   2
                                  London Borough of Southwark                 1
                                  Sheffield City Council                      3
            > 90% single match    Epsom and Ewell Borough Council             5
                                  London Borough of Hammersmith & Fulham      2
                                  London Borough of Lambeth                   3
                                  London Borough of Southwark                 3
            edge intersection     Cornwall Council                            1
                                  Epsom and Ew

In [349]:
# count by organisation of entities with intersections > 10%
LPA_LPA_join[(LPA_LPA_join["pct_min_intersection"] > 0.1)].groupby(["organisation_name_1"]).size().sort_values(ascending = False)

organisation_name_1
Maldon District Council                   20
London Borough of Lambeth                  8
Epsom and Ewell Borough Council            7
London Borough of Hammersmith & Fulham     4
London Borough of Southwark                4
Sheffield City Council                     3
Dudley Metropolitan Borough Council        2
Buckinghamshire Council                    1
dtype: int64

**notes from run through with Swati**

solution - go back to LPA
possible explanation - data is coming from different endpoint, and first one is not retired. Need to rule this out before we go back to LPA.

When new endpoint is added, we want to keep both. Want to keep record of data over time. Platform should only present latest version.
Need to understand entity creation process a little bit more to understand how geo duplicates could get made - talk to Kena.

In [521]:
# inspect example
# plot_issues_map(entity_gdf, ["44005062", "44002577"], "name", "Accent")
plot_issues_map(entity_gdf, "44000171-44000170", "name", "Accent")


In [522]:
# inspect example
plot_issues_map(entity_gdf, "44008830-44006848", "name", "Accent")


## 2 - Intersection across organisations
   
### 2.a LPA entity overlaps with entity from another LPA

In [415]:
# Overlay all non-Heritage England entities
LPA_cross_join = gpd.overlay(
    entity_gdf[entity_gdf["organisation_entity"] != 16],
    entity_gdf[entity_gdf["organisation_entity"] != 16],
    how = "intersection", keep_geom_type=False 
)

# filter to join across organisations and entities
LPA_cross_join = LPA_cross_join[(LPA_cross_join["organisation_entity_1"] != LPA_cross_join["organisation_entity_2"]) &
             (LPA_cross_join["entity_1"] != LPA_cross_join["entity_2"])]

# each intersection will be in there twice because we're joining the same dataset 
# (e.g. polygon1-polygon2 and polygon2-polygon1), so remove these
LPA_cross_join["entity_join"] = LPA_cross_join.apply(lambda x: '-'.join(sorted(x[["entity_1", "entity_2"]])), axis=1)
LPA_cross_join.drop_duplicates(subset="entity_join", inplace = True) #Drop them by name

# # calculate overlap %'s

LPA_cross_join["area_intersection"] = LPA_cross_join["geometry"].area

# # LPA_LPA_join["p_pct_intersect"] = LPA_LPA_join["area_intersection"] / LPA_LPA_join["area_1"]
# # LPA_LPA_join["pct_intersection"] = LPA_LPA_join["area_intersection"] / (LPA_LPA_join["area_1"] + LPA_LPA_join["area_2"] - LPA_LPA_join["area_intersection"])
# # LPA_LPA_join["s_pct_intersect"] = LPA_LPA_join["area_intersection"] / LPA_LPA_join["area_2"]

# intersection area as % of smallest primary or secondary area
LPA_cross_join["pct_min_intersection"] = LPA_cross_join["area_intersection"] / LPA_cross_join[["area_1", "area_2"]].min(axis = 1)

nrow(LPA_cross_join)
LPA_cross_join.head()

No. of records in df: 46


Unnamed: 0,entity_1,entry_date_1,name_1,organisation_entity_1,reference_1,organisation_name_1,organisation_type_1,LPACD_1,area_1,entity_2,...,organisation_entity_2,reference_2,organisation_name_2,organisation_type_2,LPACD_2,area_2,geometry,entity_join,area_intersection,pct_min_intersection
32,44008941,2020-09-04,Thames,182,COA00000544,Royal Borough of Kensington and Chelsea,local-authority-eng,E60000194,306674.786572,44001064,...,376,COA00000338,London Borough of Wandsworth,local-authority-eng,E60000200,123464.2,"POLYGON ((526643.359 176971.706, 526643.284 17...",44001064-44008941,0.00654,5.296789e-08
143,44008828,2020-09-04,Walcot,192,COA00000224,London Borough of Lambeth,local-authority-eng,E60000195,80645.499631,44002361,...,329,14,London Borough of Southwark,local-authority-eng,E60000198,156723.5,"MULTIPOLYGON (((531397.460 178999.938, 531451....",44002361-44008828,61.364033,0.0007609108
149,44008827,2020-09-04,St Marks,192,COA00000222,London Borough of Lambeth,local-authority-eng,E60000195,258058.687895,44002362,...,329,9,London Borough of Southwark,local-authority-eng,E60000198,34491.27,"POLYGON ((531567.046 177878.773, 531566.975 17...",44002362-44008827,1.594233,4.622134e-05
184,44008587,2022-01-19,Minet Estate,192,CA25,London Borough of Lambeth,local-authority-eng,E60000195,254651.578819,44002371,...,329,5,London Borough of Southwark,local-authority-eng,E60000198,86119.13,"POLYGON ((532016.682 176737.603, 532035.491 17...",44002371-44008587,287.899188,0.003343034
203,44008841,2020-09-04,0,198,COA00000275,London Borough of Lewisham,local-authority-eng,E60000196,61970.45524,44002380,...,329,24,London Borough of Southwark,local-authority-eng,E60000198,1806972.0,"MULTIPOLYGON (((534602.962 172572.973, 534608....",44002380-44008841,731.761008,0.01180822


In [None]:
# Look at distribution to check how many are edges vs. major overlaps
plt.hist(LPA_cross_join["pct_min_intersection"], bins=50);

In [254]:
# how many entities which have issues of intersection > 10%? 
LPA_cross_join[(LPA_cross_join["pct_min_intersection"] > 0.1)]

Unnamed: 0,entity_1,name_1,organisation_entity_1,reference_1,organisation_name_1,organisation_type_1,LPACD_1,area_1,entity_2,name_2,organisation_entity_2,reference_2,organisation_name_2,organisation_type_2,LPACD_2,area_2,geometry,entity_join,area_intersection,pct_min_intersection
644,44009059,,329,COA00000781,London Borough of Southwark,local-authority-eng,E60000198,18976.083189,44008830,South Bank,192,COA00000240,London Borough of Lambeth,local-authority-eng,E60000195,536689.574671,"POLYGON ((531348.617 180459.492, 531276.469 18...",44008830-44009059,9639.44694,0.507979


In [527]:
plot_issues_map(entity_gdf, "44009059-44008830", "organisation_name", "Accent")

### 2.b LPA entity overlaps with entity from Historic England 

In [416]:
# start_time = time.time()

LPA_HE_join = gpd.overlay(
    entity_gdf[entity_gdf["organisation_entity"] != 16],
    entity_gdf[entity_gdf["organisation_entity"] == 16],
    how = "intersection", keep_geom_type=False
)

LPA_HE_join["area_intersection"] = LPA_HE_join["geometry"].area

LPA_HE_join["p_pct_intersect"] = LPA_HE_join["area_intersection"] / LPA_HE_join["area_1"]
LPA_HE_join["pct_intersection"] = LPA_HE_join["area_intersection"] / (LPA_HE_join["area_1"] + LPA_HE_join["area_2"] - LPA_HE_join["area_intersection"])
LPA_HE_join["s_pct_intersect"] = LPA_HE_join["area_intersection"] / LPA_HE_join["area_2"]


# intersection area as % of smallest primary or secondary area
LPA_HE_join["pct_min_intersection"] = LPA_HE_join["area_intersection"] / LPA_HE_join[["area_1", "area_2"]].min(axis = 1)


# end_time = time.time()

# elapsed_time = (end_time - start_time) 
# print(f"Elapsed time: {elapsed_time:.2f} ")

nrow(LPA_HE_join)
LPA_HE_join.head()

No. of records in df: 847


Unnamed: 0,entity_1,entry_date_1,name_1,organisation_entity_1,reference_1,organisation_name_1,organisation_type_1,LPACD_1,area_1,entity_2,...,organisation_name_2,organisation_type_2,LPACD_2,area_2,geometry,area_intersection,p_pct_intersect,pct_intersection,s_pct_intersect,pct_min_intersection
0,44000865,2022-01-19,Lansdowne Gardens,192,CA3,London Borough of Lambeth,local-authority-eng,E60000195,59097.031686,44006594,...,Historic England,government-organisation,,37843.726017,GEOMETRYCOLLECTION (POLYGON ((530194.816 17674...,2.391815,4e-05,2.5e-05,6.3e-05,6.3e-05
1,44000873,2022-01-19,La Retraite,192,CA36,London Borough of Lambeth,local-authority-eng,E60000195,46132.000399,44001073,...,Historic England,government-organisation,,26904.215511,"MULTIPOLYGON (((528970.549 173582.032, 528972....",2.786277,6e-05,3.8e-05,0.000104,0.000104
2,44000880,2022-01-19,Streatham Park & Garrads Road,192,CA12,London Borough of Lambeth,local-authority-eng,E60000195,185497.367902,44000883,...,Historic England,government-organisation,,268240.152734,POINT (530014.382 171844.897),0.0,0.0,0.0,0.0,0.0
3,44000880,2022-01-19,Streatham Park & Garrads Road,192,CA12,London Borough of Lambeth,local-authority-eng,E60000195,185497.367902,44001076,...,Historic England,government-organisation,,73176.136666,GEOMETRYCOLLECTION (POLYGON ((529607.084 17219...,16.250818,8.8e-05,6.3e-05,0.000222,0.000222
4,44000889,2022-01-19,Brockwell Park,192,CA39,London Borough of Lambeth,local-authority-eng,E60000195,631719.05047,44000881,...,Historic England,government-organisation,,17371.062248,"MULTIPOINT (531356.603 174586.838, 531288.586 ...",0.0,0.0,0.0,0.0,0.0


In [None]:
# plot the issues by the amount the two entities which make up each issue intersect each other
# this is useful to start to define categories for the types of issues they represent

fig = plt.figure()
plt.grid()
plt.scatter(LPA_HE_join["p_pct_intersect"], LPA_HE_join["s_pct_intersect"], s = 8, alpha=0.6)
fig.suptitle('Entity intersection %s', fontsize=14)
plt.xlabel('% of LPA entity intersected', fontsize=10)
plt.ylabel('% of Historic England entity intersected', fontsize=10)

By the number of points on the far right of the chart we can see that there are a lot of LPA entities which are entirely or almost entirely contained within an HE entity, but how closely the HE area matches varies from not at all to almost exactly.

Bottom left is a cluster of tiny edge intersections, and there are a small number of instances where HE entities are contained within LPA ones.

In [None]:
# flag issue types - defined to pick up main issue clusters on chart above and using a 90% or 10% intersection cutoffs

LPA_HE_join["issue_type"] = np.select(
    [
        (LPA_HE_join["p_pct_intersect"] >= 0.9) & (LPA_HE_join["s_pct_intersect"] >= 0.9),
        (LPA_HE_join["p_pct_intersect"] <= 0.1) & (LPA_HE_join["s_pct_intersect"] <= 0.1),
        (LPA_HE_join["p_pct_intersect"] >= 0.9),
        (LPA_HE_join["s_pct_intersect"] >= 0.9)
    ],
    [
        "LPA and HE cover each other", "edge intersection", "LPA covered by HE", "LPA covers HE"
    ],
    default = "-"
)

In [None]:
# LPA_HE_join[(LPA_HE_join["pct_intersection"] >= 0.9)].sort_values("pct_intersection")
# LPA_HE_join[(LPA_HE_join["issue_type"] == "LPA covers HE")].sort_values("pct_intersection")

In [257]:
# count of issue types (where cover is defined as >=90% intersection, and edge as <=10%)
LPA_HE_join.groupby(["issue_type"]).size()

issue_type
-                               18
LPA and HE cover each other    210
LPA covered by HE              201
LPA covers HE                   28
edge intersection              390
dtype: int64

In [256]:
# count of non-edge issues
nrow(LPA_HE_join[(LPA_HE_join["issue_type"] != "edge intersection")])

# LPAs with most non-intersection issues
LPA_HE_join[(LPA_HE_join["issue_type"] != "edge intersection")].groupby(["organisation_name_1"]).size().sort_values(ascending = False).head(15)

No. of records in df: 457


organisation_name_1
Maldon District Council                   60
Doncaster Metropolitan Borough Council    45
North Dorset District Council             37
London Borough of Bromley                 37
Medway Council                            24
Great Yarmouth Borough Council            21
Peak District National Park Authority     19
Sheffield City Council                    15
Gloucester City Council                   14
London Borough of Hounslow                14
Waverley Borough Council                  13
London Borough of Bexley                  12
Mole Valley District Council              12
London Borough of Richmond upon Thames    12
East Suffolk Council                      11
dtype: int64

### Issue examples by type

#### LPA and HE cover each other (almost perfect matches)

- need to get to the bottom of authority here, who can create conservation areas
- could we just switch off HE conservation areas for an area when we add data for a new LPA?

In [528]:
plot_issues_map(entity_gdf, "44002803-44008960", "organisation_name", "Accent")


#### LPA covered by HE

In [None]:
# LPA_HE_join[(LPA_HE_join["issue_type"] == "LPA covered by HE")].sort_values("pct_min_intersection", ascending = False).head()

In [529]:
plot_issues_map(entity_gdf, "44002322-44009177", "organisation_name", "Accent")


#### LPA covers HE

In [531]:
plot_issues_map(entity_gdf, "44005188-44009160", "organisation_name", "Accent")

#### Edge intersection

In [533]:
plot_issues_map(entity_gdf, "44006481-44006512", "organisation_name", "Accent")

#### Entities with multiple issues

In [534]:
# LPA_HE_join[(LPA_HE_join["entity_1"] == "44006512")]

entity_count = LPA_HE_join.groupby(["entity_1"]).size().reset_index()
entity_count.columns = ["entity_1", "count"]
entity_count[entity_count["count"] > 1].sort_values('count', ascending = False)

Unnamed: 0,entity_1,count
446,44009667,8
398,44009090,8
56,44006848,7
369,44009057,7
287,44008830,7
...,...,...
211,44007968,2
376,44009066,2
375,44009065,2
223,44008198,2


In [None]:
t = LPA_HE_join[LPA_HE_join["entity_1"] == "44009090"]

# grab all entities that have an issue with  44009090
te = np.concatenate((
    t["entity_1"].drop_duplicates().values,
    t["entity_2"].drop_duplicates().values
))

plot_issues_map(entity_gdf, te, "organisation_name", "Accent")


#### Entities with non-classified issues

In [None]:
# these are really just entities which have overlaps > 10% but less than 90% in one form 
LPA_HE_join[(LPA_HE_join["issue_type"] == "-")].sort_values("pct_min_intersection", ascending = False).head()

In [None]:
# looking at entity re-directs
entity_df[entity_df["entity"].isin(["44000549", "44008664"])]

## Questions to resolve
* how to find endpoint / resource for each entity?
* what existing issues / replacements have been documented for the dataset?
* which entity takes precedence? Oldest / newest?
* what threshold to set for removing duplicates?
* how to extract data required for updating through lookups file
* how to replicate this check in endpoint checker with a new dataset

**notes from run-through with Swati**   
feed in LPA boundaries here to make sure we contact the right LPA - change this query to use LPA boundaries.
check with Carlos for how it's been done for brownfield