## Examining duplication Between Organisations

In our data we often recieve multiple data sources per dataset. unfortunately this leads to duplication of geometries and other data points in the datasets. this notebook looks to investigate identifying these duplications between organisations.

In [104]:
# from download_data import download_dataset
# from data import get_entity_dataset, nrow
# from plot import plot_map, plot_issues_map
import spatialite
import pandas as pd
import geopandas as gpd
import os
import itertools
import shapely.wkt
import logging

import matplotlib.pyplot as plt
import time
import urllib

import numpy as np

pd.set_option("display.max_rows", 100)


In [105]:
# if running on Colab, uncomment and run this line below too:
# !pip install mapclassify

### Functions

In [106]:
def nrow(df):
    return print(f"No. of records in df: {len(df):,}")


def plot_issues_map(gdf:gpd.GeoDataFrame, entity_link, chloro_var, palette):

    if type(gdf) != gpd.GeoDataFrame:
        logging.error('input is not a GeodataFrame')

    entity_list = entity_link.split('-')
    
    base = gdf[gdf["entity"].isin(entity_list)].explore(
        column = chloro_var,  # make choropleth based on "BoroName" column
        cmap = palette,
        tooltip = False,
        popup = ["organisation_name", "entity", "name", "entry_date", "reference"],
        tiles = "CartoDB positron",  # use "CartoDB positron" tiles
        highlight = False,
        style_kwds = {
        "fillOpacity" : "0.1"
        }
    )
    
    return base

def get_all_organisations():
    params = urllib.parse.urlencode({
        "sql": f"""
        select organisation, name, entity as organisation_entity, statistical_geography
        from organisation
        """,
        "_size": "max"
        })
    url = f"https://datasette.planning.data.gov.uk/digital-land.csv?{params}"
    df = pd.read_csv(url)
    return df


def get_old_entity(collection_name):
    params = urllib.parse.urlencode({
        "sql": f"""
        select *
        from old_entity
        """,
        "_size": "max"
        })
    url = f"https://datasette.planning.data.gov.uk/{collection_name}.csv?{params}"
    df = pd.read_csv(url)
    return df

### Data import

In [107]:
# get LAD to LPA lookup from github
lookup_lad_lpa = pd.read_csv("https://github.com/digital-land/organisation-collection/raw/main/data/local-authority.csv",
                             usecols = ["entity", "local-authority-district", "local-planning-authority"])

lookup_lad_lpa.columns = ["organisation_entity", "LADCD", "LPACD"]

nrow(lookup_lad_lpa)
lookup_lad_lpa.head()

No. of records in df: 376


Unnamed: 0,organisation_entity,LADCD,LPACD
0,26,E07000223,E60000281
1,27,E07000026,E60000019
2,28,E07000032,E60000077
3,29,E07000224,E60000282
4,30,E07000105,E60000253


**Note on LAD to LPA mapping**   
Currently this [lookup file from github](https://github.com/digital-land/organisation-collection/raw/main/data/local-authority.csv) just records a 1:1 link between LADs and LPAs, but according to the ONS this relationship is actually 1:many. 
See [2020 lookup file](https://geoportal.statistics.gov.uk/datasets/ons::local-planning-authority-to-local-authority-district-april-2020-in-the-united-kingdom-lookup-1/about) and the example of Ryedale [`E07000167`], which is mapped to the following two LPAs:

* Ryedale LPA [`E60000061`]
* North York Moors National Park LPA [`E60000322`]

We need to agree some validation rules around this, i.e. can we expect Ryedale to submit data that might sit within either of these LPA areas, or for any London Boroughs to submit within the "London Legacy Development Corporation LPA" area?
But for simplicity's sake at the moment to get things up and running (as per Owen's advice), will test with existing 1:1 mapping and aim to develop logic once there is more clarity about multiple area handling.

The git lookup file also seems to be missing some areas, e.g. "Peak District National Park Authority" entity 405.

In [108]:
# get org data from datasette
lookup_org = get_all_organisations()

# lookup_org["organisation_entity"] = lookup_org["organisation_entity"].astype(str)
lookup_org.columns = ["organisation", "organisation_name", "organisation_entity", "statistical_geography"]

# split out org type and join on LPA codes from LAD to LPA lookup
lookup_org["organisation_type"] = lookup_org["organisation"].apply(lambda x: x.split(":")[0])
lookup_org = lookup_org.merge(lookup_lad_lpa, how = "left", on = "organisation_entity")

nrow(lookup_org)
lookup_org.head()

No. of records in df: 441


Unnamed: 0,organisation,organisation_name,organisation_entity,statistical_geography,organisation_type,LADCD,LPACD
0,development-corporation:Q20648596,Old Oak and Park Royal Development Corporation,1,E51000002,development-corporation,,
1,development-corporation:Q4916714,Birmingham Heartlands Development Corporation,2,,development-corporation,,
2,development-corporation:Q6670544,London Legacy Development Corporation,3,E51000001,development-corporation,,
3,development-corporation:Q6670837,London Thames Gateway Development Corporation,4,,development-corporation,,
4,development-corporation:Q72456968,South Tees Development Corporation,5,E51000004,development-corporation,,


In [109]:
# check what types of org are missing the LPA code
nrow(lookup_org[lookup_org["LPACD"].isnull()])
lookup_org[lookup_org["LPACD"].isnull()].groupby("organisation_type").size()

No. of records in df: 108


organisation_type
development-corporation          14
government-organisation          20
local-authority                   4
local-authority-eng              43
national-park-authority          10
nonprofit                         1
passenger-transport-executive     1
public-authority                  1
regional-park-authority           1
transport-authority               8
waste-authority                   5
dtype: int64

In [110]:
# LPA boundary data from planning.data.gov

LPA_boundary_df = pd.read_csv("https://files.planning.data.gov.uk/dataset/local-planning-authority.csv", 
                                  usecols = ["reference", "name", "geometry"])

LPA_boundary_df.columns = ["geometry", "name", "LPACD"]


# load geometry and create GDF
LPA_boundary_df['geometry'] = LPA_boundary_df['geometry'].apply(shapely.wkt.loads)
LPA_boundary_gdf = gpd.GeoDataFrame(LPA_boundary_df, geometry='geometry')

# Transform to ESPG:27700 for more interpretable area units
LPA_boundary_gdf.set_crs(epsg=4326, inplace=True)
LPA_boundary_gdf.to_crs(epsg=27700, inplace=True)

nrow(LPA_boundary_gdf)
LPA_boundary_gdf.head()


No. of records in df: 337


Unnamed: 0,geometry,name,LPACD
0,"MULTIPOLYGON (((428366.003 554230.393, 428288....",County Durham LPA,E60000001
1,"MULTIPOLYGON (((436388.046 522354.244, 436372....",Darlington LPA,E60000002
2,"MULTIPOLYGON (((449073.036 536806.421, 448888....",Hartlepool LPA,E60000003
3,"MULTIPOLYGON (((451894.321 521145.352, 451858....",Middlesbrough LPA,E60000004
4,"MULTIPOLYGON (((429247.025 604972.344, 429241....",Northumberland LPA,E60000005


In [111]:
# load conservation area entity dataset from planning.data.gov into geopandas and transform CRS to EPSG:27700

entity_df = pd.read_csv("https://files.planning.data.gov.uk/dataset/conservation-area.csv",
                            usecols = ["entity", "name", "organisation-entity", "reference", "entry-date", "geometry"])
            
# entity_df.head()
entity_df.columns = [x.replace("-", "_") for x in entity_df.columns]



# set entity to string, needed later to sort and remove duplicate self intersections
entity_df["entity"] = entity_df["entity"].astype(str)
# entity_df["organisation_entity"] = entity_df["organisation_entity"].astype(str)

# join organisation name and LPA codes from lookup
entity_df = entity_df.merge(
    lookup_org[["organisation_name", "organisation_type", "organisation_entity", "LPACD"]], 
    how = "left",
    on = "organisation_entity")

# load geometry and create GDF
entity_df['geometry'] = entity_df['geometry'].apply(shapely.wkt.loads)
entity_gdf = gpd.GeoDataFrame(entity_df, geometry='geometry')

# Transform to ESPG:27700 for more interpretable area units
entity_gdf.set_crs(epsg=4326, inplace=True)
entity_gdf.to_crs(epsg=27700, inplace=True)

# calculate area
entity_gdf["area"] = entity_gdf["geometry"].area

nrow(entity_gdf)
entity_gdf.head()

No. of records in df: 8,923


Unnamed: 0,entity,entry_date,geometry,name,organisation_entity,reference,organisation_name,organisation_type,LPACD,area
0,44000001,2022-04-12,"MULTIPOLYGON (((516981.159 204270.242, 516973....",Napsbury,16,5080,Historic England,government-organisation,,495087.300218
1,44000002,2022-04-12,"MULTIPOLYGON (((512390.333 209659.962, 512382....",Shafford Mill,16,5071,Historic England,government-organisation,,136187.979619
2,44000003,2022-04-12,"MULTIPOLYGON (((511610.510 205098.079, 511611....",Potters Crouch,16,5074,Historic England,government-organisation,,34603.675292
3,44000004,2022-04-12,"MULTIPOLYGON (((512515.275 200300.431, 512520....",Old Brickett Wood,16,5075,Historic England,government-organisation,,55128.469061
4,44000005,2022-04-12,"MULTIPOLYGON (((520248.830 206717.191, 520410....",Sleapshyde,16,5078,Historic England,government-organisation,,44167.433073


In [112]:
# check of the organisations that we don't have an LPA code for
entity_df[entity_df["LPACD"].isnull()].groupby(["organisation_type", "organisation_name"]).size()

organisation_type        organisation_name                    
development-corporation  London Legacy Development Corporation       2
government-organisation  Historic England                         7077
local-authority-eng      North Dorset District Council              37
                         Purbeck District Council                  126
national-park-authority  Peak District National Park Authority      21
dtype: int64

In [113]:
old_entity_df = get_old_entity("conservation-area")
old_entity_df["entity"] = old_entity_df["entity"].astype('str')
old_entity_df["old_entity"] = old_entity_df["old_entity"].astype('str')

nrow(old_entity_df)
old_entity_df.head()

No. of records in df: 529


Unnamed: 0,end_date,entity,entry_date,notes,old_entity,start_date,status
0,,44009617,,,44008389,,301
1,,44009617,,,44008390,,301
2,,44009621,,,44008391,,301
3,,44009621,,,44008392,,301
4,,44009621,,,44008393,,301


In [114]:
# pd.concat([old_entity_df["entity"], old_entity_df["old_entity"]], ignore_index=True).drop_duplicates()

# Identifying geographical duplicates  
## Report

Aim of this is to quickly categorise the overlaps based on whether they fall into the following groups:

Entity overlaps with another: 

1. within the same organisation
    
2. from a different organisation   

    a. LPA entity overlaps with entity from another LPA
        
    b. LPA entity overlaps with entity from Historic England

Within some of these categories (1. and 2.b) there are distinctions made on the type of overlap that's occuring, with different actions recommended for different types



In [115]:
MATCH_LOWER_THRESH = 0.9  # defines the lower limit of the shared overlap between two entities to be called a match
EDGE_UPPER_THRESH = 0.1   # defines the upper limit of the shared overlap between two entities to be called an edge intersection


# full join of all geometries
entity_join_all = gpd.overlay(
    entity_gdf, 
    entity_gdf,
    how = "intersection", keep_geom_type=False 
)


# remove self-intersections and duplicates of the same intersections
entity_join_all = entity_join_all[entity_join_all["entity_1"] != entity_join_all["entity_2"]]

entity_join_all["entity_join"] = entity_join_all.apply(lambda x: '-'.join(sorted(x[["entity_1", "entity_2"]])), axis=1)

# extra sort to make sure matches to Historic England always show as Historic England as org 2 
entity_join_all["name_for_sort"] = np.where(entity_join_all["organisation_entity_1"] == 16, "Z", "A")
entity_join_all.sort_values(["entity_join", "name_for_sort"], ascending=True, inplace=True)

entity_join_all.drop_duplicates(subset="entity_join", inplace = True,   ) #Drop them by name

# nrow(entity_join_all)

# flag the types of intersections between organisations
# is org the same
entity_join_all["int_org_match"] = np.where(entity_join_all["organisation_entity_1"] == entity_join_all["organisation_entity_2"], True, False)

# the types of org-org matches
entity_join_all["int_org_types"] = np.select(
    [
        (entity_join_all["organisation_entity_1"] == 16) & (entity_join_all["organisation_entity_2"] == 16),
        (entity_join_all["organisation_entity_1"] != 16) & (entity_join_all["organisation_entity_2"] != 16),
        ((entity_join_all["organisation_entity_1"] != 16) & (entity_join_all["organisation_entity_2"] == 16)) |
        ((entity_join_all["organisation_entity_1"] == 16) & (entity_join_all["organisation_entity_2"] != 16))
    ],
    ["HE - HE", "LPA - LPA", "HE - other"],
    default = "-"
)

# does the entity entry date match?
entity_join_all["date_match"] = np.where(entity_join_all["entry_date_1"] == entity_join_all["entry_date_2"], True, False)

# has one of the intersected entities already been re-mapped?
entity_join_all["entity_old"] = np.where(entity_join_all["entity_1"].isin(old_entity_df["old_entity"]) |
                                         entity_join_all["entity_2"].isin(old_entity_df["old_entity"]), True, False)


# calculate overlap %'s

entity_join_all["area_intersection"] = entity_join_all["geometry"].area

entity_join_all["p_pct_intersect"] = entity_join_all["area_intersection"] / entity_join_all["area_1"]
entity_join_all["pct_intersection"] = entity_join_all["area_intersection"] / (entity_join_all["area_1"] + entity_join_all["area_2"] - entity_join_all["area_intersection"])
entity_join_all["s_pct_intersect"] = entity_join_all["area_intersection"] / entity_join_all["area_2"]

# intersection area as % of smallest primary or secondary area
entity_join_all["pct_min_intersection"] = entity_join_all["area_intersection"] / entity_join_all[["area_1", "area_2"]].min(axis = 1)


entity_join_all["issue_type"] = np.select(
    [
        (entity_join_all["p_pct_intersect"] >= MATCH_LOWER_THRESH) & (entity_join_all["s_pct_intersect"] >= MATCH_LOWER_THRESH),
        (entity_join_all["p_pct_intersect"] <= EDGE_UPPER_THRESH) & (entity_join_all["s_pct_intersect"] <= EDGE_UPPER_THRESH),
        ((entity_join_all["p_pct_intersect"] >= MATCH_LOWER_THRESH) | (entity_join_all["s_pct_intersect"] >= MATCH_LOWER_THRESH)),
        
    ],
    [
        "> 90% combined match", "edge intersection", "> 90% single match"
    ],
    default = "-"
)

nrow(entity_join_all)
entity_join_all.head()

No. of records in df: 2,786


Unnamed: 0,entity_1,entry_date_1,name_1,organisation_entity_1,reference_1,organisation_name_1,organisation_type_1,LPACD_1,area_1,entity_2,...,int_org_match,int_org_types,date_match,entity_old,area_intersection,p_pct_intersect,pct_intersection,s_pct_intersect,pct_min_intersection,issue_type
7,44000009,2022-04-12,Childwickbury,16,5063,Historic England,government-organisation,,1885513.0,44000007,...,True,HE - HE,True,False,2.170036,1e-06,4.225103e-07,6.675918e-07,1e-06,edge intersection
18,44000732,2022-04-12,Bodenham Road,16,5104,Historic England,government-organisation,,96781.47,44000016,...,True,HE - HE,True,False,2.162021,2.2e-05,5.452397e-06,7.212809e-06,2.2e-05,edge intersection
22,44000770,2022-04-12,Leominster Town,16,2499,Historic England,government-organisation,,255623.2,44000017,...,True,HE - HE,True,False,2.033437,8e-06,4.665495e-06,1.128278e-05,1.1e-05,edge intersection
50,44000043,2022-04-12,Butterworth Hall,16,7716,Historic England,government-organisation,,29687.92,44000042,...,True,HE - HE,True,False,0.0,0.0,0.0,0.0,0.0,edge intersection
60,44003132,2022-04-12,Worcester and Birmingham Canal,16,449,Historic England,government-organisation,,253931.3,44000050,...,True,HE - HE,True,False,4.129991,1.6e-05,7.315925e-06,1.329709e-05,1.6e-05,edge intersection


In [116]:
# check the flagging or intersections between different org types is correct
entity_join_all.groupby(["int_org_match", "int_org_types", "organisation_entity_1", "organisation_entity_2"]).size()

int_org_match  int_org_types  organisation_entity_1  organisation_entity_2
False          HE - other     3                      16                        1
                              33                     16                        2
                              43                     16                       14
                              65                     16                       38
                              67                     16                        1
                                                                              ..
True           LPA - LPA      309                    309                       2
                              319                    319                       3
                              329                    329                      28
                              352                    352                       1
                              376                    376                       4
Length: 133, dtype: int64

In [117]:
# count of issues by type breakdown
entity_join_all.groupby(['entity_old', 'int_org_match', 'int_org_types', 'date_match', 'issue_type']).size()

# write to csv to add in further descriptions
# entity_join_all.groupby(['entity_old', 'int_org_match', 'int_org_types', 'date_match', 'issue_type']).size().to_csv("temp_issue_mapping_table.csv")

entity_old  int_org_match  int_org_types  date_match  issue_type          
False       False          HE - other     False       -                        19
                                                      > 90% combined match    301
                                                      > 90% single match      233
                                                      edge intersection       484
                           LPA - LPA      False       edge intersection        32
                                          True        -                         1
                                                      edge intersection        15
            True           HE - HE        False       -                         2
                                                      > 90% combined match      1
                                                      > 90% single match        4
                                                      edge intersection       129
                       

In [None]:
# read back in issue flag mapping table with extra fields
# issue_mapping_df = pd.read_csv("temp_issue_mapping_table.csv")

# check join works
nrow(entity_join_all)
nrow(entity_join_all.merge(
        issue_mapping_df, 
        how = "inner", 
        on = ['entity_old', 'int_org_match', 'int_org_types', 'date_match', 'issue_type']))

entity_join_all = entity_join_all.merge(
        issue_mapping_df, 
        how = "inner", 
        on = ['entity_old', 'int_org_match', 'int_org_types', 'date_match', 'issue_type'])

In [None]:
# write full report table to csv

nicecols = [
    'entity_join', 'entity_1', 'entity_2', 'entry_date_1', 'name_1', 'organisation_entity_1',
    'reference_1', 'organisation_name_1', 'organisation_type_1',
    'entry_date_2', 'name_2', 'organisation_entity_2',
    'reference_2', 'organisation_name_2', 'organisation_type_2',
    'p_pct_intersect', 'pct_intersection', 's_pct_intersect', 'pct_min_intersection',
    'int_org_match', 'int_org_types', 'date_match', 'entity_old',
    'issue_type', 'issue_description', 'action']

# entity_join_all[nicecols].to_csv("issues_report_test.csv", index=False)

### checking issues for funded LPAs

In [None]:
lpa_fund_list = ['Buckinghamshire Council','Doncaster Metropolitan Borough Council','Gloucester City Council','London Borough of Camden','London Borough of Lambeth','London Borough of Southwark','Medway Council','Newcastle City Council','Birmingham City Council','Canterbury City Council','Epsom and Ewell Borough Council','London Borough of Barnet','Gateshead Metropolitan Borough Council','Great Yarmouth Borough Council','Royal Borough of Kingston upon Thames','St Albans City and District Council','Tewkesbury Borough Council','West Berkshire Council','Dorset District Council','Dover District Council','Liverpool City Council','London Borough of Redbridge','London Borough of Waltham Forest','North Lincolnshire Council','North Somerset Council','Salford City Council','Wirral Borough Council']

lpa_fund_issues = entity_join_all[
    entity_join_all["organisation_name_1"].isin(lpa_fund_list) | entity_join_all["organisation_name_2"].isin(lpa_fund_list)
    ]

In [None]:
# export issues list for all funded lpas
# lpa_fund_issues[nicecols].to_csv("temp_issues_funded.csv")

In [None]:
# epsom ones to remove
# entity_gdf[(entity_gdf["organisation_entity"] == 129) & (entity_gdf["reference"].apply(lambda x: len(x)) < 5)].to_csv("temp_epsom_to_remove.csv")

In [140]:
# read in current lambeth endpoint and outer join to existing entities
lambeth_endpoint_gdf = gpd.read_file("https://gis.lambeth.gov.uk/arcgis/rest/services/LambethConservationAreas/MapServer/0/query?where=1%3D1&text=&objectIds=&time=&geometry=&geometryType=esriGeometryEnvelope&inSR=&spatialRel=esriSpatialRelIntersects&distance=&units=esriSRUnit_Foot&relationParam=&outFields=*&returnGeometry=true&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=&havingClause=&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&historicMoment=&returnDistinctValues=false&resultOffset=&resultRecordCount=&returnExtentOnly=false&datumTransformation=&parameterValues=&rangeValues=&quantizationParameters=&featureEncoding=esriDefault&f=geojson")

lambeth_endpoint_gdf = lambeth_endpoint_gdf[["CA_REF_NO"]]
lambeth_endpoint_gdf["record_in_endpoint"] = True

# nrow(lambeth_endpoint_gdf)
# lambeth_endpoint_gdf.head()

entity_df[entity_df["organisation_entity"] == 192][["entity", "reference", "name"]].merge(
    lambeth_endpoint_gdf,
    how =  "outer",
    left_on  = "reference", right_on = "CA_REF_NO"
).to_csv("../data/geo_analysis/funded_lpa_checks/temp_lambeth_entity_endpoint_check.csv")

ERROR:fiona._env:PROJ: internal_proj_identify: /Users/gslater/miniconda3/envs/pdp_jupyter/share/proj/proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation.


In [141]:
epsom_endpoint_gdf = pd.read_csv("../data/geo_analysis/funded_lpa_checks/Epsom_conservation-area_WFS.csv")
epsom_endpoint_gdf = epsom_endpoint_gdf[["name", "reference"]]
epsom_endpoint_gdf["record_in_endpoint"] = True

nrow(epsom_endpoint_gdf)
epsom_endpoint_gdf.head()

epsom_pdp_gdf = gpd.read_file("https://www.planning.data.gov.uk/entity.geojson?organisation_entity=129&dataset=conservation-area&limit=100")
epsom_pdp_gdf["record_in_pdp"] = True

nrow(epsom_pdp_gdf)
# epsom_pdp_gdf.head()

epsom_pdp_gdf[["entity", "reference", "name", "record_in_pdp"]].merge(
    epsom_endpoint_gdf,
    how =  "outer",
    on  = "reference"
).to_csv("../data/geo_analysis/funded_lpa_checks/temp_epsom_entity_endpoint_check.csv")

No. of records in df: 21


ERROR:fiona._env:PROJ: internal_proj_identify: /Users/gslater/miniconda3/envs/pdp_jupyter/share/proj/proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation.


No. of records in df: 41
