# Listed building grade population checks

**Author**: Greg Slater  
**Date**: 2024-02-12  
**Data scope**: Listed buildings  
**Report type**: Ad-hoc 

## Purpose
This is a short piece of analysis to inform decision-making around the importance of the `listed-building-grade` field in the [`listed-building-outline`](https://www.planning.data.gov.uk/dataset/listed-building-outline) dataset (supplied by individual LPAs). Currently it is not always provided, but we may be able to take it from the [`listed-building`](https://www.planning.data.gov.uk/dataset/listed-building) dataset (supplied by Heritage England), in which case it need not be made mandatory.

**Outputs**:  
* how frequently is the `listed-building` (reference) field populated in the `listed-building-area` dataset
* can it be used to reliably link to the listed-bulding dataset and grab the grade from there

## Links
[Trello ticket](https://trello.com/c/8kWs5zOi/1042-check-the-link-between-listed-building-and-listed-building-outlines)

# [Notebook name]

**Author**: [name]   
**Date**: [date]  
**Data scope**: [all datasets / collection name / single dataset name]  
**Report type**: [Recurring (daily/weekly/monthly) / ad-hoc]  

## Purpose
[Describe intended purpose, required inputs, and expected outputs]

## Links
[Trello ticket links if appropriate]

In [1]:
# pip install -e git+https://github.com/digital-land/pipeline.git#egg=digital-land

In [1]:
import os
import pandas as pd
import urllib
# from functions import run_endpoint_workflow
# from sqlite_query_functions import DatasetSqlite
# from convert_functions import convert_resource
# from digital_land.collection import Collection
# from data_file import get_duplicates_between_orgs
# from download_data import download_dataset

import numpy as np

# import shapely.wkt
# import geopandas as gpd

In [2]:
def nrow(df):
    return print(f"No. of records in df: {len(df):,}")

def get_all_organisations():
    params = urllib.parse.urlencode({
        "sql": f"""
        select organisation, name as organisation_name, entity as organisation_entity, statistical_geography
        from organisation
        """,
        "_size": "max"
        })
    url = f"https://datasette.planning.data.gov.uk/digital-land.csv?{params}"
    df = pd.read_csv(url)
    return df

### Data import

In [3]:
# get org data from datasette
lookup_org = get_all_organisations()

# # lookup_org["organisation_entity"] = lookup_org["organisation_entity"].astype(str)
lookup_org.columns = ["organisation", "organisation-name", "organisation-entity", "statistical-geography"]

# # split out org type and join on LPA codes from LAD to LPA lookup
# lookup_org["organisation_type"] = lookup_org["organisation"].apply(lambda x: x.split(":")[0])
# lookup_org = lookup_org.merge(lookup_lad_lpa, how = "left", on = "organisation_entity")

nrow(lookup_org)
lookup_org.head()

No. of records in df: 437


Unnamed: 0,organisation,organisation-name,organisation-entity,statistical-geography
0,passenger-transport-executive:Q25171369,West Midlands Passenger Transport Executive,408,
1,passenger-transport-executive:Q6820591,Merseytravel,409,
2,passenger-transport-executive:Q682520,Transport for London,410,
3,passenger-transport-executive:Q7569004,South Yorkshire Passenger Transport Executive,411,
4,passenger-transport-executive:Q7834921,Transport for Greater Manchester,412,


In [4]:
lb_df = pd.read_csv("https://files.planning.data.gov.uk/dataset/listed-building.csv")
lb_df["reference"] = lb_df["reference"].astype(str)

nrow(lb_df)
lb_df.head()

No. of records in df: 379,176


Unnamed: 0,dataset,end-date,entity,entry-date,geojson,geometry,name,organisation-entity,point,prefix,reference,start-date,typology,documentation-url,listed-building,listed-building-grade,notes,organisation,wikidata,wikipedia
0,listed-building,,31479292,2023-05-25,,,20 and 20A Whitbourne Springs,16,POINT (-2.239114 51.198840),listed-building,1021466,1987-11-05,geography,https://historicengland.org.uk/listing/the-lis...,,II,,,,
1,listed-building,,31479293,2023-05-25,,,TENNIS CORNER FARMHOUSE WITH GRANARY AND STABLE,16,POINT (-2.247296 51.256559),listed-building,1021467,1987-11-05,geography,https://historicengland.org.uk/listing/the-lis...,,II,,,,
2,listed-building,,31479294,2023-05-25,,,CHALCOT HOUSE,16,POINT (-2.226356 51.238375),listed-building,1021468,1968-09-11,geography,https://historicengland.org.uk/listing/the-lis...,,II*,,,,
3,listed-building,,31479295,2023-05-25,,,FIVE LORDS FARMHOUSE,16,POINT (-2.248224 51.250587),listed-building,1021469,1987-11-05,geography,https://historicengland.org.uk/listing/the-lis...,,II,,,,
4,listed-building,,31479296,2023-05-25,,,PENLEIGH MILL,16,POINT (-2.205967 51.253187),listed-building,1021470,1987-11-05,geography,https://historicengland.org.uk/listing/the-lis...,,II,,,,


In [5]:
lbo_df = pd.read_csv("https://files.planning.data.gov.uk/dataset/listed-building-outline.csv")

nrow(lbo_df)
lbo_df.head()

No. of records in df: 22,262


Unnamed: 0,dataset,end-date,entity,entry-date,geojson,geometry,name,organisation-entity,point,prefix,...,address-text,description,document-url,documentation-url,listed-building,listed-building-grade,notes,organisation,wikidata,wikipedia
0,listed-building-outline,,42101001,2021-12-08,,"MULTIPOLYGON (((-0.104751 51.488986,-0.104432 ...","Church of St Mary, Newington",329,POINT(-0.104512 51.489008),listed-building-outline,...,,,,https://geo.southwark.gov.uk/connect/analyst/I...,,II,,,,
1,listed-building-outline,,42101080,2021-11-23,,"MULTIPOLYGON (((-0.113036 51.477708,-0.113054 ...","Christ Church Hall, Mowll Street",192,POINT(-0.113270 51.477574),listed-building-outline,...,,,,,,II,,,,
2,listed-building-outline,,42101128,2021-11-23,,"MULTIPOLYGON (((-0.109618 51.505660,-0.109546 ...",1-18 (cons) Aquinas Street,192,POINT(-0.108960 51.505824),listed-building-outline,...,,,,,,II,,,,
3,listed-building-outline,,42101170,2021-11-23,,"MULTIPOLYGON (((-0.117867 51.461044,-0.117765 ...",24 & 26 Acre Lane,192,POINT(-0.117708 51.460939),listed-building-outline,...,,,,,,II,,,,
4,listed-building-outline,,42101171,2021-11-23,,"MULTIPOLYGON (((-0.118338 51.460796,-0.118311 ...",1 to 12 Acre Lane (consec),192,POINT(-0.118094 51.460831),listed-building-outline,...,,,,,,II,,,,


In [6]:
lbo_df = lbo_df.merge(
    lookup_org[["organisation-entity", "organisation-name"]],
    how = "inner",
    on = "organisation-entity")

nrow(lbo_df)

No. of records in df: 22,262


## Analysis

In [7]:
# check listed-building-grade field is fully populated in listed-building dataset
lb_df[["listed-building-grade"]].notnull().groupby("listed-building-grade").size()

listed-building-grade
True    379176
dtype: int64

In [8]:
# add flags for whether grade and reference are populated, and whether reference joins correctly
lbo_df["status_grade"] = np.where(lbo_df["listed-building-grade"].isna(), "missing", "populated")
lbo_df["status_ref"] = np.where(lbo_df["listed-building"].isna(), "missing", "populated")
lbo_df["status_ref_join"] = np.where(lbo_df["listed-building"].isin(lb_df["reference"]), "match", "no match")

In [68]:
# check how populated listed-building-grade field is in listed-building dataset
lbo_count_grade = lbo_df.groupby("status_grade").size().reset_index(name = "n_records")
lbo_count_grade["n_record_pct"] = lbo_count_grade["n_records"] / sum(lbo_count_grade["n_records"])

lbo_count_grade

Unnamed: 0,status_grade,n_records,n_record_pct
0,missing,3151,0.141542
1,populated,19111,0.858458


In [69]:
# check how populated listed-building field is in listed-building dataset
lbo_count_ref = lbo_df.groupby("status_ref").size().reset_index(name = "n_records")
lbo_count_ref["n_record_pct"] = lbo_count_ref["n_records"] / sum(lbo_count_ref["n_records"])

lbo_count_ref

Unnamed: 0,status_ref,n_records,n_record_pct
0,missing,10517,0.472419
1,populated,11745,0.527581


In [11]:
# table for summary
lbo_cross_ref = lbo_df.groupby(["status_grade", "status_ref", "status_ref_join"]).size().reset_index(name = "n_records")
lbo_cross_ref["n_records_pct"] = lbo_cross_ref["n_records"] / sum(lbo_cross_ref["n_records"])

lbo_cross_ref

Unnamed: 0,status_grade,status_ref,status_ref_join,n_records,n_records_pct
0,missing,missing,no match,1468,0.065942
1,missing,populated,match,1677,0.07533
2,missing,populated,no match,6,0.00027
3,populated,missing,no match,9049,0.406477
4,populated,populated,match,9157,0.411329
5,populated,populated,no match,905,0.040652


14% of records in `listed-building-outline` are missing the grade, 7.5% of which can be picked up from the `listed-building` dataset, while 6.5% don't have a reference field to join to so can't be populated.

In [14]:
# chech what match rate is like for records which do have a `listed-building` value
lbo_join_check = lbo_df[lbo_df["status_ref"] == "populated"].groupby(["status_ref", "status_ref_join"]).size().reset_index(name = "n_records")

lbo_join_check["n_records_pct"] = lbo_join_check["n_records"] / sum(lbo_join_check["n_records"])
lbo_join_check

Unnamed: 0,status_ref,status_ref_join,n_records,n_records_pct
0,populated,match,10834,0.922435
1,populated,no match,911,0.077565


In [15]:
# count and % of records in listed-building-outline dataset with `listed-building-grade` field missing - by organisation
lbo_grouped_na = lbo_df.groupby(["organisation-entity", "organisation-name", "status_grade"]).size().reset_index(name = "n_records")

lbo_grouped_na["org_n_record_pct"] = lbo_grouped_na["n_records"] / lbo_grouped_na.groupby("organisation-name")["n_records"].transform("sum")

lbo_grouped_na

Unnamed: 0,organisation-entity,organisation-name,status_grade,n_records,org_n_record_pct
0,41,London Borough of Barking and Dagenham,populated,48,1.0
1,48,London Borough of Barnet,populated,649,1.0
2,67,Buckinghamshire Council,missing,1035,0.165627
3,67,Buckinghamshire Council,populated,5214,0.834373
4,75,Canterbury City Council,populated,1778,1.0
5,90,London Borough of Camden,populated,1961,1.0
6,109,Doncaster Metropolitan Borough Council,populated,798,1.0
7,111,Dover District Council,populated,1699,1.0
8,129,Epsom and Ewell Borough Council,populated,272,1.0
9,142,Gateshead Metropolitan Borough Council,populated,248,1.0
