# Dataset staleness and related issues

The notebook contains a range of queries that can be used to cast light on issues of data quality such as 'staleness'.

## Using

You need to 

1. Download the digital_land dataset from https://datasette.planning.data.gov.uk/digital-land
1. Point the `source_file` variable (below) at your download. 

The first time it runs, it takes a few minutes to build some indexes which speed up subsequent queries. 

### Variables

These determine what data is processed. 

* source_file - the file you downloaded above
* collection - the collection to process
* staleness_days - if data has been collected without changing for this number of days, then consider it to be stale
* recent_entry_cutoff - used to say if an endpoint was recently added ( greater than this value)
* organisation - used to show the endpoints for just one organisation.

In [90]:
import pandas as pd
import numpy as np
import urllib
from datetime import date
import urllib.parse
import matplotlib 
import sqlite3

In [91]:
datasette_url = "https://datasette.planning.data.gov.uk/"

source_file = "/mnt/c/Users/MarkSmith/Downloads/digital-land-2023-10-05.sqlite3" # or whatever you called your download.
dest_file=  "/mnt/c/Users/MarkSmith/Downloads/entity_2023_10_14_crosstab.csv" # or wherever you want your output

collection = "brownfield-land"
staleness_days=365*3


cnx = sqlite3.connect(source_file)

cursor = cnx.cursor()

def add_index (table, column) :
    cnx.execute(F"CREATE INDEX if not exists idx_{table}_{column} ON {table}({column})") 

add_index ("log", "endpoint")
add_index ("log", "resource")
add_index ("log", "status")

add_index ("source", "collection")
add_index ("source", "endpoint")
add_index ("source", "start_date")
add_index ("source", "end_date")

add_index ("endpoint", "endpoint")
add_index ("endpoint", "entry_date")
add_index ("endpoint", "start_date")
add_index ("endpoint", "end_date")

def query_datasette (query_text):
    return pd.read_sql_query(query_text, cnx)


# Live endpoints still using HTTP instead of HTTPS

These are candidates for being old, since http (compared with https) was retired years ago.

In [92]:
http_sql = F"""
  select src.organisation, src.endpoint, ep.entry_date, ep.endpoint_url, ep.start_date, ep.end_date  from source src
  inner join endpoint ep on src.endpoint = ep.endpoint
  where src.collection  = "{collection}"
  and ep.end_date = ""
  and ep.endpoint_url like "http:%"
  order by src.organisation, ep.start_date ASC NULLS LAST
"""

query_datasette(http_sql)


Unnamed: 0,organisation,endpoint,entry_date,endpoint_url,start_date,end_date
0,development-corporation:Q6670544,4c238528f325bcca9a03697583d9f39a91ecf1ec1ecb66...,2018-07-05T00:00:00Z,http://www.queenelizabetholympicpark.co.uk/-/m...,,
1,local-authority-eng:AMB,1945e57b32134a160bfb4972bcd6f86935b4c6fc0bd000...,2019-11-23T00:00:00Z,http://info.ambervalley.gov.uk/shareddatasets/...,,
2,local-authority-eng:AMB,239f7a38efe51fb6271d40bd76dd7413fecb458a5c4d1e...,2019-11-23T00:00:00Z,http://info.ambervalley.gov.uk/shareddatasets/...,,
3,local-authority-eng:AMB,c0fcdc64525d3be69736427cd397bf5a3c2d68104e8a90...,2018-05-22T00:00:00Z,http://info.ambervalley.gov.uk/shareddatasets/...,2017-12-22,
4,local-authority-eng:AMB,533d03f6fd2b5717597c4f3057106b907f4b23c8fb2e7b...,2020-05-21T00:00:00Z,http://info.ambervalley.gov.uk/shareddatasets/...,2020-05-21,
...,...,...,...,...,...,...
71,local-authority-eng:WOI,30b94b097f288a5f2df6ad2cae6fa403a5599e39bece53...,2018-05-22T00:00:00Z,http://woking2027.info/res/uploads/WokingBorou...,,
72,local-authority-eng:WYE,36c1c91ee6ec46da952246ce66916adf3200f7b14d42f5...,2018-05-22T00:00:00Z,http://www.wyreforestdc.gov.uk/media/3585747/W...,,
73,local-authority-eng:WYO,466796e9f6e18591dead03d9e7836e02d6494ec1a5ac9e...,2018-05-22T00:00:00Z,http://data.wycombe.gov.uk/download/planning/b...,2019-06-21,
74,local-authority-eng:WYO,7b6a413b1c1018215fe400dae82e6366b80fb8f89c0240...,2019-11-24T00:00:00Z,http://data.wycombe.gov.uk/download/planning/b...,2019-08-21,


# Live endpoints with stale data

These are endpoints that have not given us any new data in {staleness_days}.

In [102]:
sql = F"""
	select src.organisation, src.start_date, src.end_date,  log.endpoint, log.resource, count (log.resource) as c from log 
    inner join source src on src.endpoint = log.endpoint
	where log.status = 200
	and src.end_date = ""
	and src.collection = "{collection}"
    group by 1,2, 3, 4, 5
	having c > {staleness_days}
	order by c desc
	limit 100
"""

query_datasette(sql)

Unnamed: 0,organisation,start_date,end_date,endpoint,resource,c
0,local-authority-eng:ALL,2017-12-20,,9c2e8adfd12b4f474e7d511580029d1e69d1e08a17e0cb...,3b694528538878d378fd6892a649fa634f0013ef299331...,1389
1,local-authority-eng:BDF,,,f8979dadf073ed0fa1373b6c018fa376833c5aa9e740a2...,12e709c847614f924fc6bd9dcb9c162816e59623887765...,1389
2,local-authority-eng:CAN,2017-12-01,,660ed3b8f8bffe8a326dfb597ddce7a3b4c9b28a04fe78...,4c524bd78ea05989aaaabdcad0cae87d48f4ad9eaf18a4...,1389
3,local-authority-eng:CAS,2017-12-01,,b7d3310848346fdcc872fb40dcb80697f05dfa1f593565...,cd0c958ec2843364957c2fd9757151fdfb6659bb8e0d51...,1389
4,local-authority-eng:CHR,,,26ed7ad003da61c66f9656ba61750b14a53b3ac74a1865...,1a349651d94dfdbdc44773a256313c391c7418f0bcad53...,1389
...,...,...,...,...,...,...
95,local-authority-eng:WRL,,,560d9611860046e86e142c95bf632c39b28f67e03bbd9e...,6a9a2fadde5d38023cc88d97ce04c3e52781d6f5332c79...,1370
96,local-authority-eng:YOR,2017-12-22,,2fc9a0b88861aa02584f2b90292bff7e1ccba9d420ef08...,7bbf483dd896de667a8476ab102c8aa657f01c9cb8dea5...,1370
97,local-authority-eng:YOR,2017-12-22,,ed8725acf5769b2dd2467dcab1783eebc46afcdd7718d6...,7e47edd7c0f31c2a1c6b164b638058bc95ce053743da6f...,1370
98,local-authority-eng:CAN,,,e0cab939ecfa32577b6180ef89187a57ec62309511afa3...,a42627cfea482a8ed81312034f8a6af8e4147bea6e1fac...,1369


# Endpoints with no documentation URL

Just from LPAs

In [101]:
sql = F"""
select organisation, entry_date,  source, start_date from source 
where collection = "{collection}" and documentation_url = ""  and end_date = "" and organisation like "local-authority-eng%"  
order by 2 
"""

query_datasette(sql)


Unnamed: 0,organisation,entry_date,source,start_date
0,local-authority-eng:HAL,2018-05-22T00:00:00Z,59748eb2bbe6634a99620f87acaf937e,2017-12-31
1,local-authority-eng:BAB,2018-05-22T00:00:00Z,f92ceb2080500b1b68f7d4b33ceadedc,2017-12-21
2,local-authority-eng:THE,2018-05-22T00:00:00Z,f7721c61cb8732bf0fcad7083400d972,
3,local-authority-eng:HIG,2018-05-22T00:00:00Z,efb9c28a1591def1700ee912bb07bfe1,
4,local-authority-eng:EXE,2018-05-22T00:00:00Z,0787bb89a3ec6994493219b2c8551daa,
5,local-authority-eng:BRT,2018-05-22T00:00:00Z,07c446af20f78efab3ba5ab35e1a77ef,2017-12-13
6,local-authority-eng:DUR,2018-05-22T00:00:00Z,09de2780b40099bf37cd0828834dbb93,2017-12-01
7,local-authority-eng:HAE,2018-05-22T00:00:00Z,09e319746177cd551d1ac5542bb0d622,
8,local-authority-eng:EPS,2018-05-22T00:00:00Z,e6d3bda67c7c5f30dd70846bb1f38267,
9,local-authority-eng:FOR,2018-05-22T00:00:00Z,e284fb7807bef3d01ff56ec43512d3cc,2017-12-13


# Recently Added endpoints with no start date

A recently added edpoint should have a start date, either derived from the LPA documentation page or from the data.

In [100]:
recent_entry_cutoff= "2023-06-01"

sql = F"""
    select src.organisation, src.documentation_url,  src.start_date, src.end_date, ep.endpoint, ep.endpoint_url, ep.entry_date
	from source src 
	inner join endpoint ep on ep.endpoint = src.endpoint
	and src.collection = "{collection}"
	and ep.entry_date >= "{recent_entry_cutoff}"
	and ep.start_date = ""
    and src.end_date = ""
    """

query_datasette(sql)


Unnamed: 0,organisation,documentation_url,start_date,end_date,endpoint,endpoint_url,entry_date
0,local-authority-eng:WOK,https://www.wokingham.gov.uk/council-and-meeti...,,,b23d3b7b05aac7b35722507f35cf8c1292d67e7897b878...,https://www.woking2027.info/ldfregisters/brown...,2023-08-11T12:12:19Z
1,local-authority-eng:BAB,https://www.midsuffolk.gov.uk/planning/plannin...,,,3198d071873720b401c9a37ce0f9ae2c450736af0dd316...,https://www.midsuffolk.gov.uk/assets/Planning-...,2023-08-16T14:14:48Z
2,local-authority-eng:NGM,https://www.opendatanottingham.org.uk/dataset....,2021-11-05,,c883436d69d3f852a8225da903fda36aa29d86f9ca5c4f...,https://geoserver.nottinghamcity.gov.uk/openda...,2023-09-13T00:00:00Z
3,local-authority-eng:BAN,,,,8440f35a3e42919f5e9c851e98d5445a3c418c8ac314f6...,https://www.basingstoke.gov.uk/content/page/72...,2023-06-28T13:13:00Z
4,local-authority-eng:BAR,https://www.barrowbc.gov.uk/residents/planning...,,,c78c84a3bde8a5ea9bad3faa9e9a1146679975a454eb84...,https://www.barrowbc.gov.uk/_resources/assets/...,2023-07-04T07:07:29Z
...,...,...,...,...,...,...,...
173,local-authority-eng:HAM,https://www.easthants.gov.uk/planning-services...,,,57ab427342eab58168dce8bf1c03a2351b8ad46d17bd3f...,https://www.easthants.gov.uk/media/6377/downlo...,2023-09-13T00:00:00Z
174,local-authority-eng:NTT,https://www.opendatanottingham.org.uk/dataset....,,,c883436d69d3f852a8225da903fda36aa29d86f9ca5c4f...,https://geoserver.nottinghamcity.gov.uk/openda...,2023-09-13T00:00:00Z
175,local-authority-eng:GLA,https://data.london.gov.uk/dataset/brownfield_...,2020-07-12,,890c3ac73da82610fe1b7d444c8c89c92a7f368316e3c0...,https://data.london.gov.uk/download/brownfield...,2023-09-13T00:00:00Z
176,local-authority-eng:HRW,https://www.harrow.gov.uk/planning-development...,,,890c3ac73da82610fe1b7d444c8c89c92a7f368316e3c0...,https://data.london.gov.uk/download/brownfield...,2023-09-13T00:00:00Z


# Inspect a single LPA

You can use the query below to get a quick overview of what entries we have for a single LPA in the collection 

In [99]:
organisation= "local-authority-eng:ALL"


sql = F"""
    select src.organisation, src.documentation_url,  src.start_date, src.end_date, ep.endpoint_url, ep.entry_date,  ep.start_date, ep.end_date
	from source src 
	inner join endpoint ep on ep.endpoint = src.endpoint
	where src.organisation = "{organisation}"
    and src.collection = "{collection}"
    """

query_datasette(sql)

Unnamed: 0,organisation,documentation_url,start_date,end_date,endpoint_url,entry_date,start_date.1,end_date.1
0,local-authority-eng:ALL,https://www.allerdale.gov.uk/en/planning-build...,2017-12-20,,https://df4iy9syor5px.cloudfront.net/media/fil...,2018-05-22T00:00:00Z,2017-12-20,
1,local-authority-eng:ALL,https://www.allerdale.gov.uk/en/planning-build...,2017-12-20,2019-11-25,https://www-cloudfront.allerdale.gov.uk/media/...,2018-07-30T00:00:00Z,2017-12-20,2019-11-25
2,local-authority-eng:ALL,https://www.allerdale.gov.uk/en/planning-build...,2023-07-06,,https://www-cloudfront.allerdale.gov.uk/media/...,2023-07-06T11:11:52Z,2023-07-06,
