# Compliance to specification report
**Author**:  Greg Slater <br>
**Date**:  14th March 2024 <br>
**Data Scope**: article-4-direction, listed-building, conservation-area, and tree-preservation-order collections <br>
**Report Type**: Recurring daily <br>

## Purpose
This report measures resources (an instance of a dataset supplied by an LPA) on their compliance to the specification they should be following. The datasets included are those in the article-4-direction, listed-building, conservation-area, and tree-preservation-order collections which have been supplied in the latest endpoints by LPAs in the ODP and RIPA-BOPS cohorts.

Each resource is scored using three metrics:

* **Fields supplied**: The numer of fields provided which can be mapped to the specification (this may include manual re-mapping done when the endpoint was added, as well as any automatic matching done by the pipeline).
* **Fields with no errors**: The number of fields for which there are no issues with a severity level of "error" raised for any values (to see all issue types and definitions see the [`issue_type` table](https://datasette.planning.data.gov.uk/digital-land/issue_type) in Datasette).
* **Fields with correct names**: The number of fields which exactly match the specification field names.

See [this sheet](https://docs.google.com/spreadsheets/d/1DJ0wqMj-vMidzaUIqbvP0nEIY0kOcj1dd_HJk7aZuTE/edit#gid=905004357) for an illustrated example of how a dataset is scored. Note that currently scoring is done only against the full spec, not against any minimum requirement or 'mandatory' fields.

In [1]:
%pip install wget
import wget
import pandas as pd
import os
import numpy as np
import urllib


Download helper utility files from GitHub:

In [2]:
util_file = "master_report_endpoint_utils.py"
if os.path.isfile(util_file):
    from master_report_endpoint_utils import *
else:
    url = "https://raw.githubusercontent.com/digital-land/jupyter-analysis/main/service_report/master_report/master_report_endpoint_utils.py"
    wget.download(url)
    from master_report_endpoint_utils import *

The default prioritised LPAs are used unless a specific set of LPAs is detected using an 'organisation_input.csv' file in the same directory as this notebook.

In [3]:
# Get input from .csv or use default prioritised LPAs
input_path = './organisation_input.csv'
if os.path.isfile(input_path):
    input_df = pd.read_csv(input_path)
    organisation_list = input_df['organisation'].tolist()
    print('Input file found. Using', len(organisation_list), 'organisations from input file.')
else:
    provision_df = get_provisions()
    organisation_list = provision_df["organisation"].str.replace(":","-eng:")
    print('Input file not found. Using default list of organisations.')

Input file not found. Using default list of organisations.


In [4]:
def get_endpoint_resource_data():
    datasette_url = "https://datasette.planning.data.gov.uk/"
  
    params = urllib.parse.urlencode({
    "sql": f"""
    select *
    from reporting_latest_endpoints
    """,
    "_size": "max"
    })
    
    url = f"{datasette_url}digital-land.csv?{params}"
    df = pd.read_csv(url)
    return df

def get_fields_for_resource(resource, dataset):
    datasette_url = "https://datasette.planning.data.gov.uk/"
    params = urllib.parse.urlencode({
        "sql": f"""
        select f.field, fr.resource
        from 
            fact_resource fr
            inner join fact f on fr.fact = f.fact
        where 
            resource = '{resource}'
        group by
            f.field
        """,
        "_size": "max"
    })
    url = f"{datasette_url}{dataset}.csv?{params}"
    facts_df = pd.read_csv(url)
    # facts_list = facts_df['field'].tolist()
    return facts_df

def get_column_mappings_for_resource(resource, dataset):
    datasette_url = "https://datasette.planning.data.gov.uk/"
    params = urllib.parse.urlencode({
        "sql": f"""
        select column, field
        from 
          column_field  
        where 
            resource = '{resource}'
        """,
        "_size": "max"
    })
    url = f"{datasette_url}{dataset}.csv?{params}"
    column_field_df = pd.read_csv(url)
    return column_field_df


def get_all_issues_for_resource(resource, dataset):
    params = urllib.parse.urlencode({
        "sql": f"""
        select field, issue_type, count(*) as count_issues
        from issue
        where resource = '{resource}'
        group by field, issue_type
        """,
        "_size": "max"
    })
    url = f"{datasette_url}{dataset}.csv?{params}"
    issues_df = pd.read_csv(url)
    return issues_df


## Get endpoint data

In [5]:
# get data from datasette
endpoint_resource_df = get_endpoint_resource_data()

# filter to org_list, valid, active endpoints and resources
endpoint_resource_filtered_df = endpoint_resource_df[
    (endpoint_resource_df["organisation"].isin(organisation_list)) &
    (endpoint_resource_df["status"] == 200) &
    (endpoint_resource_df["endpoint_end_date"].isnull()) &
    (endpoint_resource_df["resource_end_date"].isnull())
].copy()

print(len(endpoint_resource_df))
print(len(endpoint_resource_filtered_df))

print(len(endpoint_resource_filtered_df[["endpoint", "pipeline"]].drop_duplicates()))
print(len(endpoint_resource_filtered_df[["resource"]].drop_duplicates()))
print(len(endpoint_resource_filtered_df[["endpoint"]].drop_duplicates()))

162
87
87
82
82


## Get field and col mapping data

In [6]:
# table of unique resources and pipelines
resource_df = endpoint_resource_filtered_df[["pipeline", "resource"]].drop_duplicates().dropna(axis = 0)
print(len(resource_df))

issue_severity_lookup = get_issue_types_by_severity(["error"])

87


In [7]:
# generic function to try the resource datasette queries 
# will return a df with resource and dataset fields as keys, and query results as other fields
def try_results(function, resource, dataset):

    # try grabbing results
    try:
        df = function(resource, dataset)

        df["resource"] = resource
        df["dataset"] = dataset

    # if error record resource and dataset, other fields will be given NaNs in concat
    except:
        df = pd.DataFrame({"resource" : [resource],
                           "dataset" : [dataset]
        })

    return df


# get results for col mappings and fields in arrays
results_col_map = [try_results(get_column_mappings_for_resource, r["resource"], r["pipeline"]) for index, r in resource_df.iterrows()]
results_field_resource = [try_results(get_fields_for_resource, r["resource"], r["pipeline"]) for index, r in resource_df.iterrows()]
results_issues = [try_results(get_all_issues_for_resource, r["resource"], r["pipeline"]) for index, r in resource_df.iterrows()]

# concat the results, resources which errored with have NaNs in query results fields
results_col_map_df = pd.concat(results_col_map)
results_field_resource_df = pd.concat(results_field_resource)
results_issues_df = pd.concat(results_issues)

# join on severity to issues
results_issues_df = results_issues_df.merge(
    issue_severity_lookup,
    how = "inner",
    on = "issue_type"
)

# filter to just errors and get a unique list of fields with errors per dataset and resource
resource_issue_errors_df = results_issues_df[["dataset", "resource", "field"]].drop_duplicates()


# no. of resources in each query response array
print(len(results_col_map))
print(len(results_field_resource))

# no of records in each results df
print(len(results_col_map_df))
print(len(results_field_resource_df))


87
87
603
549


In [8]:
# add in match field for column mappings 
results_col_map_df["field_matched"] = np.where(
        (results_col_map_df["field"].isin(["geometry", "point"])) |
        (results_col_map_df["field"] == results_col_map_df["column"]),
        1, 
        0
)

# add in flag for fields supplied (i.e. they're in the mapping table)
results_col_map_df["field_supplied"] = 1

# add in flag for fields present
results_field_resource_df["field_loaded"] = 1

# add in flag for fields with errors
resource_issue_errors_df["field_errors"] = 1

## Calculate match rates

In [9]:
dataset_field_df = pd.read_csv('https://raw.githubusercontent.com/digital-land/specification/main/specification/dataset-field.csv')

# dataset_field_df.head()

In [10]:
# rename pipeline to dataset in endpoint_resource table
endpoint_resource_filtered_df.rename(columns={"pipeline":"dataset"}, inplace=True)

# left join from endpoint resource table to all the fields that each dataset should have
resource_spec_fields_df = endpoint_resource_filtered_df[
        ["organisation", "name", "dataset", "endpoint", "status", "latest_log_entry_date", "endpoint_entry_date", "resource"]
    ].merge(
        dataset_field_df[["dataset", "field"]],
        on = "dataset"
)

# join on field loaded flag for each resource and field
resource_fields_match = resource_spec_fields_df.merge(
    results_field_resource_df[["dataset", "resource", "field", "field_loaded"]],
    how = "left",
    on = ["dataset", "resource", "field"]
)

# join on field supplied and matched flag for each resource and field
resource_fields_map_match = resource_fields_match.merge(
    results_col_map_df[["dataset", "resource", "field", "field_supplied", "field_matched"]],
    how = "left",
    on = ["dataset", "resource", "field"]
)

# join on field errors flag for each resource and field
resource_fields_map_issues = resource_fields_map_match.merge(
    resource_issue_errors_df,
    how = "left",
    on = ["dataset", "resource", "field"]
)

# check we're not getting dupes in the left joins
print(len(resource_spec_fields_df))
print(len(resource_fields_match))
print(len(resource_fields_map_match))
print(len(resource_fields_map_issues))
# resource_fields_map_issues.head()

1401
1401
1402
1402


In [11]:
# remove fields that are auto-created in the pipeline from final table to avoid mis-counting
# ("entity", "organisation", "prefix", "point" for all but tree, and "entity", "organisation", "prefix" for tree)
resource_fields_scored = resource_fields_map_issues[
    ((resource_fields_map_issues["dataset"] != "tree") & (~resource_fields_map_issues["field"].isin(["entity", "organisation", "prefix", "point"])) |
     (resource_fields_map_issues["dataset"] == "tree") & (~resource_fields_map_issues["field"].isin(["entity", "organisation", "prefix"])))
]

# where entry-date hasn't been supplied it is auto-created - change field_loaded to NaN in these instances so we don't count it as a loaded field
entry_date_mask = ((resource_fields_scored["field"] == "entry-date") &
    (resource_fields_scored["field_supplied"].isnull()) &
    (resource_fields_scored["field_loaded"] == 1))

resource_fields_scored.loc[entry_date_mask, "field_loaded"] = np.nan

In [12]:
# group by and aggregate for final summaries
final_count = resource_fields_scored.groupby(
    ["organisation", "name", "dataset", "endpoint", "resource", "status", "latest_log_entry_date", "endpoint_entry_date"]
    ).agg(
        {"field":"count",
         "field_supplied" : "sum",
         "field_matched" : "sum",
         "field_loaded" : "sum",
         "field_errors" : "sum"}
         ).reset_index(
         ).sort_values(["name"])

# add a field for the endpoint number (so that orgs and datasets with multiple endpoints are split out and in index)
final_count["endpoint_number"] = final_count.groupby(["organisation", "name", "dataset"]).cumcount() + 1
final_count["field_error_free"] = final_count["field_supplied"] - final_count["field_errors"]

# add string fields for [n fields]/[total fields] style counts
final_count["field_supplied_count"] = final_count["field_supplied"].astype(int).map(str) + "/" + final_count["field"].map(str)
final_count["field_error_free_count"] = final_count["field_error_free"].astype(int).map(str) + "/" + final_count["field"].map(str)
final_count["field_matched_count"] = final_count["field_matched"].astype(int).map(str) + "/" + final_count["field"].map(str)
# final_count["field_loaded_count"] = final_count["field_loaded"].astype(int).map(str) + "/" + final_count["field"].map(str)

# create % columns
final_count["field_supplied_pct"] = final_count["field_supplied"] / final_count["field"]
final_count["field_error_free_pct"] = final_count["field_error_free"] / final_count["field"]  
final_count["field_matched_pct"] = final_count["field_matched"] / final_count["field"] 
# final_count["field_loaded_pct"] = final_count["field_loaded"] / final_count["field"] 

final_count.reset_index(drop=True, inplace=True)
final_count.to_csv("compliance_to_standard_report.csv")
# final_count.head()

## Report output

In [13]:
final_count_out = final_count[
    ["organisation", "name", "dataset", "endpoint_number", "field_supplied_count", "field_supplied_pct", 
     "field_error_free_count", "field_error_free_pct", "field_matched_count", "field_matched_pct"]
].copy()

final_count_out.sort_values(["name", "dataset", "endpoint_number"], inplace=True)

slice_ = ["field_supplied_pct", "field_error_free_pct", "field_matched_pct"]

final_count_out.style \
    .relabel_index(["Organisation", "Org Name", "Dataset", "Endpoint no.", "Fields supplied", "Fields supplied (%)", 
                    "Fields with no errors", "Fields with no errors (%)", "Field with correct names", "Field with correct names (%)"], axis=1) \
    .format("{:.0%}", subset = slice_) \
    .background_gradient(axis=None, vmin=0, vmax=1, cmap="YlGn", subset = slice_)

Unnamed: 0,Organisation,Org Name,Dataset,Endpoint no.,Fields supplied,Fields supplied (%),Fields with no errors,Fields with no errors (%),Field with correct names,Field with correct names (%)
0,local-authority-eng:BIR,Birmingham City Council,article-4-direction-area,1,11/12,92%,11/12,92%,1/12,8%
1,local-authority-eng:BIR,Birmingham City Council,conservation-area,1,8/10,80%,8/10,80%,1/10,10%
2,local-authority-eng:BOS,Bolsover District Council,conservation-area,1,0/10,0%,0/10,0%,0/10,0%
5,local-authority-eng:CAT,Canterbury City Council,article-4-direction-area,1,0/12,0%,0/12,0%,0/12,0%
6,local-authority-eng:CAT,Canterbury City Council,conservation-area,1,4/10,40%,4/10,40%,1/10,10%
4,local-authority-eng:CAT,Canterbury City Council,listed-building-outline,1,2/16,12%,1/16,6%,1/16,6%
3,local-authority-eng:CAT,Canterbury City Council,locally-listed-building,1,4/12,33%,4/12,33%,1/12,8%
8,local-authority-eng:DNC,Doncaster Metropolitan Borough Council,article-4-direction-area,1,6/12,50%,6/12,50%,1/12,8%
9,local-authority-eng:DNC,Doncaster Metropolitan Borough Council,conservation-area,1,4/10,40%,4/10,40%,1/10,10%
10,local-authority-eng:DNC,Doncaster Metropolitan Borough Council,listed-building-outline,1,1/16,6%,1/16,6%,1/16,6%


In [14]:
resource_fields_scored.head()

Unnamed: 0,organisation,name,dataset,endpoint,status,latest_log_entry_date,endpoint_entry_date,resource,field,field_loaded,field_supplied,field_matched,field_errors
0,local-authority-eng:BIR,Birmingham City Council,article-4-direction-area,2d9575d771afff89f6d731be59a1ff8cedfd99efcd8bb2...,200.0,2024-04-09T00:18:04Z,2023-11-14T00:00:00Z,7a937605655b895bf9ebfbe29f8e35af8d3f606fd811b4...,address-text,,1.0,0.0,
1,local-authority-eng:BIR,Birmingham City Council,article-4-direction-area,2d9575d771afff89f6d731be59a1ff8cedfd99efcd8bb2...,200.0,2024-04-09T00:18:04Z,2023-11-14T00:00:00Z,7a937605655b895bf9ebfbe29f8e35af8d3f606fd811b4...,article-4-direction,,1.0,0.0,
2,local-authority-eng:BIR,Birmingham City Council,article-4-direction-area,2d9575d771afff89f6d731be59a1ff8cedfd99efcd8bb2...,200.0,2024-04-09T00:18:04Z,2023-11-14T00:00:00Z,7a937605655b895bf9ebfbe29f8e35af8d3f606fd811b4...,description,,,,
3,local-authority-eng:BIR,Birmingham City Council,article-4-direction-area,2d9575d771afff89f6d731be59a1ff8cedfd99efcd8bb2...,200.0,2024-04-09T00:18:04Z,2023-11-14T00:00:00Z,7a937605655b895bf9ebfbe29f8e35af8d3f606fd811b4...,end-date,,1.0,0.0,
5,local-authority-eng:BIR,Birmingham City Council,article-4-direction-area,2d9575d771afff89f6d731be59a1ff8cedfd99efcd8bb2...,200.0,2024-04-09T00:18:04Z,2023-11-14T00:00:00Z,7a937605655b895bf9ebfbe29f8e35af8d3f606fd811b4...,entry-date,1.0,1.0,0.0,


In [21]:
dataset_name = "article-4-direction-area"

scored_dataset = resource_fields_scored[resource_fields_scored["dataset"] == dataset_name]

scored_dataset.groupby(["dataset", "field"]).agg(
        {"resource":"count",
         "field_supplied" : "sum",
         "field_matched" : "sum",
         "field_loaded" : "sum",
         "field_errors" : "sum"}
         ).reset_index()

Unnamed: 0,dataset,field,resource,field_supplied,field_matched,field_loaded,field_errors
0,article-4-direction-area,address-text,16,8.0,2.0,1.0,0.0
1,article-4-direction-area,article-4-direction,16,10.0,3.0,9.0,0.0
2,article-4-direction-area,description,16,3.0,2.0,2.0,0.0
3,article-4-direction-area,end-date,16,11.0,3.0,0.0,0.0
4,article-4-direction-area,entry-date,16,12.0,3.0,12.0,0.0
5,article-4-direction-area,geometry,16,15.0,15.0,13.0,0.0
6,article-4-direction-area,name,16,14.0,11.0,13.0,0.0
7,article-4-direction-area,notes,16,9.0,8.0,5.0,0.0
8,article-4-direction-area,permitted-development-rights,16,7.0,2.0,6.0,1.0
9,article-4-direction-area,reference,16,14.0,10.0,13.0,0.0
