# Provision data quality report
**Author**:  Greg Slater <br>
**Date created**:  April 2025 <br>
**Dataset Scope**: ODP datasets <br>
**Report Type**: Ad-hoc <br>

**Purpose**: The purpose of this report is demonstrate the basic methodology of scoring provisions using the data quality measurement framework. The data sources required for assessing whether a provision passes or fails criteria in the framework vary, so currently this stripped back method just uses issues logged in the `digital-land.issue` table which are summarised in `performance.endpoint_dataset_issue_type_summary`, and failed expectations logged in `digital-land.expectation`. The starting point is the `performance.provision_summary` table, which includes all provisions and whether they currently have an active endpoint or not. For simplicity only provisions with one or more active endpoints are scored.

### Data quality framework  
The table below visualises the framework that is used to assign a quality level to each ODP data provision. 

The criteria marked as "true" at each level must be met by a data provision in order for it to be scored at that level. The levels are cumulative, so all criteria must be met in order for a provision to be scored as *data that is trustworthy*.

![quality framework table](quality-framework-table.png)



Criteria not included in this example method are:

* *Data from an authoritative source*
* *No deleted entities*

In [1]:
import os
import pandas as pd
import numpy as np
from datetime import datetime
import json

import functions_core as fc

td = datetime.today().strftime('%Y-%m-%d')

db_dir = "../../data/db_downloads/"
os.makedirs(db_dir, exist_ok=True)

output_dir = "../../data/quality_report/"
os.makedirs(output_dir, exist_ok=True)

## 1. Import

In [None]:
# performance db
fc.download_dataset("performance", db_dir, overwrite=False)
path_perf_db = os.path.join(db_dir, "performance.db")

# Issue quality criteria lookup
lookup_issue_qual = fc.datasette_query(
    "digital-land",
    """
    SELECT 
        description,
        issue_type,
        name,
        severity,
        responsibility,
        quality_criteria_level || " - " || quality_criteria as quality_criteria,
        quality_criteria_level as quality_level
    FROM issue_type
    WHERE quality_criteria_level != ''
    AND quality_criteria != ''
    """ 
    )


## 2. Transform

In [3]:
provision = fc.query_sqlite(
    path_perf_db,
    """
    SELECT organisation, dataset, active_endpoint_count
    FROM provision_summary
""")

# provision

In [4]:
# IDENTIFY PROBLEMS - issues

# extract issue count by provision from endpoint_dataset_issue_type_summary
qual_issue = fc.query_sqlite(
    path_perf_db,
    """
    SELECT 
        organisation, dataset,
        'issue' as problem_source,
        issue_type as problem_type, 
        sum(count_issues) as count
    FROM endpoint_dataset_issue_type_summary
    WHERE resource_end_date is not NULL
    AND issue_type is not NULL
    GROUP BY organisation, dataset, issue_type
""")

# join on quality criteria and level from issue_type lookup (this restricts to only issues linked to a quality criteria)
qual_issue = qual_issue.merge(
    lookup_issue_qual[["issue_type", "quality_criteria", "quality_level"]],
    how = "inner",
    left_on = "problem_type",
    right_on = "issue_type"
)

qual_issue.drop("issue_type", axis=1, inplace=True)

qual_issue.to_csv("01_quality_problems-source-issues_all.csv", index = False)
# qual_issue

In [5]:
# IDENTIFY PROBLEMS - expectations - entity beyond LPA bounds

qual_expectation_bounds = fc.datasette_query(
    "digital-land", 
    """
    SELECT organisation, dataset, details
    FROM expectation
    WHERE 1=1
        AND name = 'Check no entities are outside of the local planning authority boundary'
        AND passed = 'False'
        AND message not like '%error%'
    """)

qual_expectation_bounds["problem_source"] = "expectation"
qual_expectation_bounds["problem_type"] = "entity outside of the local planning authority boundary"
qual_expectation_bounds["count"] = [json.loads(v)["actual"] for v in qual_expectation_bounds["details"]]
qual_expectation_bounds["quality_criteria"] = "3 - entities within LPA boundary"
qual_expectation_bounds["quality_level"] = 3
qual_expectation_bounds.drop("details", axis=1, inplace=True)
qual_expectation_bounds.to_csv("01_quality_problems-source-expectation_bounds.csv", index = False)
# qual_expectation_bounds

In [6]:
# IDENTIFY PROBLEMS - expectations - entity beyond LPA bounds

qual_expectation_count = fc.datasette_query(
    "digital-land", 
    """
    SELECT organisation, dataset, details
    FROM expectation
    WHERE 1=1
        AND name = 'Check number of entities inside the local planning authority boundary matches the manual count' 
        AND passed = 'False'
        AND message not like '%error%'
    """)

qual_expectation_count["problem_source"] = "expectation"
qual_expectation_count["problem_type"] = "entity count doesn't match manual count"
qual_expectation_count["count"] = [json.loads(v)["actual"] for v in qual_expectation_count["details"]]
qual_expectation_count["quality_criteria"] = "3 - conservation area entity count matches LPA"
qual_expectation_count["quality_level"] = 3
qual_expectation_count.drop("details", axis=1, inplace=True)
qual_expectation_count.to_csv("01_quality_problems-source-expectation_count.csv", index = False)
# qual_expectation_count

In [7]:
# combine all problem source tables, and aggregate to criteria level

qual_all_criteria = pd.concat(
    [qual_issue, qual_expectation_bounds, qual_expectation_count]
).groupby(
    ["organisation", "dataset", "quality_criteria", "quality_level"],
    as_index=False
).agg(
    count_failures = ("count", "sum")
)

# qual_all_criteria

In [8]:
prov_qual_all = provision.merge(
    qual_all_criteria,
    how = "left",
    on = ["organisation", "dataset"]
)

prov_qual_all.to_csv("02_provision_quality_all_criteria.csv", index = False)
# prov_qual_all

In [9]:
# when quality_level is null & active_endpoint_count > 0 then quality_level = 4 (we have endpoints with no issues, so quality = trustworthy)
# active_endpoint_count = 0 then quality_level = 0 (we have no active endpoints, so quality = no score)
prov_qual_all["quality_level_for_sort"] = np.select(
    [
        (prov_qual_all["active_endpoint_count"] == 0),
        (prov_qual_all["quality_level"].notnull()),
        (prov_qual_all["active_endpoint_count"] > 0) & (prov_qual_all["quality_level"].isnull())
    ],
    [
        0, prov_qual_all["quality_level"], 4
    ]
)

In [10]:
level_map = {
    4: "4. data that is trustworthy",
    3: "3. data that is good for ODP",
    2: "2. authoritative data from the LPA",
    1: "1. some data",
    0: "0. no score"}

prov_quality = prov_qual_all.groupby([
    "organisation", "dataset"
],
    as_index=False,
    dropna=False
).agg(
    quality_level = ("quality_level_for_sort", "min")
)

prov_quality["quality"] = prov_quality["quality_level"].map(level_map)
prov_quality["notes"] = ""
prov_quality["end-date"] = ""
prov_quality["start-date"] = td
prov_quality["entry-date"] = td

print(prov_quality.value_counts(["quality_level", "quality"]))
prov_quality[["dataset", "end-date", "entry-date", "notes", "organisation", "quality", "start-date"]].to_csv("03_provision_quality_scored.csv", index = False)
# prov_quality

quality_level  quality                           
0.0            0. no score                           4956
3.0            3. data that is good for ODP           314
4.0            4. data that is trustworthy            310
2.0            2. authoritative data from the LPA     177
Name: count, dtype: int64
