# Provision data quality report
**Author**:  Greg Slater <br>
**Date**:  November 2024 <br>
**Dataset Scope**: ODP datasets <br>
**Report Type**: Ad-hoc <br>

**Purpose**: The purpose of this report is to measure the quality of the data that makes up each data provision on the platform, by applying a data quality framework that sets out criteria that must be met in order to reach one of 4 different quality levels. These levels are based around the quality requirements of the ODP software which uses platform data.

Note: the datasets included in this report are active resources of active endpoints. So where we have retired endpoints we may have data for a provision still appearing on the platform but it will not appear in this report. This means in the "Dataset quality scoring detail" table not all ODP provisions are present, 68 compared to 73. This is because there are 5 ODP providers where we don't have endpoints or HE conservation-area data, so there are no issues to display.

Future improvements:
* Error handling. Queries not working may break bits of the report. Not very high priority while report is more of a POC.
* Base tables. Expand summaries to full ODP provision, including where no data at all. This could be done by switching the `qual_cat_summary` table to be constructed from a base of the provision table, rather than `qual_all` (which only includes provisions with quality issues).
* Adding more quality checks. This depends on more checks going live in issues or expectations tables, but once they are should be easy to add extra criteria checks through the `qual_` table structure.
* Include data from old endpoints. This will need re-working of the base table query (from `fi.get_endpoint_res_issues()`) to include old endpoints and resources. Though this will add complexity to work out which are the "latest" endpoints and resources to include, especially for provisions with multiple endpoints. May be low priority.


### Data quality framework  
The table below visualises the framework that is used to assign a quality level to each ODP data provision. 

The criteria marked as "true" at each level must be met by a data provision in order for it to be scored at that level. The levels are cumulative, so all criteria must be met in order for a provision to be scored as *data that is trustworthy*. Where we have data from alternative providers (e.g. Historic England conservation-area data) the first criteria cannot be met so it is scored as the first quality level, *some data*.

![quality framework table](quality-framework-table.png)

In [1]:
import os
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from datetime import datetime

td = datetime.today().strftime('%Y-%m-%d')

In [None]:
def save_util_file(file_name):

    if os.path.isfile(file_name) == False:
        url = f"https://raw.githubusercontent.com/digital-land/jupyter-analysis/refs/heads/main/service_report/quality_report/{file_name}"
        !wget {url}
        print(f"downloaded {file_name} from github")

    else:
        print("file available locally")

for f in ["functions_core.py", "functions_import.py", "functions_transform.py"]:
    save_util_file(f)

import functions_core as fc
import functions_import as fi
import functions_transform as ft

In [3]:
db_dir = "../../data/db_downloads/"
os.makedirs(db_dir, exist_ok=True)

output_dir = "../../data/quality_report/"
os.makedirs(output_dir, exist_ok=True)

## 1. Import

In [None]:
# performance db
fc.download_dataset("performance", db_dir, overwrite=True)
path_perf_db = os.path.join(db_dir, "performance.db")

# Issue quality criteria lookup
lookup_issue_qual = pd.read_csv("https://raw.githubusercontent.com/digital-land/jupyter-analysis/refs/heads/main/service_report/input/issue_type_quality.csv")

# Provision lookups
lookup_provision_odp = fi.get_odp_provision_lookup()
lookup_provision_odp.rename(columns={"dataset" : "pipeline"}, inplace=True)


# Dataset subset dict for chart
dataset_subset_dict = dict({
        "ODP" : ["conservation-area", "conservation-area-document", "article-4-direction-area", "article-4-direction", "listed-building-outline", "tree", "tree-preservation-zone", "tree-preservation-order"],
        "BFL" : ["brownfield-land"],
        "Developers" : ["developer-agreement", "developer-agreement-contribution", "developer-agreement-transaction"]
    })

# Base table
ep_res_issues = fi.get_endpoint_res_issues(path_perf_db)


# Below is all extra for adding in the conservation-area authoritative or not checks
# ---------------------------------------------------------------------------------------------------------
# Organisation lookups
lookup_org = fi.get_organisation_lookup()
lookup_org[["lpa_flag", "organisation_entity"]] = lookup_org[["lpa_flag", "organisation_entity"]].astype(int)

# Conservation area dataset - for non-auth issues
ca_gdf = fc.get_pdp_dataset("conservation-area", "point")
ca_gdf[["organisation_entity"]] = ca_gdf[["organisation_entity"]].astype(int)

# LPA boundaries
lpa_gdf = fc.get_pdp_dataset("local-planning-authority", "geometry")

# conservation area manual counts
ca_count_df = pd.read_csv("https://raw.githubusercontent.com/digital-land/conservation-area-data/refs/heads/main/data/conservation-area-count.csv")
ca_count_df.columns = [x.replace("-", "_") for x in ca_count_df.columns]
ca_count_df[["organisation_entity"]] = ca_count_df[["organisation_entity"]].astype(int)

## 2. Transform

In [6]:
# sort out CA and LPA tables for joining

# rename for easier joining
lpa_gdf.rename(
    columns = {
        'name':'lpa_name',
        'reference':'LPACD'
    }, 
        inplace=True)

# restrict LPAs to un-ended ones and join on organisation field
lpa_live_gdf = lpa_gdf[["LPACD", "geometry"]].merge(
    lookup_org[lookup_org["end_date"].isnull()][["LPACD", "organisation", "organisation_name", "organisation_entity"]],
    how = "inner",
    on = "LPACD"
)

# set up base table - will now include LPAs with no data, and outer join keeps in non-LPA provided dataset
base = lpa_live_gdf[["LPACD", "organisation"]].merge(
    ep_res_issues,
    how = "outer",
    on = "organisation"
)

# add lpa flag field to ca_gdf - used to calculate provenance
ca_gdf = ca_gdf.merge(
    lookup_org[["organisation_entity", "organisation_name", "lpa_flag"]],
    how = "left",
    on = "organisation_entity"
)


In [7]:
# PROVENANCE TABLE - flagging when conservation-area provisions are from alternative sources

qual_prov = ft.make_ca_provenance_issues_table(lpa_live_gdf, ca_gdf)

# CA MATCH CHECK TABLE - flagging when conservation-area counts per LPA don't match manual count

qual_match = ft.make_ca_count_match_issues_table(lpa_live_gdf, ca_gdf, ca_count_df)


# ISSUES TABLE - flagging when provisions have data quality issues

qual_issues = ft.make_issues_input_table(base, lookup_issue_qual)


# # FRESHNESS TABLE - flagging when provisions haven't been updated in last year - not included in quality framework for now

# # create table of old resources and flag quality level as 5
# ep_res_fresh_qual = ft.make_freshness_input_table(ep_res_issues, age_days = 365)


# ALL QUALITY CATEGORIES TABLE - joining all records of quality categories (freshness & DQ issues) into one long table 
# concat tables for each type
qual_all = pd.concat([qual_prov, qual_match, qual_issues])
# qual_all.head()

In [8]:
# # store functions & arguments that return quality calculation data as a list of tuples 
# qual_calc_functions = [
#     (ft.make_freshness_input_table, [ep_res_issues, 365]),
#     (ft.make_issues_input_table, [ep_res_issues, lookup_issue_qual])
# ]

# tables = [func(*args) for func, args in qual_calc_functions if isinstance(func(*args), pd.DataFrame)]
# print(len(tables))

In [None]:
level_map = {
    4: "4. data that is trustworthy",
    3: "3. data that is good for ODP",
    2: "2. authoritative data from the LPA",
    1: "1. some data"}


qual_summary = ft.make_score_summary_table(qual_all, level_map)
print(len(qual_summary))

## 3. Summarise

### ODP LPA x Dataset quality table

In [10]:

# subset to ODP and pivot
odp_lpa_summary = qual_summary.merge(
    lookup_provision_odp[["organisation", "pipeline", "cohort"]],
    how = "inner",
    on = ["organisation", "pipeline"]
)

odp_lpa_summary_wide = odp_lpa_summary.pivot(
    columns = "pipeline",
    values = "quality_level_label",
    index = ["cohort", "organisation", "organisation_name"]
).reset_index(
).sort_values(
    ["cohort", "organisation_name"]
)

odp_lpa_summary_wide.replace(np.nan, "0. no data", inplace=True)

In [11]:
# flag whether LPAs are "ready for ODP" (must have at least quality level 3 for all geography datasets)
# count and min quality of geography datasets for each provider
ready_for_odp_calc = qual_summary[qual_summary["pipeline"].isin(
    ["article-4-direction-area", "conservation-area", "listed-building-outline", "tree", "tree-preservation-zone"]
    )].groupby(
    ["organisation"], as_index=False
).agg(
    area_dataset_count = ("pipeline", "count"),
    min_quality_level = ("quality_level", "min")
)

# add flag - count == 5 means all datasets must be provided
ready_for_odp_calc["ready_for_ODP_adoption"] = np.where(
    (ready_for_odp_calc["area_dataset_count"] == 5) &
    (ready_for_odp_calc["min_quality_level"] >= 2),
    "yes", "no"
)

# add flag to summary wide table
odp_lpa_summary_wide = odp_lpa_summary_wide.merge(
    ready_for_odp_calc[["organisation", "ready_for_ODP_adoption"]],
    how = "left",
    on = "organisation"
)

In [12]:
level_background_colours = {
    "4. data that is trustworthy" : "background-color: #1a6837",
    "3. data that is good for ODP" : "background-color: #87cb67",
    "2. authoritative data from the LPA" : "background-color: #fefebf",
    "1. some data" : "background-color: #f78c51"
    }

ready_flag_colours = {
        "yes" : "color:green"
    }

def make_color_mask_odp_lpa(df):
    #DataFrame with same index and columns names as original filled empty strings
    df_color_map =  pd.DataFrame("", index=df.index, columns=df.columns)

    flag_slice = df.columns[2:-1]
    for s in flag_slice:
        df_color_map[s] = df[s].map(level_background_colours)

    df_color_map["ready_for_ODP_adoption"] = df["ready_for_ODP_adoption"].map(ready_flag_colours)

    return df_color_map

# make_color_mask_odp_lpa(odp_lpa_summary)
# odp_lpa_summary.style.apply(make_color_mask_odp_lpa, axis=None)

### Dataset x quality categories table

In [13]:
# count issues by the quality category 
qual_cat_count = qual_all.groupby(
        ["pipeline", "organisation", "organisation_name", "quality_category"],
        as_index=False
    ).agg(
        n_issues = ("quality_level", "count")
    )

In [None]:
# create a base table with each quality category for each provision - this is so it can be pivoted correctly with all categories included
prov = qual_all[["pipeline", "organisation", "organisation_name"]].drop_duplicates()
prov["key"] = 1

qual_cat = qual_all[qual_all["quality_category"].notnull()][["quality_category"]].drop_duplicates()
qual_cat["key"] = 1

qual_cat_summary = prov.merge(
    qual_cat,
    how = "left",
    on = "key"
)
print(len(qual_cat_summary))

# left join on the counts to the base table
qual_cat_summary = qual_cat_summary.merge(
    qual_cat_count,
    how = "left",
    on = ['pipeline', 'organisation', 'organisation_name', 'quality_category']
)

# create boolean flag for each category
qual_cat_summary["issue_flag"] = np.where(qual_cat_summary["n_issues"] > 0, False, True)
print(len(qual_cat_summary))
# qual_cat_summary.head()

In [15]:
# pivot quality category summary table so that quality categories are columns, join on overall quality level per provision
qual_cat_summary_wide = qual_cat_summary.pivot(
        columns = "quality_category",
        values = "issue_flag",
        index = ["pipeline", "organisation", "organisation_name"]
    ).reset_index(
    ).merge(
        qual_summary[["pipeline", "organisation", "quality_level_label"]],
        how = "left",
        on = ["pipeline", "organisation"]
    )

def get_dataset_qual_detail(dataset):
    # just subsets and styles main wide quality detail table

    qual_detail = qual_cat_summary_wide[qual_cat_summary_wide["pipeline"] == dataset].copy()

    return qual_detail.style.apply(make_color_mask_dataset_lpa, axis=None)


flag_colours = {
        True : "color:green",
        False : "color:red"
    }

def make_color_mask_dataset_lpa(df):
    #DataFrame with same index and columns names as original filled empty strings
    df_color_map =  pd.DataFrame("", index=df.index, columns=df.columns)
    # turn label column into colours
    df_color_map["quality_level_label"] = df["quality_level_label"].map(level_background_colours)

    flag_slice = df.columns[3:-1]
    for s in flag_slice:
        df_color_map[s] = df[s].map(flag_colours)

    return df_color_map


# make widget

dataset_dropdown = widgets.Dropdown(
    options = dataset_subset_dict["ODP"],
    value = "article-4-direction",
    description = "Select Dataset: ",
)


### ODP quality maps

In [16]:
level_colours = {
    "4. data that is trustworthy" : "#1a6837",
    "3. data that is good for ODP" : "#87cb67",
    "2. authoritative data from the LPA" : "#fefebf",
    "1. some data" : "#f78c51",
    "0. no data" : "#eaeaea"
}

def map_odp_quality_scores(dataset):

    map_score = lpa_live_gdf.merge(
        qual_summary[qual_summary["pipeline"] == dataset][["LPACD", "pipeline", "quality_level_label"]],
        how = "left",
        on = "LPACD"
    )

    map_score["quality_level_label"] = map_score["quality_level_label"].replace(np.nan, "0. no data")
    map_score["colour"] = map_score["quality_level_label"].map(level_colours)
    map_score["geometry"] = map_score["geometry"].simplify(0.001)

    m = map_score.explore(
        tiles = "CartoDB positron",  # use "CartoDB positron" tiles
        popup = ["organisation_name", "pipeline", "quality_level_label"],  # add in field names to show in popups
        tooltip = False,
        color = map_score["colour"]
    )

    return display(m)


# make widget
odp_dataset_list = dataset_subset_dict["ODP"]

odp_dataset_dropdown = widgets.Dropdown(
    options = odp_dataset_list,
    value = "conservation-area",
    description = "Select Dataset: ",
)
# map_odp_quality_scores("tree")


### Chart

In [17]:
# VISUALISE

# color map to use in chart
cmap = plt.get_cmap('RdYlGn')
colors = [cmap(i / 4) for i in np.arange(1, 5)]

def make_quality_overview_chart(subset):
    """
    Uses the qual summary table to display a horizontal bar chart 
    """

    qual_summary_subset = qual_summary[qual_summary["pipeline"].isin(dataset_subset_dict[subset])]

    # count providers by dataset & quality level
    qual_chart = qual_summary_subset.groupby(["pipeline", "quality_level", "quality_level_label"], as_index=False).agg(
        n_providers = ("quality_level", "count")
    )

    qual_chart.sort_values(["pipeline", "quality_level_label"], inplace=True)
    qual_chart_wide = qual_chart.pivot(columns = "quality_level_label", values = "n_providers", index = "pipeline")
    
    qual_chart_wide.plot.barh(
        stacked = True, 
        color = colors, 
        figsize = (9, 6))

    # Add labels and title
    plt.xlabel('Count of providers')
    plt.ylabel('Dataset')
    plt.title('Quality levels for ODP datasets')
    plt.legend(title='Quality level')

    return plt.show()


subset_dropdown = widgets.Dropdown(
    options = ["ODP"],
    # value = dataset_list[0],
    description = "Select Dataset subset: ",
)

# widgets.interact(make_quality_overview_chart, subset = subset_dropdown)

## 4. Present

### Data quality overview chart - by dataset groups

In [None]:
widgets.interact(make_quality_overview_chart, subset = subset_dropdown)

### Data quality overview map - for ODP datasets

In [None]:
widgets.interact(map_odp_quality_scores, dataset = odp_dataset_dropdown)

### ODP LPA overview table by dataset & quality

In [None]:
odp_lpa_summary_wide.style.apply(make_color_mask_odp_lpa, axis=None)

### Dataset quality scoring detail table

In [None]:
widgets.interact(get_dataset_qual_detail, dataset = dataset_dropdown)

### Output
Save report files

In [22]:
fn = os.path.join(output_dir, f"quality_ODP-dataset-scores-by-LPA_{td}.xlsx")
odp_lpa_summary_wide.style.apply(make_color_mask_odp_lpa, axis=None).to_excel(fn, index = False)

In [23]:
fn = os.path.join(output_dir, f"quality_dataset-quality-detail_{td}.csv")
qual_cat_summary_wide.to_csv(fn)