## Setup

*You must run the cells in this section each time you connect to a new runtime. For example, when you return to the notebook after an idle timeout, when the runtime crashes, or when you restart or factory reset the runtime.*

Set name of spreadsheet to export results to:

In [None]:
spreadsheet_name = 'feedback_results'

Enter credentials:

In [None]:
import getpass

print('Enter your Kingfisher credentials')
user = input('Username:')
password = getpass.getpass('Password:')

Setup notebook environment:

In [None]:
# Install Kingfisher Colab and required packages
%shell pip install --upgrade 'ocdskingfishercolab<0.4' psycopg2-binary > pip.log

# Import libraries and functions
from google.colab.data_table import DataTable
from ocdskingfishercolab import (
    list_source_ids,
    list_collections,
    render_json,
    set_spreadsheet_name,
    save_dataframe_to_sheet,
    save_dataframe_to_spreadsheet,
    set_search_path)
import pandas as pd

# Load https://pypi.org/project/ipython-sql/
%load_ext sql 

# Load https://colab.research.google.com/notebooks/data_table.ipynb
%load_ext google.colab.data_table
DataTable.max_columns = 50 # Increase max columns so that dataframes with many columns are rendered as data tables
DataTable.include_index = False # Remove the index from data tables for easier copy-pasting to Google Docs

# Set config
%config SqlMagic.autopandas = True  # Return Pandas DataFrames instead of regular result sets
%config SqlMagic.displaycon = False  # Don't show connection string after execute
%config SqlMagic.feedback = False  # Don't print number of rows affected by DML
set_spreadsheet_name(spreadsheet_name) # Set name of spreadsheet for exporting results

# Connect to database
connection_string = 'postgresql://' + user + ':' + password + '@postgres-readonly.kingfisher.open-contracting.org/ocdskingfisherprocess?sslmode=require'
%sql $connection_string
  
# Install and setup plotting library
# Maybe this can also be moved to Kingfisher-Colab?
!pip install seaborn

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr

colab_dark_style = {'figure.facecolor': '#383838',
                       'axes.edgecolor': '#d5d5d5',
                       'axes.facecolor': '#383838',
                       'axes.labelcolor': '#d5d5d5',
                       'text.color': '#d5d5d5',
                       'xtick.color': '#d5d5d5',
                       'ytick.color': '#d5d5d5'}

sns.set_style('dark', colab_dark_style)

# Define function to apply number formatting to axis labels
# Maybe this can also be moved to Kingfisher-Colab?
# Needs updating to support other locales
def format_thousands(axis):
  axis.set_major_formatter(tkr.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x)))

## Choose collections and schema

*Use this section to choose the collections and schema that you want to query*

### Set the collection(s)

Update `collection_ids` with the `id`(s) of the [Kingfisher Process collection(s)](https://kingfisher-process.readthedocs.io/en/latest/data-model.html#collections):

In [None]:
collection_ids = (2358, 2359)

If you don't know which collections you need, run the next cell and use the **Filter** button to filter the [collection table](https://kingfisher-process.readthedocs.io/en/latest/database-structure.html#collection-table) to find the collection(s). You can use the `source_id` column to filter on the `name` of the [Kingfisher Collect spider](https://kingfisher-collect.readthedocs.io/en/latest/spiders.html) used to collect the data. Use the value(s) from the `id` column to update the previous cell.

In [None]:
list_collections()

### Set the schema

Update `schema_name` with the name of the [Kingfisher Summarize schema](https://kingfisher-summarize.readthedocs.io/en/latest/index.html#how-it-works).

In [None]:
schema_name = 'view_data_collection_2358_2359'
set_search_path(schema_name)

If you don't know which schema you need, run the next cell and use the **Filter** button to filter the [selected collections table](https://kingfisher-summarize.readthedocs.io/en/latest/database.html#summaries-selected-collections) to find the schema. You can use the `collection_id` column to filter on the `id` of the collections that you identified in the previous step. Alternatively, you can filter on the `source_id` column. Use the value from the `schema` column to update the previous cell.

In [None]:
%%sql

SELECT
    summaries.selected_collections.*,
    source_id
FROM
    summaries.selected_collections
    JOIN collection ON summaries.selected_collections.collection_id = collection.id


If you can't find a schema containing the collections that you want to query, you can create a schema using Kingfisher Summarize's [add command](https://kingfisher-summarize.readthedocs.io/en/latest/cli.html#add).

## Check for data collection and processing errors

### Collection notes

Generate a list of notes for each collection from the `collection_note` table. 
Users can add notes when starting a spider or via the cli. Transforms can also add notes.

In [None]:
%%sql

SELECT
    collection_id,
    note
FROM
    collection_note
WHERE
    collection_id IN :collection_ids


### Collection file errors

Generate a summary of errors and warnings from the `collection_file` table.
Kingfisher Collect and the `local-load` command report errors when they cannot retrieve a file.
Kingfisher Process reports warnings when it needs to modify the contents of a file in order to store it.
Presently, the only warning is about the removal of control characters.

In [None]:
%%sql collection_file_error_summary <<

SELECT
    collection_id,
    warnings,
    errors,
    count(*),
    (array_agg(filename ORDER BY random()))[1:3] AS example_filenames,
    (array_agg(url ORDER BY random()))[1:3] AS example_urls
FROM
    collection_file
WHERE
    collection_id IN :collection_ids
    AND (errors IS NOT NULL
        OR warnings IS NOT NULL)
GROUP BY
    1,
    2,
    3
ORDER BY
    4 DESC;



In [None]:
collection_file_error_summary

Generate a full list of errors and warnings from the `collection_file` table.

In [None]:
%%sql collection_file_errors <<

SELECT
    collection_id,
    filename,
    warnings,
    url,
    errors
FROM
    collection_file
WHERE
    collection_id IN :collection_ids
    AND (errors IS NOT NULL
        OR warnings IS NOT NULL);



In [None]:
collection_file_errors

### Collection file item errors

Generate a summary of errors and warnings from the `collection_file_item` table.
Kingfisher Process reports errors when it cannot load a file item.

In [None]:
%%sql collection_file_item_error_summary <<

SELECT
    collection_id,
    cfi.warnings,
    cfi.errors,
    count(*)
FROM
    collection_file_item AS cfi
    JOIN collection_file AS cf ON cfi.collection_file_id = cf.id
WHERE
    cf.collection_id IN :collection_ids
    AND (cfi.errors IS NOT NULL
        OR cfi.warnings IS NOT NULL)
GROUP BY
    1,
    2,
    3
ORDER BY
    4 DESC;



In [None]:
collection_file_item_error_summary

Generate a full list of errors and warnings from the `collection_file_item` table.

In [None]:
%%sql collection_file_item_errors <<

SELECT
    cfi.number,
    cfi.warnings,
    cfi.errors
FROM
    collection_file_item AS cfi
    JOIN collection_file AS cf ON cfi.collection_file_id = cf.id
WHERE
    cf.collection_id IN :collection_ids
    AND (cfi.errors IS NOT NULL
        OR cfi.warnings IS NOT NULL);



In [None]:
collection_file_item_errors

### Check errors

A summary of errors from the `release_check_error` and `record_check_error` tables.
CoVE reports errors when it cannot check a release or record.

In [None]:
%%sql check_error_summary <<

WITH errors AS (
    SELECT
        collection_id,
        'release' AS TYPE,
        release.id AS release_id,
        release_check_error.error
    FROM
        release_check_error
        JOIN RELEASE ON release_check_error.release_id = release.id
    WHERE
        release.collection_id IN :collection_ids
    UNION
    SELECT
        collection_id,
        'record' AS TYPE,
        record.id AS record_id,
        record_check_error.error
    FROM
        record_check_error
        JOIN record ON record_check_error.record_id = record.id
    WHERE
        record.collection_id IN :collection_ids
)
SELECT
    collection_id,
    TYPE,
    error,
    count(*),
    (array_agg(release_id ORDER BY random()))[1:3] AS example_release_ids
FROM
    errors
GROUP BY
    1,
    2,
    3
ORDER BY
    4 DESC;



In [None]:
check_error_summary

Generate a full list of errors from the `release_check_error` and `record_check_error` tables.

In [None]:
%%sql check_errors <<

SELECT
    collection_id,
    'release' AS TYPE,
    release.id AS release_id,
    release_check_error.error
FROM
    release_check_error
    JOIN RELEASE ON release_check_error.release_id = release.id
WHERE
    release.collection_id IN :collection_ids
UNION
SELECT
    collection_id,
    'record' AS TYPE,
    record.id AS record_id,
    record_check_error.error
FROM
    record_check_error
    JOIN record ON record_check_error.record_id = record.id
WHERE
    record.collection_id IN :collection_ids;



In [None]:
check_errors

## Check scope



Use this section to check:

* how many releases, records and compiled releases your data contains
* what stages of the contracting process your data covers
* what date range your data covers

### Release and record counts

Collections in Kingfisher Process contain either [releases](https://standard.open-contracting.org/latest/en/schema/reference/), [records](https://standard.open-contracting.org/latest/en/schema/records_reference/) or [compiled releases](https://standard.open-contracting.org/latest/en/schema/records_reference/#compiled-release). Kingfisher generates compiled release collections from release or record collections.

Use this section to check that the data contains the expected number of releases, records and compiled releases.

Generate a count of releases, records and compiled releases for each collection. Where possible, you should check these numbers against the total number of results available in the front-end of the partner's data source.

In [None]:
%%sql

SELECT
    id AS collection_id,
    cached_releases_count AS releases_count,
    cached_records_count AS records_count,
    cached_compiled_releases_count AS compiled_releases_count
FROM
    collection
WHERE
    id IN :collection_ids


### Contracting process stages

Use this section to check that the data covers the expected stages of the contracting process.

#### Release tags

[Release tags](https://standard.open-contracting.org/latest/en/schema/codelists/#release-tag) indicate the stage of the contracting process an OCDS release relates to.

Generate a summary of releases by `tag`.

In [None]:
%%sql

SELECT
    collection_id,
    release_type,
    tag,
    count(*)
FROM
    release_summary
GROUP BY
    collection_id,
    release_type,
    tag
ORDER BY
    collection_id;



#### Objects per stage

In OCDS, data is organized into objects, for each stage of a contracting process. Each compiled release has: at most one `Planning` object, at most one `Tender` object, any number of `Award` objects, and any number of `Contract` objects. Each `Contract` object has at most one `Implementation` object. As such, the number of `Award` objects can exceed the number of unique OCIDs, but the number of `Tender` objects can't.

Generate and plot a count of objects per stage:

In [None]:
%%sql objects_per_stage << 

SELECT
    CASE WHEN paths.path = 'contracts/implementation' THEN
        'implementation'
    ELSE
        paths.path
    END AS stage,
    CASE WHEN paths.path IN ('planning',
        'tender',
        'contracts/implementation') THEN
        GREATEST (object_property,
            0)
    ELSE
        GREATEST (array_count,
            0)
    END AS object_count
FROM (
    SELECT
        unnest(ARRAY['planning', 'tender', 'awards', 'contracts', 'contracts/implementation']) AS path) AS paths
    LEFT JOIN (
        SELECT
            *
        FROM
            field_counts
        WHERE
            collection_id IN :collection_ids
            AND release_type = 'compiled_release'
            AND path IN ('planning', 'tender', 'awards', 'contracts', 'contracts/implementation')) AS field_counts USING (path)


In [None]:
objects_per_stage_chart = sns.catplot(x="stage", y="object_count", kind="bar", data=objects_per_stage).set_xticklabels(rotation=90)

for ax in objects_per_stage_chart.axes.flat:
  format_thousands(ax.yaxis)

objects_per_stage

### Date ranges


Use this section to check that the data covers the expected date range.

Generate a summary of the earliest and latest `date`, `awards/date` and `contracts/dateSigned`.

In [None]:
%%sql

SELECT
    collection_id,
    release_type,
    'release_date' AS date_type,
    min(date) AS min,
    max(date) AS max
FROM
    release_summary
GROUP BY
    collection_id,
    release_type,
    date_type
UNION ALL
SELECT
    collection_id,
    release_type,
    'award_date' AS date_type,
    min(first_award_date) AS min,
    max(last_award_date) AS max
FROM
    release_summary
GROUP BY
    collection_id,
    release_type,
    date_type
UNION ALL
SELECT
    collection_id,
    release_type,
    'contract_datesigned' AS date_type,
    min(first_contract_datesigned) AS min,
    max(last_contract_datesigned) AS max
FROM
    release_summary
GROUP BY
    collection_id,
    release_type
ORDER BY
    collection_id,
    release_type,
    date_type;



### Release date distribution

Use this section to check that releases are distributed as expected.

Plot the count of releases per month:

In [None]:
%%sql release_dates <<

SELECT
    collection_id::text,
    release_type,
    date,
    count(*) AS release_count
FROM
    release_summary rs
WHERE
    collection_id IN :collection_ids
GROUP BY
    collection_id,
    release_type,
    date
ORDER BY
    date ASC;



In [None]:
# Resample by month
release_dates = release_dates.set_index('date')
release_dates = release_dates.groupby(['collection_id', 'release_type']).resample("M").sum()

fig, ax = plt.subplots(figsize = [15,5])
sns.lineplot(data = release_dates, x='date', y='release_count', hue = 'collection_id', style = 'release_type')

format_thousands(ax.yaxis)
sns.despine()

### Extensions 

Use this section to check which extensions the data uses.

Generate a list of extensions declared in the package metadata:

In [None]:
%%sql

SELECT
    collection_id,
    release_type,
    jsonb_array_elements(package_data -> 'extensions') AS EXTENSION,
    count(*) AS count
FROM
    release_summary
WHERE
    collection_id IN :collection_ids
    AND package_data IS NOT NULL
GROUP BY
    collection_id,
    release_type,
    EXTENSION
ORDER BY
    collection_id,
    release_type,
    count DESC;



## Check for structure and format errors

Kingfisher Process checks data against the OCDS schema using [CoVE](https://github.com/OpenDataServices/cove). For release collections, Kingfisher Process stores check results in the `release_check` table. For record collections, Kingfisher Process stores check results in the `record_check` table.

### Confirm that checks are complete

By default, Kingfisher Process checks all data, therefore there is often a long queue of collections to be checked. Use the following query to confirm that checks are complete for your collection(s).

If checks for your collection(s) have not started yet, you can use the [`check collection` command](https://kingfisher-process.readthedocs.io/en/latest/cli/check-collection.html) to start the checks manually.

If checks are in progress, you should wait for the checks to finish before running the queries in this section.

In [None]:
%%sql

SELECT
    collection_id,
    'release' AS collection_type,
    CASE WHEN count(release.id) = count(release_check.id) THEN
        'complete'
    WHEN count(release_check.id) = 0 THEN
        'not_started'
    ELSE
        'in_progress'
    END AS check_status,
    count(release_check.id)::text || '/' || count(release.id)::text AS check_progress
FROM
    release_check
    RIGHT JOIN release ON release_check.release_id = release.id
WHERE
    collection_id IN :collection_ids
GROUP BY
    collection_id
UNION
SELECT
    collection_id,
    'record' AS collection_type,
    CASE WHEN count(record.id) = count(record_check.id) THEN
        'complete'
    WHEN count(record_check.id) = 0 THEN
        'not_started'
    ELSE
        'in_progress'
    END AS check_status,
    count(record_check.id)::text || '/' || count(record.id)::text AS check_progress
FROM
    record_check
    RIGHT JOIN record ON record_check.record_id = record.id
WHERE
    collection_id IN :collection_ids
GROUP BY
    collection_id;



### Error summary

Generate a summary of errors from the `release_check` and `record_check` tables.

In [None]:
%%sql structure_and_format_error_summary <<

WITH errors AS (
    SELECT
        collection_id,
        errors ->> 'type' AS error_type,
    LEFT (errors ->> 'description',
        49000) AS error,
    ocid,
    errors ->> 'value' AS value,
    row_number() OVER (PARTITION BY collection_id,
        errors ->> 'type',
    LEFT (errors ->> 'description',
    49000)) AS rownum
FROM
    release_check rc
    CROSS JOIN jsonb_array_elements(cove_output -> 'validation_errors') AS errors
    JOIN RELEASE r ON rc.release_id = r.id
WHERE
    collection_id IN :collection_ids
UNION ALL
SELECT
    collection_id,
    errors ->> 'type' AS error_type,
    LEFT (errors ->> 'description',
        49000) AS error,
    ocid,
    errors ->> 'value' AS value,
    row_number() OVER (PARTITION BY collection_id,
        errors ->> 'type',
    LEFT (errors ->> 'description',
    49000)) AS rownum
FROM
    record_check rc
    CROSS JOIN jsonb_array_elements(cove_output -> 'validation_errors') AS errors
    JOIN record r ON rc.record_id = r.id
WHERE
    collection_id IN :collection_ids
),
examples AS (
    SELECT
        collection_id,
        error_type,
        error,
        array_agg(ocid) AS example_ocids,
        array_agg(value) AS example_values
    FROM
        errors
    WHERE
        rownum <= 3
    GROUP BY
        collection_id,
        error_type,
        error
)
SELECT
    collection_id,
    error_type,
    error,
    count(*) AS count,
    example_ocids,
    example_values
FROM
    errors
    JOIN examples USING (collection_id, error_type, error)
GROUP BY
    collection_id,
    error_type,
    error,
    example_ocids,
    example_values;



In [None]:
structure_and_format_error_summary

### Error details

Generate a full list of errors from the `release_check` and `record_check` tables.

In [None]:
%%sql structure_and_format_errors <<

SELECT
    collection_id,
    'release' AS collection_type,
    errors ->> 'type' AS error_type,
    errors ->> 'field' AS field,
    LEFT (errors ->> 'description',
        49000) AS error,
    ocid,
    errors ->> 'value' AS value
FROM
    release_check rc
    CROSS JOIN jsonb_array_elements(cove_output -> 'validation_errors') AS errors
    JOIN RELEASE r ON rc.release_id = r.id
WHERE
    collection_id IN :collection_ids
UNION ALL
SELECT
    collection_id,
    'record' AS collection_type,
    errors ->> 'type' AS error_type,
    errors ->> 'field' AS field,
    LEFT (errors ->> 'description',
        49000) AS error,
    ocid,
    errors ->> 'value' AS value
FROM
    record_check rc
    CROSS JOIN jsonb_array_elements(cove_output -> 'validation_errors') AS errors
    JOIN record r ON rc.record_id = r.id
WHERE
    collection_id IN :collection_ids


In [None]:
structure_and_format_errors