### Check an endpoint
This notebook aims to run the pipeline on a given endpoint to check to see if it will be successful. This includes collecting, pipeline and datset stages. It aims to also highlight useful information as a summary as to whether the endpoint would be successful on our platform. it will download all relevant data to do this and hence might be disk intensive. You'll need to provide the following information:
- collection - this is the collection that the dataset belongs to, this can be extracted from the specification but for this notebook we ask to provide it incase you want to test the pipeline on something which isn't being included in the main site right now
- dataset - this is the dataset that the endpoint is meant to provide data for, technically this can be multiple datasets but this this use case it should just be one. It is also the name of the pipeline that is ran on the individual resources that are downloaded from the endpoint. the terms dataset/pipline are often the same
- organisation - the organisation identifier to be used for the endpoint
- endpoint url - the actual url needed for the endpoint
- plugin - often we use plugins to download the data this is only needed for specific endpoints

If you are seeing errors regarding digital-land, then try 

`pip install -e git+https://github.com/digital-land/pipeline.git#egg=digital-land`

And restart the notebook.

In [1]:
import os
import pandas as pd
import urllib
from functions import run_endpoint_workflow, missing_columns
from sqlite_query_functions import DatasetSqlite
from convert_functions import convert_resource
from digital_land.collection import Collection
from data_file import get_duplicates_between_orgs
from download_data import download_dataset

In [2]:
# Extend these lists as/when you need to add other collections

# collection_name = 'article-4-direction-collection'
# collection_name = 'brownfield-land-collection'
# collection_name = 'conservation-area-collection'
# collection_name = 'flood-risk-zone-collection'
collection_name = 'listed-building-collection'
# collection_name = 'tree-preservation-order-collection'

# dataset = 'article-4-direction'
# dataset = 'article-4-direction-area'
# dataset = 'brownfield-land'
# dataset = 'conservation-area'
# dataset = 'conservation-area-document'
# dataset = 'flood-risk-zone'
dataset = 'listed-building-outline'
# dataset = 'tree'
# dataset = 'tree-preservation-order'
# dataset = 'tree-preservation-zone'

# additional_column_mappings=None
# additional_concats=None

# plugin = None
# plugin = 'arcgis'
# plugin = 'wfs'
 
# additional_column_mappings=None

# additional_concats=None

# EXAMPLE / TEST DATA HERE

# collection_name = 'brownfield-land-collection'
# dataset = 'brownfield-land'
organisation = 'local-authority-eng:SAL'
endpoint_url = 'https://www.stalbans.gov.uk/sites/default/files/documents/publications/planning-building-control/Agile/Listed_Buildings_Dataset.json'
documentation_url = "https://opendata.camden.gov.uk/Environment/Brownfield-Land-Register/izhm-jdrx/about_data"
start_date="2024-01-31"
plugin = ''
reference_column = ""

additional_column_mappings=None
additional_concats=None

# generic data_dir setting
data_dir = '../data/endpoint_checker'

# example playing with additional confiigs
# additional_concats = [{
#     'dataset':'tree-preservation-zone',
#     'endpoint':'de1eb90a8b037292ef8ae14bfabd1184847ef99b7c6296bb6e75379e6c1f9572',
#     'resource':'e6b0ccaf9b50a7f57a543484fd291dbe43f52a6231b681c3a7cc5e35a6aba254',
#     'field':'reference',
#     'fields':'REFVAL;LABEL',
#     'separator':'/'
# }]


In [3]:
run_endpoint_workflow(
    collection_name,
    dataset,
    organisation,
    endpoint_url,
    plugin,
    data_dir,
    additional_col_mappings=additional_column_mappings,
    additional_concats=additional_concats
)

HTTP Error 404: Not Found
Hello
../data/endpoint_checker/collection
Hello again
[{'resource': 'e28b118b256247283cca33d03a2cfddbc89c01478ae05bf52020e963687a2b18', 'bytes': '', 'endpoints': '4ecb586385c9fbdaec5df01e81f63ea222d2a7e79c9209d4253e7ffc6fca20f0', 'organisations': 'local-authority-eng:SAL', 'datasets': 'listed-building-outline', 'start-date': '2024-02-21', 'end-date': ''}]




#### Collection log summaries

We need to establish if a resource was downloaded from the endpoint and whether there were any issues during the collection process. Examine the output of the below. There should be one log for the attempt made at downloading from the endpoint. If status code is 200 then the resource was downloaded successfully

In [4]:
try:
    # Initialize the collection from the specified directory
    collection = Collection(os.path.join(data_dir,'collection'))
    
    # Load the collection
    collection.load(directory=os.path.join(data_dir,'collection'))
    
    # Access the records of a resource within the collection
    collection.resource.records
    
    print("Download has succeeded.")
except Exception as e:
    # If anything goes wrong, print the error and status code
    print(f"Download failed with error: {e}")
    if hasattr(e, 'status'):
        print(f"Status code: {e.status}")


Download successful 


In [5]:
logs = collection.log.entries
logs = pd.DataFrame.from_records(logs)
logs


Unnamed: 0,bytes,content-type,elapsed,endpoint,resource,status,entry-date,start-date,end-date,exception
0,,,0.327,4ecb586385c9fbdaec5df01e81f63ea222d2a7e79c9209...,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,200,2024-02-21T12:41:55.807203,,,


### Check unnassigned entiities
Detect and assign entity numbers where entries are currently unnassigned. Examine the list below to see what (if any) entities have been assigned. if you were to include these in an actual pipeline you would need to update the configuration lookup.csv with these values. It's worth checking they are sensible before this happens.

In [6]:
unassigned_entries = pd.read_csv(os.path.join(data_dir,'var','cache','unassigned-entries.csv'))
if len(unassigned_entries) == 0:
    print('No additional entity numbers required')
else:
    print(F"{len(unassigned_entries)} unassigned entities\n")
    print(unassigned_entries)

839 unassigned entities

                organisation                   prefix  reference
0    local-authority-eng:SAL  listed-building-outline    1295678
1    local-authority-eng:SAL  listed-building-outline    1175422
2    local-authority-eng:SAL  listed-building-outline    1102888
3    local-authority-eng:SAL  listed-building-outline    1295340
4    local-authority-eng:SAL  listed-building-outline    1347243
..                       ...                      ...        ...
834  local-authority-eng:SAL  listed-building-outline    1347202
835  local-authority-eng:SAL  listed-building-outline    1102969
836  local-authority-eng:SAL  listed-building-outline    1174580
837  local-authority-eng:SAL  listed-building-outline    1347203
838  local-authority-eng:SAL  listed-building-outline    1459132

[839 rows x 3 columns]


#### Check logs collated from the pipeline process
We need to read the logs and examine to see if the data points were all read in correctly. This uses the sqlite database to do so with some custom queries. You could directly examine the csvs if the pipeline fails.

First, check the column mappings to see what columns the pipeline automatically mapped. If this is empty or has missing values, then it's likely to be the reason data isn't appearing at the end.

In [7]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_column_mappings()
df = pd.read_csv('https://raw.githubusercontent.com/digital-land/specification/main/specification/dataset-field.csv')
expected_columns = df.groupby('dataset')['field'].apply(list).to_dict()
missing_columns(results, dataset, expected_columns)
results

Missing columns for dataset 'listed-building-outline': address, point, entity, organisation, wikidata, document-url, description, wikipedia, address-text, prefix, documentation-url.



Unnamed: 0,end_date,entry_date,field,dataset,start_date,resource,column
0,,2024-02-21T12:42:16Z,geometry,listed-building-outline,,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,WKT
1,,2024-02-21T12:42:16Z,end-date,listed-building-outline,,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,end_date
2,,2024-02-21T12:42:16Z,entry-date,listed-building-outline,,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,entry_date
3,,2024-02-21T12:42:16Z,listed-building,listed-building-outline,,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,listed_building
4,,2024-02-21T12:42:16Z,listed-building-grade,listed-building-outline,,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,listed_building_grade
5,,2024-02-21T12:42:16Z,name,listed-building-outline,,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,name
6,,2024-02-21T12:42:16Z,notes,listed-building-outline,,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,notes
7,,2024-02-21T12:42:16Z,reference,listed-building-outline,,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,reference
8,,2024-02-21T12:42:16Z,start-date,listed-building-outline,,e28b118b256247283cca33d03a2cfddbc89c01478ae05b...,start_date


## Issues Logs

List all of the issues and warnings. This could be improved in the future by examining the severity. For example 'OSGB' or 'default-value' issues are just warnings.

In [8]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_issues_by_type()
results

Unnamed: 0,issue_type,count,responsibility,severity
0,default-value,4637,internal,info
1,invalid geometry - fixed,3,external,warning


## Look at a specific problem type in more detail

Take the issue_type from the above.

In [9]:
results = dataset_db.get_issues()

#problem = 'OSGB out of bounds of England'
#problem = 'invalid geometry'
# problem = 'Unexpected geom type'
problem = 'OSGB'

results.loc[(results.issue_type == problem) ]



Unnamed: 0,end_date,entry_date,entry_number,field,issue_type,line_number,dataset,resource,start_date,value,message


#### Final dataset 

Shows the end result of the processing. You should see a decent number of these columns populated with data from the raw resources above.

In [10]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_entities()

final_length = len(results)

print("")
print (F"Final data contains {final_length} records")

results


Final data contains 833 records


Unnamed: 0,dataset,end_date,entity,entry_date,geojson,geometry,json,name,organisation_entity,point,prefix,reference,start_date,typology
0,listed-building-outline,,42125879,2004-06-01,,"MULTIPOLYGON (((-0.407211 51.827450,-0.407261 ...","{""listed-building"": ""1295678"", ""listed-buildin...","Turners Hall House & Rear Outbuilding, Annable...",278,POINT(-0.407362 51.827393),listed-building-outline,1295678,1953-10-19,geography
1,listed-building-outline,,42125880,2004-06-01,,"MULTIPOLYGON (((-0.363897 51.761415,-0.363871 ...","{""listed-building"": ""1175422"", ""listed-buildin...","The Pre Hotel, Redbourn Road, St Albans",278,POINT(-0.363904 51.761558),listed-building-outline,1175422,1984-09-05,geography
2,listed-building-outline,,42125881,2004-06-01,,"MULTIPOLYGON (((-0.411670 51.743107,-0.411589 ...","{""listed-building"": ""1102888"", ""listed-buildin...","West range of outbuildings, inc former pigsty,...",278,POINT(-0.411716 51.743165),listed-building-outline,1102888,1978-03-06,geography
3,listed-building-outline,,42125882,2004-06-01,,"MULTIPOLYGON (((-0.411590 51.742953,-0.411760 ...","{""listed-building"": ""1295340"", ""listed-buildin...","Corner Farmhouse, Hemel Hempstead Road, St Albans",278,POINT(-0.411755 51.742966),listed-building-outline,1295340,1978-03-06,geography
4,listed-building-outline,,42125883,2004-06-01,,"MULTIPOLYGON (((-0.411224 51.743066,-0.411323 ...","{""listed-building"": ""1347243"", ""listed-buildin...","L-plan range of outbuildings at Corner Farm, H...",278,POINT(-0.411281 51.742879),listed-building-outline,1347243,1978-03-06,geography
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
828,listed-building-outline,,42126713,2004-06-01,,"MULTIPOLYGON (((-0.404076 51.827962,-0.404085 ...","{""listed-building"": ""1347202"", ""listed-buildin...","Well House at Annables Manor, Annables Lane, K...",278,POINT(-0.404152 51.827978),listed-building-outline,1347202,1984-09-27,geography
829,listed-building-outline,,42126714,2004-06-01,,"MULTIPOLYGON (((-0.404347 51.828060,-0.404506 ...","{""listed-building"": ""1102969"", ""listed-buildin...","Annables Lodge, Annables Lane, Kinsbourne Green",278,POINT(-0.404380 51.828048),listed-building-outline,1102969,1984-09-27,geography
830,listed-building-outline,,42126715,2004-06-01,,"MULTIPOLYGON (((-0.404517 51.827833,-0.404379 ...","{""listed-building"": ""1174580"", ""listed-buildin...","Barn adjoining Annables Lodge, Annables Lane, ...",278,POINT(-0.404483 51.827935),listed-building-outline,1174580,1986-03-25,geography
831,listed-building-outline,,42126716,2004-06-01,,"MULTIPOLYGON (((-0.404418 51.827547,-0.404376 ...","{""listed-building"": ""1347203"", ""listed-buildin...",Barn to SW side at Annables Farm (Annables Hou...,278,POINT(-0.404508 51.827620),listed-building-outline,1347203,1984-09-27,geography


#### Existing Duplicate Entities between organisations

This downloads a sqlite db for the current dataset  
It comapres the current endpoint entities with existing ones   
Identifies duplicates between all organisations

In [11]:
download_dataset(dataset,collection_name,f"{data_dir}/entity_resolution")
dataset_path = os.path.join(f"{data_dir}/entity_resolution",f'{dataset}.sqlite3')
duplicates = get_duplicates_between_orgs(dataset_path,f'../data/endpoint_checker/dataset/{dataset}.sqlite3')

if duplicates.empty:
    print("No duplicate entities found with existing entities")

duplicates

No duplicate entities found with existing entities


Unnamed: 0,primary_entity,primary_name,primary_reference,primary_organisation_entity,primary_geometry,secondary_entity,secondary_name,secondary_reference,secondary_organisation_entity,secondary_geometry,pct_overlap


### Duplicates with different entity numbers

In [17]:
if not duplicates.empty:
    filtered_df = duplicates[duplicates['primary_entity'] != duplicates['secondary_entity']]
    filtered_df

To merge the duplicated entities  
For each entity, use the corresponding primary_entity value (as the entity number)

#### Possible Internal Duplicate Entities

The below table displays duplicates in the data provided identified using the geographical info (geometry and point column).  
Sometimes it is legit, but worth checking in the source data to make sure it passes the sniff test.

In [13]:
if not (results[['geometry', 'point']].apply(lambda x: x.str.strip() == '')).all().all():
    grouped = results.groupby(['geometry', 'point'])
    grouped_list=[]
    for key, value in (grouped.groups).items():
        if len(value) > 1 and key[1] !='':
            filtered_df = results[(results['geometry'] == key[0]) & (results['point'] == key[1])]
            grouped_list.append(filtered_df)

    if len(grouped_list)>1:
        for i in range(len(grouped_list)):
            display(grouped_list[i])
    else:
        print("No internal duplicates found in the given endpoint")
else:
    print("No geometry or point data in the dataset")

No geometry or point data in the dataset


## RAW (ish) DATA

This is the lightly processed data. 

In [14]:
# load in raw resources
collection = Collection(os.path.join(data_dir,'collection'))
collection.load(directory=os.path.join(data_dir,'collection'))
resources = collection.resource.entries
resources

[{'resource': 'b4601330812f8710e184930f9d415e4beadd36399700d27e8836243001fc37e1',
  'bytes': '197667',
  'organisations': 'local-authority-eng:EPS',
  'datasets': 'article-4-direction-area',
  'endpoints': '99dd0c0c5c8d8b3d4250ac27ed6e5c04c7a0c661f7f5e698d84c140ac240d1bc',
  'start-date': '2023-12-19',
  'end-date': ''}]

In [15]:
resource = resources[0]['resource']
resource_path = os.path.join(data_dir,'collection','resource',resource)

print (F"Reading raw resource from {resource_path}")

try:
    raw_resource = pd.read_csv(resource_path)
except (UnicodeDecodeError,TypeError,pd.errors.ParserError):
    converted_resource_dir = os.path.join(data_dir,'var','converted_resources')
    converted_resource_path = os.path.join(converted_resource_dir,f'{resource}.csv') 
    if not os.path.exists(converted_resource_path):
        convert_resource(resource,resource_path,converted_resource_dir,dataset)
    print (F"Failed - reading from {converted_resource_path} instead.")
    raw_resource = pd.read_csv(converted_resource_path)

raw_length = len(raw_resource)
print("")
print (F"Raw data contains {raw_length} records")

raw_resource


Reading raw resource from ../data/endpoint_checker/collection/resource/b4601330812f8710e184930f9d415e4beadd36399700d27e8836243001fc37e1
Failed - reading from ../data/endpoint_checker/var/converted_resources/b4601330812f8710e184930f9d415e4beadd36399700d27e8836243001fc37e1.csv instead.

Raw data contains 30 records


Unnamed: 0,WKT,gml_id,reference,name,article_4_direction,permitted_development_rights,entry_date,documentation_url,document_url,description,ogc_fid,start_date,uprn,address_text
0,"MULTIPOLYGON (((521712.6495 161746.9547,521715...",article4areas,11/00002/ART4,Ewell Village,11/00002/ART4,1a; 1c;1d;1f;2a;2b;,2011-07-08,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,1,,,
1,"MULTIPOLYGON (((521634.4945 160872.1447,521517...",article4areas,12/00002/ART4,Pikes Hill (Wyeths Road front gardens),12/00002/ART4,1a;1b;1c;1d;1f;2a;,2012-10-23,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Permitted Development Rights revoked within Co...,2,2012-09-11,,
2,"MULTIPOLYGON (((520019.1995 160362.3447,520013...",article4areas,05/00001/ART4,Stamford Green,05/00001/ART4,1a;1c;1d;1f;2a;2b;,2005-01-24,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,3,2005-01-24,,
3,"MULTIPOLYGON (((522162.5495 160777.5947,522159...",article4areas,05/00002/ART4,Higher Green/Longdown Lane,05/00002/ART4,1a;1c;1d;1f;2a;2b;,2005-01-24,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,4,2005-01-24,,
4,"MULTIPOLYGON (((522319.8495 160906.5247,522316...",article4areas,05/00004/ART4,The Green/Ewell Downs Road,05/00004/ART4,1a;1c;1d;1f;2a;2b;,2005-01-24,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,5,2000-02-03,,
5,"MULTIPOLYGON (((521630.0637 160359.3092,521625...",article4areas,11/00003/ART4,Burgh Heath Road,11/00003/ART4,1a;1b;1c;1d;1f;2a,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Permitted Development Rights revoked within Co...,6,2011-08-12,,
6,"MULTIPOLYGON (((521357.7495 160752.1547,521368...",article4areas,11/00004/ART4,Church Street (Epsom),11/00004/ART4,1a;1b;1c;1d;1f;2a,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,7,2011-08-12,,
7,"MULTIPOLYGON (((521737.7995 160365.9547,521759...",article4areas,11/00005/ART4,College Road,11/00005/ART4,1a;1b;1c;1d;1f;2a,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,8,2011-08-12,,
8,"MULTIPOLYGON (((521685.4495 159613.3547,521685...",article4areas,11/00006/ART4,Downs Road Estate,11/00006/ART4,1a;1b;1c;1d;1f;2a,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,9,2011-08-12,,
9,"MULTIPOLYGON (((521259.1995 161345.1047,521264...",article4areas,11/00007/ART4,Lintons Lane (Part),11/00007/ART4,1a;1b;1c;1f;2a;,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Permitted Development Rights revoked within Co...,10,2011-08-12,,


## Scripting

if everything above looks OK, you can use the scripts below to insert the relevant updates into the collection.

In [16]:
print ( F"IMPORTING INTO {collection_name} -------------------")
print ("")
print ("touch import.csv")

header = "organisation,documentation-url,endpoint-url,start-date,pipelines,plugin"

line = F"{organisation},{documentation_url},{endpoint_url},{start_date},{dataset},"
if plugin is not None:
    line = line + F"{plugin}"

collection = collection_name.rsplit('-', 1)[0]

print ("")
print (header)
print (line)
print ("")
print (F"Save the two lines above to `import.csv` and run the line below from inside your collection folder. You need a .venv in place.\n")
print ("")
print (F"digital-land add-endpoints-and-lookups ./import.csv {collection}")
print ("")





IMPORTING INTO article-4-direction-collection -------------------

touch import.csv

organisation,documentation-url,endpoint-url,start-date,pipelines,plugin
local-authority-eng:EPS,https://ckan.publishing.service.gov.uk/dataset/article-4-directions_for_planx,https://maps.epsom-ewell.gov.uk/getOWS.ashx?MapSource=EEBC/planx&service=WFS&version=1.1.0&request=GetFeature&Typename=article4areas,2023-10-21,article-4-direction-area,wfs

Save the two lines above to `import.csv` and run the line below from inside your collection folder. You need a .venv in place.


digital-land add-endpoints-and-lookups ./import.csv article-4-direction-area

