### Check an endpoint
This notebook aims to run the pipeline on a given endpoint to check to see if it will be successful. This includes collecting, pipeline and datset stages. It aims to also highlight useful information as a summary as to whether the endpoint would be successful on our platform. it will download all relevant data to do this and hence might be disk intensive. You'll need to provide the following information:
- collection - this is the collection that the dataset belongs to, this can be extracted from the specification but for this notebook we ask to provide it incase you want to test the pipeline on something which isn't being included in the main site right now
- dataset - this is the dataset that the endpoint is meant to provide data for, technically this can be multiple datasets but this this use case it should just be one. It is also the name of the pipeline that is ran on the individual resources that are downloaded from the endpoint. the terms dataset/pipline are often the same
- organisation - the organisation identifier to be used for the endpoint
- endpoint url - the actual url needed for the endpoint
- plugin - often we use plugins to download the data this is only needed for specific endpoints

In [1]:
import os
import pandas as pd
from functions import run_endpoint_workflow
from sqlite_query_functions import DatasetSqlite
from convert_functions import convert_resource
from digital_land.collection import Collection

In [2]:
#southwark convservation area endpoint
# collection_name = 'conservation-area-collection'
# dataset = 'conservation-area'
# organisation = 'local-authority-eng:SWK'
# endpoint_url = 'https://www.southwark.gov.uk/assets/attach/194104/Conservation-Areas.gpkg'
# plugin = None
# additional_column_mappings=None
# additional_concats=None

# doncaster tpos
# collection_name = 'tree-preservation-order-collection'
# dataset = 'tree-preservation-zone'
# organisation = 'local-authority-eng:DNC'
# endpoint_url='https://maps.doncaster.gov.uk/server/rest/services/Planning/TPO_Map/MapServer/1'
# plugin = 'arcgis'
# additional_column_mappings=None
# additional_concats=None

# 
collection_name = 'conservation-area-collection'
dataset = 'conservation-area'
organisation = 'local-authority-eng:GRY'
endpoint_url = 'https://services7.arcgis.com/NXULfNGIlsKmpCJD/ArcGIS/rest/services/Great_Yarmouth_Borough_Council_Conservation_Areas_(GeoJSON)/FeatureServer/0'
plugin = 'arcgis'
additional_column_mappings=None
additional_concats=None

# generic data_dir setting
data_dir = '../data/endpoint_checker'

# example playing with additional confiigs
# additional_concats = [{
#     'dataset':'tree-preservation-zone',
#     'endpoint':'de1eb90a8b037292ef8ae14bfabd1184847ef99b7c6296bb6e75379e6c1f9572',
#     'resource':'e6b0ccaf9b50a7f57a543484fd291dbe43f52a6231b681c3a7cc5e35a6aba254',
#     'field':'reference',
#     'fields':'REFVAL;LABEL',
#     'separator':'/'
# }]


In [3]:
run_endpoint_workflow(
    collection_name,
    dataset,
    organisation,
    endpoint_url,
    plugin,
    data_dir,
    additional_col_mappings=additional_column_mappings,
    additional_concats=additional_concats
)

HTTP Error 404: Not Found
HTTP Error 404: Not Found
HTTP Error 404: Not Found
../data/endpoint_checker/collection
[{'resource': '02a5e9a92b566d8a94f6571e4b85362bfdcfa6222a8a1fff50af3476746db275', 'bytes': '', 'endpoints': '212d2778ce05c5aa782ddbfd9d737e54d25747846e25d79e0929854b9545fee8', 'organisations': 'local-authority-eng:GRY', 'datasets': 'conservation-area', 'start-date': '2023-07-26', 'end-date': ''}]




['categories', 'conservation-area', 'documentation-url', 'end-date', 'entity', 'entry-date', 'geometry', 'legislation', 'name', 'notes', 'organisation', 'point', 'prefix', 'reference', 'start-date']


In [5]:
# validate endpoint


#### collection log summaries

We need to establish if a resource was downloaded from the endpoint and whether there were any issues during the collection process. Examine the output of the below. There should be one log for the attempt made at downloading from the endpoint. If status code is 200 then the resource was downloaded successfully

In [7]:
collection = Collection(os.path.join(data_dir,'collection'))
collection.load(directory=os.path.join(data_dir,'collection'))
logs = collection.log.entries
logs = pd.DataFrame.from_records(logs)
logs

Unnamed: 0,bytes,content-type,elapsed,endpoint,resource,status,entry-date,start-date,end-date,exception
0,4006,text/html; charset=utf-8,0.129,65e1a8d8e1fe2c7995543a290474017df2c82c30cdaa9f...,,200,2023-07-26T10:03:32.457156,,,


### Check unnassigned entiities
This process automatically aims to detect and assign entity numbers where entries are currently unnassigned. Examine the list below to see what (if any) entities have been assigned. if you were to include these in an actual pipeline you would need to update the configuration lookup.csv with these values. It's worth checking they are sensible before this happens 

In [4]:
unassigned_entries = pd.read_csv(os.path.join(data_dir,'var','cache','unassigned-entries.csv'))
if len(unassigned_entries) == 0:
    print('no additional entity numbers where required')
else:
    print(unassigned_entries)

               organisation             prefix reference
0   local-authority-eng:GRY  conservation-area      CA01
1   local-authority-eng:GRY  conservation-area     CA01E
2   local-authority-eng:GRY  conservation-area      CA02
3   local-authority-eng:GRY  conservation-area     CA02E
4   local-authority-eng:GRY  conservation-area      CA03
5   local-authority-eng:GRY  conservation-area     CA03E
6   local-authority-eng:GRY  conservation-area      CA04
7   local-authority-eng:GRY  conservation-area     CA04E
8   local-authority-eng:GRY  conservation-area      CA05
9   local-authority-eng:GRY  conservation-area     CA05E
10  local-authority-eng:GRY  conservation-area      CA06
11  local-authority-eng:GRY  conservation-area      CA07
12  local-authority-eng:GRY  conservation-area      CA08
13  local-authority-eng:GRY  conservation-area      CA09
14  local-authority-eng:GRY  conservation-area      CA10
15  local-authority-eng:GRY  conservation-area      CA11
16  local-authority-eng:GRY  co

#### Check logs collated from the pipeline process
We need to readin in logs and examine to see if the data points were all read in correctly This uses the sqlite database to do so with some custom queries I have wrote. you could directly examine the csvs if the pipeline fails.

first check column mappings to see what columns the pipeline automatically mapped. if this is empty or missing values it's likely to be the reason data isn't appearing at the end.

In [5]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_column_mappings()
results

Unnamed: 0,end_date,entry_date,field,dataset,start_date,resource,column
0,,2023-07-26T10:23:38Z,geometry,conservation-area,,02a5e9a92b566d8a94f6571e4b85362bfdcfa6222a8a1f...,Geometry
1,,2023-07-26T10:23:38Z,name,conservation-area,,02a5e9a92b566d8a94f6571e4b85362bfdcfa6222a8a1f...,Name
2,,2023-07-26T10:23:38Z,notes,conservation-area,,02a5e9a92b566d8a94f6571e4b85362bfdcfa6222a8a1f...,Notes
3,,2023-07-26T10:23:38Z,reference,conservation-area,,02a5e9a92b566d8a94f6571e4b85362bfdcfa6222a8a1f...,Reference
4,,2023-07-26T10:23:38Z,geometry,conservation-area,,02a5e9a92b566d8a94f6571e4b85362bfdcfa6222a8a1f...,WKT


examine the issues logs, we'll look at the types of errors being raised and list all of them. This could be improved in the future by examining the severity

In [6]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_issues_by_type()
results

Unnamed: 0,issue_type,count
0,OSGB,1035
1,default-value,1187
2,invalid geometry,30
3,patch,1035


In [7]:
results = dataset_db.get_issues()
results

Unnamed: 0,end_date,entry_date,entry_number,field,issue_type,line_number,dataset,resource,start_date,value
0,,,1,organisation,patch,2,conservation-area,f617d13ef7ff061424ec50ed641f05779797ff03beb2cd...,,Havering
1,,,1,geometry,OSGB,2,conservation-area,f617d13ef7ff061424ec50ed641f05779797ff03beb2cd...,,
2,,,1,entry-date,default-value,2,conservation-area,f617d13ef7ff061424ec50ed641f05779797ff03beb2cd...,,2023-07-24
3,,,2,organisation,patch,3,conservation-area,f617d13ef7ff061424ec50ed641f05779797ff03beb2cd...,,Havering
4,,,2,geometry,OSGB,3,conservation-area,f617d13ef7ff061424ec50ed641f05779797ff03beb2cd...,,
...,...,...,...,...,...,...,...,...,...,...
3282,,,51,entry-date,default-value,52,conservation-area,274345f7c75fbb670408a5d57b5cdd72ba201e2f5b1ed9...,,2023-07-24
3283,,,52,organisation,default-value,53,conservation-area,274345f7c75fbb670408a5d57b5cdd72ba201e2f5b1ed9...,,local-authority-eng:SWK
3284,,,52,entry-date,default-value,53,conservation-area,274345f7c75fbb670408a5d57b5cdd72ba201e2f5b1ed9...,,2023-07-24
3285,,,53,organisation,default-value,54,conservation-area,274345f7c75fbb670408a5d57b5cdd72ba201e2f5b1ed9...,,local-authority-eng:SWK


#### Final dataset comparison against the sqlite database

Below are two tables which show the difference betwen what was provided to us and what is ucrrently in the entity table. It is important to bear in mind that we assign entities automaticallyis process, the table above shows what we have added.

In [8]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_entities()
results

Unnamed: 0,dataset,end_date,entity,entry_date,geojson,geometry,json,name,organisation_entity,point,prefix,reference,start_date,typology
0,conservation-area,,44009677,2023-07-26,,"MULTIPOLYGON (((1.732998 52.598272,1.732999 52...",,No.1 Camperdown,152,POINT(1.734384 52.599775),conservation-area,CA01,,geography
1,conservation-area,,44009678,2023-07-26,,"MULTIPOLYGON (((1.733197 52.598323,1.733170 52...",,No.1 Camperdown Extension,152,POINT(1.732864 52.596802),conservation-area,CA01E,,geography
2,conservation-area,,44009679,2023-07-26,,"MULTIPOLYGON (((1.725801 52.606626,1.727346 52...",,"No.2 Market Place, Rows & North Quay",152,POINT(1.726043 52.608407),conservation-area,CA02,,geography
3,conservation-area,,44009680,2023-07-26,,"MULTIPOLYGON (((1.721599 52.608503,1.721852 52...",,"No.2 Market Place, Rows & North Quay Extension",152,POINT(1.725341 52.609483),conservation-area,CA02E,,geography
4,conservation-area,,44009681,2023-07-26,,"MULTIPOLYGON (((1.725801 52.606626,1.725599 52...",,No.3 Hall Quay / South Quay,152,POINT(1.725013 52.605461),conservation-area,CA03,,geography
5,conservation-area,,44009682,2023-07-26,,"MULTIPOLYGON (((1.722014 52.606203,1.722609 52...",,No.3 Hall Quay / South Quay Extension,152,POINT(1.722801 52.606116),conservation-area,CA03E,,geography
6,conservation-area,,44009683,2023-07-26,,"MULTIPOLYGON (((1.729086 52.607073,1.728459 52...",,No.4 King Street,152,POINT(1.729306 52.603959),conservation-area,CA04,,geography
7,conservation-area,,44009684,2023-07-26,,"MULTIPOLYGON (((1.731135 52.601797,1.731489 52...",,No.4 King Street Extension,152,POINT(1.730760 52.601018),conservation-area,CA04E,,geography
8,conservation-area,,44009685,2023-07-26,,"MULTIPOLYGON (((1.725371 52.610008,1.725405 52...",,No.5 St Nicholas / Northgate Street,152,POINT(1.726889 52.611277),conservation-area,CA05,,geography
9,conservation-area,,44009686,2023-07-26,,"MULTIPOLYGON (((1.730376 52.614018,1.730387 52...",,No.5 St Nicholas / Northgate Street Extension,152,POINT(1.731805 52.614110),conservation-area,CA05E,,geography


In [9]:
# load in raw resources
collection = Collection(os.path.join(data_dir,'collection'))
collection.load(directory=os.path.join(data_dir,'collection'))
resources = collection.resource.entries
resources

[{'resource': '02a5e9a92b566d8a94f6571e4b85362bfdcfa6222a8a1fff50af3476746db275',
  'bytes': '',
  'organisations': 'local-authority-eng:GRY',
  'datasets': 'conservation-area',
  'endpoints': '212d2778ce05c5aa782ddbfd9d737e54d25747846e25d79e0929854b9545fee8',
  'start-date': '2023-07-26',
  'end-date': ''}]

In [14]:
# currently this just reads in the raw resource but in  the future this should check for 
# converted resource first
resource = resources[0]['resource']
resource_path = os.path.join(data_dir,'collection','resource',resource)
try:
    raw_resource = pd.read_csv(resource_path)
except (UnicodeDecodeError,TypeError,pd.errors.ParserError):
    converted_resource_dir = os.path.join(data_dir,'var','converted_resources')
    converted_resource_path = os.path.join(converted_resource_dir,f'{resource}.csv') 
    if not os.path.exists(converted_resource_path):
        convert_resource(resource,resource_path,converted_resource_dir,dataset)
    raw_resource = pd.read_csv(converted_resource_path)
    

In [15]:
raw_resource

Unnamed: 0,WKT,Id,Reference,Name,Doc_url,Notes,Start_date,End_date,Entry_date,Geometry,ObjectId,Shape__Area,Shape__Length
0,"MULTIPOLYGON (((1.7329984 52.5982717,1.7329972...",0,CA01,No.1 Camperdown,,,-2592000000,,1687478400000,,1,152690.5,1783.579191
1,"MULTIPOLYGON (((1.7331975 52.5983235,1.7332193...",0,CA01E,No.1 Camperdown Extension,,,1065744000000,,1687824000000,,2,270956.8,2415.602492
2,"MULTIPOLYGON (((1.7258015 52.6066257,1.7255991...",0,CA02,"No.2 Market Place, Rows & North Quay",,,175478400000,,1687824000000,,3,259940.3,3261.361601
3,"MULTIPOLYGON (((1.7215994 52.6085035,1.7215853...",0,CA02E,"No.2 Market Place, Rows & North Quay Extension",,,1065744000000,,1687824000000,,4,54256.17,2449.845027
4,"MULTIPOLYGON (((1.7258015 52.6066257,1.7260825...",0,CA03,No.3 Hall Quay / South Quay,,,175478400000,,1687824000000,,5,306439.1,3094.843406
5,"MULTIPOLYGON (((1.7220141 52.606203,1.7219689 ...",0,CA03E,No.3 Hall Quay / South Quay Extension,,,1065744000000,,1687824000000,,6,10350.76,595.59588
6,"MULTIPOLYGON (((1.7290864 52.6070726,1.7291139...",0,CA04,No.4 King Street,,,175478400000,,1687824000000,,7,308390.0,4635.825873
7,"MULTIPOLYGON (((1.7311355 52.6017967,1.7311392...",0,CA04E,No.4 King Street Extension,,,1065744000000,,1687824000000,,8,9704.828,719.597124
8,"MULTIPOLYGON (((1.7253708 52.6100084,1.7253086...",0,CA05,No.5 St Nicholas / Northgate Street,,,175478400000,,1687824000000,,9,459344.9,3094.684691
9,"MULTIPOLYGON (((1.7303756 52.6140176,1.7299218...",0,CA05E,No.5 St Nicholas / Northgate Street Extension,,,1065744000000,,1687824000000,,10,289284.3,2350.928523


In [3]:
a_list= ['a',2]

In [4]:
a_list

['a', 2]