### Check an endpoint
This notebook aims to run the pipeline on a given endpoint to check to see if it will be successful. This includes collecting, pipeline and datset stages. It aims to also highlight useful information as a summary as to whether the endpoint would be successful on our platform. it will download all relevant data to do this and hence might be disk intensive. You'll need to provide the following information:
- collection - this is the collection that the dataset belongs to, this can be extracted from the specification but for this notebook we ask to provide it incase you want to test the pipeline on something which isn't being included in the main site right now
- dataset - this is the dataset that the endpoint is meant to provide data for, technically this can be multiple datasets but this this use case it should just be one. It is also the name of the pipeline that is ran on the individual resources that are downloaded from the endpoint. the terms dataset/pipline are often the same
- organisation - the organisation identifier to be used for the endpoint
- endpoint url - the actual url needed for the endpoint
- plugin - often we use plugins to download the data this is only needed for specific endpoints

If you are seeing errors regarding digital-land, then try 

`pip install -e git+https://github.com/digital-land/pipeline.git#egg=digital-land`

And restart the notebook.

In [17]:
import os
import pandas as pd
from functions import run_endpoint_workflow
from sqlite_query_functions import DatasetSqlite
from convert_functions import convert_resource
from digital_land.collection import Collection

In [18]:
# Extend these lists as/when you need to add other collections

# collection_name = 'conservation-area-collection'
# collection_name = 'listed-building-collection'
# collection_name = 'tree-preservation-order-collection'

# dataset = 'conservation-area'
# dataset = 'listed-building-outline'
# dataset = 'tree-preservation-zone'
# dataset = 'tree'


# additional_column_mappings=None
# additional_concats=None

# plugin = None
# plugin = 'arcgis'
# plugin = 'wfs'
 
# additional_column_mappings=None

# additional_concats=None


collection_name = 'article-4-direction-collection'
dataset = 'article-4-direction-area'
organisation = 'local-authority-eng:EPS'
endpoint_url = 'https://maps.epsom-ewell.gov.uk/getOWS.ashx?MapSource=EEBC/planx&service=WFS&version=1.1.0&request=GetFeature&Typename=article4areas'
documentation_url = "https://ckan.publishing.service.gov.uk/dataset/article-4-directions_for_planx"
start_date="2023-10-21"
plugin = 'wfs'


additional_column_mappings=None
additional_concats=None

# generic data_dir setting
data_dir = '../data/endpoint_checker'

# example playing with additional confiigs
# additional_concats = [{
#     'dataset':'tree-preservation-zone',
#     'endpoint':'de1eb90a8b037292ef8ae14bfabd1184847ef99b7c6296bb6e75379e6c1f9572',
#     'resource':'e6b0ccaf9b50a7f57a543484fd291dbe43f52a6231b681c3a7cc5e35a6aba254',
#     'field':'reference',
#     'fields':'REFVAL;LABEL',
#     'separator':'/'
# }]


In [19]:
run_endpoint_workflow(
    collection_name,
    dataset,
    organisation,
    endpoint_url,
    plugin,
    data_dir,
    additional_col_mappings=additional_column_mappings,
    additional_concats=additional_concats
)

HTTP Error 404: Not Found




../data/endpoint_checker/collection
[{'resource': 'b4601330812f8710e184930f9d415e4beadd36399700d27e8836243001fc37e1', 'bytes': '197667', 'endpoints': '99dd0c0c5c8d8b3d4250ac27ed6e5c04c7a0c661f7f5e698d84c140ac240d1bc', 'organisations': 'local-authority-eng:EPS', 'datasets': 'article-4-direction-area', 'start-date': '2023-11-01', 'end-date': ''}]


#### Collection log summaries

We need to establish if a resource was downloaded from the endpoint and whether there were any issues during the collection process. Examine the output of the below. There should be one log for the attempt made at downloading from the endpoint. If status code is 200 then the resource was downloaded successfully

In [20]:
collection = Collection(os.path.join(data_dir,'collection'))
collection.load(directory=os.path.join(data_dir,'collection'))

collection.resource.records

{'b4601330812f8710e184930f9d415e4beadd36399700d27e8836243001fc37e1': [{'resource': 'b4601330812f8710e184930f9d415e4beadd36399700d27e8836243001fc37e1',
   'bytes': '197667',
   'organisations': 'local-authority-eng:EPS',
   'datasets': 'article-4-direction-area',
   'endpoints': '99dd0c0c5c8d8b3d4250ac27ed6e5c04c7a0c661f7f5e698d84c140ac240d1bc',
   'start-date': '2023-11-01',
   'end-date': ''}]}

In [21]:

logs = collection.log.entries
logs = pd.DataFrame.from_records(logs)
logs

Unnamed: 0,bytes,content-type,elapsed,endpoint,resource,status,entry-date,start-date,end-date,exception
0,197667,text/xml; subtype=gml/3.1.1; charset=utf-8,0.58,99dd0c0c5c8d8b3d4250ac27ed6e5c04c7a0c661f7f5e6...,b4601330812f8710e184930f9d415e4beadd36399700d2...,200,2023-11-01T17:48:25.741280,,,


### Check unnassigned entiities
This process automatically aims to detect and assign entity numbers where entries are currently unnassigned. Examine the list below to see what (if any) entities have been assigned. if you were to include these in an actual pipeline you would need to update the configuration lookup.csv with these values. It's worth checking they are sensible before this happens 

In [22]:
unassigned_entries = pd.read_csv(os.path.join(data_dir,'var','cache','unassigned-entries.csv'))
if len(unassigned_entries) == 0:
    print('No additional entity numbers required')
else:
    print(F"{len(unassigned_entries)} unassigned entities\n")
    print(unassigned_entries)



No additional entity numbers required


#### Check logs collated from the pipeline process
We need to read the logs and examine to see if the data points were all read in correctly. This uses the sqlite database to do so with some custom queries. You could directly examine the csvs if the pipeline fails.

First, check the column mappings to see what columns the pipeline automatically mapped. Tf this is empty or missing values, then it's likely to be the reason data isn't appearing at the end.

In [23]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_column_mappings()

results

Unnamed: 0,end_date,entry_date,field,dataset,start_date,resource,column
0,,2023-11-01T18:03:12Z,geometry,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,WKT
1,,2023-11-01T18:03:12Z,address-text,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,address_text
2,,2023-11-01T18:03:12Z,article-4-direction,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,article_4_direction
3,,2023-11-01T18:03:12Z,description,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,description
4,,2023-11-01T18:03:12Z,entry-date,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,entry_date
5,,2023-11-01T18:03:12Z,name,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,name
6,,2023-11-01T18:03:12Z,permitted-development-rights,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,permitted_development_rights
7,,2023-11-01T18:03:12Z,reference,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,reference
8,,2023-11-01T18:03:12Z,start-date,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,start_date
9,,2023-11-01T18:03:12Z,uprn,article-4-direction-area,,b4601330812f8710e184930f9d415e4beadd36399700d2...,uprn


examine the issues logs, we'll look at the types of errors being raised and list all of them. This could be improved in the future by examining the severity

In [24]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_issues_by_type()
results

Unnamed: 0,issue_type,count
0,OSGB,70
1,default-field,40
2,default-value,142
3,invalid geometry,12


In [25]:
results = dataset_db.get_issues()

#results

results.loc[(results.issue_type != 'default-value') ]

# Some alternative views below. 
# duff_geom = results.loc[(results.issue_type == 'invalid geometry')]
# duff_geom = results.loc[(results.issue_type == 'Unexpected geom type')]
# duff_geom

# cols = duff_geom[["line_number", "value"]]
# print (cols)


Unnamed: 0,end_date,entry_date,entry_number,field,issue_type,line_number,dataset,resource,start_date,value
0,,,1,geometry,OSGB,2,article-4-direction-area,b4601330812f8710e184930f9d415e4beadd36399700d2...,,
2,,,2,geometry,OSGB,3,article-4-direction-area,b4601330812f8710e184930f9d415e4beadd36399700d2...,,
4,,,3,geometry,OSGB,4,article-4-direction-area,b4601330812f8710e184930f9d415e4beadd36399700d2...,,
6,,,4,geometry,OSGB,5,article-4-direction-area,b4601330812f8710e184930f9d415e4beadd36399700d2...,,
8,,,5,geometry,OSGB,6,article-4-direction-area,b4601330812f8710e184930f9d415e4beadd36399700d2...,,
...,...,...,...,...,...,...,...,...,...,...
249,,,12,geometry,invalid geometry,13,article-4-direction-area,ceb8832d8f0d93156ebf7e7ce72faa737010fc09dab0f6...,,Self-intersection[-0.17860633008658 51.5571355...
250,,,12,geometry,invalid geometry,13,article-4-direction-area,ceb8832d8f0d93156ebf7e7ce72faa737010fc09dab0f6...,,Too few points in geometry component[-0.164362...
253,,,13,geometry,invalid geometry,14,article-4-direction-area,ceb8832d8f0d93156ebf7e7ce72faa737010fc09dab0f6...,,Too few points in geometry component[-0.189968...
256,,,14,geometry,invalid geometry,15,article-4-direction-area,ceb8832d8f0d93156ebf7e7ce72faa737010fc09dab0f6...,,Self-intersection[-0.146218 51.536104]


#### Final dataset comparison against the sqlite database

Below are two tables which show the difference betwen what was provided to us and what is ucrrently in the entity table. It is important to bear in mind that we assign entities automaticallyis process, the table above shows what we have added.

In [26]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_entities()
results.head(26)

Unnamed: 0,dataset,end_date,entity,entry_date,geojson,geometry,json,name,organisation_entity,point,prefix,reference,start_date,typology
0,article-4-direction-area,,6100156,2011-07-08,,"MULTIPOLYGON (((-0.253935 51.341786,-0.253814 ...","{""article-4-direction"": ""11/00002/ART4"", ""desc...",Ewell Village,129,POINT(-0.251449 51.350634),article-4-direction-area,11/00002/ART4,,geography
1,article-4-direction-area,,6100157,2012-10-23,,"MULTIPOLYGON (((-0.255355 51.333941,-0.255621 ...","{""article-4-direction"": ""12/00002/ART4"", ""desc...",Pikes Hill (Wyeths Road front gardens),129,POINT(-0.256291 51.333998),article-4-direction-area,12/00002/ART4,2012-09-11,geography
2,article-4-direction-area,,6100158,2005-01-24,,"MULTIPOLYGON (((-0.278702 51.329702,-0.277627 ...","{""article-4-direction"": ""05/00001/ART4"", ""desc...",Stamford Green,129,POINT(-0.281016 51.332719),article-4-direction-area,05/00001/ART4,2005-01-24,geography
3,article-4-direction-area,,6100159,2005-01-24,,"MULTIPOLYGON (((-0.247811 51.332978,-0.247591 ...","{""article-4-direction"": ""05/00002/ART4"", ""desc...",Higher Green/Longdown Lane,129,POINT(-0.244961 51.332815),article-4-direction-area,05/00002/ART4,2005-01-24,geography
4,article-4-direction-area,,6100160,2005-01-24,,"MULTIPOLYGON (((-0.245510 51.334103,-0.245293 ...","{""article-4-direction"": ""05/00004/ART4"", ""desc...",The Green/Ewell Downs Road,129,POINT(-0.246739 51.337815),article-4-direction-area,05/00004/ART4,2000-02-03,geography
5,article-4-direction-area,,6100161,2011-08-12,,"MULTIPOLYGON (((-0.255593 51.329332,-0.256100 ...","{""article-4-direction"": ""11/00003/ART4"", ""desc...",Burgh Heath Road,129,POINT(-0.256471 51.327048),article-4-direction-area,11/00003/ART4,2011-08-12,geography
6,article-4-direction-area,,6100162,2011-08-12,,"MULTIPOLYGON (((-0.259366 51.332921,-0.259330 ...","{""article-4-direction"": ""11/00004/ART4"", ""desc...",Church Street (Epsom),129,POINT(-0.260777 51.331192),article-4-direction-area,11/00004/ART4,2011-08-12,geography
7,article-4-direction-area,,6100163,2011-08-12,,"MULTIPOLYGON (((-0.254045 51.329369,-0.253832 ...","{""article-4-direction"": ""11/00005/ART4"", ""desc...",College Road,129,POINT(-0.253319 51.328533),article-4-direction-area,11/00005/ART4,2011-08-12,geography
8,article-4-direction-area,,6100164,2011-08-12,,"MULTIPOLYGON (((-0.255053 51.322616,-0.255040 ...","{""article-4-direction"": ""11/00006/ART4"", ""desc...",Downs Road Estate,129,POINT(-0.258010 51.323635),article-4-direction-area,11/00006/ART4,2011-08-12,geography
9,article-4-direction-area,,6100165,2011-08-12,,"MULTIPOLYGON (((-0.260579 51.338271,-0.260828 ...","{""article-4-direction"": ""11/00007/ART4"", ""desc...",Lintons Lane (Part),129,POINT(-0.260179 51.337938),article-4-direction-area,11/00007/ART4,2011-08-12,geography


In [27]:
# load in raw resources
collection = Collection(os.path.join(data_dir,'collection'))
collection.load(directory=os.path.join(data_dir,'collection'))
resources = collection.resource.entries
resources

[{'resource': 'b4601330812f8710e184930f9d415e4beadd36399700d27e8836243001fc37e1',
  'bytes': '197667',
  'organisations': 'local-authority-eng:EPS',
  'datasets': 'article-4-direction-area',
  'endpoints': '99dd0c0c5c8d8b3d4250ac27ed6e5c04c7a0c661f7f5e698d84c140ac240d1bc',
  'start-date': '2023-11-01',
  'end-date': ''}]

In [28]:
# currently this just reads in the raw resource but in the future this should check for 
# converted resource first
resource = resources[0]['resource']
resource_path = os.path.join(data_dir,'collection','resource',resource)

print (F"Reading raw resource from {resource_path}")
try:
    raw_resource = pd.read_csv(resource_path)
except (UnicodeDecodeError,TypeError,pd.errors.ParserError):
    converted_resource_dir = os.path.join(data_dir,'var','converted_resources')
    converted_resource_path = os.path.join(converted_resource_dir,f'{resource}.csv') 
    if not os.path.exists(converted_resource_path):
        convert_resource(resource,resource_path,converted_resource_dir,dataset)
    raw_resource = pd.read_csv(converted_resource_path)
    

Reading raw resource from ../data/endpoint_checker/collection/resource/b4601330812f8710e184930f9d415e4beadd36399700d27e8836243001fc37e1


In [29]:
raw_resource

Unnamed: 0,WKT,gml_id,reference,name,article_4_direction,permitted_development_rights,entry_date,documentation_url,document_url,description,ogc_fid,start_date,uprn,address_text
0,"MULTIPOLYGON (((521712.6495 161746.9547,521715...",article4areas,11/00002/ART4,Ewell Village,11/00002/ART4,1a; 1c;1d;1f;2a;2b;,2011-07-08,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,1,,,
1,"MULTIPOLYGON (((521634.4945 160872.1447,521517...",article4areas,12/00002/ART4,Pikes Hill (Wyeths Road front gardens),12/00002/ART4,1a;1b;1c;1d;1f;2a;,2012-10-23,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Permitted Development Rights revoked within Co...,2,2012-09-11,,
2,"MULTIPOLYGON (((520019.1995 160362.3447,520013...",article4areas,05/00001/ART4,Stamford Green,05/00001/ART4,1a;1c;1d;1f;2a;2b;,2005-01-24,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,3,2005-01-24,,
3,"MULTIPOLYGON (((522162.5495 160777.5947,522159...",article4areas,05/00002/ART4,Higher Green/Longdown Lane,05/00002/ART4,1a;1c;1d;1f;2a;2b;,2005-01-24,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,4,2005-01-24,,
4,"MULTIPOLYGON (((522319.8495 160906.5247,522316...",article4areas,05/00004/ART4,The Green/Ewell Downs Road,05/00004/ART4,1a;1c;1d;1f;2a;2b;,2005-01-24,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,5,2000-02-03,,
5,"MULTIPOLYGON (((521630.0637 160359.3092,521625...",article4areas,11/00003/ART4,Burgh Heath Road,11/00003/ART4,1a;1b;1c;1d;1f;2a,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Permitted Development Rights revoked within Co...,6,2011-08-12,,
6,"MULTIPOLYGON (((521357.7495 160752.1547,521368...",article4areas,11/00004/ART4,Church Street (Epsom),11/00004/ART4,1a;1b;1c;1d;1f;2a,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,7,2011-08-12,,
7,"MULTIPOLYGON (((521737.7995 160365.9547,521759...",article4areas,11/00005/ART4,College Road,11/00005/ART4,1a;1b;1c;1d;1f;2a,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,8,2011-08-12,,
8,"MULTIPOLYGON (((521685.4495 159613.3547,521685...",article4areas,11/00006/ART4,Downs Road Estate,11/00006/ART4,1a;1b;1c;1d;1f;2a,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Conservation area with Development rights revoked,9,2011-08-12,,
9,"MULTIPOLYGON (((521259.1995 161345.1047,521264...",article4areas,11/00007/ART4,Lintons Lane (Part),11/00007/ART4,1a;1b;1c;1f;2a;,2011-08-12,https://www.epsom-ewell.gov.uk/residents/plann...,https://www.epsom-ewell.gov.uk/sites/default/f...,Permitted Development Rights revoked within Co...,10,2011-08-12,,


## Scripting

if everything above looks OK, you can use the scripts below to insert the relevant updates into the collection.

In [33]:
add_source =F"digital-land collection-add-source {dataset} '{endpoint_url}' organisation {organisation} documentation-url '{documentation_url}'"

if plugin is not None:
    add_source = add_source + F" plugin {plugin}"

print ("OPTION 1 -------------------")
print ("")
print ("")
print (add_source)
print ("")
print (F"make")
print ("")

print ("OPTION 2 (Better)-------------------")
header = "organisation,documentation-url,endpoint-url,start-date,pipelines,plugin"
line = F"{organisation},{documentation_url},{endpoint_url},{start_date},{dataset},{plugin}"
print ("")
print (header)
print (line)
print ("")
print (F"Save the two lines above to `import.csv` and run ")
print (F"  digital-land add-endpoints-and-lookups ./import.csv {dataset}")
print ("From inside your collection folder. You need a .venv in place.\n")




OPTION 1 -------------------


digital-land collection-add-source article-4-direction-area 'https://maps.epsom-ewell.gov.uk/getOWS.ashx?MapSource=EEBC/planx&service=WFS&version=1.1.0&request=GetFeature&Typename=article4areas' organisation local-authority-eng:EPS documentation-url 'https://ckan.publishing.service.gov.uk/dataset/article-4-directions_for_planx' plugin wfs

make

OPTION 2 (Better)-------------------

organisation,documentation-url,endpoint-url,start-date,pipelines,plugin
local-authority-eng:EPS,https://ckan.publishing.service.gov.uk/dataset/article-4-directions_for_planx,https://maps.epsom-ewell.gov.uk/getOWS.ashx?MapSource=EEBC/planx&service=WFS&version=1.1.0&request=GetFeature&Typename=article4areas,2023-10-21,article-4-direction-area,wfs

Save the two lines below to `import.csv` and run 
  digital-land add-endpoints-and-lookups ./import.csv article-4-direction-area
From inside your collection folder. You need a .venv in place.

