### Check an endpoint
This notebook aims to run the pipeline on a given endpoint to check to see if it will be successful. This includes collecting, pipeline and datset stages. It aims to also highlight useful information as a summary as to whether the endpoint would be successful on our platform. it will download all relevant data to do this and hence might be disk intensive. You'll need to provide the following information:
- collection - this is the collection that the dataset belongs to, this can be extracted from the specification but for this notebook we ask to provide it incase you want to test the pipeline on something which isn't being included in the main site right now
- dataset - this is the dataset that the endpoint is meant to provide data for, technically this can be multiple datasets but this this use case it should just be one. It is also the name of the pipeline that is ran on the individual resources that are downloaded from the endpoint. the terms dataset/pipline are often the same
- organisation - the organisation identifier to be used for the endpoint
- endpoint url - the actual url needed for the endpoint
- plugin - often we use plugins to download the data this is only needed for specific endpoints

If you are seeing errors regarding digital-land, then try 

`pip install -e git+https://github.com/digital-land/pipeline.git#egg=digital-land`

And restart the notebook.

In [1]:
import os
import pandas as pd
from functions import run_endpoint_workflow
from sqlite_query_functions import DatasetSqlite
from convert_functions import convert_resource
from digital_land.collection import Collection

In [2]:
#southwark convservation area endpoint
# collection_name = 'conservation-area-collection'
# dataset = 'conservation-area'
# organisation = 'local-authority-eng:SWK'
# endpoint_url = 'https://www.southwark.gov.uk/assets/attach/194104/Conservation-Areas.gpkg'
# plugin = None
# additional_column_mappings=None
# additional_concats=None

# doncaster tpos
# collection_name = 'tree-preservation-order-collection'
# dataset = 'tree-preservation-zone'
# organisation = 'local-authority-eng:DNC'
# endpoint_url='https://maps.doncaster.gov.uk/server/rest/services/Planning/TPO_Map/MapServer/1'
# plugin = 'arcgis'
# additional_column_mappings=None
# additional_concats=None

# 
collection_name = 'conservation-area-collection'
dataset = 'conservation-area'
organisation = 'local-authority-eng:NET'
endpoint_url = 'https://datamillnorth.org/download/2rlwm/5785bd0c-b3d2-4724-b7f4-8b6f3ff231c9/conservation_areas.csv'
plugin = None

additional_column_mappings=None
additional_concats=None

# generic data_dir setting
data_dir = '../data/endpoint_checker'

# example playing with additional confiigs
# additional_concats = [{
#     'dataset':'tree-preservation-zone',
#     'endpoint':'de1eb90a8b037292ef8ae14bfabd1184847ef99b7c6296bb6e75379e6c1f9572',
#     'resource':'e6b0ccaf9b50a7f57a543484fd291dbe43f52a6231b681c3a7cc5e35a6aba254',
#     'field':'reference',
#     'fields':'REFVAL;LABEL',
#     'separator':'/'
# }]


In [3]:
run_endpoint_workflow(
    collection_name,
    dataset,
    organisation,
    endpoint_url,
    plugin,
    data_dir,
    additional_col_mappings=additional_column_mappings,
    additional_concats=additional_concats
)

HTTP Error 404: Not Found
HTTP Error 404: Not Found
../data/endpoint_checker/collection
[{'resource': '222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a210bd79cb5def57e31', 'bytes': '', 'endpoints': '431efe2f2cbc73eb181f67ce9cc27cc61f88236835eb2b3b96ee6a54ceb6458a', 'organisations': 'local-authority-eng:CAM', 'datasets': 'listed-building-outline', 'start-date': '2023-10-10', 'end-date': ''}]




#### Collection log summaries

We need to establish if a resource was downloaded from the endpoint and whether there were any issues during the collection process. Examine the output of the below. There should be one log for the attempt made at downloading from the endpoint. If status code is 200 then the resource was downloaded successfully

In [4]:
collection = Collection(os.path.join(data_dir,'collection'))
collection.load(directory=os.path.join(data_dir,'collection'))

collection.resource.records

{'222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a210bd79cb5def57e31': [{'resource': '222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a210bd79cb5def57e31',
   'bytes': '',
   'organisations': 'local-authority-eng:CAM',
   'datasets': 'listed-building-outline',
   'endpoints': '431efe2f2cbc73eb181f67ce9cc27cc61f88236835eb2b3b96ee6a54ceb6458a',
   'start-date': '2023-10-10',
   'end-date': ''}]}

In [5]:

logs = collection.log.entries
logs = pd.DataFrame.from_records(logs)
logs

Unnamed: 0,bytes,content-type,elapsed,endpoint,resource,status,entry-date,start-date,end-date,exception
0,,text/csv; charset=utf-8,1.162,431efe2f2cbc73eb181f67ce9cc27cc61f88236835eb2b...,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,200,2023-10-10T10:17:04.542539,,,


### Check unnassigned entiities
This process automatically aims to detect and assign entity numbers where entries are currently unnassigned. Examine the list below to see what (if any) entities have been assigned. if you were to include these in an actual pipeline you would need to update the configuration lookup.csv with these values. It's worth checking they are sensible before this happens 

In [6]:
unassigned_entries = pd.read_csv(os.path.join(data_dir,'var','cache','unassigned-entries.csv'))
if len(unassigned_entries) == 0:
    print('no additional entity numbers where required')
else:
    print(unassigned_entries)

                 organisation                   prefix reference
0     local-authority-eng:CAM  listed-building-outline    LB1859
1     local-authority-eng:CAM  listed-building-outline    LB1481
2     local-authority-eng:CAM  listed-building-outline    LB1872
3     local-authority-eng:CAM  listed-building-outline    LB1531
4     local-authority-eng:CAM  listed-building-outline    LB1532
...                       ...                      ...       ...
1956  local-authority-eng:CAM  listed-building-outline     LB117
1957  local-authority-eng:CAM  listed-building-outline    LB1648
1958  local-authority-eng:CAM  listed-building-outline    LB1854
1959  local-authority-eng:CAM  listed-building-outline    LB1724
1960  local-authority-eng:CAM  listed-building-outline    LB1958

[1961 rows x 3 columns]


#### Check logs collated from the pipeline process
We need to read the logs and examine to see if the data points were all read in correctly. This uses the sqlite database to do so with some custom queries. You could directly examine the csvs if the pipeline fails.

First, check the column mappings to see what columns the pipeline automatically mapped. Tf this is empty or missing values, then it's likely to be the reason data isn't appearing at the end.

In [7]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_column_mappings()

results

Unnamed: 0,end_date,entry_date,field,dataset,start_date,resource,column
0,,2023-10-10T10:17:22Z,end-date,listed-building-outline,,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,end-date
1,,2023-10-10T10:17:22Z,entry-date,listed-building-outline,,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,entry-date
2,,2023-10-10T10:17:22Z,geometry,listed-building-outline,,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,geometry
3,,2023-10-10T10:17:22Z,listed-building,listed-building-outline,,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,listed-building
4,,2023-10-10T10:17:22Z,listed-building-grade,listed-building-outline,,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,listed-building-grade
5,,2023-10-10T10:17:22Z,name,listed-building-outline,,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,name
6,,2023-10-10T10:17:22Z,notes,listed-building-outline,,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,notes
7,,2023-10-10T10:17:22Z,reference,listed-building-outline,,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,reference
8,,2023-10-10T10:17:22Z,start-date,listed-building-outline,,222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a...,start-date


examine the issues logs, we'll look at the types of errors being raised and list all of them. This could be improved in the future by examining the severity

In [8]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_issues_by_type()
results

Unnamed: 0,issue_type,count
0,default-value,3922
1,invalid geometry,6


In [9]:
results = dataset_db.get_issues()


results.loc[(results.issue_type != 'default-value') ]

# duff_geom = results.loc[(results.issue_type == 'invalid geometry')]

duff_geom = results.loc[(results.issue_type == 'Unexpected geom type')]

cols = duff_geom[["line_number", "value"]]

print (cols)


Empty DataFrame
Columns: [line_number, value]
Index: []


#### Final dataset comparison against the sqlite database

Below are two tables which show the difference betwen what was provided to us and what is ucrrently in the entity table. It is important to bear in mind that we assign entities automaticallyis process, the table above shows what we have added.

In [10]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_entities()
results.head(26)

Unnamed: 0,dataset,end_date,entity,entry_date,geojson,geometry,json,name,organisation_entity,point,prefix,reference,start_date,typology
0,listed-building-outline,,42118245,2023-10-10,,"MULTIPOLYGON (((-0.129989 51.535897,-0.130021 ...","{""listed-building"": ""10271"", ""listed-building-...","(East, off) Court Building, St Pancras Coroner...",71,POINT(-0.130051 51.535969),listed-building-outline,LB1859,2003-09-05,geography
1,listed-building-outline,,42118246,2023-10-10,,"MULTIPOLYGON (((-0.192568 51.561490,-0.192569 ...","{""listed-building"": ""477772"", ""listed-building...",(West side) Cattle Trough at junction with Her...,71,POINT(-0.192542 51.561486),listed-building-outline,LB1481,1998-07-01,geography
2,listed-building-outline,,42118247,2023-10-10,,"MULTIPOLYGON (((-0.174260 51.541968,-0.174272 ...","{""listed-building"": ""492770"", ""listed-building...","HAMPSTEAD, ADELAIDE ROAD Swiss Cottage Regency...",71,POINT(-0.174453 51.542105),listed-building-outline,LB1872,2006-09-18,geography
3,listed-building-outline,,42118248,2023-10-10,,"MULTIPOLYGON (((-0.134375 51.520073,-0.134312 ...","{""listed-building"": ""1061382"", ""listed-buildin...","Nos. 64-67, Nos. 2-8",71,POINT(-0.134257 51.520167),listed-building-outline,LB1531,2002-07-19,geography
4,listed-building-outline,,42118249,2023-10-10,,"MULTIPOLYGON (((-0.133429 51.520708,-0.133451 ...","{""listed-building"": ""1061383"", ""listed-buildin...",NORTH CRESCENT War Memorial,71,POINT(-0.133435 51.520699),listed-building-outline,LB1532,2002-07-19,geography
5,listed-building-outline,,42118250,2023-10-10,,"MULTIPOLYGON (((-0.125589 51.513687,-0.125239 ...","{""listed-building"": ""1061403"", ""listed-buildin...",(South side) Nos.42-54,71,POINT(-0.125752 51.513737),listed-building-outline,LB1533,2002-07-25,geography
6,listed-building-outline,,42118251,2023-10-10,,"MULTIPOLYGON (((-0.179434 51.555601,-0.179424 ...","{""listed-building"": ""1067340"", ""listed-buildin...",(North side) No.5,71,POINT(-0.179406 51.555659),listed-building-outline,LB337,1950-08-11,geography
7,listed-building-outline,,42118252,2023-10-10,,"MULTIPOLYGON (((-0.179434 51.555601,-0.179442 ...","{""listed-building"": ""1067341"", ""listed-buildin...",(North side) No.6 and attached railings,71,POINT(-0.179495 51.555648),listed-building-outline,LB338,1950-08-11,geography
8,listed-building-outline,,42118253,2023-10-10,,"MULTIPOLYGON (((-0.179549 51.555617,-0.179548 ...","{""listed-building"": ""1067342"", ""listed-buildin...",(North side) No.7 and attached railings,71,POINT(-0.179574 51.555673),listed-building-outline,LB339,1950-08-11,geography
9,listed-building-outline,,42118254,2023-10-10,,"MULTIPOLYGON (((-0.179596 51.555590,-0.179589 ...","{""listed-building"": ""1067343"", ""listed-buildin...",(North side) No.8 and attached railings and gate,71,POINT(-0.179621 51.555667),listed-building-outline,LB340,1950-08-11,geography


In [11]:
# load in raw resources
collection = Collection(os.path.join(data_dir,'collection'))
collection.load(directory=os.path.join(data_dir,'collection'))
resources = collection.resource.entries
resources

[{'resource': '222aa207c5ad94d02272ac94ac4d0ea93f37a733dd217a210bd79cb5def57e31',
  'bytes': '',
  'organisations': 'local-authority-eng:CAM',
  'datasets': 'listed-building-outline',
  'endpoints': '431efe2f2cbc73eb181f67ce9cc27cc61f88236835eb2b3b96ee6a54ceb6458a',
  'start-date': '2023-10-10',
  'end-date': ''}]

#### Duplicate entities within organisation

The below table displays duplicates in the data provided identified using the geographical info (geometry and point column).

In [12]:
if not (results[['geometry', 'point']].apply(lambda x: x.str.strip() == '')).all().all():
    grouped = results.groupby(['geometry', 'point'])
    grouped_list=[]
    for key, value in (grouped.groups).items():
        if len(value) > 1 and key[1] !='':
            filtered_df = results[(results['geometry'] == key[0]) & (results['point'] == key[1])]
            grouped_list.append(filtered_df)

    if len(grouped_list)>1:
        for i in range(len(grouped_list)):
            display(grouped_list[i])
    else:
        print("No duplicates found in the given dataset")
else:
    print("No geometry or point data in the dataset")

No duplicates found in the given dataset


In [13]:
# currently this just reads in the raw resource but in  the future this should check for 
# converted resource first
resource = resources[0]['resource']
resource_path = os.path.join(data_dir,'collection','resource',resource)
try:
    raw_resource = pd.read_csv(resource_path)
except (UnicodeDecodeError,TypeError,pd.errors.ParserError):
    converted_resource_dir = os.path.join(data_dir,'var','converted_resources')
    converted_resource_path = os.path.join(converted_resource_dir,f'{resource}.csv') 
    if not os.path.exists(converted_resource_path):
        convert_resource(resource,resource_path,converted_resource_dir,dataset)
    raw_resource = pd.read_csv(converted_resource_path)
    

In [14]:
raw_resource

Unnamed: 0,reference,name,listed-building,listed-building-grade,notes,start-date,end-date,entry-date,geometry
0,LB1859,"(East, off) Court Building, St Pancras Coroner...",10271,II,,2003-09-05,,,"POLYGON ((-0.129989 51.535897,-0.130021 51.535..."
1,LB1481,(West side) Cattle Trough at junction with Her...,477772,II,,1998-07-01,,,"POLYGON ((-0.192568 51.56149,-0.192569 51.5614..."
2,LB1872,"HAMPSTEAD, ADELAIDE ROAD Swiss Cottage Regency...",492770,II,,2006-09-18,,,"POLYGON ((-0.17426 51.541968,-0.174269 51.5419..."
3,LB1531,"Nos. 64-67, Nos. 2-8",1061382,II,,2002-07-19,,,"POLYGON ((-0.134375 51.520073,-0.134506 51.520..."
4,LB1532,NORTH CRESCENT War Memorial,1061383,II,,2002-07-19,,,"POLYGON ((-0.133429 51.520708,-0.133419 51.520..."
...,...,...,...,...,...,...,...,...,...
1956,LB117,(West side) Tomb of Joseph Edwards in Highgate...,1378949,II,,1974-05-14,,,"POLYGON ((-0.149322 51.567993,-0.149323 51.567..."
1957,LB1648,Thirteen lamp posts and one bollard at south e...,1067388,II,,1974-05-14,,,"MULTIPOLYGON (((-0.146389 51.532315,-0.14639 5..."
1958,LB1854,(South West side) Wall to south-east of Terrac...,1113180,II,,1999-12-30,,,"MULTIPOLYGON (((-0.181143 51.566891,-0.181083 ..."
1959,LB1724,Railings around Euston Square Gardens,1342039,II,,1974-05-14,,,"MULTIPOLYGON (((-0.132753 51.527112,-0.13276 5..."
