## Check an endpoint
This notebook aims to run the pipeline on a given endpoint to check to see if it will be successful. This includes collecting, pipeline and datset stages. It aims to also highlight useful information as a summary as to whether the endpoint would be successful on our platform. it will download all relevant data to do this and hence might be disk intensive. You'll need to provide the following information:
- collection - this is the collection that the dataset belongs to, this can be extracted from the specification but for this notebook we ask to provide it incase you want to test the pipeline on something which isn't being included in the main site right now
- dataset - this is the dataset that the endpoint is meant to provide data for, technically this can be multiple datasets but this this use case it should just be one. It is also the name of the pipeline that is ran on the individual resources that are downloaded from the endpoint. the terms dataset/pipline are often the same
- organisation - the organisation identifier to be used for the endpoint
- endpoint url - the actual url needed for the endpoint
- plugin - often we use plugins to download the data this is only needed for specific endpoints

If you are seeing errors regarding digital-land, then try 

`pip3 install -e git+https://github.com/digital-land/pipeline.git#egg=digital-land`

And restart the notebook.

In [None]:
import os
import pandas as pd
import urllib
from functions import run_endpoint_workflow, missing_columns
from sqlite_query_functions import DatasetSqlite
from convert_functions import convert_resource
from digital_land.collection import Collection
from data_file import get_duplicates_between_orgs
from download_data import download_dataset

In [None]:
# Extend these lists as/when you need to add other collections

# collection_name = 'article-4-direction-collection'
# collection_name = 'brownfield-land-collection'
# collection_name = 'conservation-area-collection'
# collection_name = 'flood-risk-zone-collection'
# collection_name = 'listed-building-collection'
# collection_name = 'tree-preservation-order-collection'
collection_name = 'developer-contributions'

# dataset = 'article-4-direction'
# dataset = 'article-4-direction-area'
# dataset = 'brownfield-land'
# dataset = 'conservation-area'
# dataset = 'conservation-area-document'
# dataset = 'flood-risk-zone'
# dataset = 'listed-building-outline'
# dataset = 'tree'
# dataset = 'tree-preservation-order'
# dataset = 'tree-preservation-zone'
# dataset = 'developer-agreement-transaction'
# dataset = 'developer-agreement-contribution'
dataset = 'developer-agreement'

# additional_column_mappings=None
# additional_concats=None

# plugin = None
# plugin = 'arcgis'
# plugin = 'wfs'
 
# additional_column_mappings=None

# additional_concats=None

# EXAMPLE / TEST DATA HERE

# collection_name = 'brownfield-land-collection'
# dataset = 'brownfield-land'
organisation = 'local-authority-eng:TEW'
endpoint_url = 'https://tewkesburyborough-my.sharepoint.com/:x:/g/personal/website_tewkesburyborough_onmicrosoft_com/EUtZYznG_r9DmnRTN2GJFywBQDyfS9oWczBkcHk6DEqj5A?e=V8myPt&download=1'
documentation_url = "https://tewkesbury.gov.uk/services/planning/community-infrastructure-levy-cil/developer-contributions/"
start_date=""
plugin = ''
licence = "ogl3"

reference_column = ""

additional_column_mappings=None
additional_concats=None

# generic data_dir setting
data_dir = '../data/endpoint_checker'

# example playing with additional confiigs
# additional_concats = [{
#     'dataset':'tree-preservation-zone',
#     'endpoint':'de1eb90a8b037292ef8ae14bfabd1184847ef99b7c6296bb6e75379e6c1f9572',
#     'resource':'e6b0ccaf9b50a7f57a543484fd291dbe43f52a6231b681c3a7cc5e35a6aba254',
#     'field':'reference',
#     'fields':'REFVAL;LABEL',
#     'separator':'/'
# }]


In [None]:
run_endpoint_workflow(
    collection_name,
    dataset,
    organisation,
    endpoint_url,
    plugin,
    data_dir,
    additional_col_mappings=additional_column_mappings,
    additional_concats=additional_concats
)

### Collection log summaries

We need to establish if a resource was downloaded from the endpoint and whether there were any issues during the collection process. Examine the output of the below. There should be one log for the attempt made at downloading from the endpoint. If status code is 200 then the resource was downloaded successfully

In [None]:
try:
    # Initialize the collection from the specified directory
    collection = Collection(os.path.join(data_dir,'collection'))
    
    # Load the collection
    collection.load(directory=os.path.join(data_dir,'collection'))
    
    # Access the records of a resource within the collection
    collection.resource.records
    
    print("Download has succeeded.")
except Exception as e:
    # If anything goes wrong, print the error and status code
    print(f"Download failed with error: {e}")
    if hasattr(e, 'status'):
        print(f"Status code: {e.status}")


In [None]:
logs = collection.log.entries
logs = pd.DataFrame.from_records(logs)
logs


### Check unnassigned entiities
Detect and assign entity numbers where entries are currently unnassigned. Examine the list below to see what (if any) entities have been assigned. if you were to include these in an actual pipeline you would need to update the configuration lookup.csv with these values. It's worth checking they are sensible before this happens.

In [None]:
unassigned_entries = pd.read_csv(os.path.join(data_dir,'var','cache','unassigned-entries.csv'))
if len(unassigned_entries) == 0:
    print('No additional entity numbers required')
else:
    print(F"{len(unassigned_entries)} unassigned entities\n")
    print(unassigned_entries)

### Check logs collated from the pipeline process
We need to read the logs and examine to see if the data points were all read in correctly. This uses the sqlite database to do so with some custom queries. You could directly examine the csvs if the pipeline fails.

First, check the column mappings to see what columns the pipeline automatically mapped. If this is empty or has missing values, then it's likely to be the reason data isn't appearing at the end.

In [None]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_column_mappings()
df = pd.read_csv('https://raw.githubusercontent.com/digital-land/specification/main/specification/dataset-field.csv')
expected_columns = df.groupby('dataset')['field'].apply(list).to_dict()
missing_columns(results, dataset, expected_columns)
results

### Issues Logs

Lists all of the issues/warnings, their respective severities and whose responsibility it is to address.

In [None]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
dataset_issues = dataset_db.get_issues_by_type()

def get_issue_types_with_responsibility_and_severity():
    datasette_url = "https://datasette.planning.data.gov.uk/"
    params = urllib.parse.urlencode({
    "sql": f"""
    select issue_type, responsibility, severity
    from issue_type
    """,
    "_size": "max"
    })
    
    url = f"{datasette_url}digital-land.csv?{params}"
    df = pd.read_csv(url)
    return df
    
issue_reference = get_issue_types_with_responsibility_and_severity()
matched_dataset_issues = dataset_issues.merge(issue_reference, on='issue_type', how='left')
matched_dataset_issues

#### Look at a specific problem type in more detail

Take the issue_type from the above.

In [None]:
results = dataset_db.get_issues()

#problem = 'OSGB out of bounds of England'
#problem = 'invalid geometry'
# problem = 'Unexpected geom type'
problem = 'OSGB'

results.loc[(results.issue_type == problem) ]



### Final dataset 

Shows the end result of the processing. You should see a decent number of these columns populated with data from the raw resources above.

In [None]:
dataset_db = DatasetSqlite(os.path.join(data_dir,'dataset',f'{dataset}.sqlite3'))
results = dataset_db.get_entities()

final_length = len(results)

print("")
print (F"Final data contains {final_length} records")

results

### Existing Duplicate Entities between organisations

This downloads a sqlite db for the current dataset  
It compares the current endpoint entities with existing ones   
Identifies duplicates between all organisations

In [None]:
download_dataset(dataset,collection_name,f"{data_dir}/entity_resolution")
dataset_path = os.path.join(f"{data_dir}/entity_resolution",f'{dataset}.sqlite3')
duplicates = get_duplicates_between_orgs(dataset_path,f'../data/endpoint_checker/dataset/{dataset}.sqlite3')

if duplicates.empty:
    print("No duplicate entities found with existing entities")

duplicates

### Duplicates with different entity numbers

In [None]:
if not duplicates.empty:
    filtered_df = duplicates[duplicates['primary_entity'] != duplicates['secondary_entity']]
    filtered_df

To merge the duplicated entities  
For each entity, use the corresponding primary_entity value (as the entity number)

### Possible Internal Duplicate Entities

The below table displays duplicates in the data provided identified using the geographical info (geometry and point column).  
Sometimes it is legit, but worth checking in the source data to make sure it passes the sniff test.

In [None]:
if not (results[['geometry', 'point']].apply(lambda x: x.str.strip() == '')).all().all():
    grouped = results.groupby(['geometry', 'point'])
    grouped_list=[]
    for key, value in (grouped.groups).items():
        if len(value) > 1 and key[1] !='':
            filtered_df = results[(results['geometry'] == key[0]) & (results['point'] == key[1])]
            grouped_list.append(filtered_df)

    if len(grouped_list)>1:
        for i in range(len(grouped_list)):
            display(grouped_list[i])
    else:
        print("No internal duplicates found in the given endpoint")
else:
    print("No geometry or point data in the dataset")

### RAW (ish) DATA

This is the lightly processed data. 

In [None]:
# load in raw resources
collection = Collection(os.path.join(data_dir,'collection'))
collection.load(directory=os.path.join(data_dir,'collection'))
resources = collection.resource.entries
resources

In [None]:
resource = resources[0]['resource']
resource_path = os.path.join(data_dir,'collection','resource',resource)

print (F"Reading raw resource from {resource_path}")

try:
    raw_resource = pd.read_csv(resource_path)
except (UnicodeDecodeError,TypeError,pd.errors.ParserError):
    converted_resource_dir = os.path.join(data_dir,'var','converted_resources')
    converted_resource_path = os.path.join(converted_resource_dir,f'{resource}.csv') 
    if not os.path.exists(converted_resource_path):
        convert_resource(resource,resource_path,converted_resource_dir,dataset)
    print (F"Failed - reading from {converted_resource_path} instead.")
    raw_resource = pd.read_csv(converted_resource_path)

raw_length = len(raw_resource)
print("")
print (F"Raw data contains {raw_length} records")

raw_resource


### Duplicate Values in Reference Columns
The provided reference field must contain unique values. This will check whether there are any duplicated values in the reference column selected (at the top of the Jupyter Notebook). If the reference_column variable is an empty string ("") the aggregates will be calculated for all fields.

Please note "**count**" is the number of entries in a field **not including NaN values**, "**size**" is the number of entries in a field **including NaN values**, and "**nunique**" is the number of unique values. An appropriate primary key should not include any NaN values, and all values must be unique. 

For the ideal reference/primary key field: `count = nunique = size`.

Please note calculations are run on raw(ish) data generated above.

In [None]:
if reference_column == "":
    duplicate_reference_check = raw_resource.agg(['count', 'size', 'nunique'])
else:
    duplicate_reference_check = raw_resource[reference_column].agg(['count', 'size', 'nunique'])
duplicate_reference_check

### Scripting

if everything above looks OK, you can use the scripts below to insert the relevant updates into the collection.

In [None]:
print ( F"IMPORTING INTO {collection_name} -------------------")
print ("")
print ("touch import.csv")



header = "organisation,documentation-url,endpoint-url,start-date,pipelines,plugin,licence"

line = F"{organisation},{documentation_url},{endpoint_url},{start_date},{dataset},{licence}"
if plugin is not None:
    line = line + F"{plugin}"

collection = collection_name.rsplit('-', 1)[0]

if(licence == "" or licence == None):
    print("The licence field cannot be null or empty")
elif(documentation_url == "" or documentation_url == None):
    print("The licence field cannot be null or empty")
else:
    print ("")
    print (header)
    print (line)
    print ("")
    print (F"Save the two lines above to `import.csv` and run the line below from inside your collection folder. You need a .venv in place.\n")
    print ("")
    print (F"digital-land add-endpoints-and-lookups ./import.csv {collection}")
    print ("")



