# Data downloading architectural decision record

The problem: how do we efficiently download data from DNO data portals?

We have a few sub-problems:
1. We don't know in advance where the files we're going to download are (what urls)
2. DNO's use different data portal tech (CKAN, OpenDataSoft, Custom) with different APIs
  (or none)
3. The files are either numerous (e.g. SSEN's one per day, NGED's hundreds per month)
  or large (e.g. NPg/UKPN's once per month)

We want to make sure we have the latest data, but minimise the bandwith costs 
to ourselves and the DNOs by only downloading stuff we haven't seen before.

We need to decide whether we're building a complete historical archive, or just trying 
to get the latest data. We also need to decide if we care about protecting ourselves 
from data loss - e.g. if a DNO publishes a new version of a file with less data in it.
This may have data privacy implications, e.g. if they publish stuff they shouldn't and
later redact it.

In [1]:
import requests

## SSEN

SSEN's dataset list is a bit hidden, I found it by looking at the network requests made by the website. It's a JSON API endpoint that lists all the datasets, which looks like it's basically coming direct from an S3 bucket ListObjects call (Judging by the format of their etags anyway).

I think SSEN do use CKAN, but all I can find in their actual CKAN instance is the postcode mapping dataset file, not this stuff.

The data returned from this API includes an etag and an "uploaded" timestamp for each file. Etags seem like the most
reliable thing to use for versioning - it's exactly what we want, a hash of the contents.

Given that they publish files regularly, they're the most likely data source to update old files, but they do seem to wait 5 days or so before publishing, which might suggest they're giving some grace period for meter readings to arrive before publishing to avoid having to do so.

In [2]:
response = requests.get("https://ssen-smart-meter-prod.datopian.workers.dev/LV_FEEDER_USAGE/").json()
files = ((o["etag"], o["downloadLink"]) for o in response["objects"])
list(files)

[('59afefb87a92823d8ab0bb1489fbbe5e-97',
  'https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-02-12.csv'),
 ('597ff9204b0417c36d5b3e4079d74647-97',
  'https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-02-13.csv'),
 ('db8d997bf2f9a218e8487dca906d544f-97',
  'https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-02-14.csv'),
 ('0d3b07711f27fdeb543ea2e180aa0c9b-97',
  'https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-02-15.csv'),
 ('3d98935cdd80690c536fbc236ad53043-97',
  'https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-02-16.csv'),
 ('a4a22ba2e59e3c5e39d96035fd1cd279-97',
  'https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-02-17.csv'),
 ('793b77b5d7ef2931066e5b241d018608-97',
  'https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-02-18.csv'),
 ('20619dd86b56e1b35ccf58f2e1bd0398-97',
  'https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-02-19.csv'),
 ('a86595e531312bc1c8c0045b84632

## UKPN

UKPN use a different data portal tech from OpenDataSoft and have (so far) put their data actually _inside_ the data portal's database. They have just a single "dataset" for everything and then the file downloads are "exports" of that dataset (in OpenDataSoft's terminology). Therefore, there's no need to really list anything, we can just go straight to the export of the dataset - I've chosen parquet in my example but there are lots of options.

In theory we could use their portal's query language ODSQL: https://help.opendatasoft.com/apis/ods-explore-v2/#tag/Dataset to reduce the data returned, but I don't think it's worth our investment, since surely isn't going to scale so they're going to have to do something totally different.

For now I think we should never download this again after the first time - we're fairly confident it isn't going to change because it's just their prototype. 

Note you need to authenticate, but it returns a valid 0 row file if you don't!

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()
ukpn_api_key = os.environ["UKPN_API_KEY"]
ukpn_url = f"https://ukpowernetworks.opendatasoft.com/api/explore/v2.1/catalog/datasets/ukpn-smart-meter-consumption-lv-feeder/exports/parquet?apikey={ukpn_api_key}"
# Can't fetch exactly but you get the idea
# Note it takes bloody ages to get this because it's being exported on the fly!
# import pandas as pd
# df = pd.read_parquet(ukpn_url)
# df.info()

## Northern Powergrid

NPg also use OpenDataSoft, but they have modelled it differently. Their "dataset" contains no data, it's just a placeholder to attach bunch of file "attachments". They attach both substation and feeder level data to the same dataset so we have to do a bit of fragile filtering based on the names of the files. They also make you login to download it.

There are no etags or other metadata to help us here, so we can't tell if a file has changed or not. We can either assume it's not going to change and just download it once, or download it every time. They wait even longer than SSEN to publish stuff, so it seems unlikely the meter readings are going to change.

In [20]:
import os
from dotenv import load_dotenv
load_dotenv()
npg_api_key = os.environ["NPG_API_KEY"]
npg_url = f"https://northernpowergrid.opendatasoft.com/api/explore/v2.1/catalog/datasets/aggregated-smart-metering-data/attachments/?apikey={npg_api_key}"
data = requests.get(npg_url).json()
urls = [a["href"] for a in data["attachments"] if "feeder" in a["metas"]["id"]]
list(urls)

['https://northernpowergrid.opendatasoft.com/api/explore/v2.1/catalog/datasets/aggregated-smart-metering-data/attachments/npg_feeder_december2023_zip',
 'https://northernpowergrid.opendatasoft.com/api/explore/v2.1/catalog/datasets/aggregated-smart-metering-data/attachments/npg_feeder_november2023_zip',
 'https://northernpowergrid.opendatasoft.com/api/explore/v2.1/catalog/datasets/aggregated-smart-metering-data/attachments/npg_feeder_january2024_zip',
 'https://northernpowergrid.opendatasoft.com/api/explore/v2.1/catalog/datasets/aggregated-smart-metering-data/attachments/npg_feeder_february2024_zip',
 'https://northernpowergrid.opendatasoft.com/api/explore/v2.1/catalog/datasets/aggregated-smart-metering-data/attachments/npg_feeder_march2024_zip',
 'https://northernpowergrid.opendatasoft.com/api/explore/v2.1/catalog/datasets/aggregated-smart-metering-data/attachments/npg_feeder_april2024_zip',
 'https://northernpowergrid.opendatasoft.com/api/explore/v2.1/catalog/datasets/aggregated-smart

## NGED

NGED use CKAN, so we can use the CKAN API to list the "resources" attached to the "package" aka dataset they've set up. Resources are the individual files, of which there are many, like nearly 1,000 already. They also require authentication - if you miss it, the urls are "redacted" instead.

There are several metadata fields that could be used to determine if a file has changed, but I think the best one is `last_modified` - they don't provide any hash or etag.

Given the number of files I'm concerned they might overload the portal at some point, but there's no mention of pagination in the API docs for this endpoint. I think we just have to assume it's all there, but we need to keep an eye on it, especially if we don't see over 1,000 next month (the limit they apply on other API responses).

Again they wait at least a month before publishing data, so it's hard to imagine they'll add meter readings after that to older files.

In [33]:
import os
from dotenv import load_dotenv
load_dotenv()
nged_api_key = os.environ["NATIONAL_GRID_API_TOKEN"]
headers = {"Authorization": nged_api_key}
nged_url = "https://connecteddata.nationalgrid.co.uk/api/3/action/package_show?id=aggregated-smart-meter-data-lv-feeder"
response = requests.get(nged_url, headers=headers)
urls = [(r["last_modified"], r["url"]) for r in response["result"]["resources"]]
list(urls)


[('2024-03-01T14:01:07.253873',
  'https://connecteddata.nationalgrid.co.uk/dataset/a920c581-9c6f-4788-becc-9d2caf20050c/resource/105a7821-7f5c-4591-90e8-5915f253b1ff/download/aggregated-smart-meter-data-lv-feeder-2024-01-part0000.csv'),
 ('2024-03-01T14:01:08.136233',
  'https://connecteddata.nationalgrid.co.uk/dataset/a920c581-9c6f-4788-becc-9d2caf20050c/resource/a9cea137-9149-44df-a90e-f8a89ab8dcfa/download/aggregated-smart-meter-data-lv-feeder-2024-01-part0001.csv'),
 ('2024-03-01T14:02:51.447894',
  'https://connecteddata.nationalgrid.co.uk/dataset/a920c581-9c6f-4788-becc-9d2caf20050c/resource/a0598187-c6a7-4314-9072-ca87d3c529e1/download/aggregated-smart-meter-data-lv-feeder-2024-01-part0003.csv'),
 ('2024-03-01T14:03:47.373096',
  'https://connecteddata.nationalgrid.co.uk/dataset/a920c581-9c6f-4788-becc-9d2caf20050c/resource/23541896-d38e-45ad-9271-a1e8dbad8a35/download/aggregated-smart-meter-data-lv-feeder-2024-01-part0002.csv'),
 ('2024-03-01T14:04:53.347418',
  'https://conne

## Options

Our approach might have to vary depending on each DNO:
- UKPN we will just download the file once and never again (for now)
- NPG we can only de-dupe based on name, unless we hash them ourselves
- SSEN we could de-dupe based on name and etag
- NGED we could de-dupe based on name and last_modified

Having explored the options with S3 a bit, I see three main options:

1. Use S3 with versioning and custom object metadata for deduping
2. Use S3 with versioning and some additional metadata index for deduping
3. Use S3 with name-based de-duping

### Option 1
We could store the metadata we need to de-dupe in S3's custom metadata, so each file can have a DNO-etag or DNO-last-modified value. Combined with S3's versioning, this would give us a way to store a complete archive and make any pipeline run reproducible, whilst only uploading new & changed data.

The downside is that the comparison procession would need to make a HEAD request to every single object in our bucket every time we run it. There's no way to list the metadata for a whole bucket in one API call. This isn't too bad for DNOs like NPg, but for NGED it could grow into several thousand S3 API requests.

### Option 2
An obvious solution to the problems of 1. is to store the metadata elsewhere, in some kind of index file or database. Then our download process could make one API request (to our index) to get the latest metadata for every file, so it can compare to the results from the various DNOs.

The best way to do this within the S3/AWS world seems to be to set up a lamba function to receive S3 events and then write the metadata into DynamoDB. [Here's AWS' example code for that](https://github.com/aws-samples/aws-big-data-blog/blob/master/aws-blog-s3-index-with-lambda-ddb/example-indexer-app/s3Indexer.js).

This looks pretty simple, but it's still a lot more AWS infra just to store some metadata - it adds complexity to development, deployment and monitoring.

### Option 3
The final option is simplest, just using the names (object keys) of the files in the S3 bucket to dedupe them. This is much simpler in that we can simply list the bucket to see what files we already have (n / 1000 API requests where n is the number of objects in the bucket).

One nice property of this is that it's also easier to swap out S3 for any other object store, or filesystem.

To dedupe based on content or other metadata, we would need to include that in the filename, and have some way of determining the latest version. Then we would parse the listing results to separate out the data we need.

This gives us more explicit versioning of files, just without the nice API S3 offers around it - we'd probably need some shared code for listing and parsing our custom naming scheme.

Note: S3 has a 1KB object key limit, our largest filenames (before versioning info etc) are maybe 60B, so I think we're fine for space.

## Decision
All of these options prove that versioning and de-duping files based on their content is more complex than I originally thought, so I'm going to take a YAGNI approach for v1:

- We'll use Option 3, de-duping solely based on original filename
- We'll simply list the bucket during downloads, assuming that any files we already have, haven't changed
- We can extend this later to have an "ignore existing" option on our pipelines that downloads everything
- We won't version files in S3
- We'll ask DNOs if they ever change existing files