Adds data pipeline framework and some uk geographical datasources #9

chrisdevereux · 2021-12-03T13:02:15Z

Description

This PR introduces:

A standard way of interacting with external APIs.
A standard way of registering and running cron jobs.
A model subclass that uses both of these to regularly sync local models to an external API.
Data sources for UK parliamentry constituencies, current MPs and parties
A Postcodes.IO client.

I won't detail it extensively here – you can read the included documentation (both code-level and conceptual) for more detail.

Motivation and Context

Django ships out of the box with an excellent ORM for modeling data created and managed by our application. This is
useful for several reasons, but mainly:

Consistent conventions mean less surprises about how different things work.
The standardised model/queryset format allows the framework to support generic views, forms and other components.

Unfortunately, they don't help so much with external data - either public data that we want to be infomred of changes to or data managed by external services.

This gets particuarly tricky when we want to augment the remote data with our own data or there are api limits that
require us to store the data locally and keep it up to date – we often end up writing lots of bug-prone glue code to
manage this.

Handling this in a standardised way decreases the amount of surprise involved when interacting with external APIs. It also makes it much easier for us to quickly support new APIs and write generic code for functionality like the above.

Possible future directions:

Building out a suite of external APIs.
DatasourceView – like Django's ModelView, but wrapping an external api instead of a model.
Making a signal fire when SyncedModel detects that a particular attribute has changed.
Integrated webhook support on Datasources so that we can support 'push' as well as 'pull'.

How Can It Be Tested?

Clone this PR in a development container (this should be working now)
Run the tests with make test
Look at the example by:
- Running the cron tasks to pull in some test data (python manage.py run_cron_tasks --all)
- Launching the example app (F5 in vscode)
- Navigating to http://localhost:8000/uk/mps/ to display the synced data
Read the docsite by:
- Running make serve-docs
- Navigating to http://localhost:8001 and looking at the Guides and Reference sections of the site.

Screenshots (if appropriate):

Types of changes

New feature (non-breaking change which adds functionality)
Documentation change

Checklist:

I have documented all new methods using google docstring format.
I have added type annotations to any public API methods.
I have updated any relevant high-level documentation.
I have added a usage example to the example app if relevant.
I have written tests covering all new changes.

conatus

As far as I understand this code, I think this is sound. There are a few minor documentation pieces I have highlighted, that it would be good to fix.

One question I do have is what does one do when one doesn't want to sync the data periodically with a Cron. The documentation doesn't answer this directly. We can imagine some cases where the data changes quite rapidly and is important to keep up to date. I presume we somehow override the method, but can't be sure. I don't think this blocks this very good work going in however! Maybe take a ticket to document this situation?

docs/guides/data-pipeline.md

conatus · 2021-12-06T13:30:56Z

docs/guides/data-pipeline.md

+
+Unfortunately, they don't help so much with external data - either public data that we want to be infomred of changes to or data managed by external services.
+
+This gets particuarly tricky when we want to augment the remote data with our own data or there are api limits that


API as opposed to api.

This to me is the crux of the "sell" of datasources, so I'm going to throw out a suggestion of how to re-word this, to tie it strongly to the reason behind this library.

Unfortunately, they don't help so much with external data - either public data that we want to be informed of changes to or data managed by external services. This gets particuarly tricky when we want to augment the remote data with our own data or there are API limits that require us to store the data locally and keep it up to date. We often end up writing lots of bug-prone glue code to manage this. Groundwork helps here by introducing a lightweight abstraction called a `Datasource`. It might be helpful to think of these as similar to Django' models and querysets, but for external APIs. In the examples that follow, we use a very common use-case for building out applications that help people organise. There is a campaign that needs to carve up people by the UK parliamentary constitutency they are in and add other information the campaign is concerned about that relate to it. The amount of people who support an action. The number of letter sent in this constituency to lobby an MP. There might be a model to represent this letter, for example. So we need to represent the constituencies and information about them against a source of truth, but augment this with things that we want to know about. But loading in all constituencies, or looking up this data on the fly, is slow or error prone. The data around constituencies also changes very infrequently. Datasources tries to solve for this situation which we have observed a fair amount in our own work and provide a lightweight API for doing so.

One question I do have is what does one do when one doesn't want to sync the data periodically with a Cron. The documentation doesn't answer this directly. We can imagine some cases where the data changes quite rapidly and is important to keep up to date. I presume we somehow override the method, but can't be sure. I don't think this blocks this very good work going in however! Maybe take a ticket to document this situation?

So, this is actually covered in the API reference, but not in the tutorial. If you pass None as the value of sync_frequency, it won't sync on a cron. This currently is mainly useful if you are only interested in embedded resources returned by the main resource, but could extend to other use cases if we want.

One way we might want to do this (which would be particularly good for fast-changing data) is adding webhook support to datasources. I'm imagining that might look something like this:

# app/urls.py from django.urls import path, include urlpatterns = [ path('webhooks/', include('groundwork.core.webhooks') ]

with datasources then registering themselves so that they can 'push' into features that use them such as SyncedModel rather than being 'pulled' on a schedule.

Or, in cases where a local copy of the data isn't important (other than view-level caching maybe), the Datasource interface could be used to produce generic views similar to Django's ModelView and ListView

Definitely something for the backlog.

This to me is the crux of the "sell" of datasources, so I'm going to throw out a suggestion of how to re-word this, to tie it strongly to the reason behind this library.

Nice one. In general, it'd be really nice to get PRs submitted for improvements to documentation, especially around the 'sell' of the different features. It's not the easiest context-switch out of writing the code.

@chrisdevereux webhook support built in would be very cool. Uncertain of a concrete usecase but... very neat.

conatus · 2021-12-06T13:31:48Z

docs/guides/data-pipeline.md

+
+!!! warning
+
+    Just because you _can_ pull in lots of data in from other systems doesn't mean you _should_. Be mindful about any


A very vital point!

This sort of relates to my overall comment re: non-slow data. What does one do in this situation. Not a blocking consideration however.

Something we also want on the backlog is some kind of deletion strategy for SyncedModel. Currently this plays it safe by not deleting things when they disappear from the remote API. But that is.... also not safe.

@chrisdevereux perhaps we can add a property to SyncedModel to flag when things disappear from the remote source. still_exists_in_remote_source = BooleanField(default=True) or somesuch.

janbaykara · 2021-12-09T17:21:08Z

One comment on the docs: could we find a way in https://groundwork.commonknowledge.coop/api/pyck.geo.territories.uk.parliament/ to have the list of actual resources (constituencies, members, parties) at the top of the page, rather than the bottom beneath the internals? Usage starts by pulling one of those into the the datasource argument, after all, so nice to have the menu options up front.

Your Name added 9 commits December 3, 2021 12:08

switch from using mypy to pylance to typecheck python

a67e180

adds standard api for registering cron tasks

4f527b3

adds uk parliament datasource

1ce3159

adds postcodes.io datasource

7bbee58

adds example of using the parliament datasource

4d0f68f

minor cleanup to launch.json

1acf57c

adds data pipeline tutorial

6aa6e70

configure integration tests to run in ci

970cb10

small tweaks to naming and documentation

260e894

chrisdevereux requested review from conatus and janbaykara December 3, 2021 13:02

fix incorrect pr trigger

c45a3ff

chrisdevereux force-pushed the data-pipeline branch from 6ed85fe to d99b2e6 Compare December 3, 2021 13:59

ci fix

a32e6ce

chrisdevereux force-pushed the data-pipeline branch from d99b2e6 to a32e6ce Compare December 3, 2021 14:30

conatus approved these changes Dec 6, 2021

View reviewed changes

Chris Devereux added 3 commits December 7, 2021 09:36

Merge branch 'main' into data-pipeline

ef8f90b

fix typos in documentation

54d0c96

improve the 'sell' for data pipelines

5d308f0

chrisdevereux merged commit e21435b into main Dec 7, 2021

chrisdevereux deleted the data-pipeline branch December 7, 2021 10:20

This was referenced Dec 9, 2021

Display resources at top of docs page rather than the bottom #15

Open

Webhook support in datasources #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds data pipeline framework and some uk geographical datasources #9

Adds data pipeline framework and some uk geographical datasources #9

chrisdevereux commented Dec 3, 2021

conatus left a comment

conatus Dec 6, 2021

conatus Dec 6, 2021 •

edited

Loading

chrisdevereux Dec 6, 2021

chrisdevereux Dec 6, 2021

janbaykara Dec 9, 2021

conatus Dec 6, 2021

conatus Dec 6, 2021

chrisdevereux Dec 6, 2021 •

edited

Loading

janbaykara Dec 9, 2021 •

edited

Loading

janbaykara commented Dec 9, 2021 •

edited

Loading


		!!! warning

		Just because you _can_ pull in lots of data in from other systems doesn't mean you _should_. Be mindful about any

Adds data pipeline framework and some uk geographical datasources #9

Adds data pipeline framework and some uk geographical datasources #9

Conversation

chrisdevereux commented Dec 3, 2021

Description

Motivation and Context

How Can It Be Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

conatus left a comment

Choose a reason for hiding this comment

conatus Dec 6, 2021

Choose a reason for hiding this comment

conatus Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

chrisdevereux Dec 6, 2021

Choose a reason for hiding this comment

chrisdevereux Dec 6, 2021

Choose a reason for hiding this comment

janbaykara Dec 9, 2021

Choose a reason for hiding this comment

conatus Dec 6, 2021

Choose a reason for hiding this comment

conatus Dec 6, 2021

Choose a reason for hiding this comment

chrisdevereux Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

janbaykara Dec 9, 2021 • edited Loading

Choose a reason for hiding this comment

janbaykara commented Dec 9, 2021 • edited Loading

conatus Dec 6, 2021 •

edited

Loading

chrisdevereux Dec 6, 2021 •

edited

Loading

janbaykara Dec 9, 2021 •

edited

Loading

janbaykara commented Dec 9, 2021 •

edited

Loading