Skip to content
This repository has been archived by the owner on Jun 21, 2024. It is now read-only.

Adds data pipeline framework and some uk geographical datasources #9

Merged
merged 14 commits into from
Dec 7, 2021

Conversation

chrisdevereux
Copy link
Contributor

Description

This PR introduces:

  • A standard way of interacting with external APIs.
  • A standard way of registering and running cron jobs.
  • A model subclass that uses both of these to regularly sync local models to an external API.
  • Data sources for UK parliamentry constituencies, current MPs and parties
  • A Postcodes.IO client.

I won't detail it extensively here – you can read the included documentation (both code-level and conceptual) for more detail.

Motivation and Context

Django ships out of the box with an excellent ORM for modeling data created and managed by our application. This is
useful for several reasons, but mainly:

  • Consistent conventions mean less surprises about how different things work.
  • The standardised model/queryset format allows the framework to support generic views, forms and other components.

Unfortunately, they don't help so much with external data - either public data that we want to be infomred of changes to or data managed by external services.

This gets particuarly tricky when we want to augment the remote data with our own data or there are api limits that
require us to store the data locally and keep it up to date – we often end up writing lots of bug-prone glue code to
manage this.

Handling this in a standardised way decreases the amount of surprise involved when interacting with external APIs. It also makes it much easier for us to quickly support new APIs and write generic code for functionality like the above.

Possible future directions:

  • Building out a suite of external APIs.
  • DatasourceView – like Django's ModelView, but wrapping an external api instead of a model.
  • Making a signal fire when SyncedModel detects that a particular attribute has changed.
  • Integrated webhook support on Datasources so that we can support 'push' as well as 'pull'.

How Can It Be Tested?

  • Clone this PR in a development container (this should be working now)
  • Run the tests with make test
  • Look at the example by:
    • Running the cron tasks to pull in some test data (python manage.py run_cron_tasks --all)
    • Launching the example app (F5 in vscode)
    • Navigating to http://localhost:8000/uk/mps/ to display the synced data
  • Read the docsite by:
    • Running make serve-docs
    • Navigating to http://localhost:8001 and looking at the Guides and Reference sections of the site.

Screenshots (if appropriate):

Screenshot 2021-12-03 at 12 42 47

Screenshot 2021-12-03 at 12 42 03

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Documentation change

Checklist:

  • I have documented all new methods using google docstring format.
  • I have added type annotations to any public API methods.
  • I have updated any relevant high-level documentation.
  • I have added a usage example to the example app if relevant.
  • I have written tests covering all new changes.

Copy link
Member

@conatus conatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand this code, I think this is sound. There are a few minor documentation pieces I have highlighted, that it would be good to fix.

One question I do have is what does one do when one doesn't want to sync the data periodically with a Cron. The documentation doesn't answer this directly. We can imagine some cases where the data changes quite rapidly and is important to keep up to date. I presume we somehow override the method, but can't be sure. I don't think this blocks this very good work going in however! Maybe take a ticket to document this situation?

docs/guides/data-pipeline.md Outdated Show resolved Hide resolved
docs/guides/data-pipeline.md Outdated Show resolved Hide resolved

Unfortunately, they don't help so much with external data - either public data that we want to be infomred of changes to or data managed by external services.

This gets particuarly tricky when we want to augment the remote data with our own data or there are api limits that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API as opposed to api.

Copy link
Member

@conatus conatus Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to me is the crux of the "sell" of datasources, so I'm going to throw out a suggestion of how to re-word this, to tie it strongly to the reason behind this library.

Unfortunately, they don't help so much with external data - either public data that we want to be informed of changes to or data managed by external services.

This gets particuarly tricky when we want to augment the remote data with our own data or there are API limits that require us to store the data locally and keep it up to date. We often end up writing lots of bug-prone glue code to manage this.

Groundwork helps here by introducing a lightweight abstraction called a `Datasource`. It might be helpful to
think of these as similar to Django' models and querysets, but for external APIs.

In the examples that follow, we use a very common use-case for building out applications that help people organise. There is a campaign that needs to carve up people by the UK parliamentary constitutency they are in and add other information the campaign is concerned about that relate to it. The amount of people who support an action. The number of letter sent in this constituency to lobby an MP. There might be a model to represent this letter, for example.

So we need to represent the constituencies and information about them against a source of truth, but augment this with things that we want to know about. But loading in all constituencies, or looking up this data on the fly, is slow or error prone. The data around constituencies also changes very infrequently.

Datasources tries to solve for this situation which we have observed a fair amount in our own work and provide a lightweight API for doing so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I do have is what does one do when one doesn't want to sync the data periodically with a Cron. The documentation doesn't answer this directly. We can imagine some cases where the data changes quite rapidly and is important to keep up to date. I presume we somehow override the method, but can't be sure. I don't think this blocks this very good work going in however! Maybe take a ticket to document this situation?

So, this is actually covered in the API reference, but not in the tutorial. If you pass None as the value of sync_frequency, it won't sync on a cron. This currently is mainly useful if you are only interested in embedded resources returned by the main resource, but could extend to other use cases if we want.

One way we might want to do this (which would be particularly good for fast-changing data) is adding webhook support to datasources. I'm imagining that might look something like this:

# app/urls.py

from django.urls import path, include

urlpatterns = [
    path('webhooks/', include('groundwork.core.webhooks')
]

with datasources then registering themselves so that they can 'push' into features that use them such as SyncedModel rather than being 'pulled' on a schedule.

Or, in cases where a local copy of the data isn't important (other than view-level caching maybe), the Datasource interface could be used to produce generic views similar to Django's ModelView and ListView

Definitely something for the backlog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to me is the crux of the "sell" of datasources, so I'm going to throw out a suggestion of how to re-word this, to tie it strongly to the reason behind this library.

Nice one. In general, it'd be really nice to get PRs submitted for improvements to documentation, especially around the 'sell' of the different features. It's not the easiest context-switch out of writing the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisdevereux webhook support built in would be very cool. Uncertain of a concrete usecase but... very neat.


!!! warning

Just because you _can_ pull in lots of data in from other systems doesn't mean you _should_. Be mindful about any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very vital point!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of relates to my overall comment re: non-slow data. What does one do in this situation. Not a blocking consideration however.

Copy link
Contributor Author

@chrisdevereux chrisdevereux Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we also want on the backlog is some kind of deletion strategy for SyncedModel. Currently this plays it safe by not deleting things when they disappear from the remote API. But that is.... also not safe.

Copy link
Member

@janbaykara janbaykara Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisdevereux perhaps we can add a property to SyncedModel to flag when things disappear from the remote source. still_exists_in_remote_source = BooleanField(default=True) or somesuch.

@chrisdevereux chrisdevereux merged commit e21435b into main Dec 7, 2021
@chrisdevereux chrisdevereux deleted the data-pipeline branch December 7, 2021 10:20
@janbaykara
Copy link
Member

janbaykara commented Dec 9, 2021

One comment on the docs: could we find a way in https://groundwork.commonknowledge.coop/api/pyck.geo.territories.uk.parliament/ to have the list of actual resources (constituencies, members, parties) at the top of the page, rather than the bottom beneath the internals? Usage starts by pulling one of those into the the datasource argument, after all, so nice to have the menu options up front.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants