-
-
Notifications
You must be signed in to change notification settings - Fork 1
Adds data pipeline framework and some uk geographical datasources #9
Conversation
6ed85fe
to
d99b2e6
Compare
d99b2e6
to
a32e6ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand this code, I think this is sound. There are a few minor documentation pieces I have highlighted, that it would be good to fix.
One question I do have is what does one do when one doesn't want to sync the data periodically with a Cron. The documentation doesn't answer this directly. We can imagine some cases where the data changes quite rapidly and is important to keep up to date. I presume we somehow override the method, but can't be sure. I don't think this blocks this very good work going in however! Maybe take a ticket to document this situation?
docs/guides/data-pipeline.md
Outdated
|
||
Unfortunately, they don't help so much with external data - either public data that we want to be infomred of changes to or data managed by external services. | ||
|
||
This gets particuarly tricky when we want to augment the remote data with our own data or there are api limits that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API as opposed to api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This to me is the crux of the "sell" of datasources, so I'm going to throw out a suggestion of how to re-word this, to tie it strongly to the reason behind this library.
Unfortunately, they don't help so much with external data - either public data that we want to be informed of changes to or data managed by external services.
This gets particuarly tricky when we want to augment the remote data with our own data or there are API limits that require us to store the data locally and keep it up to date. We often end up writing lots of bug-prone glue code to manage this.
Groundwork helps here by introducing a lightweight abstraction called a `Datasource`. It might be helpful to
think of these as similar to Django' models and querysets, but for external APIs.
In the examples that follow, we use a very common use-case for building out applications that help people organise. There is a campaign that needs to carve up people by the UK parliamentary constitutency they are in and add other information the campaign is concerned about that relate to it. The amount of people who support an action. The number of letter sent in this constituency to lobby an MP. There might be a model to represent this letter, for example.
So we need to represent the constituencies and information about them against a source of truth, but augment this with things that we want to know about. But loading in all constituencies, or looking up this data on the fly, is slow or error prone. The data around constituencies also changes very infrequently.
Datasources tries to solve for this situation which we have observed a fair amount in our own work and provide a lightweight API for doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question I do have is what does one do when one doesn't want to sync the data periodically with a Cron. The documentation doesn't answer this directly. We can imagine some cases where the data changes quite rapidly and is important to keep up to date. I presume we somehow override the method, but can't be sure. I don't think this blocks this very good work going in however! Maybe take a ticket to document this situation?
So, this is actually covered in the API reference, but not in the tutorial. If you pass None
as the value of sync_frequency
, it won't sync on a cron. This currently is mainly useful if you are only interested in embedded resources returned by the main resource, but could extend to other use cases if we want.
One way we might want to do this (which would be particularly good for fast-changing data) is adding webhook support to datasources. I'm imagining that might look something like this:
# app/urls.py
from django.urls import path, include
urlpatterns = [
path('webhooks/', include('groundwork.core.webhooks')
]
with datasources then registering themselves so that they can 'push' into features that use them such as SyncedModel
rather than being 'pulled' on a schedule.
Or, in cases where a local copy of the data isn't important (other than view-level caching maybe), the Datasource interface could be used to produce generic views similar to Django's ModelView
and ListView
Definitely something for the backlog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This to me is the crux of the "sell" of datasources, so I'm going to throw out a suggestion of how to re-word this, to tie it strongly to the reason behind this library.
Nice one. In general, it'd be really nice to get PRs submitted for improvements to documentation, especially around the 'sell' of the different features. It's not the easiest context-switch out of writing the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrisdevereux webhook support built in would be very cool. Uncertain of a concrete usecase but... very neat.
|
||
!!! warning | ||
|
||
Just because you _can_ pull in lots of data in from other systems doesn't mean you _should_. Be mindful about any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very vital point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of relates to my overall comment re: non-slow data. What does one do in this situation. Not a blocking consideration however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we also want on the backlog is some kind of deletion strategy for SyncedModel
. Currently this plays it safe by not deleting things when they disappear from the remote API. But that is.... also not safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrisdevereux perhaps we can add a property to SyncedModel
to flag when things disappear from the remote source. still_exists_in_remote_source = BooleanField(default=True)
or somesuch.
One comment on the docs: could we find a way in https://groundwork.commonknowledge.coop/api/pyck.geo.territories.uk.parliament/ to have the list of actual resources (constituencies, members, parties) at the top of the page, rather than the bottom beneath the internals? Usage starts by pulling one of those into the the |
Description
This PR introduces:
I won't detail it extensively here – you can read the included documentation (both code-level and conceptual) for more detail.
Motivation and Context
Django ships out of the box with an excellent ORM for modeling data created and managed by our application. This is
useful for several reasons, but mainly:
Unfortunately, they don't help so much with external data - either public data that we want to be infomred of changes to or data managed by external services.
This gets particuarly tricky when we want to augment the remote data with our own data or there are api limits that
require us to store the data locally and keep it up to date – we often end up writing lots of bug-prone glue code to
manage this.
Handling this in a standardised way decreases the amount of surprise involved when interacting with external APIs. It also makes it much easier for us to quickly support new APIs and write generic code for functionality like the above.
Possible future directions:
How Can It Be Tested?
make test
python manage.py run_cron_tasks --all
)F5
in vscode)make serve-docs
http://localhost:8001
and looking at the Guides and Reference sections of the site.Screenshots (if appropriate):
Types of changes
Checklist: