-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Break out QA into a separate service #65
Comments
This is a good aim, to provide QA on all CKANs everywhere. It might be worth breaking it down a bit to decide what you want from the Archiver/QA features as it would be simpler to only have some:
Doing the QA of a file is pretty quick - you can do 50k resources in a couple of hours or so, since it is just a matter of running file/magic on it or other analysis on the first 1k. The slow bits are: downloading the file (actually done by archiver) and if you start going over the API, reading of the resource and writing it back. There really are two different use cases - one-offs and the whole of CKAN in bulk. For the latter case, you could read the resources in one batch by getting hold of the CKAN dump, and writing the results in batch would be good. One more complication that could be usefully dealt with is dealling with files that change every time you read them, meaning you write to the resource every time you check (hash, file length etc). That makes no sense, creates excessive revisions/history and false alerts to watchers. |
Do you mean it'd be a separate service like datapusher is right now? |
Perhaps, but I was really thinking about something you can install independently of CKAN with a nice simple API/interface, that had no dependency on CKAN itself. There's nothing in QA that needs CKAN, and I think it is valuable to others - like csvlint.io but more complete. |
It might be good to start by thinking independently of the exact implementation and focus on getting a good list of user stories that we're trying to solve and then work from that to the simplest/best solution. |
My first pass, off the top of my head. QA User StoriesAs a user As a user As a user As a user As a user,developer,sysadmin As a data-portal developer As a data-portal developer As a data-portal developer As a data-portal developer As a data-portal developer As a data-portal developer/sysadmin As a developer As a system administrator As a system administrator As a system administrator, data-portal developer |
I think what I'm suggesting is that ckanext-qa become a very simple extension, delegating the work out to a centralised service (although obviously it should be possible to host your own). Something like. The 'magic' section is the archiving, pregenerating preview, and QA checking, although it could incorporate things like csvlint.io. Also possibly s/token/apikey/ Although the more I think about it, the more this is really not very CKAN specific, and wonder if this is really the right place to discuss it. |
@rossjones I think this is spot-on and totally agree with the separation of concerns. In fact, original QA design had this and we were focusing on the "microformat" for storing QA results on datasets and resources. I would emphasize KISS here in terms of next steps - whilst it is true that one could make something very generic (and very powerful) it might be worth focusing on just a few use cases (e.g. just QA and leave out previewing) and try and implement that. |
If the services are always calling back to a ckan instance then auth on the services could be quite simple. Each service would have a list of CKAN instances (API URLs) it's willing to do work for. Every time a job is started from the instance it would contact the service and prove that the work request came from this instance by having the service "phone back" to the instance API to verify the request and establish a token for the job. This makes it possible to handle all user authorization on the ckan side, keeping the services simple. We probably don't even API keys for the services. |
That would be nice and clean, and would make it easy to generalise a lot of the CKAN side for handling those tokens. |
I've started work on a broken link checking "service" for CKAN here: https://github.com/seanh/ckanext-deadoralive that's similar to Ross's idea. Feedback welcome. (Much of the design decisions are Ross and Ian's ideas from IRC...) It's just for checking broken links at the moment, not doing five-star ratings or detecting file formats etc. That's all we need for the project I'm doing this for. It could easily be extended to do the other things as well (but I wonder if each should be a separate service?) I hope this'll be much simpler and more reliable than ckanext-qa. It's implemented in two parts: A CKAN plugin (one that's trivial to install, has no dependencies or setup) and a CKAN API client designed to be run manually in a terminal or by cron (also no dependencies or setup - just copy-paste a line into crontab). I'm doing it as a cron job rather than a web service because it's much quicker and simpler to implement, gets you 90% of the way there, and is very instructive for how a web service could be designed. As long as it communicates with CKAN purely via the API and can be run on a different machine from where CKAN is installed (so not a paster command), then the code should be completely reusable if you were to wrap it in a web service. The way it works is:
The link checker is completely dumb, has no state. The CKAN plugin is quite careful about which resource IDs it returns. It's designed so that you can potentially have many It first sends resources that have never been checked (ordered oldest resources first), then if there are no more of those it starts re-sending resources that were last checked more than 24 hours ago. After those it will resend resources that have a pending check (i.e. CKAN hasn't received the result back from the link checker yet) from more than 2 hours ago. (It has to keep track of pending checks so that if multiple link checkers are requesting links to check simultaneously CKAN doesn't send the same link to two different checkers at the same time.) Eventually it will send back empty lists, if all the resources have been checked recently. But then after while it'll start sending back resource IDs again, once it's time to re-check them. The checkers are dumb: they just keep periodically asking for more links to check, So this is resilient to link checker service failures and can handle an unknown number of link checkers running simultaneously. This is rough but implemented and working now. The major piece that isn't done is the display layer: I need to add code to the CKAN plugin that builds on top of the link check results it has in the db and provides "this resource is broken" notices on resource pages, a report page listing datasets with broken resources grouped by organization, etc. That should be quite straightforward although it may have to query multiple tables of the CKAN db directly to build reports like a summary of datasets with broken links grouped by organization. ThoughtsIn Ross's diagram he has CKAN posting "please check this link" to the QA service (presumably CKAN will do this each time a new resource is created, or a resource URL is changed?), and then the QA service calling a callback URL with the check result. I've got it the other way round: the service calls CKAN to say "please give me some links to check." I think my way avoids a few problems, in Ross's version:
With my approach, the CKAN plugin has no dependencies and isn't running any long-running, periodic or asynchronous tasks. The link checker service runs periodically. In the current implementation, the link checker gets a list of resource IDs from CKAN and has to call This also avoids a problem with private resources. If there was a bug in the Still, this does mean that each link checker has to call the Being coupled to CKAN: The Particularly if we were going to make this into a web service, could we change the design to make it not tied to CKAN? Would it even be worthwhile to do so? Are there likely to be any non-CKAN users of a QA service? It seems any QA microservice is going to have to define an API of its own, and then we're gonna have to write a CKAN plugin to make CKAN use that API, and any non-CKAN applications are gonna need their own plugins to use it as well. Is this likely to happen? Or realistically are we just reimplementing What are the pros of extending this to become a web service? As far as I can think, the only advantage to a web service over a cron job is: it can support a "check this resource / dataset / organization for broken links now" button in the CKAN HTML interface for sysadmins, kicking off ad-hoc link-checkers tests in response to user interaction (because CKAN can send a request to the link checker service to do the check). The primary mode of operation would still be automatic periodic checks, but admins could request ad-hoc checks as well. That is an advantage and I do think a web service would ultimately be better than a cron job. But I think the effort-benefit ratio is quite bad because a web service is a lot more work. |
Another pro of doing a web service is that CKAN can automatically request an ad-hoc check when a user creates or updates a resource, so the user can see their link check result right away. With the cron version, they have to wait until the next time a link checker cron job runs. |
This sounds good. Any thoughts on how we'd implement the web service? I'm thinking Flask (very nice for quick, small web apps with JSON APIs) and Celery with the Periodic Tasks feature. I haven't done much Celery coding though, and it does require you to run a celery worker in terminal in development, and to setup e.g. supervisord to run celery in production. Something with less setup would be nice, but it needs to be able to run a "check for new links to check or re-check" periodically. Maybe it should be one link checker script that can be called by a web service for ad-hoc tasks, but can also be called by a cron job for periodic tasks! But it'd be nice if all you had to do was install the Flask app, not install the Flask app and setup a cron job. |
Cool! What do the results look like? |
A link checker result (as posted back to CKAN by the link checker service) currently looks like:
That goes into a db table with columns:
I may add a "reason" string but so far I don't think it's strictly necessary. |
It'd be handy to have a reason I think, for when alive=False, so we could have "Authentication required", or "Resource does not exist" or "An error occurred on the server". |
BTW is it worth checking back at our original spec re Data Quality info structure: http://web.archive.org/web/20111221052527/http://wiki.ckan.org/Data_Quality In particular, field had:
|
What about the WebUI? I agree with Ross and Rufus that it is not good enough just to give one result. It is much more robust and indeed helpful to be able to respond like this: "It has been working on our weekly tests we've done since 4th March, and then the last two gave, the latest of which was yesterday at 4pm, gave an error "504 Server Error". The full results are on web page X and there is a public message on the dataset page to warn people before they try your link, which will be removed as soon as you fix it. Your publisher is below average in the public rankings, so let's sort it out before someone notices". |
Maybe store the HTTP status code? |
I've put some thought into defining a non-CKAN specific protocol for this: ckan/ckanext-deadoralive#1 feedback welcome |
I'm happy to add a reason string. |
The way I'm planning to implement the web UI is that if a link has been down for at least 3 checks in a row over a period of at least 3 days (both numbers configurable) then we mark it as broken. That's why the CKAN plugin is storing a history of link checker results, not just the most recent one. Since we're keeping a history of results we could potentially show detailed info about the history of checks like in the example you link to. The only thing is that I currently have it deleting results that are more than 1 week (configurable) old to stop the db table from growing indefinitely. |
Right. So I still need to define my "reporting" APIs that will be used to put the link checking info into CKAN's HTML. That's what this seems to apply to. So note that I don't consider this to be part of the link checker web service protocol - this is just part of what the CKAN plugin does with the results. I can compute I don't have |
Actually, I think CKAN should do the reporting of what has happened - but the link-checker needs to supply enough information for it to do that. IMO keep the link checker simple and neutral (no CKAN specific code). I'd suggest you need to add:
CKAN can then (if it wishes) store this information somewhere and calculate the other information based on the timestamps and status codes. |
Good, so you have the test history covered. BTW We're really happy with just storing the what gets reported, rather than the full details of every old test. i.e. just storing the date of the last succesful test and number of fails, rather than the dates of all the tests. I didn't see the justification for storing the full details of older tests, and it would be another order of magnitude of data to store. But maybe you have the use cases. |
One more thought about the protocol between the CKAN plugin and the link checker service: one of the diagrams above has the web service supporting an optional "check status of a link checker task" API (where the CKAN site makes a request to the web service to get the status of a task, e.g. queued, succeeded, failed etc). I don't think we should support this:
I'm trying to get all the storage / state in the CKAN plugin, and all the async and periodic stuff in the service, I think this keeps both things simpler: CKAN already provides a db to the plugin so it can have storage easily. CKAN doesn't provide a good easy way to do async or periodic so that's what the service should do, but to keep the service as simple as possible it shouldn't do anything else. |
I'm happy to add status and reason. I think CKAN can easily compute the timestamp itself though (it just does |
I think you might be right here. Currently I store every result for each resource, and delete the ones older than a week. But it might be much simpler to store only the latest result for each resource but add last successful and number of fails columns that get recomputed on each write, instead of computing these at read time. |
This is what became of the link checker in the end: https://github.com/ckan/deadoralive It's "finished for now" (fulfils our particular client's needs, bugs notwithstanding) but still plenty of potential enhancements in the github issues. |
Wanted to let you know that I am running a broken link checker against CKAN. It is available here and works by harvesting some metadata, fetching all sorts of URIs (not just ressources but also contact points, etc.) and reporting into PostgreSQL. As I am not good at UI I stopped at aggregating statistics into Redis. It's available here https://github.com/the42/ogdat/tree/master/ogdatwatcher |
What?
Unlike datastore, I think QA is a viable candidate for breaking out into a separate service as there is no hard-requirement for it to be performant (not a real word, I know) and the bindings to CKAN could be conceptually looser than current implementations imply.
Why?
The ckanext-qa/ckanext-archiver combo is hard to set up and can consume a rather large amount of resource. More value can be provided to the wider open data community by opening up the QA processing.
How?
The QA Service should be easier to setup with a simpler interface to the caller. Having a separate service that responds to POST requests with asynchronous responses would provide value to others not using CKAN as well. QA/Archiver doesn't have any hard dependencies on CKAN (other than various branches writing direct to DB, or using the API) and could be modified to just POST a response in a known format.
The QA extension itself would just become a thin extension that would POST changes to a resource, and an API for receiving responses writing to a known extra fields in the resource.
The text was updated successfully, but these errors were encountered: