Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain how it datapusher works and add API documentation #18

Open
rufuspollock opened this issue Jan 23, 2014 · 6 comments
Open

Explain how it datapusher works and add API documentation #18

rufuspollock opened this issue Jan 23, 2014 · 6 comments

Comments

@rufuspollock
Copy link
Member

Need to add the following to the documentation:

  • How does the datapusher push to CKAN - is it via a direct DB connection or via the CKAN DataStore API
  • Does datapusher expose its own API so I can POST a file or is it tightly integrated with CKAN? If it does have an API where is that documented?
@nigelbabu
Copy link
Contributor

@rgrp The datapusher uses ckanservice provider. It is run independently of CKAN and uses the API. As for the documentation for the API, ckanserviceprovider should give you an idea; further pull requests welcome. I've updated the issue slightly to read as a bug.

@amercader
Copy link
Member

These are really good questions and really timely as we are working on the DataPusher docs before the release.

As @nigelbabu mentions, the DataPusher is a standalone application (although generally installed in the same server) and all communication with CKAN core and the DataStore is done via HTTP.

  1. CKAN talks with the DataPusher using the CKAN Service Provider protocol, telling him "Please, upload this resource to the DataStore". The request sent is something like:

    http POST http://localhost:8800/job Content-Type:application/json < dp.json
    {
    "api_key": "XXXXXX",
    "job_type": "push_to_datastore",
    "result_url": "http://localhost:5000/api/3/action/datapusher_hook",
    "metadata": {
        "ckan_url": "http://localhost:5000",
        "resource_id": "08872bf2-c620-4555-97ed-18e9f874a314"
    }   
    }
    

    You can of course send these requests from another client.

  2. Once the job is created, the DataPusher will request the remote file contents, process them and push them to the DataStore via the datastore_create action.

Here's a simple schema of the whole process in case it helps:

glasgow workshop

We'll try and improve the docs with these details.

@rufuspollock
Copy link
Member Author

Also I now understand this is an instance of CKAN Service Provider and follows it docs.

The actual job type is push_to_datastore. Example code grabbed from ckanext-datapusherext is:

    requests.post(
        urlparse.urljoin(datapusher_url, 'job'),
        headers={
            'Content-Type': 'application/json'
        },
        data=json.dumps({
            'api_key': user['apikey'],
            'job_type': 'push_to_datastore',
            'result_url': callback_url,
            'metadata': {
                'ckan_url': pylons.config['ckan.site_url'],
                'resource_id': res_id,
                'set_url_type': data_dict.get('set_url_type', False)
            }
        }))

@florianm
Copy link

For us non-core developers, it would be great to have some docs on the requests sent between datapusher and the CKAN API. It is relevant to deployment behind firewalls and proxies to understand that datapusher will send HTTP requests to the ckan.site_url, which must pass firewall and proxy.

It's quite tricky to figure out why perfectly fine datapusher gets a mysterious "could not post to result_url" from a perfectly fine CKAN API. Of course this is not a problem of datapusher per se, but it's in the nature of CKAN/datapusher that they will get installed for bigger audiences, often on cloud services with weird and wonderful proxy and firewall settings. I'm happy to contribute a section on using curl to debug failing http requests between datapusher and the CKAN API if that's any good!

@smrgeoinfo
Copy link

+1 on documentation for this HTTP traffic-- we have been stuck for several weeks trying to figure out why datapusher and harvesting aren't working on our deployments. It's cost US A LOT of money.
https://github.com/ngds/ckanext-ngds/issues/580

@florianm
Copy link

update for those stuck between their firewall and a hard place: multi-tenant setup from source (should also work for single-tenant installs) and a diagram illustrating HTTP traffic crossing the installation localhost's boundaries.

Also worth reading is boxkite's setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants