DataStore and the Data API

The CKAN DataStore provides a database for structured storage of data together with a powerful Web-accesible Data API, all seamlessly integrated into the CKAN interface and authorization system.

Relationship to FileStore

The DataStore is distinct but complementary to the FileStore (see filestore). In contrast to the the FileStore which provides 'blob' storage of whole files with no way to access or query parts of that file, the DataStore is like a database in which individual data elements are accessible and queryable. To illustrate this distinction consider storing a spreadsheet file like a CSV or Excel document. In the FileStore this filed would be stored directly. To access it you would download the file as a whole. By contrast, if the spreadsheet data is stored in the DataStore one would be able to access individual spreadsheet rows via a simple web-api as well as being able to make queries over the spreadsheet contents.

The DataStore Data API

The DataStore's Data API, which derives from the underlying data-table, is RESTful and JSON-based with extensive query capabilities.

Each resource in a CKAN instance can have an associated DataStore 'table'. The basic API for accessing the DataStore is detailed below. For a detailed tutorial on using this API see using-data-api.

Installation and Configuration

Warning

This is an advanced topic.

The DataStore in previous lives required a custom setup of ElasticSearch and Nginx, but that is no more, as it can use any relational database management system (PostgreSQL for example). However, you should set-up a separate database for the datastore and create a read-only user to make you CKAN installation save.

To create a new database and a read-only user, use the SQL-script in ckanext/datastore/bin.

Edit the script to your needs and then execute it:

sudo -u postgres psql postgres -f create_read_only_user.sql

In your config file ensure that the datastore extension is enabled:

ckan.plugins = datastore

Also ensure that the ckan.datastore_write_url and datastore_read_url variables are set:

ckan.datastore_write_url = postgresql://ckanuser:pass@localhost/datastore
ckan.datastore_read_url = postgresql://readonlyuser:pass@localhost/datastore

To test you can create a new datastore, so on linux command line do:

curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}"
  -d '{"resource_id": "{RESOURCE-ID}", "fields": [ {"id": "a"}, {"id": "b"} ],
   "records": [ { "a": 1, "b": "xyz"}, {"a": 2, "b": "zzz"} ]}'

DataStorer: Automatically Add Data to the DataStore

Often, one wants data that is added to CKAN (whether it is linked to or uploaded to the FileStore <filestore>) to be automatically added to the DataStore. This requires some processing, to extract the data from your files and to add it to the DataStore in the format the DataStore can handle.

This task of automatically parsing and then adding data to the datastore is performed by a DataStorer, a queue process that runs asynchronously and can be triggered by uploads or other activities. The DataStorer is an extension and can be found, along with installation instructions, at:

API Reference

Note

Lists can always be expressed in different ways. It is possible to use lists, comma separated strings or single items. These are valid lists: ['foo', 'bar'], foo, bar, "foo", "bar" and foo.

datastore_create

The datastore_create API endpoint allows a user to post JSON data to be stored against a resource. This endpoint also supports altering tables, aliases and indexes and bulk insertion. The JSON must be in the following form:

{
   resource_id: resource_id, # the data is going to be stored against.
   aliases: # list of names for read only aliases to the resource
   fields: [] # a list of dictionaries of fields/columns and their extra metadata.
   records: [] # a list of dictionaries of the data, eg:  [{"dob": "2005", "some_stuff": ['a', b']}, ..]
   primary_key: # list of fields that represent a unique key
   indexes: # indexes on table
}

datastore_delete

The datastore_delete API endpoint allows a user to delete from a resource. The JSON for searching must be in the following form:

{
   resource_id: resource_id # the data that is going to be deleted.
   filter: # dictionary of matching conditions to delete
           # e.g  {'key1': 'a. 'key2': 'b'}
           # this will be equivalent to "delete from table where key1 = 'a' and key2 = 'b' "
}

datastore_upsert

The datastore_upsert API endpoint allows a user to add data to an existing datastore resource. In order for the upsert and update to work a unique key has to defined via the datastore_create API endpoint command. The JSON for searching must be in the following form:

{
   resource_id: resource_id # resource id that the data is going to be stored under.
   records: [] # a list of dictionaries of the data, eg:  [{"dob": "2005", "some_stuff": ['a', b']}, ..]
   method: # the method to use to put the data into the datastore
           # possible options: upsert (default), insert, update
}

upsert: Update if record with same key already exists, otherwise insert. Requires unique key.
insert: Insert only. This method is faster that upsert because checks are omitted. Does not require a unique key.
update: Update only. Exception will occur if the key that should be updated does not exist. Requires unique key.

datastore_search

The datastore_search API endpoint allows a user to search data at a resource. The JSON for searching must be in the following form:

{
    resource_id: # the resource id to be searched against
    filters : # dictionary of matching conditions to select e.g  {'key1': 'a. 'key2': 'b'}
       # this will be equivalent to "select * from table where key1 = 'a' and key2 = 'b' "
    q: # full text query
    plain: # treat as plain text query (default: true)
    language: # language of the full text query (default: english)
    limit: # limit the amount of rows to size (default: 100)
    offset: # offset the amount of rows
    fields:  # list of fields return in that order, defaults (empty or not present) to all fields in fields order.
    sort: # ordered list of field names as, eg: "fieldname1, fieldname2 desc"
}

datastore_search_sql

The datastore_search_sql API endpoint allows a user to search data at a resource or connect multiple resources with join expressions. The underlying SQL engine is the PostgreSQL engine. The JSON for searching must be in the following form:

{
   sql: # a single sql select statement
}

Table aliases

Resources in the datastore can have multiple aliases that are easier to remember than the resource id. Aliases can be created and edited with the datastore_create API endpoint. All aliases can be found in a special view called _table_metadata.

Comparison of different querying methods

The datastore supports querying with the datastore_search and datastore_search_sql API endpoint. They are similar but support different features. The following list gives an overview on the different methods.

	datastore_search	datastore_search_sql SQL	HTSQL
==============================	=======================	=====================	======================
Status	Stable	Stable	In development
Ease of use	Easy	Complex	Medium
Query language	Custom (JSON)	SQL	HTSQL
Connect multiple resources	No	Yes	Yes
Use aliases	Yes	Yes	Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datastore.rst

datastore.rst

DataStore and the Data API

Relationship to FileStore

The DataStore Data API

Installation and Configuration

DataStorer: Automatically Add Data to the DataStore

API Reference

datastore_create

datastore_delete

datastore_upsert

datastore_search

datastore_search_sql

Table aliases

Comparison of different querying methods

Files

datastore.rst

Latest commit

History

datastore.rst

File metadata and controls

DataStore and the Data API

Relationship to FileStore

The DataStore Data API

Installation and Configuration

DataStorer: Automatically Add Data to the DataStore

API Reference

datastore_create

datastore_delete

datastore_upsert

datastore_search

datastore_search_sql

Table aliases

Comparison of different querying methods