Skip to content

Latest commit

 

History

History
378 lines (252 loc) · 15.7 KB

datastore.rst

File metadata and controls

378 lines (252 loc) · 15.7 KB

DataStore and the Data API

The CKAN DataStore provides a database for structured storage of data together with a powerful Web-accessible Data API, all seamlessly integrated into the CKAN interface and authorization system.

Note

The DataStore requires PostgreSQL 9.0 or later. It is possible to use the DataStore on verions prior to 9.0 (for example 8.4). However, the :ref:`datastore_search_sql` will not be available and the set-up is slightly different. Make sure, you read :ref:`old_pg` for more details.

Warning

The DataStore does not support hiding resources in a private dataset.

Installation and Configuration

Warning

Make sure that you follow the steps below and make sure that the settings are correct. Wrong settings could lead to serious security issues.

The DataStore in previous lives required a custom setup of ElasticSearch and Nginx, but that is no more, as it now uses the relational database management system PostgreSQL. However, you should set up a separate database for the DataStore and create a read-only user to make your CKAN and DataStore installation safe.

1. Enable the plugin

In your config file ensure that the datastore extension is enabled:

ckan.plugins = datastore

2. Set-up the database

The datastore requires a separate postgres database to save the resources to.

List existing databases:

sudo -u postgres psql -l

Check that the encoding of databases is ‘UTF8’, if not internationalisation may be a problem. Since changing the encoding of PostgreSQL may mean deleting existing databases, it is suggested that this is fixed before continuing with the CKAN install.

Next you’ll need to create a two database users for the datastore. One user will be the write user that can create, edit and delete resources. The second user will be a read-only user who can only read resources.

A few things have to be kept in mind:

  • The DataStore cannot be on the CKAN database (except for testing)
  • The write user (i.e. writeuser) and read-only user (i.e. readonlyuser) cannot be the same

Create users and databases

Tip

The write user does not have to be created since you can also use the CKAN user. However, this might not be possible if the CKAN database and the DataStore database are on different servers. We recommend that you use the same user for CKAN and the write datastore user if possible.

Create a write user called writeuser, and enter pass for the password when prompted:

sudo -u postgres createuser -S -D -R -P -l writeuser

Create a read-only user called readonlyuser, and enter pass for the password when prompted:

sudo -u postgres createuser -S -D -R -P -l readonlyuser

Create the database (owned by writeuser), which we’ll call datastore:

sudo -u postgres createdb -O writeuser datastore

Set URLs

Now, ensure that the ckan.datastore.write_url and datastore.read_url variables are set:

ckan.datastore.write_url = postgresql://writeuser:pass@localhost/datastore
ckan.datastore.read_url = postgresql://readonlyuser:pass@localhost/datastore

Set permissions

Once the datastore database and the users are created, the permissions on the datastore and CKAN database have to be set. Since there are different set-ups, there are different ways of setting the permissions. Only one of ther optins should be used.

  1. Use the paster command if CKAN and Postgres are on the same server

To set the permissions, use this paster command after you've set the database urls:

paster datastore set-permissions SQL_SUPER_USER
  1. Use the command line tool in datastore/bin/datastore_setup.py

Note

This option should be used if the CKAN server is different from the database server.

Copy the content from the datastore/bin/ directory to the database server. Then run the command line tool to set the permissions on the database. To see all available options, run:

python datastore_setup.py -h

Once you are confident that you know the right names, set the permissions (assuming that the CKAN database is called ckan and the CKAN PostgreSQL user is called ckanuser):

python datastore_setup.py ckan datastore ckanuser writeuser readonlyuser -p postgres
  1. Run the SQL commands manually on the database

Note

This option is for more complex set-ups and requires understanding of SQL and PostgreSQL.

Copy the set_permissions.sql file to the server that the database runs on. Make sure you set all variables in the file correctly and comment out the parts that are not needed for you setup. Then, run the script:

sudo -u postgres psql postgres -f set_permissions.sql

3. Test the set-up

The datastore is now set-up. To test the set-up you can create a new DataStore. To do so you can run the following command:

curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}" -d '{"resource_id": "{RESOURCE-ID}", "fields": [ {"id": "a"}, {"id": "b"} ], "records": [ { "a": 1, "b": "xyz"}, {"a": 2, "b": "zzz"} ]}'

A table named after the resource id should have been created on your DataStore database. Visiting the following URL should return a response from the DataStore with the records inserted above:

http://127.0.0.1:5000/api/3/action/datastore_search?resource_id={RESOURCE_ID}

Notes on how to install the DataStore on old PostgreSQL versions

To be continued...

Relationship to FileStore

The DataStore is distinct but complementary to the FileStore (see :doc:`filestore`). In contrast to the the FileStore which provides 'blob' storage of whole files with no way to access or query parts of that file, the DataStore is like a database in which individual data elements are accessible and queryable. To illustrate this distinction, consider storing a spreadsheet file like a CSV or Excel document. In the FileStore this file would be stored directly. To access it you would download the file as a whole. By contrast, if the spreadsheet data is stored in the DataStore, one would be able to access individual spreadsheet rows via a simple web API, as well as being able to make queries over the spreadsheet contents.

DataStorer: Automatically Add Data to the DataStore

Often, one wants data that is added to CKAN (whether it is linked to or uploaded to the :doc:`FileStore <filestore>`) to be automatically added to the DataStore. This requires some processing, to extract the data from your files and to add it to the DataStore in the format the DataStore can handle.

This task of automatically parsing and then adding data to the DataStore is performed by a DataStorer, a queue process that runs asynchronously and can be triggered by uploads or other activities. The DataStorer is an extension and can be found, along with installation instructions, at: https://github.com/okfn/ckanext-datastorer

The DataStore Data API

The DataStore's Data API, which derives from the underlying data table, is RESTful and JSON-based with extensive query capabilities.

Each resource in a CKAN instance can have an associated DataStore 'table'. The basic API for accessing the DataStore is outlined below. For a detailed tutorial on using this API see :doc:`using-data-api`.

API Reference

Note

Lists can always be expressed in different ways. It is possible to use lists, comma separated strings or single items. These are valid lists: ['foo', 'bar'], 'foo, bar', "foo", "bar" and 'foo'.

datastore_create

The datastore_create API endpoint allows a user to post JSON data to be stored against a resource. This endpoint also supports altering tables, aliases and indexes and bulk insertion. The JSON must be in the following form:

{
   resource_id:  # the data is going to be stored against.
   aliases:      # list of names for read-only aliases to the resource
   fields:       # a list of dictionaries of fields/columns and their extra metadata.
   records:      # a list of dictionaries of the data, eg:  [{"dob": "2005", "some_stuff": ['a', 'b']}, ..]
   primary_key:  # list of fields that represent a unique key
   indexes:      # indexes on table
}

See :ref:`fields` and :ref:`records` for details on how to lay out records.

datastore_delete

The datastore_delete API endpoint allows a user to delete records from a resource. The JSON for searching must be in the following form:

{
   resource_id: # the data that is going to be deleted.
   filter:      # dictionary of matching conditions to delete
                # e.g  {'key1': 'a', 'key2': 'b'}
                # this will be equivalent to "delete from table where key1 = 'a' and key2 = 'b' "
}

datastore_upsert

The datastore_upsert API endpoint allows a user to add or edit records in an existing DataStore resource. In order for the upsert and update methods to work, a unique key has to defined via the datastore_create API endpoint command. The JSON for searching must be in the following form:

{
   resource_id: # resource id that the data is going to be stored under.
   records:     #  a list of dictionaries of the data, eg:  [{"dob": "2005", "some_stuff": ['a', 'b']}, ..]
   method:      #  the method to use to put the data into the datastore
                #  possible options: upsert (default), insert, update
}
upsert
Update if record with same key already exists, otherwise insert. Requires unique key.
insert
Insert only. This method is faster that upsert, but will fail if any inserted record matches an existing one. Does not require a unique key.
update
Update only. An exception will occur if the key that should be updated does not exist. Requires unique key.

datastore_search

The datastore_search API endpoint allows a user to search data in a resource. The JSON for searching must be in the following form:

{
    resource_id:  # the resource id to be searched against
    filters :     # dictionary of matching conditions to select e.g  {'key1': 'a. 'key2': 'b'}
                  # this will be equivalent to "select * from table where key1 = 'a' and key2 = 'b' "
    q:            # full text query
    plain:        # treat as plain text query (default: true)
    language:     # language of the full text query (default: english)
    limit:        # limit the amount of rows to size (default: 100)
    offset:       # offset the amount of rows
    fields:       # list of fields return in that order, defaults (empty or not present) to all fields in fields order.
    sort:         # ordered list of field names as, eg: "fieldname1, fieldname2 desc"
}

datastore_search_sql

The datastore_search_sql API endpoint allows a user to search data in a resource or connect multiple resources with join expressions. The underlying SQL engine is the PostgreSQL engine. The JSON for searching must be in the following form:

{
   sql:  # a single sql select statement
}

datastore_search_htsql

Note

HTSQL is not in the core DataStore. To use it, it is necessary to install the ckanext-htsql extension available at https://github.com/okfn/ckanext-htsql.

The datastore_search_htsql API endpoint allows a user to search data in a resource using the HTSQL query expression language. The JSON for searching must be in the following form:

{
   htsql:  # a htsql query statement.
}

Fields

Fields define the column names and the type of the data in a column. A field is defined as follows:

{
    "id":    # a string which defines the column name
    "type":  # the data type for the column
}

Field types are optional and will be guessed by the DataStore from the provided data. However, setting the types ensures that future inserts will not fail because of wrong types. See :ref:`valid-types` for details on which types are valid.

Example:

[
    {
        "id": "foo",
        "type": "int4"
    },
    {
        "id": "bar"
        # type is optional
    }
]

Records

A record is the data to be inserted in a table and is defined as follows:

{
    "<id>":  # data to be set
    # .. more data
}

Example:

[
    {
        "foo": 100,
        "bar": "Here's some text"
    },
    {
        "foo": 42
    }
]

Field types

The DataStore supports all types supported by PostgreSQL as well as a few additions. A list of the PostgreSQL types can be found in the type section of the documentation. Below you can find a list of the most common data types. The json type has been added as a storage for nested data.

In addition to the listed types below, you can also use array types. They are defines by prepending a _ or appending [] or [n] where n denotes the length of the array. An arbitrarily long array of integers would be defined as int[].

text
Arbitrary text data, e.g. Here's some text.
json
Arbitrary nested json data, e.g {"foo": 42, "bar": [1, 2, 3]}. Please note that this type is a custom type that is wrapped by the DataStore.
date
Date without time, e.g 2012-5-25.
time
Time without date, e.g 12:42.
timestamp
Date and time, e.g 2012-10-01T02:43Z.
int
Integer numbers, e.g 42, 7.
float
Floats, e.g. 1.61803.
bool
Boolean values, e.g. true, 0

You can find more information about the formatting of dates in the date/time types section of the PostgreSQL documentation.

Table aliases

A resource in the DataStore can have multiple aliases that are easier to remember than the resource id. Aliases can be created and edited with the datastore_create API endpoint. All aliases can be found in a special view called _table_metadata.

Comparison of different querying methods

The DataStore supports querying with multiple API endpoints. They are similar but support different features. The following list gives an overview of the different methods.

  :ref:`datastore_search` :ref:`datastore_search_sql` :ref:`datastore_search_htsql`
    SQL HTSQL
Status Stable Stable Available as extension
Ease of use Easy Complex Medium
Flexibility Low High Medium
Query language Custom (JSON) SQL HTSQL
Connect multiple resources No Yes Not yet
Use aliases Yes Yes Yes