Performance tips for large imports

Adrià Mercader edited this page Jun 25, 2013 · 1 revision
Clone this wiki locally

Note: this page collects different approachs to improve performance proposed during this discussion: https://github.com/okfn/ckan/issues/681. If the tips on this page don't help, it is recommended to follow the whole thread in case, as it includes an excellent analysis by @wardi.


In some cases there is the need to import a large number of datasets to a new or existing instance. Whether via the harvesters or a custom script, this import is likely to be made via the action API. The way CKAN has been historically structured can make that at some point the import (and on very large number of datasets, the general performance of the site) will start to run slower. This page describes some approaches used in some existing instances with large numbers of datasets (like the Canada and US national portals).

Note that all these tweaks are relatively custom and is important to plan in advance and understand the consequences of some of them. Remember to always backup any existing data and code before implementing some of them.

Solr

A significant amount of time after creations and updates is spent commiting the changes to Solr search index. Depending on the Solr version you are using there are different workarounds to avoid commiting after each dataset is indexed. If you are using Solr 1.3 or 1.4 the easiest way is to just disable the synchronous indexing with the following config option, let the process finish and then rebuild the search index afterwards (with paster search-index rebuild):

ckan.search.automatic_indexing = false

You can also disable the commiting with the following option, and then do a manual commit on Solr afterwards or run a cron job that does the commits:

ckan.search.solr_commit = false

On Solr 4 you can combine the previous flag with a Solr config option that will do automatically a commit every 30 seconds or so (and a soft commit option which can be done about every second)

Database

There are a lot of constraints defined on a CKAN database that slow insertions significantly when you have a large number of datasets.

Performance improved greatly when dropping these constraints:

https://github.com/okfn/ckanext-geodatagov/blob/master/constraints.sql

In terms of restoring those back once the import is finished, it will probably depend on whether you need to do more big imports in the future. Dropping some of those could cause errors if the application is not handling constraints properly or cause problems eg during a database migration.

Also modifying these indexes helped quite a lot:

https://github.com/okfn/ckanext-geodatagov/blob/master/what_to_alter.sql

Hopefully some of these changes will find their way to CKAN's codebase to make them default at some point

Tags and extras

Because of how the database is defined, inserting new datasets with several tags can be very slow in CKAN. We found the same situation on geodatagov and we decided to store the tags from the remote document as an extra in the CKAN dataset, and use the before_view extension point to put them back to the dataset dict [1], so they are shown as usual on the dataset page. (You can use the after_show extension point if you also want the tags to come up eg on the API).

The same principle can be applied to extras, storing all of them in a combined extra on before_action [2] and unpacking them on after_show [3].

[1] https://github.com/okfn/ckanext-geodatagov/blob/master/ckanext/geodatagov/plugins.py#L269

[2] https://github.com/okfn/ckanext-geodatagov/blob/master/ckanext/geodatagov/plugins.py#L239

[3] https://github.com/okfn/ckanext-geodatagov/blob/master/ckanext/geodatagov/plugins.py#L335