diff --git a/doc/importing-datasets.rst b/doc/importing-datasets.rst index 1df5e191557..3611e33a55b 100644 --- a/doc/importing-datasets.rst +++ b/doc/importing-datasets.rst @@ -1,29 +1,31 @@ -============= -Load Datasets -============= +================== +Importing Datasets +================== -You can upload individual datasets through the CKAN front-end, but for importing datasets on masse, you have two choices: +You can create individual datasets using the CKAN front-end. +However, when importing multiple datasets it is generally more efficient to +automate this process in some way. +There are two common approaches to importing datasets in CKAN: -* :ref:`load-data-api`. You can use the `CKAN API `_ to script import. To simplify matters, we offer provide standard loading scripts for Google Spreadsheets, CSV and Excel. +* :ref:`load-data-api`. Using the `CKAN API `_. -* :ref:`load-data-harvester`. The `CKAN harvester extension `_ provides web and command-line interfaces for larger import tasks. +* :ref:`load-data-harvester`. Using the + `CKAN harvester extension `_. + This provides web and command-line interfaces for larger import tasks. -If you need advice on data import, `contact the ckan-dev mailing list `_. +.. note :: If loading your data requires scraping a web page regularly, you + may find it best to write a scraper on + `ScraperWiki `_ and combine this with either of + the methods above. -.. note :: If loading your data requires scraping a web page regularly, you may find it best to write a scraper on `ScraperWiki `_ and combine this with either of the methods above. .. _load-data-api: Import Data with the CKAN API ----------------------------- -You can use the `CKAN API `_ to upload datasets directly into your CKAN instance. - -The Simplest Approach - CKAN API -++++++++++++++++++++++++++++++++ - -The simplest way to automate dataset loading is with a Python script using -:doc:`CKAN's API `. Here's an example script to create a new dataset:: +You can use the `CKAN API `_ to upload datasets directly into your +CKAN instance. Here's an example script that creates a new dataset:: #!/usr/bin/env python import urllib2 @@ -33,8 +35,8 @@ The simplest way to automate dataset loading is with a Python script using # Put the details of the dataset we're going to create into a dict. dataset_dict = { - 'name': 'my_dataset_name', - 'notes': 'A long description of my dataset', + 'name': 'my_dataset_name', + 'notes': 'A long description of my dataset', } # Use the json module to dump the dictionary to a string for posting. @@ -42,7 +44,7 @@ The simplest way to automate dataset loading is with a Python script using # We'll use the package_create function to create a new dataset. request = urllib2.Request( - 'http://www.my_ckan_site.com/api/action/package_create') + 'http://www.my_ckan_site.com/api/action/package_create') # Creating a dataset requires an authorization header. # Replace *** with your API key, from your user account on the CKAN site @@ -62,56 +64,20 @@ The simplest way to automate dataset loading is with a Python script using pprint.pprint(created_package) -Loader Scripts -++++++++++++++ - -'Loader scripts' provide a simple way to take any format metadata and bulk upload it to a remote CKAN instance. - -Essentially each set of loader scripts converts the dataset metadata to the standard 'dataset' format, and then loads it into CKAN. - -To get a flavour of what loader scripts look like, take a look at `the ONS scripts `_. - -Loader Scripts for CSV and Excel -******************************** - -For CSV and Excel formats, the `SpreadsheetPackageImporter` (found in ``ckanext-importlib/ckanext/importlib/spreadsheet_importer.py``) loader script wraps the file in `SpreadsheetData` before extracting the records into `SpreadsheetDataRecords`. - -SpreadsheetPackageImporter copes with multiple title rows, data on multiple sheets, dates. The loader can reload datasets based on a unique key column in the spreadsheet, choose unique names for datasets if there is a clash, add/merge new resources for existing datasets and manage dataset groups. - -Loader Scripts for Google Spreadsheets -************************************** - -The `SimpleGoogleSpreadsheetLoader` class (found in ``ckanclient.loaders.base``) simplifies the process of loading data from Google Spreadsheets (there is an additional dependency on the ``gdata`` Python package). - -`This script `_ has a simple example of loading data from Google Spreadsheets. - -Write Your Own Loader Script -**************************** - -## this needs work ## - -First, you need an importer that derives from `PackageImporter` (found in ``ckan/lib/importer.py``). This takes whatever format the metadata is in and sorts it into records of type `DataRecord`. - -Next, each DataRecord is converted into the correct fields for a dataset using the `record_2_package` method. This results in dataset dictionaries. - -The `PackageLoader` takes the dataset dictionaries and loads them onto a CKAN instance using the ckanclient. There are various settings to determine: - - * ##how to identify the same dataset, previously been loaded into CKAN.## This can be simply by name or by an identifier stored in another field. - * how to merge in changes to an existing datasets. It can simply replace it or maybe merge in resources etc. - -The loader should be given a command-line interface using the `Command` base class (``ckanext/command.py``). - -You need to add a line to the CKAN ``setup.py`` (under ``[console_scripts]``) and when you run ``python setup.py develop`` it creates a script for you in your Python environment. - .. _load-data-harvester: Import Data with the Harvester Extension ---------------------------------------- -The `CKAN harvester extension `_ provides useful tools for more advanced data imports. +The `CKAN harvester extension `_ +provides useful tools for more advanced data imports. -These include a command-line interface and a web user interface for running harvesting jobs. +These include a command-line interface and a web user interface for running +harvesting jobs. -To use the harvester extension, create a class that implements the `harvester interface ` derived from the `base class of the harvester extension `_. +To use the harvester extension, create a class that implements the +`harvester interface ` +derived from the +`base class of the harvester extension `_. For more information on working with extensions, see :doc:`extensions`.