Permalink
124 lines (87 sloc) 2.94 KB

Helpers

Collection of simple helper functions that abstract some specifics or the raw API.

Bulk helpers

There are several helpers for the bulk API since its requirement for specific formatting and other considerations can make it cumbersome if used directly.

All bulk helpers accept an instance of Elasticsearch class and an iterable actions (any iterable, can also be a generator, which is ideal in most cases since it will allow you to index large datasets without the need of loading them into memory).

The items in the action iterable should be the documents we wish to index in several formats. The most common one is the same as returned by :meth:`~elasticsearch.Elasticsearch.search`, for example:

{
    '_index': 'index-name',
    '_type': 'document',
    '_id': 42,
    '_routing': 5,
    'pipeline': 'my-ingest-pipeline',
    '_source': {
        "title": "Hello World!",
        "body": "..."
    }
}

Alternatively, if _source is not present, it will pop all metadata fields from the doc and use the rest as the document data:

{
    "_id": 42,
    "_routing": 5,
    "title": "Hello World!",
    "body": "..."
}

The :meth:`~elasticsearch.Elasticsearch.bulk` api accepts index, create, delete, and update actions. Use the _op_type field to specify an action (_op_type defaults to index):

{
    '_op_type': 'delete',
    '_index': 'index-name',
    '_type': 'document',
    '_id': 42,
}
{
    '_op_type': 'update',
    '_index': 'index-name',
    '_type': 'document',
    '_id': 42,
    'doc': {'question': 'The life, universe and everything.'}
}

Example:

Lets say we have an iterable of data. Lets say a list of words called mywords and we want to index those words into individual documents where the structure of the document is like {"word": "<myword>"}.

def gendata():
    mywords = ['foo', 'bar', 'baz']
    for word in mywords:
        yield {
            "_index": "mywords",
            "_type": "document",
            "doc": {"word": word},
        }

bulk(es, gendata())

For a more complete and complex example please take a look at https://github.com/elastic/elasticsearch-py/blob/master/example/load.py#L76-L130

Note

When reading raw json strings from a file, you can also pass them in directly (without decoding to dicts first). In that case, however, you lose the ability to specify anything (index, type, even id) on a per-record basis, all documents will just be sent to elasticsearch to be indexed as-is.

.. py:module:: elasticsearch.helpers

.. autofunction:: streaming_bulk

.. autofunction:: parallel_bulk

.. autofunction:: bulk


Scan

.. autofunction:: scan


Reindex

.. autofunction:: reindex