# Getting started with Elastic via Python API

Start by importing elasticsearch library. Make sure it is installed with `python3 -m pip install --user elasticsearch`.

## Connecing and testing

In [7]:
from elasticsearch import Elasticsearch

Establish a connection. It will default to `localhost:9200` if `hosts` argument is omitted.

In [8]:
es = Elasticsearch(hosts=["192.168.10.14:9200"])

Always make sure your cluster connection is actually alive. 

In [9]:
es.ping()

True

## Indexing your first document

In [45]:
document = {
    "field1": "val1",
    "field2": "val1",
    "field3": 123
}
es.index("second", doc_type="doc", body=document, id="BBBB")

{'_index': 'second',
 '_type': 'doc',
 '_id': 'BBBB',
 '_version': 3,
 'result': 'updated',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 2,
 '_primary_term': 1}

Note that elasticsearch library is just a wrapper for talking to HTTP API, so prior example is roughly equal to this:

In [46]:
import requests
import json
url = "http://192.168.10.14:9200/second/doc/BBBB"
headers = { "Content-Type": "application/json" }

resp = requests.post(url, data=json.dumps(document), headers=headers)
print(resp.json())

{'_index': 'second', '_type': 'doc', '_id': 'BBBB', '_version': 4, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 3, '_primary_term': 1}


Then retreive it.

In [47]:
newdoc = es.get("second", doc_type="doc", id="BBBB")
print(newdoc)

{'_index': 'second', '_type': 'doc', '_id': 'BBBB', '_version': 4, '_seq_no': 3, '_primary_term': 1, 'found': True, '_source': {'field1': 'val1', 'field2': 'val1', 'field3': 123}}


Elastic attaches fair amount of meta information. Actual souce document is in `_source` field.

In [48]:
newdoc = newdoc["_source"]
print(newdoc)

{'field1': 'val1', 'field2': 'val1', 'field3': 123}


## Bulk API

Elasticsearch uses HTTP and transport protocol, so indexing individual documents is fairly expensive. Especially when talking about IDS logs. Proper way is to use `bulk` API.

See:
 * https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

Bulk format requires metadata line before each document to indicate what action should be taken, which index used, etc. Consider the illustration:

In [49]:
meta = {
    "index": {
        "_index": "third",
        "_type": "_doc",
        "_id": "CCCC"
    }
}

bulk = []
i = 0
for i in range(100):
    meta = {
        "index": {
            "_index": "third",
            "_type": "_doc",
            "_id": i
        }
    }
    doc = {
        "message": "this is message {}".format(i),
        "count": i
    }
    
    bulk.append(meta)
    bulk.append(doc)

Verify the message structure by simply printing first 10 elements.

In [54]:
for msg in bulk[0:10]:
    print(msg)

{'index': {'_index': 'third', '_type': '_doc', '_id': 0}}
{'message': 'this is message 0', 'count': 0}
{'index': {'_index': 'third', '_type': '_doc', '_id': 1}}
{'message': 'this is message 1', 'count': 1}
{'index': {'_index': 'third', '_type': '_doc', '_id': 2}}
{'message': 'this is message 2', 'count': 2}
{'index': {'_index': 'third', '_type': '_doc', '_id': 3}}
{'message': 'this is message 3', 'count': 3}
{'index': {'_index': 'third', '_type': '_doc', '_id': 4}}
{'message': 'this is message 4', 'count': 4}


Then send the bulk to elasticsearch. And verify that everything was indexed correctly.

In [59]:
resp = es.bulk(bulk)
#print(json.dumps(resp, indent=2))
print(resp["errors"])

False


Result of each indexed document will be returned, so we can check a subset of them.

In [61]:
for result in resp["items"][0:10]:
    print(result)

{'index': {'_index': 'third', '_type': '_doc', '_id': '0', '_version': 7, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 600, '_primary_term': 1, 'status': 200}}
{'index': {'_index': 'third', '_type': '_doc', '_id': '1', '_version': 7, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 601, '_primary_term': 1, 'status': 200}}
{'index': {'_index': 'third', '_type': '_doc', '_id': '2', '_version': 7, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 602, '_primary_term': 1, 'status': 200}}
{'index': {'_index': 'third', '_type': '_doc', '_id': '3', '_version': 7, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 603, '_primary_term': 1, 'status': 200}}
{'index': {'_index': 'third', '_type': '_doc', '_id': '4', '_version': 7, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 604, '_primary_term': 1, 'statu

## Running your first search query

Now we can run a search against this index, looking for documents where `count` field is `>= 12` or `<= 20`. Only three results are reported back to the user.

In [63]:
results = es.search(index="third", doc_type="_doc", body={
    "size": 3,
    "query": {
        "range": {
            "count": {
                "gte": 12,
                "lte": 20,
            }
        }
    }
})
print(results.keys())
print(results["hits"].keys())

dict_keys(['took', 'timed_out', '_shards', 'hits'])
dict_keys(['total', 'max_score', 'hits'])


In [64]:
if not results["timed_out"]:
    for result in results["hits"]["hits"]:
        print(result)

{'_index': 'third', '_type': '_doc', '_id': '12', '_score': 1.0, '_source': {'message': 'this is message 12', 'count': 12}}
{'_index': 'third', '_type': '_doc', '_id': '13', '_score': 1.0, '_source': {'message': 'this is message 13', 'count': 13}}
{'_index': 'third', '_type': '_doc', '_id': '14', '_score': 1.0, '_source': {'message': 'this is message 14', 'count': 14}}


Note that omitting `_id` will cause elastic to autogenerate one. However, if you index the same log again, then having a distinct ID will cause the old one to be updated. Otherwise, the second indexing round will duplicate the log.

Here the second indexing call updates the first document.

In [66]:
es.index(index="fourth", doc_type="doc", body=document, id="BBBB")
es.index(index="fourth", doc_type="doc", body=document, id="BBBB")

{'_index': 'fourth',
 '_type': 'doc',
 '_id': 'BBBB',
 '_version': 8,
 'result': 'updated',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 7,
 '_primary_term': 1}

Here two separate documents will be created.

In [67]:
es.index(index="fifth", doc_type="doc", body=document)
es.index(index="fifth", doc_type="doc", body=document)

{'_index': 'fifth',
 '_type': 'doc',
 '_id': '13qrD3ABOj6xcWqo7S_q',
 '_version': 1,
 'result': 'created',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 3,
 '_primary_term': 1}

In [68]:
idx = es.cat.indices()

We can verify this via `_cat` API to see that *fourth* index has only one document while *fifth* has more.

In [70]:
print(idx)

yellow open third  eYOOxir8TG2nOEBgy9i5JQ 1 1 100 400 29.5kb 29.5kb
yellow open fifth  z6oY9K_rSvWnyLabwhW5DQ 1 1   2   0  4.4kb  4.4kb
yellow open fourth gPJphi7_QBq8vKjSDzarUg 1 1   1   1 11.2kb 11.2kb
yellow open second d2blTfNLTd6AJjvHfIID6Q 1 1   1   1  5.8kb  5.8kb



Or we can verify the same thing by querying these indices.

In [73]:
es.search(index="fifth")

{'took': 1,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 4, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'fifth',
    '_type': 'doc',
    '_id': '1HqED3ABOj6xcWqoby8C',
    '_score': 1.0,
    '_source': {'field1': 'val1', 'field2': 'val1', 'field3': 123}},
   {'_index': 'fifth',
    '_type': 'doc',
    '_id': '1XqED3ABOj6xcWqoby87',
    '_score': 1.0,
    '_source': {'field1': 'val1', 'field2': 'val1', 'field3': 123}},
   {'_index': 'fifth',
    '_type': 'doc',
    '_id': '1nqrD3ABOj6xcWqo7S_K',
    '_score': 1.0,
    '_source': {'field1': 'val1', 'field2': 'val1', 'field3': 123}},
   {'_index': 'fifth',
    '_type': 'doc',
    '_id': '13qrD3ABOj6xcWqo7S_q',
    '_score': 1.0,
    '_source': {'field1': 'val1', 'field2': 'val1', 'field3': 123}}]}}

In [74]:
es.search(index="fourth")

{'took': 1,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'fourth',
    '_type': 'doc',
    '_id': 'BBBB',
    '_score': 1.0,
    '_source': {'field1': 'val1', 'field2': 'val1', 'field3': 123}}]}}

## Managing templates

Finally, note that we can manage pretty much anything via elastic python API. For example, we could create mapping template programmatically. This is a typical base template similar to what logstash creates. However, it is easier to customize it this way. For example, instead of applying only to `logstash-*` index pattern, we also added `events-*` and `suricata-*`

In [88]:
DEFAULT_SETTINGS = {
    "index": {
        "number_of_shards": 3,
        "number_of_replicas": 0,
        "refresh_interval": "30s"
    }
}

DEFAULT_PROPERTIES = {
    "@timestamp": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis||date_time"
    },
    "@version": {
        "type": "keyword"
    },
    "ip": {
        "type": "ip"
    }
}

DEFAULT_MAPPINGS = {
    "dynamic_templates": [
        {
            "message_field": {
                "path_match": "message",
                "mapping": {
                    "norms": False,
                    "type": "text"
                },
                "match_mapping_type": "string"
            }
        },
        {
            "string_fields": {
                "mapping": {
                    "norms": False,
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword"
                        }
                    }
                },
                "match_mapping_type": "string",
                "match": "*"
            }
        }
    ],
    "properties": DEFAULT_PROPERTIES
}

DEFAULT_PATTERNS = [
    "logstash-*",
    "events-*",
    "suricata-*"
]

DEFAULT_TEMPLATE = {
    "order": 0,
    "version": 0,
    "index_patterns": DEFAULT_PATTERNS,
    "settings": DEFAULT_SETTINGS,
    "mappings": DEFAULT_MAPPINGS,
    "aliases": {}
}

In [89]:
print(DEFAULT_TEMPLATE)

{'order': 0, 'version': 0, 'index_patterns': ['logstash-*', 'events-*', 'suricata-*'], 'settings': {'index': {'number_of_shards': 3, 'number_of_replicas': 0, 'refresh_interval': '30s'}}, 'mappings': {'dynamic_templates': [{'message_field': {'path_match': 'message', 'mapping': {'norms': False, 'type': 'text'}, 'match_mapping_type': 'string'}}, {'string_fields': {'mapping': {'norms': False, 'type': 'text', 'fields': {'keyword': {'type': 'keyword'}}}, 'match_mapping_type': 'string', 'match': '*'}}], 'properties': {'@timestamp': {'type': 'date', 'format': 'strict_date_optional_time||epoch_millis||date_time'}, '@version': {'type': 'keyword'}, 'ip': {'type': 'ip'}}}, 'aliases': {}}


In [90]:
tpl = DEFAULT_TEMPLATE
resp = es.indices.put_template("default", body=tpl)
print(resp)

{'acknowledged': True}


In [91]:
resp = es.indices.get_template("default")
print(resp)

{'default': {'order': 0, 'version': 0, 'index_patterns': ['logstash-*', 'events-*', 'suricata-*'], 'settings': {'index': {'number_of_shards': '3', 'number_of_replicas': '0', 'refresh_interval': '30s'}}, 'mappings': {'dynamic_templates': [{'message_field': {'path_match': 'message', 'mapping': {'norms': False, 'type': 'text'}, 'match_mapping_type': 'string'}}, {'string_fields': {'mapping': {'norms': False, 'type': 'text', 'fields': {'keyword': {'type': 'keyword'}}}, 'match_mapping_type': 'string', 'match': '*'}}], 'properties': {'@timestamp': {'format': 'strict_date_optional_time||epoch_millis||date_time', 'type': 'date'}, 'ip': {'type': 'ip'}, '@version': {'type': 'keyword'}}}, 'aliases': {}}}


Templates can be layered on top of each other, with `order` value specifying the apply precedence. Therefore, we can create a base template that applies to all index patterns, and then create a more specific template for indices containing a particular event type. Former ensures that basic settings like sharding and replicas are configured properly, along with sane default mappings. Latter can be used to explicitly map certain fields. For example, we might wish suricata `src_ip` and `dest_ip` fields to be mapped as `ip` datatype, to enable subnet queries. And to map `payload` field as binary for more efficient storage. By default they would be mapped as strings. However we do not want the same logic to be applied to other indices, line `windows-*` or as they may have same fields with conflicting types. 

In [94]:
SURICATA_TEMPLATE = {
  "order": 10,
  "version": 0,
  "index_patterns": [
    "suricata-*",
    "logstash-*"
  ],
  "mappings":{
    "properties": {
      "src_ip": { 
        "type": "ip",
        "fields": {
          "keyword" : { "type": "keyword", "ignore_above": 256 }
        }
      },
      "dest_ip": { 
        "type": "ip",
        "fields": {
          "keyword" : { "type": "keyword", "ignore_above": 256 }
        }
      },
      "payload": { "type": "binary" }
    }
  }
}

Note that IP fields have a dual-mapping. In other words, this template also create `src_ip.keyword` and `dest_ip.keyword` fields. Reason is simple - some frontend tools assume default logstash mappings where everything is a string, and thus execute string aggregations against `.keyword` fields. This template should give the best of both worlds. Raw field has correct IP mapping which means more efficient storage and enables subnet queries. Dual mapping ensures that regular `term` aggregations still work and are compatible with tools like `evebox` and `scirius`. `Payload` is a base64 encoded blob, so there is no point in tokenizing it.

In [97]:
tpl = SURICATA_TEMPLATE
resp = es.indices.put_template("default", body=tpl)
print(resp)

{'acknowledged': True}
