## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

### Mapping and Analysis

> GET /gb/_mapping/tweet

In [2]:
res = es.indices.get_mapping(index='gb', doc_type='tweet')
pprint(res)

{'gb': {'mappings': {'tweet': {'properties': {'date': {'type': 'date'},
                                              'name': {'fields': {'keyword': {'ignore_above': 256,
                                                                              'type': 'keyword'}},
                                                       'type': 'text'},
                                              'tweet': {'fields': {'keyword': {'ignore_above': 256,
                                                                               'type': 'keyword'}},
                                                        'type': 'text'},
                                              'user_id': {'type': 'long'}}}}}}


These queries return the same results because the analyzer has tokenized and normalized the string ```2014-09-15``` to ```2014```, ```09```, and ```15```.

> GET /_search?q=2014-09-15        # 12 results !

In [3]:
res = es.search(q='2014-09-15')
print(res['hits']['total'])

12


This search is against the _all meta field and so wherever these values are found (in all tweets), a hit is registered.

In [4]:
s = Search(using=es) \
    .query('match', _all='2014-09-15')
response = s.execute()
print('Total hits:{}\n'.format(response['hits']['total']))

Total hits:12



If we change the field explicitly to date, then only the one tweet (with that date) is returned:

In [5]:
s = Search(using=es) \
    .query('match', date='2014-09-15')
response = s.execute()
print('Total hits:{}\n'.format(response['hits']['total']))

Total hits:1



#### Testing Analyzers

We can test analyzers using the _analyze API:
```
GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}
```

In [6]:
text = "Text to analyze"
res = es.indices.analyze(analyzer='standard', body=text)
pprint(res)

{'tokens': [{'end_offset': 4,
             'position': 0,
             'start_offset': 0,
             'token': 'text',
             'type': '<ALPHANUM>'},
            {'end_offset': 7,
             'position': 1,
             'start_offset': 5,
             'token': 'to',
             'type': '<ALPHANUM>'},
            {'end_offset': 15,
             'position': 2,
             'start_offset': 8,
             'token': 'analyze',
             'type': '<ALPHANUM>'}]}


Note that the token is the actual term that will be stored in the inverted index:

In [7]:
for token in res['tokens']:
    print(token['token'])

text
to
analyze


#### Built-in Analyzers (Examples)

In [8]:
text = "I want to buy an i-pad and use it to purchase some socks on e-bay"

In [9]:
#standard
analyzed_text = [x['token'] for x in es.indices.analyze(analyzer='standard', body=text)['tokens']]
print(','.join(analyzed_text))

i,want,to,buy,an,i,pad,and,use,it,to,purchase,some,socks,on,e,bay


In [10]:
#simple
analyzed_text = [x['token'] for x in es.indices.analyze(analyzer='simple', body=text)['tokens']]
print(','.join(analyzed_text))

i,want,to,buy,an,i,pad,and,use,it,to,purchase,some,socks,on,e,bay


In [11]:
#whitespace
analyzed_text = [x['token'] for x in es.indices.analyze(analyzer='whitespace', body=text)['tokens']]
print(','.join(analyzed_text))

I,want,to,buy,an,i-pad,and,use,it,to,purchase,some,socks,on,e-bay


In [12]:
#english (language)
analyzed_text = [x['token'] for x in es.indices.analyze(analyzer='english', body=text)['tokens']]
print(','.join(analyzed_text))

i,want,bui,i,pad,us,purchas,some,sock,e,bai


### Mapping

Elasticsearch supports the following simple field types:

* String: string
* Whole number: byte, short, integer, long
* Floating-point: float, double
* Boolean: boolean
* Date: date

> GET /gb/_mapping/tweet

In [17]:
es.indices.get_mapping(index='gb', doc_type='tweet')

{'gb': {'mappings': {'tweet': {'properties': {'date': {'type': 'date'},
     'name': {'fields': {'keyword': {'ignore_above': 256, 'type': 'keyword'}},
      'type': 'text'},
     'tweet': {'fields': {'keyword': {'ignore_above': 256, 'type': 'keyword'}},
      'type': 'text'},
     'user_id': {'type': 'long'}}}}}}

Note that these results differ from the book because here we are using Elasticsearch 5.x. The core datatypes [are different](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html)

Let's delete the 'gb' index to experiment with different mappings.

> DELETE /gb

In [18]:
es.indices.delete(index='gb')

{'acknowledged': True}

Let's create the index again but with different field mappings. In particular, per the example, we will use a different analyzer ("english") for the 'tweet' mapping.

In [20]:
index_template = {
  "mappings": {
    "tweet" : {
      "properties" : {
        "tweet" : {
          "type" :    "text",
          "analyzer": "english"
        },
        "date" : {
          "type" :   "date"
        },
        "name" : {
          "type" :   "text"
        },
        "user_id" : {
          "type" :   "long"
        }
      }
    }
  }
}

In [21]:
es.indices.create(index='gb', body=index_template)

{'acknowledged': True, 'shards_acknowledged': True}

Later on, we decide to add a new not_analyzed text field called tag to the tweet mapping, using the _mapping endpoint:

>PUT /gb/_mapping/tweet
{
  "properties" : {
    "tag" : {
      "type" :    "string",
      "index":    "not_analyzed"
    }
  }
}

In [22]:
mapping_template = {
  "properties" : {
    "tag" : {
      "type" :    "text",
      "index":    "not_analyzed"
    }
  }
}
es.indices.put_mapping(index='gb', doc_type='tweet', body=mapping_template)

{'acknowledged': True}

#### Testing the Mapping

You can use the analyze API to test the mapping for string fields by name. Compare the output of these two requests:

>GET /gb/_analyze
{
  "field": "tweet",
  "text": "Black-cats" 
}

>GET /gb/_analyze
{
  "field": "tag",
  "text": "Black-cats" 
}

In [23]:
test_tweet = {
  "field": "tweet",
  "text": "Black-cats" 
}

test_tag = {
  "field": "tag",
  "text": "Black-cats" 
}

In [25]:
es.indices.analyze(index='gb', body=test_tweet)

{'tokens': [{'end_offset': 5,
   'position': 0,
   'start_offset': 0,
   'token': 'black',
   'type': '<ALPHANUM>'},
  {'end_offset': 10,
   'position': 1,
   'start_offset': 6,
   'token': 'cat',
   'type': '<ALPHANUM>'}]}

In [26]:
es.indices.analyze(index='gb', body=test_tag)

{'tokens': [{'end_offset': 5,
   'position': 0,
   'start_offset': 0,
   'token': 'black',
   'type': '<ALPHANUM>'},
  {'end_offset': 10,
   'position': 1,
   'start_offset': 6,
   'token': 'cats',
   'type': '<ALPHANUM>'}]}

The tweet field produces the two terms black and cat, while the tag field produces the single term Black-cats. In other words, our mapping is working correctly.

### Complex Core Field Types

#### Multivalue Fields

It is quite possible that we want our tag field to contain more than one tag. Instead of a single string, we could index an array of tags:

> ```{ "tag": [ "search", "nosql" ]}```

There is no special mapping required for arrays. Any field can contain zero, one, or more values, in the same way as a full-text field is analyzed to produce multiple terms.

Empty fields are also allowed:

>"null_value":               null,

>"empty_array":              [],

>"array_with_null_value":    [ null ]

In [29]:
# Objects can be nested and accept nest mappings:
mapping = {
  "gb": {
    "tweet": { 
      "properties": {
        "tweet":            { "type": "string" },
        "user": { 
          "type":             "object",
          "properties": {
            "id":           { "type": "string" },
            "gender":       { "type": "string" },
            "age":          { "type": "long"   },
            "name":   { 
              "type":         "object",
              "properties": {
                "full":     { "type": "string" },
                "first":    { "type": "string" },
                "last":     { "type": "string" }
              }
            }
          }
        }
      }
    }
  }
}

#### How Inner Objects are Indexeded

Lucene (Elasticsearch's core library) doesn’t understand inner objects. A Lucene document consists of a flat list of key-value pairs. In order for Elasticsearch to index inner objects usefully, it converts our document into something like this:

In [32]:
'''
data = {
    "tweet":            [elasticsearch, flexible, very],
    "user.id":          [@johnsmith],
    "user.gender":      [male],
    "user.age":         [26],
    "user.name.full":   [john, smith],
    "user.name.first":  [john],
    "user.name.last":   [smith]
}
'''

'\ndata = {\n    "tweet":            [elasticsearch, flexible, very],\n    "user.id":          [@johnsmith],\n    "user.gender":      [male],\n    "user.age":         [26],\n    "user.name.full":   [john, smith],\n    "user.name.first":  [john],\n    "user.name.last":   [smith]\n}\n'