## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [2]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Stop Words

#### Stopwords and the Standard Analyzer

To use custom stopwords in conjunction with the standard analyzer, all we need to do is to create a configured version of the analyzer and pass in the list of stopwords that we require:

In [3]:
settings = {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": { 
          "type": "standard", 
          "stopwords": [ "and", "the" ] 
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [4]:
# test with the __standard__analyzer
text = "The quick and the dead." 
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='standard', text=text)['tokens']]
print(','.join(analyzed_text))

the,quick,and,the,dead


In [5]:
# test with my_analyzer
text = "The quick and the dead." 
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_analyzer', text=text)['tokens']]
print(','.join(analyzed_text))

quick,dead


In [6]:
# Note that the word positions (quick - pos 1, dead - pos 4) have been maintained:
es.indices.analyze(index='my_index', analyzer='my_analyzer', text=text)

{'tokens': [{'end_offset': 9,
   'position': 1,
   'start_offset': 4,
   'token': 'quick',
   'type': '<ALPHANUM>'},
  {'end_offset': 22,
   'position': 4,
   'start_offset': 18,
   'token': 'dead',
   'type': '<ALPHANUM>'}]}

Word position integrity (per above example) is important for phrase queries — if the positions of each term had been adjusted, a phrase query for quick dead would have matched the preceding example incorrectly.

Note that the ```stopwords``` field accepts a range of settings:

##### Array of stop words

> ```"stopwords": [ "and", "the" ]```

##### Default language stopwords

> ```"stopwords": "_english_"```

##### No stopwords

> ```"stopwords": "_none_"```

The default stopwords for `_english_`:

```a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
no, not, of, on, or, such, that, the, their, then, there, these,
they, this, to, was, will, with```

Note that stopwords can be placed in a file (default config/stopwords).
I placed a file <es-home>/config/stopwords/english.txt with contents:
```
a
the
dead
```
i.e. one stopword per line.

In [7]:
settings={
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":           "english",
          "stopwords_path": "stopwords/english.txt" 
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [8]:
# test with my_analyzer
text = "The quick and the dead is a good film." 
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_english', text=text)['tokens']]
print(','.join(analyzed_text))

quick,and,is,good,film


Updating stopwords is easier if you specify them in a file with the stopwords_path parameter. You can just update the file (on every node in the cluster) and then force the analyzers to be re-created by either of these actions:

Closing and reopening the index (see open/close index), or
Restarting each node in the cluster, one by one
Of course, updating the stopwords list will not change any documents that have already been indexed. It will apply only to searches and to new or updated documents. To apply the changes to existing documents, you will need to reindex your data. See [Reindexing Your Data](https://www.elastic.co/guide/en/elasticsearch/guide/master/reindex.html).

#### and Operator

Search for "quick and the dead" is really search:

>`quick OR and OR the OR dead`

This is problematic because every doc containing any of those words needs to be included in the calculation of relevance (`_score`).

A more precise search might be:

>`quick AND and AND the AND dead`

`
{
    "match": {
        "title": {
            "query":    "quick and the dead",
            "operator": "and"
        }
    }
}
`



In [9]:
body = { 'title': 'The quick and the dead is a good film.'}
es.create(index='my_index', doc_type='test', body=body, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [10]:
body = { 'title': 'The quick and the alive is a bad film.'}
es.create(index='my_index', doc_type='test', body=body, id=2)

{'_id': '2',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [11]:
s = Search(using=es)
s = s.query('match', title='the quick and the dead')
s.execute()

<Response: [<Hit(my_index/test/1): {'title': 'The quick and the dead is a good film.'}>, <Hit(my_index/test/2): {'title': 'The quick and the alive is a bad film.'}>]>

In [12]:
q = {
    "match": {
        "title": {
            "query":    "quick and the dead",
            "operator": "and"
        }
    }
}

In [13]:
q = Q(q)
s = Search(using=es).query(q)
s.execute()

<Response: [<Hit(my_index/test/1): {'title': 'The quick and the dead is a good film.'}>]>

In [14]:
# Same as this:
q = Q('term', title='quick') & Q('term', title='and') \
    & Q('term', title='the')\
    & Q('term', title='dead')
s = Search(using=es).query(q)
s.execute()

<Response: [<Hit(my_index/test/1): {'title': 'The quick and the dead is a good film.'}>]>

In [15]:
# Same as this:
q = Q('bool',
      must=[Q('term', title='quick'), Q('term', title='and'), \
            Q('term', title='the'), Q('term', title='dead')])
s = Search(using=es).query(q)
s.execute()

<Response: [<Hit(my_index/test/1): {'title': 'The quick and the dead is a good film.'}>]>

Or we can narrow the results by the `minimum_should_match` parameter:

In [16]:
q = {
    "match": {
        "title": {
            "query":    "quick dead good film",
            "minimum_should_match": "100%"
        }
    }
}

In [17]:
# 100% of terms should match - all 4 of them
q = Q(q)
s = Search(using=es).query(q)
s.execute()

<Response: [<Hit(my_index/test/1): {'title': 'The quick and the dead is a good film.'}>]>

In [18]:
# 75% of terms should match - any 3 of them
q = {
    "match": {
        "title": {
            "query":    "quick dead good film",
            "minimum_should_match": "75%"
        }
    }
}
q = Q(q)
s = Search(using=es).query(q)
s.execute()

<Response: [<Hit(my_index/test/1): {'title': 'The quick and the dead is a good film.'}>]>

In [19]:
# 50% of terms should match - any 2 of them
q = {
    "match": {
        "title": {
            "query":    "quick dead alive film",
            "minimum_should_match": "50%"
        }
    }
}
q = Q(q)
s = Search(using=es).query(q)
s.execute()

<Response: [<Hit(my_index/test/2): {'title': 'The quick and the alive is a bad film.'}>, <Hit(my_index/test/1): {'title': 'The quick and the dead is a good film.'}>]>

This offers a huge performance gain over a simple query with the default or operator! But we can do better yet...

#### Divide and Conquer

The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms. Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.

The match query accepts a cutoff_frequency parameter, which allows it to divide the terms in the query string into a low-frequency and high-frequency group.

First, consider the scores for the above:

In [20]:
# 50% of terms should match - any 2 of them
q = {
    "match": {
        "title": {
            "query":    "quick dead alive film",
            "minimum_should_match": "50%"
        }
    }
}
q = Q(q)
s = Search(using=es).query(q)
res = s.execute()
for hit in res:
    print(hit.title, hit.meta.score, hit.meta.id)

The quick and the alive is a bad film. 0.8169974 2
The quick and the dead is a good film. 0.8169974 1


As expected these have identical scores:

* quick dead is in doc id=1
* quick alive is in doc id=2

Let's consider a strategy for "high frequency" group and "low frequency" group:

In [21]:
q = {
  "bool": {
    "must": { 
      "bool": {
        "should": [
          { "term": { "title": "alive" }},
          { "term": { "title": "dead"  }},
        ]
      }
    },
    "should": { 
      "bool": {
        "should": [
          { "term": { "title": "film" }},
          { "term": { "title": "good" }}
        ]
      }
    }
  }
}

In [22]:
q = Q(q)
s = Search(using=es).query(q)
res = s.execute()
for hit in res:
    print(hit.title, hit.meta.score, hit.meta.id)

The quick and the dead is a good film. 0.8169974 1
The quick and the alive is a bad film. 0.5446649 2


Note that the film id=1 now wins (higher relevance _score) because:

1. It contains the mandatory term "dead" (which it must)
2. And it contains **two** of the low freq terms: "film" and "good"

film id=2 contains the mandatory term "alive" but then only **one** of the low-frequency terms: "film".

#### Stopwords and Phrase Queries

** TO BE COMPLETED **