## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [3]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [4]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Stop Words

#### Stopwords and the Standard Analyzer

To use custom stopwords in conjunction with the standard analyzer, all we need to do is to create a configured version of the analyzer and pass in the list of stopwords that we require:

In [5]:
settings = {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": { 
          "type": "standard", 
          "stopwords": [ "and", "the" ] 
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [11]:
# test with the __standard__analyzer
text = "The quick and the dead." 
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='standard', text=text)['tokens']]
print(','.join(analyzed_text))

the,quick,and,the,dead


In [12]:
# test with my_analyzer
text = "The quick and the dead." 
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_analyzer', text=text)['tokens']]
print(','.join(analyzed_text))

quick,dead


In [13]:
# Note that the word positions (quick - pos 1, dead - pos 4) have been maintained:
es.indices.analyze(index='my_index', analyzer='my_analyzer', text=text)

{'tokens': [{'end_offset': 9,
   'position': 1,
   'start_offset': 4,
   'token': 'quick',
   'type': '<ALPHANUM>'},
  {'end_offset': 22,
   'position': 4,
   'start_offset': 18,
   'token': 'dead',
   'type': '<ALPHANUM>'}]}

Word position integrity (per above example) is important for phrase queries — if the positions of each term had been adjusted, a phrase query for quick dead would have matched the preceding example incorrectly.

Note that the ```stopwords``` field accepts a range of settings:

##### Array of stop words

> ```"stopwords": [ "and", "the" ]```

##### Default language stopwords

> ```"stopwords": "_english_"```

##### No stopwords

> ```"stopwords": "_none_"```

The default stopwords for ```_english_```:

```a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
no, not, of, on, or, such, that, the, their, then, there, these,
they, this, to, was, will, with```

Note that stopwords can be placed in a file (default config/stopwords).
I placed a file <es-home>/config/stopwords/english.txt with contents:
```
a
the
dead
```
i.e. one stopword per line.

In [21]:
settings={
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":           "english",
          "stopwords_path": "stopwords/english.txt" 
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [22]:
# test with my_analyzer
text = "The quick and the dead is a good film." 
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_english', text=text)['tokens']]
print(','.join(analyzed_text))

quick,and,is,good,film
