# Elasticsearch Demo

*by Damian Trilling*

http://www.damiantrilling.net

I created this demo in a pristine Lubuntu 18.04 installation in a VirtualBox.

The only steps I did were:


1. Install git and clone my teaching repo (to make this doc)
```
sudo apt-get install git
git clone https://github.com/damian0604/bdaca
```

2. Install Java (a prerequisite for elasticsearch):
```
sudo apt-get install openjdk-8-jre-headless```
```

3. Install Elasticsearch by copy-pasting the commands from https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html

4. Start elastic search with
```
sudo service elasticsearch start
```

5. Install Jupyter Notebook and the Python-wrapper for Elasticsearch

```
sudo apt-get install jupyter-notebook
sudo pip3 install elasticsearch
```

6. Start Jupyer notebook with ```jupyter-notebook```.

## Interacting with Elasticserach

You can interact with Elasticsearch via curl, and that can be cool for diagnosis and for some ad-hoc repairs, but we're not masocistic and will use Python to interact with it. Because ES in fact communicates with you through JSON-objects, you can just throw Python-dicts at it and receive Python-dicts back. That's cool!

In [7]:
from elasticsearch import Elasticsearch
import json

In [3]:
client = Elasticsearch()

In [5]:
client.indices.exists('test')

False

In [32]:
client.indices.create("test")

{'acknowledged': True, 'index': 'test', 'shards_acknowledged': True}

In [31]:
client.indices.delete('test')

{'acknowledged': True}

In [None]:
client.indices.exists('test')

client.indices.create("test")

## Inserting documents

There are helpers for bulk insert, but that's not important for now...

In [79]:
docs = [{'text': 'Some weird text, whatetever',
         'title': 'Some title',
        'shared':22},
        {'text':'Another text, well we know how this looks like',
        'title':'it has a title too',
        'is_paywalled':True},
        {'title':'Shorty',
        'shared':1,
        'source':'http://example.com/whatever'}
        ]

In [80]:
for doc in docs:
    client.index('test',doc_type='article',body=doc)

## Elasticsearch and schemas
As we have seen, we didn't need to specify any schema, but could get started right away. In SQL, we would have to decide on columns and datatypes in advance, like this.

```
CREATE TABLE articles (
    shared int,
    title varchar(255),
    ...
    ...
    ...
);
```

Elasticsearch does not require this. This is especially nice if newly inserted documents suddenly have new additional keys (e.g., the key 'source' in the last doc above)., 

**How does this work?**

Well, let's try it out:

In [38]:
# this will fail...
newdoc = {'title': 'Tekjkje kklegjekjgekjg', 'shared':'no'}
client.index('test',doc_type='article',body=newdoc)

POST http://localhost:9200/test/article [status:400 request:0.009s]


RequestError: RequestError(400, 'mapper_parsing_exception', 'failed to parse field [shared] of type [long]')

In [40]:
# but this works (b/c '99' can be converted to an int)
newdoc = {'title': 'Tekjkje kklegjekjgekjg', 'shared':'99'}
client.index('test',doc_type='article',body=newdoc)

{'_id': 'GsLoW2cBYLprx5XowGoI',
 '_index': 'test',
 '_primary_term': 1,
 '_seq_no': 0,
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'article',
 '_version': 1,
 'result': 'created'}

What happened?

Well, once a new key is added for the first time, ES determines a data type (string, int, ...). That's less strict than, for instance, in MySQL (you don't need to know the maximum number of characters etc.), and there is a lot of under-the-hood conversion going on (that's why the last example worked) -- but if some key is expected to contain an int, you can't throw some random text in there.

The upside of this behavior is that it makes it possible to insert new keys that were unknown before - and ES doesn't complain. The downside is that the first doc inserted that has the key 'claims' the data type. (I can tell a long debugging story here...

We can therefore *when creating the index* optionally specify rules on which keys should map to which data type (and analyzers...)

In [78]:
schema = json.load(open('schema.json'))
schema

{'mappings': {'doc': {'dynamic_templates': [{'id': {'mapping': {'type': 'keyword'},
      'match': 'id',
      'match_mapping_type': 'string'}},
    {'es': {'mapping': {'analyzer': 'spanish', 'type': 'text'},
      'match': '*_es',
      'match_mapping_type': 'string'}},
    {'nl': {'mapping': {'analyzer': 'dutch', 'type': 'text'},
      'match': '*_nl',
      'match_mapping_type': 'string'}},
    {'raw': {'mapping': {'type': 'text'},
      'match': '*_raw',
      'match_mapping_type': 'string'}},
    {'en': {'mapping': {'analyzer': 'english', 'type': 'text'},
      'match': '*_en',
      'match_mapping_type': 'string'}},
    {'default': {'mapping': {'fields': {'exact': {'analyzer': 'whitespace',
         'type': 'text'}},
       'filter': ['stop'],
       'type': 'text'},
      'match': '*',
      'match_mapping_type': 'string'}}],
   'properties': {'doctype': {'type': 'keyword'},
    'functiontype': {'type': 'keyword'},
    'id': {'type': 'keyword'},
    'source': {'type': 'keyword'}

In [82]:
client.indices.create("test2", body=schema)

{'acknowledged': True, 'index': 'test2', 'shards_acknowledged': True}

Some additional elements of the schema:

- analyzers (for full text queries)
- type 'text' vs type 'keyword' 
- possibility of having multiple fields

The last one is interesting. If we would have used the schema above for our test document, we can - next to searching for some search string in the field `text` (where standard analysis (e.g., stemming) is performed, also search in the automatically created (!) and hidden field `text.exact` to use a minimal analyzer that only splits on whitespace and does little else. That's for instance important for case-sensitive search.

## Querying

I'll be honest - writing ES queries is not really simple. The most user-friendly way is probably the `query string syntax` (see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html). This allows you to formulate your query as a string. Which can be easier than constructing nested dicts as for instance outlined here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html

It will take some time to look everything up in the documentation, but you'll get used to it.

In [76]:
query = 'title:Shorty AND shared: >2'   # change to <2 to make it work ;-)
es_query = {"query":{"bool":{"must":{"query_string":{"query":query}}}}}

r = client.search(index='test', body=es_query)
r

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [], 'max_score': None, 'total': 0},
 'timed_out': False,
 'took': 33}

In [57]:
client.search(index='test',body= {'query': {
            'match': {
                 'is_paywalled':True
            }}})

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': 'FcLmW2cBYLprx5Xon2pe',
    '_index': 'test',
    '_score': 0.2876821,
    '_source': {'is_paywalled': True,
     'text': 'Another text, well we know how this looks like',
     'title': 'it has a title too'},
    '_type': 'article'}],
  'max_score': 0.2876821,
  'total': 1},
 'timed_out': False,
 'took': 27}

see docs for more

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

## Aggregations

Example from our project

In [85]:
# q = some string query
# timefield = key of time variabke
# granuarity = 'month', 'day', ...

elastic_query = {'query':{"bool": { 'must': [ {'query_string':{'query':q}}]}},
                             'aggs':{'timeline' : {"date_histogram": {
                                 "field":timefield,
                                 "interval":granularity
} }}}

NameError: name 'q' is not defined

## some scratch
useful snippets

### Scroll queries

In [None]:
from elasticsearch import helpers
from tqdm import tqdm

total = client.search(index = elastic_index, body=query)['hits']['total']
for doc in tqdm(helpers.scan(client, index = elastic_index, query=query, scroll=scroll_time), total = total):
    yield doc

### Updating

In [94]:
old = client.search(index='test')['hits']['hits'][0]

In [95]:
old

{'_id': 'FsLmW2cBYLprx5Xon2qV',
 '_index': 'test',
 '_score': 1.0,
 '_source': {'shared': 1, 'title': 'Shorty'},
 '_type': 'article'}

In [100]:
new = old.copy()
new['_source']['text']='Text we forgot'

client.update(index='test',
              id=new['_id'],
              doc_type=new['_type'],
              body={'doc':new['_source']})

{'_id': 'FsLmW2cBYLprx5Xon2qV',
 '_index': 'test',
 '_primary_term': 1,
 '_seq_no': 2,
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'article',
 '_version': 2,
 'result': 'updated'}

SyntaxError: unexpected EOF while parsing (<ipython-input-87-5119f6593704>, line 4)