# ElasticSearch & Python
---



<h3>Ankit Maheshwari</h3>
<br/><br/><br/>
<div style="text-align: right"> 

Twitter: @ankitind <br/>
Web: www.betout.com 

</div>


# 1. What is Elasticsearch
---
- Full-text Search Engine
- NoSQL Database
- Analytics Engine
- Lucene based
- Schemaless
- RESTful interface
- Inverted Indices
- (Nearly) Real time
- ELK Stack (Elasticsearch, Kibana, Logstash)

# 2. Installing ElasticSearch
---
- Install via docker
- Create a directory to store ES Data
- Start ELK Stack

> docker pull qnib/elk <br>
> mkdir -p ~/data <br>
> docker run -d --name elasticsearch -v ~/data:/usr/share/elasticsearch/data -p 9200:9200 elasticsearch

Test the stack <br>
ElasticsSearch http://localhost:9200 & Kibana http://localhost:5601

# 3. Comparison with a RDBMS 
### like MySQL/Oracle

| Relational DBs | Elasticsearch                         |
|:---------------|:--------------------------------------|
| Database       | Index                                 |
| Partition      | Shard                                 |
| Table          | Type                                  |
| Row            | Document                              |
| Column         | Field                                 |
| Schema         | Mapping                               |
| Index          | Everything is <br> already Indexed :) |
| SQL            | DSL <br> Domain Specific Language     |



### 3.2 JSON schema cheatsheet / quick reference

| Javascript     | Python                                | Example                    |
|:---------------|:--------------------------------------|:---------------------------|
| string         | string                                | "Name"                     |
| number         | int/float                             | 42                         |
| object         | dict (dictionary)                     | {"name": "sam", "age": 26} |
| array          | list                                  | ["foo","bar", 5,"hello"]   |
| boolean        | bool                                  | tr                         |



3.2 Installing Python libraries 

In [288]:
#Few libraries we will using t
import requests
import ujson as json
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from elasticsearch_dsl import Search, DocType, Date, Integer, Keyword, Text
from datetime import datetime
from elasticsearch_dsl.connections import connections
import pandas as pd
from ipywidgets import interact, interactive, fixed, interact_manual

ES_HOST = 'http://ec2-52-91-189-234.compute-1.amazonaws.com:9200'
es = Elasticsearch(ES_HOST)
print(es)


<Elasticsearch([{'port': 9200, 'host': 'ec2-52-91-189-234.compute-1.amazonaws.com'}])>


Print all indices of Elasticsearch

In [9]:
# list all the indexes
indices=es.indices.get_alias().keys()
sorted(indices)

['.kibana',
 'books',
 'index_test',
 'logstash-2017.05.22',
 'logstash-2017.05.23',
 'megacorp',
 'schools',
 'something',
 'test',
 'test-index',
 'test-index1',
 'test_index',
 'titanic']

# 4. Mapping
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. 

#### Mapping Types
Each index has one or more mapping types, which are used to divide the documents in an index into logical groups. 
>User documents might be stored in a user type, and blog posts in a blogpost type.

##### Meta-fields
>Meta-fields are used to customize how a document’s metadata associated is treated. 
Examples of meta-fields include the document’s _index, _type, _id, and _source fields.

##### Fields or properties
>Each mapping type contains a list of fields or properties pertinent to that type. 
A user type might contain title, name, and age fields

```
PUT my_index 
{"mappings": {
    "user": { 
      "_all":       { "enabled": false  }, 
      "properties": {
        "name":     { "type": "text"  }, 
        "age":      { "type": "integer" }  
              },//end of user
    "blogpost": { 
      "_all":       { "enabled": false  }, 
      "properties": { 
        "title":    { "type": "text"  }, 
        "body":     { "type": "text"  }, 
        "user_id":  { "type": "keyword" },
        "created":  { "type":   "date", 
                      "format": "strict_date" } 
                 } // end of blogpost
      } //end of mappings
}```

# 5. Query and filter contex

### Query context
This query answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a _score representing how well the document matches

### Filter context
In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.
Does this timestamp fall into the range 2015 to 2016? Is the status field set to "published"?

## 5.1 Filter query
- When working with exact values, we will be working with non-scoring, filtering queries. 
- Filters are important because they are very fast. 
- They do not calculate relevance (avoiding the entire scoring phase) and are easily cached.
- We use a constant_score to convert the term query into a filter


In [258]:
myquery = {"query":  {"constant_score" : {"filter" : {"term" : {"Sex":"female"}}}}}
res = es.search(index="titanic", body=myquery)
for items in res['hits']['hits']:
    print(items['_source']['Name'] + " (Id: " + items['_id'] + ") has a score of: " + str(items['_score']))

Abrahim, Mrs. Joseph (Sophie Halaut Easu) (Id: 900) has a score of: 1.0
Corbett, Mrs. Walter H (Irene Colvin) (Id: 935) has a score of: 1.0
Connolly, Miss. Kate (Id: 898) has a score of: 1.0
Davidson, Mrs. Thornton (Orian Hays) (Id: 984) has a score of: 1.0
Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judith Andersson) (Id: 982) has a score of: 1.0
Hocking, Miss. Ellen Nellie"" (Id: 944) has a score of: 1.0
O'Donoghue, Ms. Bridget (Id: 980) has a score of: 1.0
Straus, Mrs. Isidor (Rosalie Ida Blun) (Id: 1006) has a score of: 1.0
Evans, Miss. Edith Corse (Id: 1004) has a score of: 1.0
Goodwin, Miss. Jessie Allis (Id: 1032) has a score of: 1.0


## 6. Types of Queries
Leaf Queries
- Match all
- Full Text
- Term Level
>"query" : {"queryType" : {"fieldname" : "fieldvalue"}}

Compund Queries
- Bool Query
- Constant Score Queries




### 6.1 Match all
The most simple query, which matches all documents, giving them all a _score of 1.0.

In [306]:
# save match all query as python variable
myquery={"query": 
         {"match_all": {}}
        }

# execute the query using body parameter and return total number of records
# select count(*) from table
res = es.search(index="titanic", body=myquery)  

print("Total records found: {rec}".format(rec=res['hits']['total']))
for x in range(0, res['hits']['total']):
    print("\n" + str(x+1))
    for key, value in res['hits']['hits'][x]['_source'].items():
        print(str(key) + ": " + str(value))
    if x == 1:
        print("-- breaking--")
        break 

In [351]:
field=[]
res = requests.get(ES_HOST + '/titanic/passenger/_mapping')
for a in res.json()['titanic']['mappings']['passenger']['properties'].keys():
    field.append(a)
def func(Query_type):
    myquery={"query":{Query_type: {}}}
    res = es.search(index="titanic", body=myquery) 
    print("Total records found: {rec}".format(rec=res['hits']['total']))
    for x in range(0, res['hits']['total']):
        print("\n" + str(x+1))
        for key, value in res['hits']['hits'][x]['_source'].items():
            print(str(key) + ": " + str(value))
        if x == 1:
            print("-- breaking--")
            break 


interact_manual(func,  Query_type={'Get All Records':'match_all'});


### 6.2 Full text queries


<img src="https://qbox.io/img/blog/elasticsearch-queries-example.png">

### 6.2.1 Match Query
- match queries accept text/numerics/dates, analyzes them, and constructs a query.
- matches even if one term is match
- more the terms better the score

In [111]:
myquery={
    "query": {
        "match" : {"Name" : "Thomas Joseph"}
    }}
res = es.search(index="titanic", body=myquery) 
print("Total records found: {rec}".format(rec=res['hits']['total']))
for x in range(0, res['hits']['total']):
    print("\n" + str(x+1))
    for key, value in res['hits']['hits'][x]['_source'].items():
        print(str(key) + ": " + str(value))
    if x == 1:
        print("-- breaking--")
        break 

Total records found: 26

1
Sex: male
Name: Thomas, Mr. Tannous
Embarked: C
Pclass: 3
Cabin: None
SibSp: 0
Parch: 0
Age: None
PassengerId: 1224
Ticket: 2684
Fare: 7.225

2
Sex: male
Name: Lamb, Mr. John Joseph
Embarked: Q
Pclass: 2
Cabin: None
SibSp: 0
Parch: 0
Age: None
PassengerId: 976
Ticket: 240261
Fare: 10.7083
-- breaking--


In [358]:
def func(Query_type, Dimension, Value):
    myquery={
    "query": {
        Query_type : {Dimension : Value}
    }}
    res = es.search(index="titanic", body=myquery) 
    print("Total records found: {rec}".format(rec=res['hits']['total']))
    for x in range(0, res['hits']['total']):
        print("\n" + str(x+1))
        for key, value in res['hits']['hits'][x]['_source'].items():
            print(str(key) + ": " + str(value))
        if x == 1:
            print("-- breaking--")
            break 


interact_manual(func,  Query_type={'Match':'match'}, Dimension = field, Value = '');

### 6.2.2 "Match Phrase"  (match_phrase) Query
The match_phrase query analyzes the text and only results if terms come in same order

In [113]:
myquery={ "query": {
        "match_phrase" : { "Name" : "Samuel"}
     }}
res = es.search(index="titanic", body=myquery) 
print("Total records found: {rec}".format(rec=res['hits']['total']))
for x in range(0, res['hits']['total']):
    print("\n" + str(x+1))
    for key, value in res['hits']['hits'][x]['_source'].items():
        print(str(key) + ": " + str(value))
    if x == 1:
        print("-- breaking--")
        break 

Total records found: 6

1
Sex: male
Name: Andersson, Mr. Johan Samuel
Embarked: S
Pclass: 3
Cabin: None
SibSp: 0
Parch: 0
Age: 26.0
PassengerId: 1212
Ticket: 347075
Fare: 7.775

2
Sex: male
Name: Davies, Mr. John Samuel
Embarked: S
Pclass: 3
Cabin: None
SibSp: 2
Parch: 0
Age: 21.0
PassengerId: 901
Ticket: A/4 48871
Fare: 24.15
-- breaking--


In [360]:
def func(Query_type, Dimension, Value):
    myquery={
    "query": {
        Query_type : {Dimension : Value}
    }}
    print(myquery)
    res = es.search(index="titanic", body=myquery) 
    print("Total records found: {rec}".format(rec=res['hits']['total']))
    for x in range(0, res['hits']['total']):
        print("\n" + str(x+1))
        for key, value in res['hits']['hits'][x]['_source'].items():
            print(str(key) + ": " + str(value))
        if x == 1:
            print("-- breaking--")
            break 


interact_manual(func,  Query_type={'Match':'match', 'Match Phrase':'match_phrase'}, Dimension = field, Value = '');

### 6.2.3 Term Query
While the full text queries will analyze the query string before executing, the term-level queries operate on the **exact terms** that are stored in the inverted index.

These queries are usually used for structured data like numbers, dates, and enums, rather than full text fields.

In [121]:
myquery={"query": {"term" : { "Ticket" : "2681" }  }}

res = es.search(index="titanic", body=myquery) 
for itemkey, itemvalue in res['hits']['hits'][0]["_source"].items():
    print(str(itemkey) + ": " + str(itemvalue))

Sex: male
Name: Thomas, Mr. John
Embarked: C
Pclass: 3
Cabin: None
SibSp: 0
Parch: 0
Age: None
PassengerId: 1008
Ticket: 2681
Fare: 6.4375


In [361]:
def func(Query_type, Dimension, Value):
    myquery={
    "query": {
        Query_type : {Dimension : Value}
    }}
    res = es.search(index="titanic", body=myquery) 
    print("Total records found: {rec}".format(rec=res['hits']['total']))
    for x in range(0, res['hits']['total']):
        print("\n" + str(x+1))
        for key, value in res['hits']['hits'][x]['_source'].items():
            print(str(key) + ": " + str(value))
        if x == 1:
            print("-- breaking--")
            break 


interact_manual(func,  Query_type={'Match':'match', 'Match Phrase':'match_phrase', 'Absolute Phrase':'term'}, Dimension = field, Value = '');

### 6.2.4 Terms query
Filters documents that have fields that match any of the provided terms (not analyzed).
>The terms query is also aliased with **in** as in MySQL

In [110]:
myquery={"query": { "terms" : { "Parch" : [2, 3, 5]}}}
res = es.search(index="titanic",  body=myquery)
print("Total records found - {a}".format(a=(res['hits']['total'])))
print("Third record : {a}".format(a=(res['hits']['hits'][2]['_source'])))

Total records found - 34
Third record : {'Sex': 'female', 'Name': 'Klasen, Mrs. (Hulda Kristina Eugenia Lofqvist)', 'Embarked': 'S', 'Pclass': 3, 'Cabin': None, 'SibSp': 0, 'Parch': 2, 'Age': 36.0, 'PassengerId': 1045, 'Ticket': '350405', 'Fare': 12.1833}


### 6.2.5 Range Query
Matches documents with fields that have terms within a certain range. 


In [125]:
myquery = {"query" : {"range": {"PassengerId":{"gte":990,"lte":1000}}}}
res = es.search(index="titanic",  body=myquery)
print("Total records found - {a}".format(a=res['hits']['total']))

myqueryTimeExample = {"query" : {"range":{"timestamp":{"gte":"2015-01-01 00:00:00", "lte":"now"}}}}


Total records found - 11


In [369]:
def func(Query_type, Dimension,  Less_Than, Greater_Than):
    myquery = {"query" : {Query_type: {Dimension:{"gte":Greater_Than,"lte":Less_Than}}}}
    res = es.search(index="titanic", body=myquery) 
    print("Total records found: {rec}".format(rec=res['hits']['total']))
    for x in range(0, res['hits']['total']):
        print("\n" + str(x+1))
        for key, value in res['hits']['hits'][x]['_source'].items():
            print(str(key) + ": " + str(value))
        if x == 1:
            print("-- breaking--")
            break 


interact_manual(func,  Query_type='range', Dimension = field,   Greater_Than = '', Less_Than = '');

In [274]:
## Summary of Leaf Queries
# 1. Match_all
q1 = {"query" : {"match_all":{}}}

# 2. Match
q2 = {"query" : {"match":{"Sex":"male"}}}

# 3. Match Phrase
q3 = {"query" : {"match_phrase":{"Sex":"male"}}}

# 4. Term
q4 = {"query" : {"term":{"Sex":"male"}}}

#5. Terms
q5 = {"query" : {"terms":{"Sex":["male", "female"]}}}

#6. Range
q6 = {"query" : {"range" : {"PassengerId" : {"gte" : 1000, "lte" : 2000}}}}

res = es.search(index="titanic", body=q5)
print(res['hits']['total'])


377


## 7. Compound queries
Compound queries wrap other compound or leaf queries, either to combine their results and scores, to change their behaviour, or to switch from query to filter context.

# 7.1 Boolean Querey
A query that matches documents matching boolean combinations of other queries.
bool query is composed of four sections:
```{
   "bool" : {
      "must" :     [],
      "should" :   [],
      "must_not" : [],
      "filter":    []
   }}```

In [273]:
myquery = {"query":
               {"bool":{
                    "filter":{"term":{"Age":"22"}},
                    "must":{"term":{"Pclass":"2"}},
                    "must_not":{"term":{"Sex":"female"}}
                    }}}



res = es.count(index="titanic", body=myquery)
print("Total count is: {a}".format(a=res['count']))

Total count is: 2


In [None]:
"A" : {"gte" : 1000, "lte" : 2000}
A["gte"] =  1000
A["lte"] =  2000

"A" : [{"gte" : 1000}, {"lte" : 2000}]
A[0]["gte"] = 1000
A[0]["lte"] = 2000

b={"query":
               {"bool":{
                    "filter":[{"term":{"Age":"22"}},
                              {"term":{"Rev":"222"}}
                             ]
                    "must":{"term":{"Pclass":"2"}},
                    "must_not":{"term":{"Sex":"female"}}
                    }}}
typ


query["bool"]["filter"][0]["term"]["Age"] = 22
query["bool"]["filter"][0]["term"]["Rev"] = 222


In [287]:

b={"query":
               {"bool":{
                    "filter":[{"term":{"Age":"36"}},
                              {"term":{"Pclass":"1"}}
                             ],
                    "must":{"term":{"Pclass":"1"}},
                    "must_not":{"term":{"Sex":"female"}}
                    }}}
print(json.dumps(b, indent=4, sort_keys=True ))

res = es.count(index="titanic", body = b)
print("Total count is: {a}".format(a=res['count']))


{
    "query":{
        "bool":{
            "filter":[
                {
                    "term":{
                        "Age":"36"
                    }
                },
                {
                    "term":{
                        "Pclass":"1"
                    }
                }
            ],
            "must":{
                "term":{
                    "Pclass":"1"
                }
            },
            "must_not":{
                "term":{
                    "Sex":"female"
                }
            }
        }
    }
}
Total count is: 1


In [392]:
def func(Query_type, subQuery_type, Dimension,  Value):
    myquery = {"query":{Query_type:{subQuery_type:{"term":{Dimension:Value}}}}}
    res = es.search(index="titanic", body=myquery) 
    print(myquery)
    print("Total records found: {rec}".format(rec=res['hits']['total']))
    for x in range(0, res['hits']['total']):
        print("\n" + str(x+1))
        for key, value in res['hits']['hits'][x]['_source'].items():
            print(str(key) + ": " + str(value))
        if x == 1:
            print("-- breaking--")
            break 

           
interact_manual(func,  
                Query_type={'Compound':'bool'},
                subQuery_type ={'Filter':'filter', 'Must have':'must', 'Must NOT have':'must_not', 'Maybe Have':'should'},
                Dimension = field,   
                Value = ''
               )

<function __main__.func>