<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Elasticsearch

In this notebook you will see how to interact with Elasticsearch using Python.

In general, the **elasticsearch** module functions as a wrapper around the REST API.

In [None]:
from elasticsearch import Elasticsearch
import pandas as pd
import qcutils

In [None]:
HOST = qcutils.read_config_value(key="es.host", cf_path="config/nosql-config.yaml")
PORT = qcutils.read_config_value(key="es.port", cf_path="config/nosql-config.yaml")

COMPLETE_HOST = "{}:{}".format(HOST, PORT)
USER = qcutils.read_config_value(key="es.username", cf_path="config/nosql-config.yaml")
PASSWORD = qcutils.read_config_value(key="es.pwd", cf_path="config/nosql-config.yaml")

## Load the driver

In [None]:
es = Elasticsearch([COMPLETE_HOST],http_auth=(USER, PASSWORD))

Let's make a test call to check the health of your cluster

In [None]:
es.cluster.health()

## Create Index and insert a document

To create a index we need to pass the mapping - unless we want to use the dynamic mapping features of Elasticsearch. Remember you **CANNOT** modify a mapping of an index if there are already documents.

In [None]:
index_configuration = {
    "mappings":{
          "properties": {
            "author": {
              "type": "keyword"
            },
            "categories": {
              "type": "keyword"
            },
            "date": {
              "type": "date",
              "format": "YYYY-MM-dd"
            },
            "lang": {
              "type": "keyword"
            },
            "title": {
              "type": "text",
              "analyzer": "english"
            }
          }
      }
    }


In [None]:
# Select a unique name for your index
INDEX_NAME = "my_index"

In [None]:

es.indices.create(INDEX_NAME,index_configuration)

Insert a document in the index

In [None]:
document = {
    "author":"Andrea",
    "categories":["tutorial","python","elasticsearch","jupyter notebook"],
    "title":"How to use Elasticsearch from a Jupyter Notebook",
    "date":"2020-10-14",
    "lang":"en-US"
}

In [None]:
response = es.index(INDEX_NAME, body=document)

Now e can retrieve the same document with the id

In [None]:
es.get(INDEX_NAME,response["_id"])

## Modify a document

You can modify a document using the id

In [None]:
edit = { 
    "doc":{
        "title":"Learn how to use Elasticsearch with Python"
    }
}

In [None]:
es.update(INDEX_NAME,id=response["_id"],body=edit)

In [None]:
es.get(INDEX_NAME,response["_id"])

## Delete a document

To delete a document you can simply pass the id using the **delete** method or using a query with the **delete_by_query** method.

In [None]:
es.delete(INDEX_NAME,id=response["_id"])

In [None]:
## We are re-inserting the document to demonstrate the other method
response = es.index(INDEX_NAME, body=document)

In [None]:
query = {
    "query":{
        "term": {
            "author": "Andrea"
            }
        }
}

es.delete_by_query(INDEX_NAME,body=query)

## Search documents

For this part we will use the *logstash-0* index that contains HTTP traffic log of a web application. The log contains event happening events from 2020-10-14T00:00:00Z to 2020-10-22T23:59:59Z 

In [None]:
# The documents in the index have many more fields, for sake of simplicity we will limit 
# to the follwing. Feel free to changhe the list.

fields = ["hits.hits._source.@timestamp","hits.hits._source.geo","hits.hits._source.agent"
          ,"hits.hits._source.ip","hits.hits._source.extension","hits.hits._source.response",
          "hits.hits._source.request","hits.hits._source.machine","hits.hits._score",
         "hits.hits._source.@message"]

In [None]:
es.search(index="logstash-0")

In [None]:
def to_dataframe(results):
    res = list(map(lambda x: x["_source"],results))

    return pd.DataFrame(res)

Let's use a function to put the results in a Panda dataframe

In [None]:
results = es.search(index="logstash-0",filter_path=fields)

to_dataframe(results["hits"]["hits"])



Select all the event happening on the 15th of October

In [None]:
query = {
    "query":{
        "range": {
            "@timestamp": {
                "gte": "2020-10-15T00:00:00.000",
                "lt": "2020-10-16T00:00:00.000"
                }
            }
        }
    }

results = es.search(index="logstash-0",body=query,filter_path=fields)

to_dataframe(results["hits"]["hits"])

Retrieve all the request that caused error.

In [None]:
query = {
  "query": {
    "bool": {
      "must_not": {
        "term": {
            "response":200
        }
      }
    }
  }
}

results = es.search(index="logstash-0",body=query,filter_path=fields)

to_dataframe(results["hits"]["hits"])

Select all the traffic generated by Safari browser

In [None]:
query = {
  "query": {
    "match": {
      "agent": {
        "query": "safari"
      }
    }
  }
}

results = es.search(index="logstash-0",body=query,filter_path=fields)

to_dataframe(results["hits"]["hits"])

Select all the the traffic generated by Safari, but give more relevance to mobile devices. By putting the **should** condition, you give more **relevance** to iOS devices while not excluding the others.

In [None]:
query = {
  "query": {
      "bool":{
          "must":[{
              "match":{
                  "agent":"safari"
              }}],
          "should":{
              "term":{
                  "machine.os":"ios"
              }
          }
      }
  }
}

results = es.search(index="logstash-0",body=query,filter_path=fields)

to_dataframe(results["hits"]["hits"])

### Geo Query

Select all the traffic generated from Canada. 

In [None]:
import json

canada_file = "data/canada.json"

json_data = open(canada_file,'r')
canada = json.load(json_data)["features"][0]["geometry"]["coordinates"]

query = {
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": {
        "geo_shape": {
          "geo.coordinates": {
            "shape":{
                "type":"multipolygon",
                "coordinates":canada
            },
               "relation": "INTERSECTS"
        }
      }
    }
  }
}
}

results = es.search(index="logstash-0",body=query,filter_path=fields)

to_dataframe(results["hits"]["hits"])



## Aggregations

Find from which countries most of the traffic is generated from

In [None]:
query = {
  "aggs" : {
    "country_count" : { "terms" : { "field" : "geo.src" } }
  }
}

results = es.search(index="logstash-0",body=query ,size=0)


pd.DataFrame(results["aggregations"]["country_count"]["buckets"])

You can use the geo centroid aggregation to add the inforation about the actual location

In [None]:
query = {
  "aggs": {
    "cities": {
      "terms": { "field": "geo.src" },
      "aggs": {
        "centroid": {
          "geo_centroid": { "field": "geo.coordinates" }
        }
      }
    }
  }
}

results = es.search(index="logstash-0",body=query ,size=0)

mapped = list(map(lambda x: {"country":x["key"],"count":x["doc_count"],
                             "lat":x["centroid"]["location"]["lat"],
                             "lon":x["centroid"]["location"]["lon"]},
                             results["aggregations"]["cities"]["buckets"]))
pd.DataFrame(mapped)


Find which operative system have the most errors

In [None]:
query = {
  "query": {
    "bool": {
      "must_not": {
        "term": {
            "response":"200"
        }
      }
    }
  },
  "aggs":{
      "os_count" : { "terms" : { "field" : "machine.os.keyword" } }
  }
}

results = es.search(index="logstash-0",body=query ,size=0)


pd.DataFrame(results["aggregations"]["os_count"]["buckets"])

Daily number of errors by os

In [None]:
query = {
     "query": {
    "bool": {
      "must_not": {
        "term": {
            "response":"200"
        }
      }
    }},
  "aggs": {
    "errors_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "day"
      },
      "aggs":{
          "errors_by_os":{
              "terms":{
                  "field":"machine.os.keyword"
              },
              "aggs":{
              "type_of_errors":{
                  "terms":{
                      "field":"response.keyword"
                  }
              }
          }
          }
      }
    }
  }
}



results = es.search(index="logstash-0",body=query ,size=0)

In [None]:
data = []

for r in results["aggregations"]["errors_over_time"]["buckets"]:
    day = r["key_as_string"]
    
    
    for os in r["errors_by_os"]["buckets"]:
        day_data = {
            "day":day
        }
        os_name = os["key"]
        
        day_data["os_name"] = os_name
        
        for error in os["type_of_errors"]["buckets"]:
            err = error["key"]
            count = error["doc_count"]
            
            day_data[err] = count
    
        data.append(day_data)

df = pd.DataFrame(data).fillna(0)
df

Now you can plot one chart for each day

In [None]:
df.groupby(["day"]).plot.bar(x="os_name")

In [None]:
table = pd.pivot_table(df,values=["404","503"],index=["day"],columns="os_name").fillna(0)

table["404"].plot.bar()
table["503"].plot.bar()

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.