[View in Colaboratory](https://colab.research.google.com/github/davidvela/testColabGH/blob/master/testElasticSearch.ipynb)

# Elastic search testing 
localhost:9200

In [0]:
from datetime import datetime
import json 
from collections import namedtuple


host = 'http://localhost:9200/'
ref = 'accounts/person/1 '
url = host + ref 

In [0]:
doc = {
    'author': 'kimchy',
    'text': 'Elasticsearch: cool. bonsai cool.',
    'timestamp': datetime.now(),
}
person = {
      "name" : "John2",
      "lastname" : "Doe2",
      "job_description" : "22 windows Systems administrator and Linux specialit linux"
}

## elasticsearch python package

### Basic example ... 

In [0]:
from elasticsearch import Elasticsearch
es = Elasticsearch()

In [0]:
# indexing... 
res = es.index(index="test-index", doc_type='tweet', id=1, body=doc)
print(res['result'])

updated


In [0]:
res = es.get(index="test-index", doc_type='tweet', id=1)
# res = es.get(index="accounts", doc_type='person', id=1)
print(res)
print(res['_source'])

{'_index': 'test-index', '_type': 'tweet', '_id': '1', '_version': 3, 'found': True, '_source': {'author': 'kimchy', 'text': 'Elasticsearch: cool. bonsai cool. ch', 'timestamp': '2018-05-21T20:08:55.066455'}}
{'author': 'kimchy', 'text': 'Elasticsearch: cool. bonsai cool. ch', 'timestamp': '2018-05-21T20:08:55.066455'}


In [0]:
es.indices.refresh(index="test-index")

In [0]:
res = es.search(index="test-index", body={"query": {"match_all": {}} })
print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])

Got 1 Hits:
2018-05-21T20:07:01.476602 kimchy: Elasticsearch: cool. bonsai cool.


### Features
* Persistent connections: ?
* Automatic Retries
* Sniffing 
* thread safety 
* SSL and Authentication
* Logging 

In [0]:
... 

Ellipsis

## Request library - HTTP lib

In [0]:
import requests 

In [0]:
# http://localhost:9200/test-index/tweet/1
ref = 'accounts/person/2 '
# ref = 'test-index/tweet/1 '
url = host + ref 
print(url)

headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}


http://localhost:9200/accounts/person/2 


In [0]:
# ***** GET 
r = requests.get(url); show_r()

<Response [200]>
200
b'{"_index":"accounts","_type":"person","_id":"2 ","_version":3,"found":true,"_source":{"name": "John2", "lastname": "Doe2", "job_description": "22 windows Systems administrator and Linux specialit linux"}}'


In [0]:
# **** POST
# payload = {'username': 'bob', 'email': 'bob@bob.com'}
# r = requests.put("http://somedomain.org/endpoint", data=payload)
doc_data = person
query = json.dumps(doc_data)
r = requests.post(url, data=query, headers= headers); show_r()

<Response [200]>
200
b'{"_index":"accounts","_type":"person","_id":"2 ","_version":4,"result":"updated","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":3,"_primary_term":2}'


In [0]:
# ***** SEARCH
term = "windows"
query = json.dumps({
    "query": {
        "match": {
            "content": term
        }
    }
})
query = json.dumps({
    "query": {
        "match_all": { }
    }
})
url = host + ref + "_search"

print(url)
# r = requests.get(host, data=query); show_r() # 406 error - header not supported
r = requests.get(url, data=query, headers = headers); show_r()
# r = requests.get(url, params=query)

http://localhost:9200/accounts/person/_search
<Response [200]>
200
b'{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"accounts","_type":"person","_id":"1 ","_score":1.0,"_source":{"name": "John", "lastname": "Doe", "job_description": "Systems administrator and Linux specialit"}},{"_index":"accounts","_type":"person","_id":"2 ","_score":1.0,"_source":{"name": "John2", "lastname": "Doe2", "job_description": "22 windows Systems administrator and Linux specialit linux"}}]}}'


In [0]:
# ***** SEARCH with HIGHLIGHT
# curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' 
# -d' { "query" : { "match": { "content": "kimchy" } }, "highlight" : { "fields" : { "content" : {} } } } '
# localhost:9200/_search?q=linux

term = "linux"
query = json.dumps({
    "query" : {
        "match": { "job_description": term }
    }
    ,
    "highlight" : {
        "fields" : {
            "job_description" : {}
        }
    }
})

# query = json.dumps({
#     "query": {
#         "match_all": {  }
#     }
# })
url = host + ref + "_search"

print(url)
print(query)
# r = requests.get(host, data=query); show_r() # 406 error - header not supported
r = requests.get(url, data=query, headers = headers); show_r()
# r = requests.get(url, params=query)

http://localhost:9200/accounts/person/_search
{"query": {"match": {"job_description": "linux"}}, "highlight": {"fields": {"job_description": {}}}}
<Response [200]>
200
b'{"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":2,"max_score":0.39556286,"hits":[{"_index":"accounts","_type":"person","_id":"2 ","_score":0.39556286,"_source":{"name": "John2", "lastname": "Doe2", "job_description": "22 windows Systems administrator and Linux specialit linux"},"highlight":{"job_description":["22 windows Systems administrator and <em>Linux</em> specialit <em>linux</em>"]}},{"_index":"accounts","_type":"person","_id":"1 ","_score":0.2876821,"_source":{"name": "John", "lastname": "Doe", "job_description": "Systems administrator and Linux specialit"},"highlight":{"job_description":["Systems administrator and <em>Linux</em> specialit"]}}]}}'


In [0]:
#GET localhost:9200/_search?q=john
# GET localhost:9200/_search?q=job_description:john
# GET localhost:9200/accounts/person/_search?q=job_description:linux

item = "job_description:linux"
# item = "john"

query = "_search?q="+item

ref = 'accounts/person/'
url = host + ref + query
print(url)
r = requests.get(url, headers = headers); show_r()

http://localhost:9200/accounts/person/_search?q=job_description:linux
<Response [200]>
200
b'{"took":167,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":2,"max_score":0.39556286,"hits":[{"_index":"accounts","_type":"person","_id":"2 ","_score":0.39556286,"_source":{"name": "John2", "lastname": "Doe2", "job_description": "22 windows Systems administrator and Linux specialit linux"}},{"_index":"accounts","_type":"person","_id":"1 ","_score":0.2876821,"_source":{"name": "John", "lastname": "Doe", "job_description": "Systems administrator and Linux specialit"}}]}}'


In [0]:
item = "job_description:linux"
# item = "john"

query = "_search?q="+item

ref = 'accounts/person/'
url = host + ref + query
print(url)
r = requests.get(url, headers = headers); show_r()

# curl 
# curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' 
# -d' { "query" : { "match": { "content": "kimchy" } }, "highlight" : { "fields" : { "content" : {} } } } '
# json: 
# {
#     "query" : {
#         "match": { "content": "kimchy" }
#     },
#     "highlight" : {
#         "fields" : {
#             "content" : {}
#         }
#     }
# }

In [0]:
# **** SHOW RESULTS 
def show_r():
    # {"_index":"test-index","_type":"tweet","_id":"1","_version":3,
    #  "found":true,
    #  "_source":{"author":"kimchy","text":"Elasticsearch: cool. bonsai cool. ch","timestamp":"2018-05-21T20:08:55.066455"}}
    print(r)
    print(r.status_code)
    print(r.content)

    # x = json.loads(cc, object_hook=lambda d: namedtuple('X', d.keys())(*d.values()))
    # print(x)

## others: httplib, urllib2

In [0]:
import httplib
connection =  httplib.HTTPConnection('1.2.3.4:1234')
body_content = 'BODY CONTENT GOES HERE'
connection.request('PUT', '/url/path/to/put/to', body_content)
result = connection.getresponse()
# Now result.status and result.reason contains interesting stuff

In [0]:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request('http://example.org', data='your_put_data')
request.add_header('Content-Type', 'your/contenttype')
request.get_method = lambda: 'PUT'
url = opener.open(request)

# Elatic Search - Text Classification


documentation: 
  * [Blog - turorial ](https://www.elastic.co/blog/text-classification-made-easy-with-elasticsearch) 
  * [Ingest - attachment plugin ](https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html) 
  
  

In [0]:
sample_mapping =   {
                      "properties":{
                         "content":{
                            "type":"text",
                            "analyzer":"german"
                         }
                      }
                    }

In [0]:
# language analyzer: 
curl -XGET "http://localhost:9200/_analyze?analyzer=english" -d'
  {
   "text" : "This is a test."
  }'
  {
    "tokens":[
       {
          "token":"test",
          "start_offset":10,
          "end_offset":14,
          "type":"<ALPHANUM>",
          "position":3
       }
    ]
  }

You just need to execute 4 steps:

1. Configure your mapping ("content" : "text", "category" : "keyword")
2. Index your documents
3. Run a More Like This Query (MLT Query)
4. Write a small script that aggregates the hits of that query by score

## example 

In [0]:
PUT sample
  POST sample/document/_mapping
  {
    "properties":{
       "content":{
          "type":"text",
          "analyzer":"english"
       },
       "category":{
          "type":"text",
          "analyzer":"english",
          "fields":{
             "raw":{
                "type":"keyword"
             }
          }
       }
    }
  }
  POST sample/document/1
  {
    "category":"Apple (Fruit)",
    "content":"Granny Smith, Royal Gala, Golden Delicious and Pink Lady are just a few of the thousands of different kinds of apple that are grown around the world! You can make dried apple rings at home - ask an adult to help you take out the core, thinly slice the apple and bake the rings in the oven at a low heat."
  }
  POST sample/document/2
  {
    "category":"Apple (Company)",
    "content":"Apple is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. Its hardware products include the iPhone smartphone, the iPad tablet computer, the Mac personal computer, the iPod portable media player, the Apple Watch smartwatch, and the Apple TV digital media player. Apple's consumer software includes the macOS and iOS operating systems, the iTunes media player, the Safari web browser, and the iLife and iWork creativity and productivity suites. Its online services include the iTunes Store, the iOS App Store and Mac App Store, Apple Music, and iCloud."
  }


GET sample/document/_search
  {
    "query":{
       "more_like_this":{
          "fields":[
             "content",
             "category"
          ],
          "like":"The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple. It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus. The tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found today. Apples have been grown for thousands of years in Asia and Europe, and were brought to North America by European colonists. Apples have religious and mythological significance in many cultures, including Norse, Greek and European Christian traditions.",
          "min_term_freq":1,
          "max_query_terms":20
       }
    }
  }


## python script 

In [0]:
# And here's a little Python script that processes the response and returns the most likely category for the input document.

from operator import itemgetter
  def get_best_category(response):
     categories = {}
     for hit in response['hits']['hits']:
         score = hit['_score']
         for category in hit['_source']['category']: 
             if category not in categories:
                 categories[category] = score
             else:
                 categories[category] += score
     if len(categories) > 0:
         sortedCategories = sorted(categories.items(), key=itemgetter(1), reverse=True)
         category = sortedCategories[0][0]
     return category

## article summary

Use cases
Classification of text is a very common real world use case for NLP. Think of e-commerce data (products). Lots of people run e-commerce shops with affiliate links. The data is provided by several shops and often comes with a category tag. But each shop has another category tag. So the category systems need to be unified and hence all the data needs to be re-classified according to the new category tree. Or think of a Business Intelligence application where company websites need to be classified according to their sector (hairdresser vs. bakery etc).

Evaluation
I evaluated this approach with a standard text classification dataset: The 20 Newsgroups dataset. The highest precision (92% correct labels) was achieved with a high quality score threshold that included only 12% of the documents. When labelling all documents (100% Recall) 72% of the predictions were correct.

The best algorithms for text classification on the 20 Newsgroups dataset are usually SVM and Naive Bayes. They have a higher average accuracy on the entire dataset.

So why should you consider using Elasticsearch for classification if there are better algorithms?

There are a few practical reasons: training an SVM model takes a lot of time. Especially when you work in a startup or you need to adapt quickly for different customers or use cases that might become a real problem. So you may not be able to retrain your model every time your data changes. I experienced it myself working on a project for a big German bank. Hence you will work with outdated models and those will for sure not score that good anymore.

With the Elasticsearch approach training happens at index time and your model can be updated dynamically at any point in time with zero downtime of your application. If your data is stored in Elasticsearch anyway, you don't need any additional infrastructure. With over 10% highly accurate results you can usually fill the first page. In many applications that's enough for a first good impression.

Why then use Elasticsearch when there are other tools?

Because your data is already there and it's going to pre-compute the underlying statistics anyway. It's almost like you get some NLP for free!

# next