# ElasticSearch (ES) in Google Colab

Troubleshooting: The Elasticsearch API is unstable in Colab and you may run into errors from time to time. Here are some troubleshooting tips:
* Only run the code from Section 1 once. This means that the server and client are only downloaded and installed once, and process duplication is avoided.
* If you create an index which contains zero documents, this is incorrect and this problem is solved by recreating the index using the code provided.
* If your index contains twice as many documents as expected, try to delete the index (code provided below) and recreate. 
* When ES server instances are duplicated, you may run into additional errors. To check if you have multiple processes running as daemon use the following code `%%bash ps -ef | grep elasticsearch`. To kill the daemon process you can use   `!kill -9 <id>` where id is the number in the second column related to the daemon process. Failing that, please ask for help from one of the demonstrators. 


### 1. Download Elasticsearch server and client

In [1]:
# The following bash scripts download the elastic search library and install it
# on the google colab instance. 

# You need to run these only once when you work on your search engine notebook

# NOTE: If you are working on a large dataset (20k+ docs) you should do this locally
# i.e. in a jupyter notebook. This way you only need to install ES once and 
# index your data ones

In [None]:
!wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
!wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
!tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
!sudo chown -R daemon:daemon elasticsearch-7.9.2/
!shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512 

In [47]:
# https://stackoverflow.com/questions/68762774/elasticsearchunsupportedproducterror-the-client-noticed-that-the-server-is-no#answer-68918449
!pip install elasticsearch==7.9.1 -q

In [None]:
# check elasticsearch version in environment
!pip freeze | grep elasticsearch

In [49]:
# import utility packages
import urllib.request 
from bs4 import BeautifulSoup 
import re
import time

# let's import ES
from elasticsearch import Elasticsearch

In [None]:
%%bash --bg
sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

In [None]:
%%bash
ps -ef | grep elasticsearch

In [None]:
# start es server
time.sleep(20) # give the server 20 seconds to start ..
!curl -X GET "http://localhost:9200"

In [52]:
def test_ES(es):
  """
  Script to test whether we have succesfully started an ES instance and
  and imported the python library
  """
  
  return es.ping()  # got True
  

In [None]:
# start and TEST es 
es = Elasticsearch("http://localhost:9200")
if test_ES(es):
  print('ES instance working')
else:
  print('ES instance not working')

In [None]:
# Server information
es.info()

### 2. Document Retrieval

In [55]:
# First thing we want to do is index some data. Let's use our poems from the
# previous lab:

In [56]:
def get_boat_poems():
  poem_data = []
  # get poems html
  url = 'https://discoverpoetry.com/poems/poems-about-ships/'
  contents = urllib.request.urlopen(url).read()
  soup = BeautifulSoup(contents)
  for poem_html in soup.find_all('article', {'class': 'poem-listing'}):
    poem = re.search('<p class="ExcerptText">(.*?)</blockquote>', str(poem_html), re.DOTALL).groups(1)
    title = re.search('<h3 class="cat-poem-title">(.*?)</h3>', str(poem_html), re.DOTALL).groups(1)
    try:
      author = re.search('<div class="intro">by (.*?)</div>', str(poem_html), re.DOTALL).groups(1)
    except AttributeError:
      print(poem_html)
    poem_data.append((title, poem, author))
  return poem_data
# parse html

In [57]:
def clean_text(raw_html):
  """
  borrowed from David Beauchemin: https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string
  """
  return BeautifulSoup(raw_html, "lxml").text

In [58]:
def clean_corpus(corpus):
  titles = [clean_text(x[0][0]) for x in corpus]
  bodies = [clean_text(x[1][0]) for x in corpus]
  authors = [clean_text(x[2][0]) for x in corpus]
  return list(zip(titles, bodies, authors))

In [59]:
# note that we have now stored the author and title fields as well.
corpus = get_boat_poems()
corpus = clean_corpus(corpus)

In [60]:
# Now let's start indexing some documents.

# The key two places to find information on the functionality and what you can do with ES are:
# 1. the python library documentation: https://elasticsearch-py.readthedocs.io/en/v7.13.2/api.html
# 2. and the general ES documentation: https://www.elastic.co/guide/index.html
# with the 
# 2.1 index API: https://www.elastic.co/guide/en/elasticsearch/reference/7.x/docs-index_.html
# and 
# 2.2 search API: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html


In [None]:
# Uncomment if you've already created an index. As you'll need to delete to recreate
#es.indices.delete(index_name)

In [None]:
# mappings are used to define what kind of structure your data has. here explicit mapping is used: 
# https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html

# The mapping is used when creating the index through the request body:

request_body = {
    'settings': {
        'number_of_shards': 1,
        'number_of_replicas': 1,
        
    },
    'mappings': {
          'properties': {
              'title': {'type': 'text'},
              'body': {'type': 'text'},
              'author': {'type': 'text'}
          }
    }
}

index_name = 'test'
try:
  es.indices.get(index_name)
  print('index {} already exists'.format(index_name))
except:
  print('creating index {}'.format(index_name))
  es.indices.create(index_name, body=request_body)

In [63]:
# now what we want to do is put some data in the index, i.e. index it: 
for title, body, author in corpus:
  doc_body = {
      'title': title,
      'body': body,
      'author': author
  }
  es.index(index_name, doc_body)

In [None]:
# Now let's have a look at our index:
print('we have made and index called {} with {} documents'.format(index_name, es.cat.count(index=index_name,h=['count'])))

In [65]:
def index_info(index_name):
  count, deleted, shards, =  es.cat.indices(index=index_name, h=['docs.count', 'docs.deleted', 'pri'])[:-1].split(' ')
  print(
      """
      #### INDEX INFO #####
      index_name = {}
      doc_count = {}
      shard_count = {}
      deleted_doc_count = {}
      """.format(index_name, count, shards, deleted)
  )

In [None]:
index_info(index_name)

In [None]:
# now let's try some queries:
# Here the key is the es.search class and the Seach API documentation: 
# https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html

query_body = {
    'query':{
        'term': {
            'body':  'ship'
        }
    }
}
print('### RESULTS ####')
explain=True
results = es.search(index=index_name, body=query_body, explain=explain)['hits']['hits']
for hit in results:
  print('title: {} - score: {}'.format(hit['_source']['title'], hit['_score']))
if explain:
  print('some info on results')
  print(hit['_explanation'])

In [39]:
# What about using the DFR?

request_body = {
    'settings': {
        'number_of_shards': 1,
        'number_of_replicas': 1,
        'index': {
            'similarity': {
                'dfr_similarity': {
                    'type': 'DFR',
                    'basic_model': 'g',
                    'after_effect': 'l',
                    'normalization': 'h2',
                    'normalization.h2.c':'3.0'

                }
            }
        }
        
    },
    'mappings': {
          'properties': {
              'title': {'type': 'text', 'similarity': 'dfr_similarity'},
              'body': {'type': 'text', 'similarity': 'dfr_similarity'},
              'author': {'type': 'text', 'similarity': 'dfr_similarity'}
          }
    }
}