This notebook shows how to use Elastic Search to index and search through data. We will use a dataset called CMU Book summaries [dataset](http://www.cs.cmu.edu/~dbamman/booksummaries.html). Alternateively, the dataset's link can be found in the `BookSummaries_Link.md` file under the Data folder in Ch7. 

For this code to work, elastic search instance has to be running in the background. 
For this you need to follow these steps :

Linux :

   1. Go to the elasticsearch-X.Y.Z/bin folder on your machine
   2. Run ./elasticsearch.  
    
Windows :

   1.  Download the latest [release](https://www.elastic.co/guide/en/elasticsearch/reference/current/windows.html)
   2.  Run .\bin\elasticsearch.bat
   
[ElasticSearch Documentation](https://www.elastic.co/guide/index.html)
    
You should now be able to access this instance on localhost:9200



In [21]:
%%bash
wget -q http://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz
tar -xzf booksummaries.tar.gz

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz

sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512 

elasticsearch-oss-7.9.2-linux-x86_64.tar.gz: OK


In [22]:
%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

Starting job # 2 in a separate thread.


In [23]:
%%bash
sleep 10

ps -ef | grep elasticsearch

curl -sX GET "localhost:9200/"

root         123     121  0 Sep12 ?        00:00:00 sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch
daemon       124     123  1 Sep12 ?        00:00:38 /content/elasticsearch-7.9.2/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-16259466427278423838 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:fileco

In [24]:
!pip install 'elasticsearch<7.14.0'

from elasticsearch import Elasticsearch 
from datetime import datetime

#elastic search instance has to be running on the machine. Default port is 9200. 
#Call the Elastic Search instance, and delete any pre-existing index
es=Elasticsearch([{'host':'localhost','port':9200}])
if es.indices.exists(index="myindex"):
    es.indices.delete(index='myindex', ignore=[400, 404]) #Deleting existing index for now 



In [39]:
#Build an index from booksummaries dataset. I am using only 500 documents for now.
path = "booksummaries/booksummaries.txt" #Add your path.
count = 1
for line in open(path):
    fields = line.split("\t")
    doc = {'id' : fields[0],
            'title': fields[2],
            'author': fields[3],
            'summary': fields[6]
          }

    res = es.index(index="myindex", id=fields[0], body=doc)
    count = count+1
    if count%100 == 0:
        print("indexed 100 documents")
    if count == 501: 
        break

indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents


In [49]:
#Check to see how big is the index
res = es.search(index="myindex", body={"query": {"match_all": {}}})
print("Your index has %d entries" % res['hits']['total']['value'])

Your index has 500 entries


In [None]:
res['hits']['hits'][0]

#res['hits']['total']['value']

In [54]:
#Try a test query. The query searches "summary" field which contains the text
#and does a full text query on that field.
res = es.search(index="myindex", body={"query": {"match": {"summary": "animal"}}})
print("Your search returned %d results." % res['hits']['total']['value'])

Your search returned 16 results.


In [56]:
res['hits']['hits'][0]

{'_id': '620',
 '_index': 'myindex',
 '_score': 6.8344107,
 '_source': {'author': 'George Orwell',
  'id': '620',
  'summary': ' Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, \'Beasts of England\'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. 

In [28]:
#Printing the title field and summary field's first 100 characters for 2nd result
print(res["hits"]["hits"][2]["_source"]["title"])
print(res["hits"]["hits"][2]["_source"]["summary"][:100])


The Murders in the Rue Morgue
 The story surrounds the baffling double murder of Madame L'Espanaye and her daughter in the Rue Mor


In [67]:
query = 'animal'

res = es.search(index="myindex", body={"query": {"match_phrase": {"summary": query}}})
print("Your search returned %d results:" % res['hits']['total']['value'])
hit = res["hits"]["hits"][0]

print(hit["_source"]["title"])
#to get a snippet 100 characters before and after the match
loc = hit["_source"]["summary"].lower().index(query)
print(f"location: {loc}")
print(hit["_source"]["summary"][:100])
print(hit["_source"]["summary"][loc-100:loc+100])

Your search returned 16 results:
Animal Farm
location: 54
 Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he co
animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals


In [57]:
#match query considers both exact matches, and fuzzy matches and works as a OR query. 
#match_phrase looks for exact matches.
while True:
    query = input("Enter your search query: ")
    if query == "STOP":
        break
    res = es.search(index="myindex", body={"query": {"match_phrase": {"summary": query}}})
    print("Your search returned %d results:" % res['hits']['total']['value'])
    for hit in res["hits"]["hits"]:
        print(hit["_source"]["title"])
        #to get a snippet 100 characters before and after the match
        loc = hit["_source"]["summary"].lower().index(query)
        print(hit["_source"]["summary"][:100])
        print(hit["_source"]["summary"][loc-100:loc+100])

    

Enter your search query: animal
Your search returned 16 results:
Animal Farm
 Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he co

P.S. Your Cat Is Dead
 Abandoned by his girlfriend on New Year's Eve, and still unaware that his beloved cat Tennessee (na
aware that his beloved cat Tennessee (named after the playwright Tennessee Williams) has died in an animal clinic, hopeless New York actor Jimmy Zoole is feeling depressed and unstable when he happens
Dead Air
 The first person narrative begins on 11 September 2001, and Banks uses the protagonist's conversati
only; makes sense.") and sees him described as a drug and booze fuelled, sexually promiscuous party animal. His politics are left-wing and libertarian, and he rants at every chance. Nott's various gir
The Murders in the Rue Morgue
 The story surrounds the baffling double murder of Madame L'Espanaye and her daughter in the Rue Mor
he sailor reveals that he had been keeping a captive or