Two types of search engines:
- Generic search engines, such as Google and Bing, that crawl the web and aim to cover as much as possible by constantly looking for new webpages
- Enterprise search engines, where our search space is restricted to a smaller set of already existing documents within an organization

# A Typical Enterprise Search Pipeline

Steps to build search engine:
- Crawling/content acquisition: read data from the location where all the news articles are stored
- Text normalization: extracting the main text and discarding additional information, such as tokenizing, lowercasing, etc.
- Indexing: vectorize the text, TF-IDF, BERT.


What happens when the user types a query?
1. Query processing and execution

The search query is passed through the text normalization process as above. Once the query is framed, it’s executed, and results are retrieved and ranked according to some notion of relevance. Elasticsearch provide custom scoring functions to modify the ranking of documents retrieved for a given query.

2. Feedback and ranking

To evaluate search results and make them more relevant to the user, user behavior is recorded and analyzed, and signals such as click action on result and time spent on a result page are used to improve the ranking algorithm.

# Elasticsearch

In [1]:
from elasticsearch import Elasticsearch 
from datetime import datetime

In [3]:
#elastic search instance has to be running on the machine. Default port is 9200. 

#Call the Elastic Search instance, and delete any pre-existing index
es = Elasticsearch([{'host':'localhost','port':9200}])
if es.indices.exists(index="myindex"):
    es.indices.delete(index='myindex', ignore=[400, 404]) #Deleting existing index for now

In [15]:
#Build an index from booksummaries dataset. I am using only 500 documents for now.
path = "Data/booksummaries/booksummaries.txt" #Add your path.
count = 1
for line in open(path, encoding="utf8"):
    fields = line.split("\t")
    doc = {'id' : fields[0],
            'title': fields[2],
            'author': fields[3],
            'summary': fields[6]
          }

    res = es.index(index="myindex", id=fields[0], body=doc)
    count = count+1
    if count%100 == 0:
        print("indexed 100 documents")
#     if count == 501:
#         break

indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 100 documents
indexed 10

In [16]:
#Check to see how big is the index
res = es.search(index="myindex", body={"query": {"match_all": {}}})
print("Your index has %d entries" % res['hits']['total']['value'])

Your index has 10000 entries


In [17]:
#Try a test query. The query searches "summary" field which contains the text
#and does a full text query on that field.
res = es.search(index="myindex", body={"query": {"match": {"summary": "animal"}}})
print("Your search returned %d results." % res['hits']['total']['value'])

Your search returned 381 results.


In [18]:
#Printing the title field and summary field's first 100 characters for 2nd result
print(res["hits"]["hits"][2]["_source"]["title"])
print(res["hits"]["hits"][2]["_source"]["summary"][:100])

Animal Farm
 Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he co


In [19]:
#match query considers both exact matches, and fuzzy matches and works as a OR query. 
#match_phrase looks for exact matches.
while True:
    query = input("Enter your search query: ")
    if query == "STOP":
        break
    res = es.search(index="myindex", body={"query": {"match_phrase": {"summary": query}}})
    print("Your search returned %d results:" % res['hits']['total']['value'])
    for hit in res["hits"]["hits"]:
        print(hit["_source"]["title"])
        #to get a snippet 100 characters before and after the match
        loc = hit["_source"]["summary"].lower().index(query)
        print(hit["_source"]["summary"][:100])
        print(hit["_source"]["summary"][loc-100:loc+100])

Enter your search query: wind
Your search returned 235 results:
At the Back of the North Wind
 The book tells the story of a young boy named Diamond. He is a very sweet little boy who makes joy 
trying to sleep, Diamond repeatedly plugs up a hole in the loft (also his bedroom) wall to stop the wind from blowing in. However, he soon finds out that this is stopping the North Wind from seeing th
Sword Quest
 There is war in the kingdom of birds, which was started by the prehistoric birds known as the archa
o escape and lays an egg. When the egg hatches to reveal a fully feathered hatchling, she names him Wind-Voice. Meanwhile, the four-winged creature, named Yin Soul, is stuck between the world of the l
Clash Of The Sky Galleons
 The story is set aboard the Sky pirate ship The Galerider. Wind Jackal wants revenge against his pr

The Wind Boy
 The novel is about a boy named Kay and a girl named Gentian who are foreign children and somewhat o
 mirrors their own. In the Clear Land, they each