# Text Similarity Search with Modern NLP and Elasticsearch Service

![alt text](https://drive.google.com/uc?export=view&id=1p9T9oXfeRoCd6vtN_TPuUSBA4vDoeIeF)


**Objective:** In this notebook, we will learn how to make use of modern NLP techniques to build a simple prototype of a search application. After going over this tutorial, the learner will be able to use pretrained language models to build a search applicaiton powered by an open-source tool called [Elasticsearch](https://www.elastic.co/downloads/).

This project will be maintained here: https://github.com/dair-ai/covid_19_search_application

If you have any questions about this tutorial or you face any issues, reach out to me on [Twitter](https://twitter.com/omarsar0). If you have any feedback or are interested to keep building on this project, please reach out either via email at ellfae@gmail.com or Twitter.

Note that this tutorial is in draft mode meaning that you could encounter typos and so forth.






## Load the data from Kaggle
- First, we need to download a datset provided on Kaggle for the [COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). You need to create a Kaggle account to be able to download the dataset. We will use the `kaggle` client to connect to Kaggle and download the dataset to this environment. This means that you don't need to manually upload the datasets.
- Create a Kaggle account and then create a token in your account, then download the `kaggle.json` provided. This holds your credentials that you will use to get access to the data set from here.
- Manually upload the `kaggle.json` file to this Colab environment. Just look on the left hand side, and you will find an option to upload the file.
- Once the file has been uploaded into this environment, you need to run the commands below:


In [0]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge

Downloading CORD-19-research-challenge.zip to /content
100% 1.47G/1.47G [00:21<00:00, 68.8MB/s]
100% 1.47G/1.47G [00:21<00:00, 74.0MB/s]


The commands above downloads the entire dataset to the environment, which you can view by clicking on the "folder" option on the left hand side. 

Unzip the files by running the command below (**note that it may take up to 5 minutes to unzip all files**):

In [0]:
%%capture
## takes ~5 mins
!unzip CORD-19-research-challenge.zip

## Import libraries
We are going to use some of the common libraries like TensorFlow so we don't really need to install any libraries up to this point. 

In [0]:
from tqdm.notebook import tqdm
import os
import json
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import time

print(tf.__version__)

2.2.0-rc2


## Load the data
We are interested in two batches of articles from the CORD-19 dataset. This amounts to over ~12000 scholarly articles 

In [0]:
## Adapted from Souvik's Kaggle notebook
## https://www.kaggle.com/souviksbhowmik/transmission-incubation-and-stability-by-souvik

working_list=[]
paths = ["comm_use_subset/comm_use_subset/pdf_json/", "noncomm_use_subset/noncomm_use_subset/pdf_json/"]
for p in paths:
    for file in tqdm(os.listdir(p)):
        #print(file)
        text_content = '' # combine title, abstract, and body text or articles
        working_dict = dict({"abstract": [], "body_text":[]})
        content = json.load(open(p+file,'r'))
        #print(type(content))
        working_dict['paper_id'] = content['paper_id']
        working_dict['path'] = p
        if 'metadata' in content and 'title' in content['metadata']:
            working_dict['title'] = content['metadata']['title']
            #print('found title')
            text_content = text_content  +' '+ content['metadata']['title']
        if 'abstract' in content:
            #print('found abstract')
            for abst in content['abstract'] :
                text_content = text_content+' '+abst['text']
                working_dict['abstract'].append(abst['text'])    
        if 'body_text' in content:
            #print('found body text')
            for bt in content['body_text'] :
                text_content = text_content+' '+bt['text']
                working_dict['body_text'].append(bt['text'])
        
        working_dict['text'] = text_content
        working_list.append(working_dict)

HBox(children=(IntProgress(value=0, max=9524), HTML(value='')))




HBox(children=(IntProgress(value=0, max=2490), HTML(value='')))




In [0]:
len(working_list)

12014

Preview the articles:

In [0]:
working_list[0:2]

[{'abstract': ["Fusion between the viral and target cell membranes is an obligatory step for the infectivity of all enveloped virus, and blocking this process is a clinically validated therapeutic strategy. Viral fusion is driven by specialized proteins which, although specific to each virus, act through a common mechanism, the formation of a complex between two heptad repeat (HR) regions. The HR regions are initially separated in an intermediate termed ''prehairpin'', which bridges the viral and cell membranes, and then fold onto each other to form a 6-helical bundle (6HB), driving the two membranes to fuse. HR-derived peptides can inhibit viral infectivity by binding to the prehairpin intermediate and preventing its transition to the 6HB. The antiviral activity of HR-derived peptides differs considerably among enveloped viruses. For weak inhibitors, potency can be increased by peptide engineering strategies, but sequence-specific optimization is time-consuming. In seeking ways to inc

In [0]:
## Uncomment the lines of below if you want to quickly preview the data using a pandas dataframe
## we don't need it here because we will be directly focusing on storing the data in json format
## to an Elasticsearch instance on the cloud.

## data = pd.DataFrame(working_list)
## data.head()

## Connect to an Elasticsearch instance on the Cloud
- In order to build the text similary search application, we need a scalable search engine that we can store the embeddings of the text we want to query. We are going to use Elasticsearch on the cloud. The idea is simple: we want to store the dense vectors of the pieces of text from the scholarly articles. In this notebook, we are mostly focusing on storing the embeddings for the body text of the articles. Once we have those stored as embeddings, we can then search using a text query. At query time, the text-based query will also be transformed to an embedding representation and then compared with the embeddings stored in Elasticsearch. We are going to use `cosine similarity` to compare the similarity between the embeddings of the query and the body text stored in Elasticsearch.
- First, setup your 14-day free trial acccout on elastic.co using the link: https://www.elastic.co/downloads/
- Use the **cloud** option as we want to keep using this online environment. This whole exercise can be replicated on your local machine, but for the purpose of this notebook, we would like to test using the cloud option.
- Once you have registered the setup process is relatively easy. You will receive an email and you will use the link provided to setup your cloud instance -- just follow the wizard.
- At the end you will have access to an Elasticsearch cloud instance as well as a Kibana instance which we won't use here but can be fun to better understand our data a bit more and create some impressive visualizations using our data.
- In your account, you will need to setup a password for your Elasticsearch instance. Generate one and create a `json` file called elastic.json containing your credentials:

```json
{
"user": "elastic", 
"password": "<password>",
"cloud_id": "<cloud_id>"
}
```

- Then you need to upload that json file to this environment similar to what we did with the `kaggle.json` file. Once you have uploaded the file we will access its information using the code below:


In [0]:
with open("elastic.json") as elastic_file:
    ELASTIC_SETTINGS = json.loads(elastic_file.read().strip())

Since we are going to use Elasticsearch to store our data, we need to provide a schema or what we refer to as a `mapping` that tells Elasticsearch how we want to structure the data we are going to store in it. The mapping also specifies the type of the field and their values that we are storing. Elasticsearch is a data storage and will store documents in the `JSON` format. Fortunately, the data already comes in that format so all we need to do is work a bit more on the structuring before sending it over to our Elasticsearch instance. That's what we will focus on in this section.

First, let's download to this environment the json files that contain the mapping information that will determine the structure of our data when stored into Elasticsearch:

In [0]:
## mappings for our indices
!wget https://raw.githubusercontent.com/dair-ai/odsc_2020_nlp/master/articles_index.json -O articles_index.json
!wget https://raw.githubusercontent.com/dair-ai/odsc_2020_nlp/master/body_index.json -O body_index.json


--2020-04-12 12:46:58--  https://raw.githubusercontent.com/dair-ai/odsc_2020_nlp/master/articles_index.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 826 [text/plain]
Saving to: ‘articles_index.json’


2020-04-12 12:46:58 (154 MB/s) - ‘articles_index.json’ saved [826/826]

--2020-04-12 12:47:01--  https://raw.githubusercontent.com/dair-ai/odsc_2020_nlp/master/body_index.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 538 [text/plain]
Saving to: ‘body_index.json’


2020-04-12 12:47:01 (101 MB/s) - ‘body_index.json’ saved [538/538]

Now install the Pythom Elasticsearch client:

In [0]:
!pip install elasticsearch

Collecting elasticsearch
[?25l  Downloading https://files.pythonhosted.org/packages/cc/cf/7973ac58090b960857da04add0b345415bf1e1741beddf4cbe136b8ad174/elasticsearch-7.6.0-py2.py3-none-any.whl (88kB)
[K     |███▊                            | 10kB 24.5MB/s eta 0:00:01[K     |███████▍                        | 20kB 6.0MB/s eta 0:00:01[K     |███████████                     | 30kB 8.5MB/s eta 0:00:01[K     |██████████████▉                 | 40kB 5.5MB/s eta 0:00:01[K     |██████████████████▌             | 51kB 6.7MB/s eta 0:00:01[K     |██████████████████████▏         | 61kB 7.9MB/s eta 0:00:01[K     |█████████████████████████▉      | 71kB 9.0MB/s eta 0:00:01[K     |█████████████████████████████▋  | 81kB 10.1MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 6.5MB/s 
Installing collected packages: elasticsearch
Successfully installed elasticsearch-7.6.0


The code block below is used to create a client connection to your Elasticsearch instance on the cloud. Remember that `elastic.json` that holds the credentional information, we will use it here:

In [0]:
from elasticsearch import Elasticsearch
from elasticsearch import helpers

es = Elasticsearch(
    cloud_id=ELASTIC_SETTINGS["cloud_id"],
    http_auth=(ELASTIC_SETTINGS["user"], ELASTIC_SETTINGS["password"]),
)

## Load TensorFlow Pretrained Model
We are interested to use sentence embeddings techniques to embed the scholarly texts. We will use the [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder-large/5) from TensorFlow Hub. You can use any pretrained models that can be used to produce the representations such as Hugging Face Transformers, etc. Load the pretrained model:

In [0]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

Let's generate embeddings for a piece of abstract as an example. 

In [0]:
sample_doc = working_list[0]['abstract'][0]
print(sample_doc)

Fusion between the viral and target cell membranes is an obligatory step for the infectivity of all enveloped virus, and blocking this process is a clinically validated therapeutic strategy. Viral fusion is driven by specialized proteins which, although specific to each virus, act through a common mechanism, the formation of a complex between two heptad repeat (HR) regions. The HR regions are initially separated in an intermediate termed ''prehairpin'', which bridges the viral and cell membranes, and then fold onto each other to form a 6-helical bundle (6HB), driving the two membranes to fuse. HR-derived peptides can inhibit viral infectivity by binding to the prehairpin intermediate and preventing its transition to the 6HB. The antiviral activity of HR-derived peptides differs considerably among enveloped viruses. For weak inhibitors, potency can be increased by peptide engineering strategies, but sequence-specific optimization is time-consuming. In seeking ways to increase potency wi

In [0]:
embeddings = embed([sample_doc])
print(embeddings.shape)

(1, 512)


You can see that the embedding generated for the piece of text is of size (1, 512). These vector representation will be stored into Elasticsearch using a `dense_vector` type including the original pieces of text.

## Index Sentence Embeddings into Elasticsearch
In this section, we will finally be indexing/storing the documents into Elasticsearch. This will be done in batches and via the client we set up before. A few settings first:

In [0]:
ARTICLE_INDEX_NAME = "covid_articles"
ARTICLE_INDEX_FILE = "articles_index.json"

BODY_INDEX_NAME = "body_vectors"
BODY_INDEX_FILE = "body_index.json"

BATCH_SIZE = 1000
SEARCH_SIZE = 10

In [0]:
## Function to embed text
def embed_text(text):
    vectors = embed(text)
    return [vector.numpy().tolist() for vector in vectors]

In [0]:
## docs format:
## [{abstract: [...], "body_text": [...], title: "", path: "", "paper_id":"", "text":""},
##  {abstract: [...], "body_text": [...], title: "", path: "", "paper_id":"", "text":""},
##  ...]

## Function to index abstracts/body paragraphs and their vector representations
def index_batch_vectors(docs, field, index_name):
    requests = []

    ## index body vectors for every document in batch
    for i, doc in enumerate(docs):
        paragraphs = []
        
        for text in doc[field]:
            paragraphs.append(text)

        ## TODO: Embed individually below so as to avoid OOM error
        vectors = embed_text(paragraphs)

        for j, t in enumerate(doc[field]):
            request = {"_op_type": "index", 
                       "_index": index_name,
                       field+"_vector": vectors[j],
                       field: paragraphs[j],
                       "paper_id": doc["paper_id"]}
            requests.append(request)
    helpers.bulk(es, requests)

def index_batch(docs):
    ## uncomment both lines to embed titles
    titles = [doc["title"] for doc in docs]
    title_vectors = embed_text(titles)

    requests = []
    for i, doc in enumerate(docs):
        request = doc
        request["_op_type"] = "index"
        request["_index"] = ARTICLE_INDEX_NAME
        request["title_vector"] = title_vectors[i]
        requests.append(request)
    helpers.bulk(es, requests)

    ## index body vectors
    index_batch_vectors(docs, "body_text", BODY_INDEX_NAME)

In the function directly above (`index_batch`), we are indexing the original scholarly text into an index called `covid_articles`. This includes all the text fields and the embedding of the titles. 

We would also need to index the embeddings for the individual abstract segments and body text segments. Remember that unlike the title, abstracts and the body text can include more that one paragraph or piece of text which means that we need to store these seperately. For this, we created a function above called `index_batch_vectors` that produces the embeddings for each paragraph in the body text of abstract. These will then be stored seperately into an index called `body_vectors` since in the example above we are mostly interested in the body of the article and its paragraphs. 

As an exercise, you can reuse the function to do the same for all the abstracts but you may need to create another index to store these. Remember that an index in Elasticsearch is a collection of documents that have similar properties so it makes sense to seperate these collections of documents in seperate indices. In essenence, if you want to retrieve or search on the orginal text use the `covid_articles` index and if you are interested in query in the individual body paragraph, which we will do below, you search on the `body_vectors` index.

Let's implement one more function to create the indices and start the batch processing to index the articles into Elasticsearch:

In [0]:
## index data in batches
def index_data(article):
   
    with open(ARTICLE_INDEX_FILE) as index_file:
        source = index_file.read().strip()
        es.indices.create(index=ARTICLE_INDEX_NAME, body=source)
    
    with open(BODY_INDEX_FILE) as index_file:
        source = index_file.read().strip()
        es.indices.create(index=BODY_INDEX_NAME, body=source)

    docs = []
    count = 0

    for line in article:
        #print (line)
        #doc = json.loads(line))
        docs.append(line)
        count += 1

        if count % BATCH_SIZE == 0:
            index_batch(docs)
            docs = []
            print("Indexed {} documents.".format(count))
            
    if docs:
        index_batch(docs)
        print("Indexed {} documents.".format(count))

    es.indices.refresh(index=ARTICLE_INDEX_NAME)
    print("Done indexing.")

And finally, let's run the indexing operation by running the code below. We have selected 2000 articles, to avoid waiting to long. If you have the time, you can choose to index all the articles. 

Note that if you choose to store all articles it will take a longer time and you may run into an OOM error. It's a simple fix but I will leave this as an exercise. I have provided a hint in the `index_batch_vectors` function above indicated by a `TODO` tag. The issue is that this environment doesn't allow for tensors bigger than a specific size, which does happen in one of our batches.

If you don't want that error, just run the code below. I will provide the fix later.

In [0]:
index_data(working_list[0:2000])

Indexed 1000 documents.
Indexed 2000 documents.
Done indexing.


## Text Similarity Search
If the indexing above finished successfully, we are now ready to test text similarity using the embeddings we have stored into Elasticsearch. 

Elasticsearch has a very rich API that allows very different types of search. Here we are going to focus on a simple search scenario.

First, we are going to ask the user to input a query. The query will then be encoded into its dense vector representation using the same pretrained model from TensorFlow. Then we will perform a query in Elasticsearch with the goal to search for similar body text by performing a cosine similarity on the body text vectors and the query vector:

```python
script_query = {
    "script_score": {
        "query": {"match_all": {}},
        "script": {
            "source": "cosineSimilarity(params.query_vector, doc['body_text_vector']) + 1.0",
            "params": {"query_vector": query_vector}
        }
    }
}
```

If you would like to know more about these Elasticsearch functionalities, feel free to check our [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-query.html#script-query-ex-request) for further details. Further understanding of Elasticsearch is left as an excerise and is out of scope for this tutorial

In [0]:
## Function to handle the query

def run_query_loop():
    while True:
        try:
            handle_query()
        except KeyboardInterrupt:
            return

            
def handle_query():
    query = input("Enter query: ")

    embedding_start = time.time()
    query_vector = embed_text([query])[0]
    embedding_time = time.time() - embedding_start

    script_query = {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "cosineSimilarity(params.query_vector, doc['body_text_vector']) + 1.0",
                "params": {"query_vector": query_vector}
            }
        }
    }

    search_start = time.time()
    response = es.search(
        index=BODY_INDEX_NAME,
        body={
            "size": SEARCH_SIZE,
            "query": script_query,
            "_source": {"includes": ["paper_id", "body_text"]}
        }
    )
    search_time = time.time() - search_start

    print()
    print("{} total hits.".format(response["hits"]["total"]["value"]))
    print("embedding time: {:.2f} ms".format(embedding_time * 1000))
    print("search time: {:.2f} ms".format(search_time * 1000))
    for hit in response["hits"]["hits"]:
        print("id: {}, score: {}".format(hit["_id"], hit["_score"]))
        print(hit["_source"])
        print()

You are now ready to enjoy your search application. Run the code below and input your phrase to search for:

In [0]:
run_query_loop()

Enter query: herbal treatment

10000 total hits.
embedding time: 22.38 ms
search time: 1232.09 ms
id: qtiSbnEBtvu42qbNUfex, score: 1.9520377
{'body_text': 'Herbal treatments', 'paper_id': 'b9f617f6229dcea28f092bc9fa9c2bbe3c8e9b81'}

id: a9mTbnEBtvu42qbN2x5K, score: 1.437657
{'body_text': 'Geranium macrorrhizum herb ----', 'paper_id': 'f8c2f1a1903f8370c15b5be2e4d32e91f2641b03'}

id: mdmTbnEBtvu42qbN2x1J, score: 1.4076636
{'body_text': 'Anti-inflammatory', 'paper_id': '8c73602b18bc6b2fb17e8f37552995f6e310c6fd'}

id: ctmTbnEBtvu42qbN2x5K, score: 1.396433
{'body_text': 'Echinacea purpurea herb + ---', 'paper_id': 'f8c2f1a1903f8370c15b5be2e4d32e91f2641b03'}

id: hNmTbnEBtvu42qbN2x1J, score: 1.3913385
{'body_text': 'The use of traditional medicine is still deep ingrained in some cultures even today; therefore thousands of people rely on the therapeutic potential of plants for treating certain diseases in their daily lives [76] [77] [78] [79] . Benefits from of this kind of therapy are not un

## References:

- [Elastic Stack and Product Documentation](https://www.elastic.co/guide/index.html)
- [CORD-19 Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
- [transmission, incubation, and stability by Souvik](https://www.kaggle.com/souviksbhowmik/transmission-incubation-and-stability-by-souvik)
- [TensorFlow Hub](https://tfhub.dev/google/universal-sentence-encoder-large/5)
- [Text Similarity Search in Elasticsearch (blog)](https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch)
- [Text Embeddings in Elasticsearch](https://github.com/jtibshirani/text-embeddings)