<a href="https://colab.research.google.com/github/antonpolishko/A_colab_collection/blob/master/CoronaWhy_Elasticsearch_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example Elasticsearch Queries and Convenience Functions for http://search.coronawhy.org

CoronaWhy has ingested the CORD-19 dataset into an Elasticsearch instance deployed on our server. You can use it to make queries, get relevant documents, do some data exploration, and check the quality of keywords (among other things). I can't provide a guide for everything possible with Elasticsearch, but in this notebook I'll try to cover the most common use cases so you can get started, without needing to learn an entirely new package and query language. 

## Insallation

Make sure you're running python 3, and if you can, make sure in your virtualenv you've installed Elasticsearch with 
pip install elasticsearch==7.6.0

In [None]:
# Make sure you're running python3 for this, I haven't tested python2! 
!pip install elasticsearch==7.6.0

from elasticsearch import helpers, Elasticsearch
ESURL = "http://elastic:changeme@search.coronawhy.org:80"
es = Elasticsearch(ESURL, Port=80) # Do not change this! 

Collecting elasticsearch==7.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/cc/cf/7973ac58090b960857da04add0b345415bf1e1741beddf4cbe136b8ad174/elasticsearch-7.6.0-py2.py3-none-any.whl (88kB)
[K     |███▊                            | 10kB 16.2MB/s eta 0:00:01[K     |███████▍                        | 20kB 1.5MB/s eta 0:00:01[K     |███████████                     | 30kB 2.1MB/s eta 0:00:01[K     |██████████████▉                 | 40kB 1.6MB/s eta 0:00:01[K     |██████████████████▌             | 51kB 1.9MB/s eta 0:00:01[K     |██████████████████████▏         | 61kB 2.2MB/s eta 0:00:01[K     |█████████████████████████▉      | 71kB 2.4MB/s eta 0:00:01[K     |█████████████████████████████▋  | 81kB 2.6MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 2.4MB/s 
Installing collected packages: elasticsearch
Successfully installed elasticsearch-7.6.0


## What does the index look like? 

There are indexes at three levels: sentence, paragraph, and full article. They are:
1. v9sentences (Sentence level annotations)
2. v9sections (Paragraph level annotations)
3. v9papers (Document level annotations)

Which index you use will depend on your use case. If you want to find papers that are related to each other, use v9papers. If you want to find a paragraph of text that talks about PCR results or the effect of smoking on COVID-19 comorbidity, use v9sections. If you want only the number of patients involved in studies, use v9sentences and filter by keywords. 

As of writing this document, only v9sentences is available. We will make the other two available as soon as is possible. :) 

Please see the documentation here to learn about how the data is processed and which fields are available:

https://drive.google.com/open?id=1FesFFx5LLrWCUTBLFBbTQKCwCWFGN8z7


Finally, let's take a look at the fields available in our index.

In [None]:
import json
import requests

# get mapping fields for a specific index:
index = "v9sentences"
#ESURL = "http://elastic:changeme@search.coronawhy.org:80"
elastic_url = ESURL 
mapping_fields_request = "_mapping/field/*?ignore_unavailable=false&allow_no_indices=false&include_defaults=true"
mapping_fields_url = "/".join([elastic_url, index, mapping_fields_request])
print(mapping_fields_url)
response = requests.get(mapping_fields_url)

# parse the data:
data = response.content.decode()
parsed_data = json.loads(data)
keys = sorted(parsed_data[index]["mappings"].keys())
print("index= {} has a total of {} keys".format(index, len(keys)))

#keys of the fields:
fields = [{i,key} for  i, key in enumerate(keys)]
print([i for i in fields])

# 87 keys!? Please read!

Yes -- but most of these are duplicates. You need to undersand the difference between a text field and a keyword field. 

Text fields ("language", "cord_uid") are ingested into Elasticsearch and split up with spaces and analyzed. You can do basic search over these fields. 

Keyword fields ("language.keyword", "cord_uid.keyword") are fields that do not get analyzed. Lucky for you, we provide these fields already analyzed! We create lists of lemmas (basic word forms) for every sentence in the entire corpus, with common words, punctuation, and pure numbers are removed. That means instead of 

"The 25 rocks are insanely big, and COVID-19 is bad"... 
the lemma column has:
"rock, insane, big, covid-19, bad"

If you're looking for a specific number, use the non-keyword field. If you're looking for keywords, use the keyword field. It's also possible to use both! 

!!!!!PLEASE NOTE!!!!! UMLS, lemma, and all the other NER fields and fields that contain lists are ALREADY keyword fields. You cannot search them using sentences; they must be searched with lists. An example of this will be provided below.

Every sentence has a unique identifier, sentence_id. For sections, that's section_id. For documents, that's cord_uid, which is the same cord_uid as in the dataset provided by AI2 in CORD-19. 

# The Search Methods

There are several search methods I've written here to make your life more convenient when working with our ES instance. This is not exhaustive, but should get you started. For additional help, please use the Elasticsearch documentation to create queries: https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html

## More LikeThis

A More Like This (MLT) query will measure relevance according to the BM25 algorithm. 

Use this query if you're trying to answer the question: "Are there other sentences, sections, or documents like this one?"

Keep in mind, if you're searching a keyword field, you should provide lemmatized text. We'll build this into our search engine later so you don't have to worry about it, so stay tuned for updates.

Additional info here: https://qbox.io/blog/mlt-similar-documents-in-elasticsearch-more-like-this-query

In [None]:
def more_like_this(query, match_phrase="", index="v9sentences", match_field="", size=1000, fields=["sentence"], min_term_freq=1, max_query_terms=12):
    
    # This function makes a query to ElasticSearch and returns the 1000 most
    # similar documents based on: a query document, and a phrase that must
    # occur in the article's main text. 
    #
    # --VARIABLE DEFINITIONS--
    # query: the main text you want to measure relevance against.
    #        Can be a word, sentence, paragraph, or whole text.
    #
    # fields: Optional list of the fields you want to search in. Fulltext 
    #         searches only work with ["sentence"], while searching for lists
    #         should work in most other fields (keyword fields)
    #
    # match_phrase: the phrase that must occur in the field's text
    #
    # match_field: The field that needs to match whatever your query is.
    #
    # match_phrase: Optional string. The search will ONLY return documents
    #               where the whole phrase is matched.              
    
    if len(match_phrase):
        search_body = {
                    "size": size,
                    "query": {
                       "bool": {
                          "must": [
                            {
                             "more_like_this": {
                             "fields" : fields,
                             "like" : query,
                             "min_term_freq" : 1,
                             "max_query_terms" : 12
                         }
                             },
                             {
                                "match_phrase": {
                                   match_field: match_phrase
                                }
                             }
                          ]
                       }
                    }
                 }

    else:
        search_body = {
              "size": size,
               "query": {
                        "more_like_this": {
                        "fields" : fields,
                        "like" : query,
                        "min_term_freq" : 1,
                        "min_doc_freq":1
                        }    
               }
        }
    
    res = es.search(index=index, body=search_body)
    return [hit["_source"] for hit in res["hits"]["hits"]]

### Test MLT

Alright, let's say I want to find sentences related to comorbidity, and I only want 5 documents. Describe what you're looking for in a summarized way. Do not ask a question! 

In [None]:
results = more_like_this(query="Comorbidity, death, coronavirus", 
                         size=5)

How many hits did we get? We're expecting 5. 

In [None]:
len(results)

Let's look at the top hit. But we only want to know what the sentence says, and the IDs for the UMLS entities. 

In [None]:
print(results[0]["sentence"])
print(results[0]["UMLS_IDS"])
print(results[0]["lemma"])

Excellent. Now let's get documents by matching a whole phrase instead of just single words. Now, we want to find "survival without comorbidity" in the sentence. 

In [None]:
results = more_like_this(query="Comorbidity, death, coronavirus", 
                         match_field="sentence",
                         match_phrase="survival without comorbidity",
                         size=5)
print(len(results))

Notice we wanted 5, but only got 4 results. That's because the phrase only occurs in 4 sentences within our corpus. Let's take a look at those sentences in order of MLT relevance.

In [None]:
for i in results:
    print(i["sentence"] + "\n" + i["sentence_id"] + "\n")

Oho! Looks like we've found some duplicate sentences. Looks like some documents were indexed twice. Can we search keyword fields in the same way? Yep! (Ignore duplicates, we're doing our best!)

In [None]:
results = more_like_this(fields=["UMLS"],
                         query="Comorbidity, death, coronavirus",
                         size=5)
for i in results:
    print(i["sentence"] + "\n" + i["sentence_id"] + "\n")

## Using match_phrase to filter by cord_uid

If we know the cord_uid of the paper we want to grab, and only want to search within that paper, we can provide the cord_uid as the match_phrase, and set match_field to "cord_uid" to achieve that functionality. 

In [None]:
results = more_like_this(fields=["UMLS"],
                         query="Comorbidity, death, coronavirus", 
                         match_field="cord_uid",
                         match_phrase="ow2xqhmp",
                         size=3)
print(len(results))

In [None]:
for i in results:
    print(i["sentence"] + "\n" + 
          "Cord_uid:" + f'\x1b[31m{i["cord_uid"]}\x1b[0m' + "\n" + 
          "Sentence ID" + i["cord_uid"] + "\n")

NameError: ignored

## Filter query

If you know what you're looking for and you want to pull all documents that contain a specific word or term, use this type of search. We're not interested in relevance here, just a simple keyword filter! 

In [None]:
def simple_filter(terms="covid-19", field="UMLS", size=10, match_all=False):
    
    # This method will search the data and return only data where the field
    # contains the terms you're looking for. You can enter a string, or a 
    # list of strings, and the method will handle them accordingly. 
    
    if isinstance(terms, str):
        search_body= {
            "size" : size,
            "query": { 
                "bool": { 
                "filter": [ 
                    { "term":  { field: terms }},
                    #{ "range": { "publish_date": { "gte": "2015-01-01" }}} # Get papers published after date
                      ]
                    }
                  }
                }
    elif isinstance(terms, list):
        if match_all==False:
            search_body={
                      "query": {
                        "bool" : {
                          "must" : {
                              "terms" : {
                                field : terms
                              }
                          }
                        }
                      }
                    }
        elif match_all==True:
            search_body={
                      "query": {
                        "bool" : {
                          "must" : {
                              "terms" : {
                                field : terms
                              }
                          },
                            "minimum_should_match": len(terms)
                        }
                      }
                    }
    
    else:
        return("You need to provide a list or a string for the terms variable!")

    res = es.search(index=index, body=search_body)
    
    if len(res["hits"]["hits"]) == 0:
        print("No hits!")
    
    return [hit["_source"] for hit in res["hits"]["hits"]]

### Cool (but problematic) fact about the UMLS column

The UMLS column actually represents a normalized version of various concepts and named entities in the "sentence" column. That means if "coronary" or "blood-pumping organ" are present in the text, they'll get mapped to something like "heart" in UMLS! 

In [None]:
results = simple_filter(terms=["heart", "dead"])

You can add the "match all" flag if you want to make sure ALL of your keywords are contained in the document. 

In [None]:
results = simple_filter(field="lemma", terms=["test", "heart"], match_all=True)

In [None]:
def get_all_the_results(query, field="sentence", index="v9sentences"):
    # Type a string query, get a list of all _id that match
    # in the ES index
    ESURL = "http://elastic:changeme@search.coronawhy.org:80/"
    es = Elasticsearch(ESURL, Port=80, request_timeout=60) # Do not change this! 
    article_list = []
    res = es.search(
        index=index,
        scroll='60s',
        size=1000,
        body={
           "query": {
                       "match": {
                          field: query
                       }
                    }
           }
        )   

    # Get the scroll ID
    sid = res['_scroll_id']
    scroll_size = len(res['hits']['hits'])
    article_list.extend([hit['_source'] for hit in res['hits']['hits']])
    
    while scroll_size > 0:
        res = es.scroll(scroll_id=sid, scroll='2m')

        # Update the scroll ID
        sid = res['_scroll_id']

        # Get the number of results that returned in the last scroll
        scroll_size = len(res['hits']['hits'])
        article_list.extend([hit['_source'] for hit in res['hits']['hits']])
    return article_list

# Converting to a Pandas dataframe

As I'm sure some of you will want to work with Pandas, I've written the code so it's as easy as 1, 2, 3! 

In [None]:
import pandas as pd
results_df = pd.DataFrame(results)
results_df.head()

Since you probably don't care about the vectors:

In [None]:
if 'w2vVector' in results_df:
  results_df.drop(columns=["w2vVector"], inplace=True)
results_df

Have fun and good luck! 