# Lexical vs. Semantic Search
<p>Actual search applications can be complex and consist of several components in addition to the search engine itself: 
query auto-completion, query spell-correction, search filtering, integration with user preferences and profiles, boolean
and/or regular expression search, etc. Here, however, we will focus only on the core search engine and its 
linguistic ability to match a natural language query to relevant fragments in the search corpus; in short, we will
focus on the significance of the core engine algorithm within NLP.</p>
<p>Search tests the ability of an NLP model to represent meaning at the <u>text</u> level: where by text
we mean a sentence, or paragraph, or sequence of paragraphs; all these are generically referred to as documents, 
irrespective of size. A search engine is at its core, a similarity metric for the distance of documents in the search
corpus and the user query, itself represented as a (mini) document. Search is organized in two stages: <ul>
<li>at <u>index time</u>, each document in the search corpus is encoded into some kind of representation. </li>
<li>at <u>run time</u>, the user query is also encoded into the same kind of representation, and compared on the fly
to the representations in the corpus. The most similar corpus documents are returned as the search results.</li> </ul>
<p>The encoding is typically a vector of features, representing the information in the text. Consequently, 
the similarity metric is one of the standard vector similarity metrics, such as cosine-similarity</p>
<p>The traditional information-retrieval based search engine represents text as a <u>lexical frequency</u> vector.
Lexical items are basically the words in the text: often just the 'content' words ('grammatical' words such as articles, 
pronouns, and other such, are ignored); sometimes words are lemmatized, i.e., replaced by their
'root' form, to achieve more general representations: for ex., in English, 'loves', 'love', 'loved', 'loving', and even
'lovable' or 'lovingly' could be unified as the same underlying 'lov' root. <br/>
In this type of vector representation, each word in the vocabulary is a dimension, and its value is the TF/IDF of the word. 
That is, Term-Frequency over Document-Frequency: the frequency of the word in the text on hand, divided by the number 
of documents/texts the word occurs in. In practice, TF/IDF is more refined: log frequency is used for document frequency, 
and adjustments are made to compensate for the length of the document. <br/>
Importantly, TF/IDF based model adopt the <u>bag-of-words</u> assumption: the (syntactic) order of words in the text
is irrelevant. This is a helpful simplification, but of course can also be a serious limitation.<br/>
<b>ElasticSearch</b> is the most popular implementation of the TF/IDF search framework.</p>
<p>Recently, a new paradigm is emerging for search: <b>Semantic Search</b>, using Neural Network models. In this 
framework, the lexical identity of the words is not important. Words, or rather, words and sub-words (word fragments), are instead 
represented as vectors of semantic features. These (sub)word vectors are usually referred to as embeddings. The number 
of dimensions is not the size of the vocabulary anymore, but depends on the model implementation: typically 384 or 768. 
The model is usually trained for the Masked-Language-Model task (MLM): some 15 percent of the words in the training materials 
are hidden (=masked) and the model is trained to predict the missing words. The resulting features or dimensions of 
the word vectors are the weights the model learns through the MLM task. Although the model does not specifically assign 
a meaning to the features, they can be thought of as fine-grained semantic traits that make up the meaning of the words: 
e.g. features like 'is a person', 'has gender', 'is inanimate', etc. These models assign a vector to each word (or sub-word) in the text
that is encoded; to provide a single vector representation of the whole text/document, the word vectors are typically
averaged. </p> 
<p>State-of-the-art models for Semantic Search implement the <b>Transformers</b> neural architecture. These 
models are able to produce <b>contextual</b> word vectors: richer vector representations of words that reflect 
their semantic and syntactic relationships with the other words in the text. The averaged single vector encoding of 
a text is usually referred to as a 'sentence embedding'. </p>
<p>At index time, all the texts in the document corpus to be searched, are encoded by the model into sentence
vectors and stored. At runtime, the user query is converted into its own sentence embedding and compared with the stored 
embeddings using a vector distance metric. The most similar sentence embeddings from the document corpus are then returned 
as the search results.</p>

## Baselines and Experiments
<p>When developing or choosing machine learning models to handle a task, it is helpful to extablish <b>baselines</b>, i.e., benchmark models 
that can be evaluated with iterative development, and provide development guidelines. </p>
<p>We will setup two baselines: one for <b>ElasticSearch</b> with its standard TF/IDF information retrieval approach, and another one for Semantic Search using a Neural Network model. </p> 
<p>The latter is a state-of-the-art model built by fine-tuning a <b>Transformer</b> model by Microsoft ('MiniLM-L12-H384') on a very large set of sentence similarity datasets, 
covering a total of 1 billion sentences. This model is '<b>all-MiniLM-L12-v2</b>'. It is a fast, small (384 dimensional embeddings) yet powerful model for search applications.</p>
<p>To summarize, the two approaches featured here differ in the following respects:
<table><tr><th>Model</th><th>Word representation</th><th>Word order and dependencies</th><th>Vector size</th><th>Vector Feature Values</th></tr>
    <tr><td>ElasticSearch 7.12</td><td>word or word-root, no internal structure</td><td>not represented</td><td>large, sparse vectors: size is entire vocabulary</td><td>TF/IDF</td></tr>
    <tr><td>All-MiniLM-L12-v2</td><td>word-vectors of semantic features</td><td>encoded in word vectors</td><td>384 dimensional word embeddings, averaged into single text-embedding</td><td>neuron weights</td></tr>
</table>
<p>So, from the point of view of search, our focus is going to be to compare the pros and cons of lexical vs. semantic search, by setting up baselines for each, and 
furthermore, by developing a basic infrastructure to run and evaluate <b>experiments</b>. An experiment consists in running a search model with some specific configuration, 
evaluating its performance over a test set, while keeping track of the results. Experiments further develop and hopefully improve the baselines<p>


## Evaluation with Test Set
<p>To compare models and their development iterations, it is necessary to develop a <u>Test Dataset</u>. </p>
<p>A test set for a search application consists of a number of simulated user queries, paired with their search result, in the form of corpus snippets, their ids, and a human-assigned 
relevance grade. </p>
<p>For our purposes here, which is to obtain an initial comparison of models and develop a methodology, we will setup a small set of test queries, say 30, and grade just the top 5 results for each. <br/>
We will assign a numerical relevance grade of '3' to search results that are relevant; a grade of '2' to partially relevant search results; and a grade of '1' when a result is 
not relevant. The queries will be setup so that there is at least some content in the corpus that is relevant. We want to avoid the situation where the user query simply 
has no relevant match in the corpus. </p>
<p>To evaluate each model and configuration against the Test Set, we will first run the model in the given configuration, collect the search results, grade them, and add them to the 
Test Set. Once all the (top 5) search results for each query are graded, we can produce evaluation metrics for the model. So this is an iterative process of running, 
grading, evaluating for each new model; the manual grading effort will tend to decrease as more and more graded search results are added to the test set. </p>
<p>A note about our grading: search applications typically display search results by providing the document's title, and sometimes the tags, along with the portion of the 
document that is (supposed to be) most relevant to the query. Our preprocessed text include the text of the post's tags, and these are part of the searchable content. 
Consequently, when grading, we have implicitly considered the tags as part of the matched text. Some result that may not appear entirely relevant to the query, have sometimes
gotten graded higher anticipating how they would be presented to the end-user.</p>
<p>Below is an example of how we have set up the Test Set. We save this as 'data/health_search_results_true.csv:</p>

In [1]:
import pandas as pd
import os

data_dir = "../data"
true_results_file = os.path.join(data_dir, "health_search_results_true.csv")
pd.options.display.max_colwidth = 180
test_set_df = pd.read_csv(true_results_file, on_bad_lines='skip')
test_set_df.head(10)

Unnamed: 0,query,grade,id,snippet,notes
0,Is it ok to hear my heart beat?,3,4542:0,Why can I hear my heart beat louder after rigorous exercise,
1,Is it ok to hear my heart beat?,3,5264:0,Is there such a thing as a hard heart beat? (as opposed to a fast heart beat),
2,Is it ok to hear my heart beat?,2,5264:5,Would this 'harder' heat beat burn more calories than a normal strength heart beat at the same bpm?,
3,Is it ok to hear my heart beat?,3,825:0,Is it normal to feel your heart beat in your chest?,
4,Is it ok to hear my heart beat?,2,20236:0,"How does cold weather affect blood flow, heart beat, and heart attack?",
5,Is it ok to hear my heart beat?,3,825:0,Is it normal to feel your heart beat in your chest?,
6,Is it ok to hear my heart beat?,3,12427:3,"The other day when i woke up i felt my heart beating hard, and i occasionally feel my heart beating without touching it with my hands.",
7,Is it ok to hear my heart beat?,3,5264:2,After a heavy set of squats for instance I can hear my heart beating in my ears and feel it beating in my chest.,beating matched to beat
8,Is it ok to hear my heart beat?,3,825:1,"Is it normal for a person to at times feel their heart beat in their chest without actually placing their hand on their chest, while at other times not be able to (even though ...",
9,Is it ok to hear my heart beat?,3,4542:1,As I understand exercise increases hear rate. Why should an increase in hear rate feel like the heart is also beating louder (not just faster).,


## Testing and Exploring the search frameworks interactively
<p>In addition to testing search frameworks through complete experiments, interactive testing is also available in this project. In interactive testing, frameworks can be tested and explored 
by submitting queries about health. </p>
<p>The end-point for interactive testing is <b>get-search-results-interactive</b>. This is an example of interactive search, where the user is prompted to submit queries:</p>
<pre>
HealthSearch>  python3 health_exchange/main.py get-search-results-interactive "./config/args_senttrans_1.json"
Usage: enter search query after prompt; enter 'q' to exit

Query: are masks helpful to prevent covid?
   1  [id: 21689:2 tags: covid-19 who-world-health-org]  I don't understand why wearing mask is not good way to prevent infection of COVID-19. Is it just because masks are out of stock?  (Score 0.83)
   2  [id: 21459:5 tags: covid-19]  I understand that we should not wear masks because there is a shortage of them, they do not protect eyes, do not protect from touching face, etc.  (Score 0.81)
   3  [id: 23489:2 tags: covid-19]  Early reports said that wearing a mask may not help to protect myself from other infected people, but it protects others from me (who might be infected).  (Score 0.8)
   4  [id: 22939:3 tags: covid-19 prevention epidemiology covid-19-datasets]  Is this empirical evidence we should be wearing surgical masks?  (Score 0.79)
   5  [id: 24411:4 tags: covid-19 face-mask-respirator]  So the question is does a shield help (to any significant degree) the wearer from spreading COVID-19 in the same way as a mask?  (Score 0.77)
Query: 

</pre>

## Evaluation Metrics
As for the metrics, we provide the following: 
<ul><li><b>Relevance Table</b>: a table showing the percentage of relevant, partial, and non-relevant results</li>
    <li><b>Discounted Cumulative Gain (DCG)</b>: this is an overall score that combines relevance and ranking, where ranking is the ability of the 
    search engine to return a more relevant result before a less relevant one. In DCG, the relevance score of each search result is 'discounted' by its ranking, or 
    position, in the result set. So a relevant search result returned as the first answer will count more than a relevant search result returned as the second, and so forth.</li>
</ul></p>

## Experiment Infrastructure
<p>Experiments consist in running a search model against a Test Set of simulated user queries, and extracting evaluation metrics.</p>
<p>Experiments center around a configuration: for us, a json file that include settings like the experiment name, the framework (Elasticsearch or Sentence Transformer), and various hyperparameters and preprocessing choices, as needed.</p>
<p>We perform experiments in three phases:
<ol><li>create a search index from the search corpus and store it </li>
    <li>run the model on the simulated user queries in the Test Set, and generate search results using the search index</li>
    <li>extract search metrics after manually grading the search results</li>
</ol>
<p>Finally, we need a way to keep track of our experiments and their results, so that we can proceed with iterative development of the baselines. 
<p>To implement this infrastructure and manage experiments, we therefore need to put in place the following components:
<ul><li>Python functions that execute each phase, depending on the framework of the experiment</li>
    <li>a convenient interface to execute these functions. A CLI (Command Line Interface) can suffice for this project (though in real life, a Rest implementation would be preferable)</li>
    <li>a framework to store, keep track, and manage all the experiments. </li>
</ul>

### Experiment Configurations
<p>Let's start with the experiment configurations. Here are the contents of the json files for the two baselines:</p>
<pre>
{
    "model_type": "elasticsearch",
    "index_name": "healthex__tags_snippet_tttf",
    "run_name": "es_baseline",
    "prep_lower": true,
    "prep_skipw": true,
    "prep_stem": false,
    "prep_remove_punct": true,
    "prep_prefix_col": "Tags",
    "prep_col": "Snippet",
    "prep_max_words_per_snippet": 30,
    "num_search_results": 5
}
{
    "model_type": "sent_transformer",
    "model_filepath": "./Models/sentence-transformers_all-MiniLM-L12-v2",
    "run_name": "sent_trans_baseline",
    "prep_lower": true,   
    "prep_skipw": false,
    "prep_stem": false,
    "prep_remove_punct": false,
    "prep_prefix_col": "Tags",
    "prep_col": "Snippet",
    "prep_max_words_per_snippet": 30,
    "num_search_results": 5
}
</pre>
<p>These are rather minimal configurations, but sufficient for this project. In real life, we would want a number of additional settings, such as json strings for the (field) "mappings", "settings", 
and "query" elements of the Elasticsearch configuration, and others. For Transformer models, one may want to configure the max sequence length (in tokens) of the inputs, the method for producing sentence embeddings from the final layers of the Transformer model, whether we encode individual texts or pairs of texts (this is an option with Transformer models), and many other hyperparameters.</p> <p>This does get quite involved, and for simplicity we have chosen here a very high-level implementation of Transformers, the <b>Sentence-Transformers</b> Python library, using the Pytorch framework. This implementation is curated by Nils Reimers and hosted at https://www.sbert.net/ (cf. Reimers, Nils and Gurevych, Iryna, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", 2019, https://arxiv.org/abs/1908.10084); it makes using these complex models very easy, by hiding a lot of the details. </p>
<p>Furthermore, we are not training a Transformer model in this project; instead we are using a ready-made, pre-trained and already fine tuned model, the 'all-MiniLM-L12-v2', which is a good fit for the non-technical language of the StackExchange Health posts. So we can dispense with a long list of training hyperparams.</p>

### Entry points for running the Experiments
<p>The python functions to drive the experiment phases are detailed in the 'main.py' file of the 'health_exchange' module. For the CLI interface, we use the 'typer' package and annotate the entry-point
functions. This conveniently allows at least members of a development team to run the experiments without having to delve into the python code. </p>
<p>Here is a summary of the high-level steps for the essential CLI entry points:
<ul><li>create-index:
        <ul><li>Elasticsearch:</li>
                <ul><li>connect to Elasticsearch server</li>
                    <li>preprocess the raw text in the search corpus, storing in a csv the mapping the preprocessed texts to the raw text and id, so we can display the latter in 
                        a user-friendly way in the search results</li>
                    <li>add the preprocessed text to the search index</li>
                </ul>
            <li>Sentence Transformer:</li>
                <ul><li>load the Transformer model from a local folder ("health_exchange/Models")</li>
                    <li>preprocess the raw text in the search corpus, storing in a csv the mapping the preprocessed texts to the raw text, so we can 
                        display the latter in a user-friendly way in the search results</li>
                    <li>encode the preprocessed text into sentence embeddings and store in a compressed ('pkl') file. Note that hardly any pre-processing needs to be done. </li>
                    <li>additionally, we want to store the preprocessed text into a file, so that we can map the embeddings to the text snippets.</li>
                </ul>
        </ul>
    </li>
    <li>run-test-queries
        <ul><li>Elasticsearch:</li>
                <ul><li>connect to Elasticsearch server</li>
                    <li>retrieve the mapping preprocessed-text to raw-text</li>
                    <li>preprocess the test query, and get results from the Elasticsearch server</li>
                    <li>map the results to the raw-text along with the snippet ids, and store into a results file</li>
                </ul>
            <li>Sentence Transformer:</li>
                <ul><li>retrieve the mapping preprocessed-text to raw-text, the file with the corpus embeddings, and the file with the (prepped) text snippets</li>
                    <li>create an embedding for the query</li>
                    <li>compare the query embedding with the stored corpus embeddings, and get the indices of the most similar corpus embeddings</li>
                    <li>using the stored mappings and text snippets, flesh out the search results in a user-friendly format, and store in a results file.
                </ul>
        </ul>
    </li>
    <li>evaluate-run
        <ul><li>this step is the same for Elasticsearch and Sentence Transformer:</li>
                <ul><li>NOTE: prior to running this entry-point, the search results need to be manually graded for relevance, and the graded results
                    stored in a csv file.</li>
                    <li>using the "true" results file, compute and publish metrics for the search results file and other stats</li>
                </ul>
        </ul>
    </li>
</ul>
<p>As can be seen, to ensure artifact persistence between entry-points, the artifact data is simply written to local files. While this serves the purposes of this project, 
in real life a more principled solution would need to be adopted, obviously. There are a number of frameworks, from Redis to MongoDB and others, that can be used for this 
purpose. Same applies to storing Transformer models.</p>

## Experiment Tracking with MLFLow
As featured in "MadeWithML" (Goku Mohandas, https://madewithml.com/, 2022), we use MLFlow as the framework for tracking and managing experiments. The python client 
allows us to pass the configuration parameters, the artifacts, and the metrics to MLFlow, and to display them by running a server with a nice GUI interface. 
<p>The python code for the MLFlow client is integrated into the code for the experiments entry-points. To start the MLFlow server, point your browser to "http://localhost:8000/". </p>
<p>To illustrate, here is a screenshot of the opening page after running the experiments:</p>
<p><img src="../images/mlflow_main.png" alt="main" width="2000"></p>
<p>Next, by clicking on the 'es_baseline_run_queries' run name, you can access the artifacts: specifically, the search results (ungraded):
<p><img src="../images/mlflow_elastic_results.png" alt="results" width="2000"></p>
<p>Note: clicking on the larger file '...info.json' may freeze the UI -- best not.
<br/>As always, refer to "COMMANDS.md" for a description of the commands to run.</p>

## Search Result Evaluation for Baselines
<p>We have setup two baseline experiments, "elastic_1" and "senttrans_1". Additionally, we feature another experiment for Elasticsearch, "elastic_1.1". 
The latter is a modification of "elastic_1" by adding stemming to the preprocessing. </p>
<p>Once the suite of commands "create-index", "run-test-queries", and "evaluate-run", are executed, these experiments produce the following evaluation metrics:</p>
<h4> Elasticsearch (args_elastic_1):</h4>
<pre>
{
    "percent relevant": 44.67,
    "percent partial": 14.67,
    "percent notrelevant": 40.67,
    "average dcg": 2.515,
    "total queries": 30,
    "total ungraded": 0,
    "total results": 150
}
</pre>
<h4> Elasticsearch (args_elastic_1.1)</h4>
<pre>
{
    "percent relevant": 44.67,
    "percent partial": 19.33,
    "percent notrelevant": 36.0,
    "average dcg": 2.592,
    "total queries": 30,
    "total ungraded": 0,
    "total results": 150
}
</pre>
<p></p>
<h4> Sentence Transformers (args_senttrans_1):</h4>
<pre>
{
    "percent relevant": 63.33,
    "percent partial": 18.0,
    "percent notrelevant": 18.67,
    "average dcg": 3.144,
    "total queries": 30,
    "total ungraded": 0,
    "total results": 150
}
</pre>
<p>As can be seen, the 'senttrans_1' experiment achieves the best accuracy scores. On the other hand, experiment "elastic_1.1" does not really improve the accuracy of "elastic_1" baseline. <br/>
We will comment on these results below.</p>

## Conclusion: a (subjective) comparison of Elasticsearch vs. Semantic Search with Transformers
<p>This project is meant as a preliminary investigation of two popular search frameworks, Elasticsearch and Semantic Search, as language models,
that is, as models of a human's language competence: specifically, the ability to understand language as text. The Test Set assembled here is 
strongly slanted toward natural language understanding: most of the questions are relatively long, syntactically well formed sentences. There are 
just a few keyword-style questions</p>
<p>From this point of view, Semantic Search using Transformers model comes out clearly as superior. The sense one gets in reviewing the search results 
for the Test Set is that the 'lexical' assumptions of Elasticsearch prevent these models from adequately representing the overall meaning of texts. 
Text meaning is clearly more than the sum of the words (cf. Elasticsearch bag-of-words assumption). In fact, Elasticsearch comes short in the
following respects:
<ul><li>It does not represent the main intent of the query: what are the most important words that need to be matched, as opposed to less 
        important ones. Linguists refer to this as the 'theme-rheme' or 'topic-comment' structure of sentences. Most of the irrelevant search 
        results produced by Elasticsearch derive from getting matches only for words that don't convey the main focus of the question.</li>
    <li>It identifies word matches in a rather superficial way (cf. 'lexical' assumption), by the accidental shape of words. Standardizing text 
        in various ways, such as removing skip-words, lower-casing, removing punctuation, does not really help. Actually it may make the problem worse; 
        as can be seen in the 'args' configuration for experiment 'elastic_1.1', stemming the words further lowers the overall accuracy of the model.
        Related to this, there is no built-in provision for synonymity. Although Elasticsearch allows for importing synonym lists, this approach
        is laborious to implement and especially, maintain, and, like stemming, can sometimes do more harm than good. Synonyms are highly domain-specific.
    <li>It does not represent the relationships between words in the text, both syntactically (=word order) and semantically</li>
</ul></p>
<p>Conversely, these are the respects in which Transformers provide a superior representation:
<ul><li>Models like 'all-MiniLM-L12-v2' are trained on a very large (1 billion) paraphrase pairs, including NLI (=Natural Language Inference) 
        triplets, where they learn which sentences are in an paraphrase or inferential relationship. Although Transformers are not perfect 
        at identifying the intent of the user queries, they make some definite strides in this respect.</li>
    <li>The lexical shape of words is largely irrelevant to Transformers models, since words (and sub-words) are immediately mapped
        to their base vectors: these are representations of the co-occurrence patterns of the words, hence they are inherently 
        semantic representations. Words that are natural synonyms tend to occur is the same contexts, and their base vectors end 
        up being similar. So, synonymity comes automatically with word-vectors. Note that with these models, it is not necessary 
        (and can may actually be harmful) to standardize the text. Furthermore, the tokenization approach used by Trasformers, where
        many words are split into sub-words, depending on relative frequency, automatically makes similar representations for 
        word variants. But importantly, all the text can be modelled, including punctuation and skip-words, thus obtaining a 
        richer representation of the text.</li>
    <li>Transformers, via their 'attention heads' are able to capture the dependencies between words in the text. They also
        keep track of the order of the words. The final word-vectors produced by Transformers reflect their syntactic and 
        semantic feature in the text, and are therefore 'contextualized' word-vectors. A model such as 'all-MiniLM-L12-v2'
        successively produces 12 vector representations of the base vectors, each capturing a different kind of relationship
        between words.</li>
</ul>

### What is the best approach for Search?
<p>Although Transformers are better language models, this does not mean they are always the best solution for search. As noted, search applications 
tend to be complex, and linguistic accuracy is only one of the aspects in which to evaluate a search application; although admittedly, is a very
fun and interesting one. </p>
<p>Elasticsearch is still a very popular framework, in part because it's highly configurable to suit the specific needs of applications. Here, we 
have featured a very basic, actually simplistic, configuration for Elasticsearch, one that hardly does justice to its power. Compared to Elasticsearch, 
our version of Semantic Search may seem like a 'black box'. In fact, Semantic Search with neural network models tends to suffer from its own 
limitations, including:
<ul><li>Speed of indexing and index footprint</li>
    <li>Lack of configurability: while as we have seen, it's very easy to setup a sophisticated Semantic Search using a pre-trained model and 
    the SentenceTransformers library, it's quite hard to effect specific changes in the search results. </li>
    <li>Difficulty with 'literal' matches: because Word-Vectors models do not maintain a representation of the outer aspect of text, it's 
    difficult to guarantee a 'literal' match, as when people encase a text string in double quotes in the query. Typical use-case is exact matching 
    of error messages.</li>
    <li>Difficulty with numbers: exact matching of multi-digit numbers, versions, and codes with a numerical part, is often imprecise with 
    these models. This may be because owing to the endless variety in numerical expressions, it is difficult for the model to 
    have sufficient number of training examples to indentify a specific number and its role in the meaning of the text.</li>
    <li>Difficulty with very short keyword-style questions: there is no provision on how to interpret a single keyword question -- for example,
    as a request for a definition, or an introductory explanation. In fact, the lack of linguistic context tends to retrieve results where 
    the keyword is matched along the least informative content. </li>
</ul>
<p>Despite its prowess with natural language understanding, these limitations indicate that currently, semantic search via Bert-like Transformers
models may not yet be a wholesale solution for search. These limitations may however, be removed in future developments: as it has been the case with 
OpenAI Chat-GPT, additional training with reinforcement learning and input from human users, may well produce semantic search AI applications that 
are much more tuned to human expectations. More immediately, and in part, some of these limitations may be overcome by training additional network 
agents to filter the search results produced by the search. For example, a classifier that could predict if a search result provides a definition, 
or explains an error message could be invoked when the user question is keyword-like or is an error code. </p>
<p>Alternatively, one emerging approach to alleviate the limitations consists in blending Elasticsearch and Semantic Search. Recent versions of Elasticsearch 
allow sentence vectors to be used as a type of search field, and integrated into the search alongside more traditional field representations. 
This may allow for longer, 'natural language' user questions to benefit from the power of contextualized word vectors, whereas short, keyword-like 
questions can receive the standard TF/IDF vector representation and IR similarity metric. Additionally, the distributed infrastrure and 
other facilities of Elasticsearch can be leveraged in this approach. 
</o>

## Examples from the Test Set
To make the above point a bit more concrete, here are some examples of search results, along with an analysis:
<ul><li>Elasticsearch, missing match to the focus of the query: <p>
        <table><tr><th>Query</th><th>Grade</th><th>Id</th><th>Result</th><th>Notes</th></tr>
               <tr><td>what are negative eye powers?</td><td>1</td><td>20794:3</td><td>"Does it not let the muscles they eye use relax? If so, why not?"</td><td>missing focus: negative powers</td></tr>
               <tr><td>can ultrasound cause damage?</td><td>1</td><td>18108:0</td><td>Does walking daily cause knee joint damage?</td><td>missing focus: ultrasound</td></tr>
               <tr><td>gallstones relief</td><td>1</td><td>5664:0</td><td>Why does exercise relief stomach pain/bloating</td><td>missing focus: gallstones</td></tr>
               <tr><td>toothpaste with or without fluoride? Pros and cons?</td><td>1</td><td>3616:0</td><td>What are the pros and cons of personalized medicine?</td><td>missing focus: fluoride toothpaste</td></tr>
        </table>
        <p></p>
    </li>
    <li>Elasticsearch, wrong meaning of word: <p>
        <table><tr><th>Query</th><th>Grade</th><th>Id</th><th>Result</th><th>Notes</th></tr>
               <tr><td>is it ok to go to the pool after I had a piercing?</td><td>1</td><td>11846:6</td><td>"A good example is playing pool. When I look at the ball, [...] and am unable to make the shot."</td><td>wrong sense of 'pool'</td></tr>
        </table>
        <p></p>
    </li>
    <li>Sentence Transformer, synonym matching: <p>
        <table><tr><th>Query</th><th>Grade</th><th>Id</th><th>Result</th><th>Notes</th></tr>
               <tr><td>How many calories do I have to burn to lose 1 Kilogram in weight?</td><td>3</td><td>797:0</td><td>Does 3500 calories really equal a pound?</td><td>'Kilogram' ~ 'pound'</td></tr>
               <tr><td>what are negative eye powers?</td><td>3</td><td>3355:0</td><td>What does eye power -6 means and how close to blindness is it?</td><td>'negative' ~ '-'</td></tr>
               <tr><td>what are the effects of drinking coffee?</td><td>3</td><td>17497:0</td><td>Did science backtrack regarding coffee causing dehydration?</td><td>'effects' ~ 'cause'</td></tr>
        </table>
        <p></p>
    </li>
    <li>Sentence Transformer, word variation: <p>
        <table><tr><th>Query</th><th>Grade</th><th>Id</th><th>Result</th><th>Notes</th></tr>
               <tr><td>is it ok to go to the pool after I had a piercing?</td><td>1</td><td>5733:1</td><td>I am a male and I am considering to get one of my nipples pierced</td><td>'piercing' ~ 'pierced'</td></tr>
               <tr><td>How many calories do I have to burn to lose 1 Kilogram in weight?</td><td>3</td><td>5394:0</td><td>"How many Calories Deficit Equals 1 KG Loss, approximately"</td><td>'Kilogram' ~ 'KG', 'lose' ~ 'loss'</td></tr>
               <tr><td>chicken pox</td><td></td><td>5120:0</td><td>Is getting the chickenpox vaccination twice in a month harmful?  (topics: vaccination, chickenpox)</td><td>'chicken pox' ~ 'chickenpox'</td></tr>
        </table>
        <p></p>
    </li>
    <li>Sentence Transformer, single keyword queries: <p>
        <table><tr><th>Query</th><th>Grade</th><th>Id</th><th>Result</th><th>Notes</th></tr>
               <tr><td>covid</td><td></td><td>22823:14</td><td>Or something else?  (topics: covid-19)</td><td>uninformative snippet, correct topic tag</td></tr>
               <tr><td>antibiotics</td><td></td><td>17787:5</td><td>I tried to look up the answer but besides some forum-posts there doesn't seem to be any kind of "official" statement (from a pharma company or something equivalent) about how to handle such a situation (or at least I wasn't able to find it).  (topics: medications, antibiotics)</td><td>uninformative snippet, correct topic tag</td></tr>>
        </table>
    </li>
</ul>
