# Lexical search

The semantic search has shown a very good performance. When asking questions, it could find the appropriate answers.

In real-world scenarios, this does not always work as well. Often, abbreviations and very special vocabular is used.
Semantic models do not excel in those situations, even though they can be *finetuned* (which is subject of another
live course!). A simpler solution is to compliment the semantic search with a lexical search - this works very
well for abbreviations and also for domain-specific vocabulary as it directly matches the words.

Lots of lexical search engines exist, many of them based on [Apache Lucene](https://lucene.apache.org/) (like
[Apache Solr](https://solr.apache.org/) or [Elastic](https://www.elastic.co/)). However, this is complex software
whic deserves its own live course. Therefore, we use a very simple (although) fast alternative here, which
is implemented in [Rust](https://rust-lang.org/). But [tantivy](https://github.com/quickwit-oss/tantivy) also
has excellent [Python bindings](https://pypi.org/project/tantivy/).

## Load data (from previous notebook)

In [None]:
import json
with open("sentences.json") as f:
    sentences = json.load(f)

In [None]:
len(sentences)

## Create index

In [None]:
import tantivy

In [None]:
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_integer_field("id", stored=True)
schema_builder.add_text_field("text", stored=True)
schema = schema_builder.build()

remove a possible old index

In [None]:
import os
import shutil
try:
    shutil.rmtree("tantivy-index")
except:
    pass
os.mkdir("tantivy-index")

In [None]:
index = tantivy.Index(schema, "tantivy-index")
writer = index.writer()

In [None]:
from tqdm.auto import tqdm
for i, t in tqdm(enumerate(sentences), total=len(sentences)):
    writer.add_document(tantivy.Document(id=i, text=t))

In [None]:
writer.commit()

## Search

In [None]:
# Reload the index to ensure it points to the last commit.
index.reload()

In [None]:
def search(query, index, top=20):
    searcher = index.searcher()
    query = index.parse_query(query, ["text"])
    search_results = searcher.search(query, limit=top).hits
    res = []
    for (score, doc_id) in search_results:
        doc = searcher.doc(doc_id)
        res.append({ "id": doc["id"][0], "text": doc["text"][0], "score": score })

    return(pd.DataFrame(res))

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

In [None]:
search("Is the climate crisis worse for poorer countries?", index)

In [None]:
search("$10.5 billion", index)