# Hybrid Search with Elasticsearch and local embeddings

## Preconditions

To make this notebook work, you need to have an elasticsearch service (> `v8.9.0`) running on [http://localhost:9200](http://localhost:9200). To use the Reciprocal Rank Fusion in the byrid search (in the end of the notbook), you need to have a Platinum/Enterprise subscription [https://www.elastic.co/subscriptions](https://www.elastic.co/subscriptions)


## Helper Methods

In [None]:
def show_chunks(docs):
    for doc in docs:
        # check if doc is tuple
        if isinstance(doc, tuple):
            doc, score = doc
        else:
            score = None
        print("Page", doc.metadata["page"], ":")
        print(doc.page_content.replace('\n', ' '))
        if score:
            print("Score:", score)
        print(100 * "-")


## Load and chunk Documents

In [None]:
from pathlib import Path

raw_data_file = Path.cwd() / "sbb-geschaeftsbericht-2023.pdf"

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader(str(raw_data_file))
chunks = loader.load_and_split(text_splitter=RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0))
print("total chunks: ", len(chunks))
print('-' * 100)
print("Example:")
chunks[25]


## Create Embeddings using an Opensource Embedding Model

In [8]:
from langchain_community.embeddings import HuggingFaceEmbeddings

e5_multilingual_embedding_model = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")
e5_multilingual_embedding_model

README.md:   0%|          | 0.00/160k [00:00<?, ?B/s]

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='intfloat/multilingual-e5-large', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

## Create Embedding from Question

In [9]:
query = "Wie viel weniger will die SBB bis 2030 ausgeben?"
e5_multilingual_embedding_model.embed_query(query)

[-0.020724598318338394,
 0.003353245323523879,
 -0.04718921333551407,
 -0.04045533016324043,
 0.026072410866618156,
 -0.018294258043169975,
 -0.01445701438933611,
 0.02227230742573738,
 0.06754158437252045,
 0.001783400890417397,
 0.0425213947892189,
 0.030309047549962997,
 -0.012533361092209816,
 -0.02684326469898224,
 -0.026066487655043602,
 -0.05336495116353035,
 0.023507127538323402,
 0.00879379641264677,
 -0.012215628288686275,
 -0.020137257874011993,
 0.03075406886637211,
 -0.04414905235171318,
 -0.01693112961947918,
 -0.000751925224903971,
 -0.007161107379943132,
 0.011271339841187,
 -0.032673366367816925,
 -0.019964460283517838,
 -0.007265779655426741,
 0.01016951072961092,
 -0.0591048002243042,
 -0.0019779279828071594,
 -0.043426934629678726,
 -0.020695149898529053,
 -0.020205454900860786,
 0.011324039660394192,
 0.05723607540130615,
 0.029726767912507057,
 -0.04964124783873558,
 0.01168445497751236,
 -0.027545282617211342,
 0.026116862893104553,
 -0.003331550629809499,
 -0.01

## Setup Connection to Elasticsearch

In [11]:
from langchain_community.vectorstores.elasticsearch import ElasticsearchStore

es_vector_search = ElasticsearchStore(
    es_url="http://localhost:9200",
    index_name="dummy_index_vector_only",
    embedding=e5_multilingual_embedding_model
)

es_vector_search.add_documents(chunks)

['ff513831-389d-4b41-b4af-186d5b3ca2aa',
 '4b018649-1939-4d45-aadd-8fc48351a7cb',
 'e43e69b6-e9e5-4d88-a27b-a11599fcf09c',
 '69bf7870-bde4-4620-a7ac-a113e32e93eb',
 '4021146c-1ba3-45b8-ad11-aa6cfd24ff0a',
 '0c16c0af-c37a-426b-a979-a8cdaab33b95',
 '931c44e2-261c-4da0-83f2-a575a19939af',
 'cfbd1f19-ad30-4235-aef1-c05d3eb6f083',
 'a2933140-d862-46c4-9d1d-f59ff5f0eede',
 '5d08fcc5-aa7d-4a78-b955-6cafd5fad75a',
 '36000474-88da-4689-bd3f-25d7dafb3c6b',
 '07566eaa-54bf-4803-9628-7703a6b7e836',
 '35b3cc1e-e82c-4547-b555-efd487513200',
 '1f0599fd-f0fd-439d-a63d-aee3ee0e241b',
 'fe212d7f-4bbe-4ba0-966e-4cac7c2ca28a',
 'cb27137d-743e-4828-94fd-7fd2444dd924',
 'dc12a627-8ed6-40bf-bf7a-4764bb3fbcb5',
 'e858da0f-a02f-4286-8cef-df2e5ef59538',
 '35097135-51d4-4110-9807-7418003103e4',
 '6b5bd9ca-c8b2-4c67-a660-71c8c561a1d6',
 '08e260c8-185a-4177-b32b-e39e53e44231',
 'ee872ef1-d346-44be-ad9a-5ea8f14cd613',
 '5c11a45b-d5c5-4ac8-8ac7-64ab0e81c70d',
 'b8027420-b9ba-432c-99e6-e5b782119711',
 '0d2af9b6-2c85-

## Search with native Elasticsearch similarity search

In [15]:
similar_documents = es_vector_search.similarity_search_with_score(query, k=4)
show_chunks(similar_documents)

Page 45 :
ten. Trotz eines positiven Konzernergebnisses bleibt der Spar- und Effizienzdruck hoch. Das positive Jahresergeb - nis reicht noch nicht aus, um den notwendigen Schulden - abbau zu erreichen. Für eine nachhaltig finanzierte  SBB  sind jährlich rund 500 Millionen Franken Gewinn nötig. Nur  so können die Schulden abgebaut und anstehende In - vestitionen, unter anderem in neues Rollmaterial und in  Bahnhöfe, getätigt werden. Stabilisierungspaket – nachhaltige   Finanzierung bis 2030.
Score: 0.93489087
----------------------------------------------------------------------------------------------------
Page 45 :
Gesunde Finanzen sind für die Zukunft der SBB  essen - ziell. Das Verschuldungsniveau muss gesenkt werden  können, die Zinslast tragbar sein und das Eignerziel zum Schuldendeckungsgrad erreicht werden. Um dies zu er - wirken, leistet die  SBB  selbst einen grossen Beitrag, in - dem sie bis 2030 rund sechs Milliarden Franken weniger  ausgibt. Erträge aus Immobilien sichern 

## Search with a hybrid approach
To understand this approach better, have a look at: https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking


In [13]:
es_hybrid_search = ElasticsearchStore.from_documents(
    chunks, 
    e5_multilingual_embedding_model, 
    es_url="http://localhost:9200", 
    index_name="dummy_index_hybrid",
    strategy=ElasticsearchStore.ApproxRetrievalStrategy(
        hybrid=True,
    )
)

The following cell will only work if you have a paid ES version with supports Reciprocal Rank Fusion (RRF). Alterantively, you can build it yourself for example with the langchain **EnsembleRetriever**. Have a look at [https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)

In [18]:
similar_documents = es_hybrid_search.similarity_search(query, k=3)
show_chunks(similar_documents)

AuthorizationException: AuthorizationException(403, 'security_exception', 'current license is non-compliant for [Reciprocal Rank Fusion (RRF)]')