# Retrieval augmented generation using Elasticsearch and Llama 3

This notebook demonstrates:
- Transform mimacom's handbook into a vector dataset and index into Elasticsearch.
- Embed a question using `SentenceTransformer`.
- Perform a vector search aka k-NN search and retrieve relevant documents.
- Pass top results to LLama 3 using __[Ollama](http://ollama.com)__ for RAG

## Install packages

In [1]:
# install packages
!python3 -m pip install -qU PyPDF2 elasticsearch sentence-transformers ollama

# import modules
import PyPDF2, re, os
from elasticsearch import Elasticsearch

## Connect to Elastic
An elastic instance should be running. As mentioned before, a docker-compose file for an elasticsearch container is provided.

In [2]:
# Replace if need it.
ES_HOST = "http://localhost:9200"
ES_INDEX = "ragdemo-1"

client = Elasticsearch(ES_HOST)

# Check if the elastic instance is running.
print(client.info())

{'name': '7917c3e5768b', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'wFoUA_1xQdaT0elbUn8r1Q', 'version': {'number': '8.8.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'c01029875a091076ed42cdb3a41c10b1a9a5a20f', 'build_date': '2023-05-23T17:16:07.179039820Z', 'build_snapshot': False, 'lucene_version': '9.6.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


## Create index mapping
Now it is time to create an index mapping. 
Elastic uses `dense_vector` as data type to store a numeric representation of the semantic or meaning of a piece of data.

We need to target `dense_vector` field in order to perform kNN search.


In [3]:
# Define index mapping
mappings = {
    "properties": {
        "content": {
            "type": "text"
        },
        "embedding": {
            "type": "dense_vector",
            "dims": 384,
            "index": True, # Default value
            "similarity": "cosine" # Default value
        }
    }
}

# Create an index if it exist
try:
    client.indices.create(index = ES_INDEX, mappings = mappings)
except Exception as e:
    print("Create index ignored", e)


## Extract data from PDF
For this example we use my company's handbook as data. The data is in a PDF and one page represents one document 

In [4]:
home = os.path.dirname(os.path.abspath("__file__"))

pdfFile = open(os.path.abspath(home) + '/company_book.pdf', 'rb')
pdfReader = PyPDF2.PdfReader(pdfFile)

# Cleanup data. Remove unnecessary text.
pattern = r'page \d+ of \d+ Internal Use'
toIndex = []
for page in pdfReader.pages:
    text = page.extract_text()
    cleanedText = re.sub(pattern, '', text)
    cleanedText = cleanedText.replace("\n", "")
    toIndex.append((cleanedText))
pdfFile.close()

## Setup encoders
In this demo we are using `all-MiniLM-L6-v2` model. This encoder embeds text in 384 dimensions. It means that the model is describing the text using 384 different features or aspects.

In [5]:
# Import
import torch
from sentence_transformers import SentenceTransformer

# Setup encoder and make sure it uses GPU on Mac Silicon
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = SentenceTransformer('all-MiniLM-L6-v2').to(device)

  from tqdm.autonotebook import tqdm, trange


In [6]:
# Index the extracted data to ES
for i, t in enumerate(toIndex):
    embeddedContent = model.encode(t, device = device)
    document = {
        'content': t,
        'embedding': embeddedContent  
    }
    response = client.index(index = ES_INDEX, body = document, id = i)


## Define knn query and retrieve relevant documents

We define a function that builds an ES knn query, perform the search and return relevant documents.
As mentioned before, we need to target the field that contains the vector data, in our case it's called `embedding`. Also we pass the question or instruction from the user.

In [7]:
def prepareAndDoSearch(embeddedQuery):
    response = client.search(
        index = ES_INDEX,
        knn = {
            'field': 'embedding',
            'query_vector': embeddedQuery,
            'num_candidates': 500,
            'k': 10,
        },
        size = 3
    )

    documents = []
    for hit in response['hits']['hits']:
        documents.append(re.escape(hit['_source']['content']))
    
    return documents

## Write and encode the prompt
Since the search is targeting a `dense_vector` field, we must encode our question/instruction using the same encoder that we used before.

For building the final prompt we attached the text of relevant documents for context, with the actual question

In [13]:
# Encode the query
query = "Tell me if they provide employees with corporate cars"
embeddedQuery = model.encode(query, device = device)

# Make a prompt based on context and question
documents = prepareAndDoSearch(embeddedQuery)
print("Amount of retrieved documents:" + str(len(documents)))

prompt = "Context: " + ''.join(documents) + "  answer the following question briefly: " + query

Amount of retrieved documents:3


## Let LLAMA shine 
`ollama` serves as a powerful and user-friendly platform for running LLMs on your local machine. And through ollama we are invoking `llama3.2` 

*Note* 
As you can see there is an option `num_ctx`. It means that we can give `llama` max. 3200 tokens to process.
By default this value is set to 2000 tokens and the maximum llama can handel is 128000 tokens. This can be seen as a limitation

In [None]:
import ollama

output = ollama.generate(
  model = "llama3.1",
  prompt = prompt,
  options = {
    "num_ctx": 32000
  }
)

print(output['response'])