#### Building the Retrieval Augmented Generation (RAG) model.

In this notebook, we will describe the whole process to build the RAG model using open source language.

We will start by describing the high level overview of the system and then we will implement each component and set them up for production.

![Rag system highlevel](./images/RAG-high-level.jpeg)

Rag High level([source:](https://blog.griddynamics.com/retrieval-augmented-generation-llm/))

A RAG system is a Information Retrieval and Question Answering system that uses Natural Language Processing (Or Generative AI for those who cares about buzz words), to answers users question using a knowledge base.
The main benefits of a RAG system is the fact that you can use a personal knowledge base to answer your question, this limit hallucination in the system and more importanly helps the system to answers the question using a knowledge base from a specific domain. [Add More information about RAG systems here.]

### How the RAG System works?

A canonical RAG Pipeline is build with the components illustrated in the figure below.
![RAG system](./images/RAG-process-flow-scheme.jpg)

Those 9 steps in the above figure can be group in 3 major steps.

The knowledge encoding steps, the Retrieval Step and Generation step.

#### The knowledge Encoding

We start with a corpus of our documents which can be a bunch of pdf document contains information about the question, or just internal website with the documentation. We split the document into manageable chunk which can be paragraphs in the documents and learn the embeddings vectors of those documents and save them in a vector database.
That part can be summarize with the steps 1, 2, 3, and 4 in the picture {}.

#### Retrieval 

This step retrieve the document similar to question from a document database. Given a question, we encode it and learn it embeddings and then we query the vector database using a similarity search approach to retrieve relevant context given a document. This part is done in the step 5, 6 in the picture.

#### Generation step

In this step we take the question the context retrieve and feed that into the language model to generate the answer to the question given the context.

Enough talking let us build the system.

#### The retrieval.

The retrieval system has 3 main components, the language model, the apis and the database.

In [1]:
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
embedding_model_id = 'michiyasunaga/BioLinkBERT-large'

In [3]:
from pathlib import Path

In [4]:
model_cache = Path().cwd().joinpath("models")

In [5]:
sentence_transformer = SentenceTransformer(embedding_model_id, cache_folder=model_cache)

No sentence-transformers model found with name /Users/esp.py/Projects/Personal/end-to-end-rag/models/michiyasunaga_BioLinkBERT-large. Creating a new one with MEAN pooling.


In [6]:
from src.retriever.embedding_model import SBertOnnxConfig, CustomEmbeddingBertModel

In [7]:
from transformers import AutoTokenizer

In [8]:
tokenizer = AutoTokenizer.from_pretrained(embedding_model_id)

In [9]:
bert_embedding_path = model_cache.joinpath(embedding_model_id.replace("/", "_"))

In [10]:
bert_embedding_path

PosixPath('/Users/esp.py/Projects/Personal/end-to-end-rag/models/michiyasunaga_BioLinkBERT-large')

In [11]:
base_model = CustomEmbeddingBertModel.from_pretrained(bert_embedding_path)

In [12]:
bert_embedding_onnx_path = bert_embedding_path.parent.joinpath("onnx")

In [13]:
bert_embedding_onnx_path.mkdir(exist_ok=True, parents=True)

In [14]:
onnx_model_config =SBertOnnxConfig.from_model_config(base_model.config)

In [15]:
from torch.onnx import export

In [16]:
test_input = f"question: Is cytokeratin immunoreactivity useful in the diagnosis of short-segment Barrett's oesophagus in Korea? context: Cytokeratin 7/20 staining has been reported to be helpful in diagnosing Barrett's oesophagus and gastric intestinal metaplasia. However, this is still a matter of some controversy. To determine the diagnostic usefulness of cytokeratin 7/20 immunostaining for short-segment Barrett's oesophagus in Korea. In patients with Barrett's oesophagus, diagnosed endoscopically, at least two biopsy specimens were taken from just below the squamocolumnar junction. If goblet cells were found histologically with alcian blue staining, cytokeratin 7/20 immunohistochemical stains were performed. Intestinal metaplasia at the cardia was diagnosed whenever biopsy specimens taken from within 2 cm below the oesophagogastric junction revealed intestinal metaplasia. Barrett's cytokeratin 7/20 pattern was defined as cytokeratin 20 positivity in only the superficial gland, combined with cytokeratin 7 positivity in both the superficial and deep glands. Barrett's cytokeratin 7/20 pattern was observed in 28 out of 36 cases (77.8%) with short-segment Barrett's oesophagus, 11 out of 28 cases (39.3%) with intestinal metaplasia at the cardia, and nine out of 61 cases (14.8%) with gastric intestinal metaplasia. The sensitivity and specificity of Barrett's cytokeratin 7/20 pattern were 77.8 and 77.5%, respectively. answer: Barrett's cytokeratin 7/20 pattern can be a useful marker for the diagnosis of short-segment Barrett's oesophagus, although the false positive or false negative rate is approximately 25%."
encoded_input = tokenizer([test_input],
                          return_tensors='pt',
                          max_length=512,
                          truncation=True)

In [17]:
_ = encoded_input.pop("token_type_ids")

In [18]:
model_output = base_model(**encoded_input)

In [19]:
custom_model_embeddings = model_output.last_hidden_state.detach().numpy().reshape(-1)

In [20]:
learned_embeddings = sentence_transformer.encode(test_input)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [21]:
from numpy.testing import assert_array_almost_equal

In [22]:
assert_array_almost_equal(custom_model_embeddings, learned_embeddings, decimal=5)

In [23]:
from torch.onnx import export as torch_onnx_export

In [24]:
torch_onnx_export(
    base_model,
    tuple(encoded_input.values()),
    f=bert_embedding_onnx_path.joinpath('bio-bert-embedder.onnx'),
    input_names=['input_ids', 'attention_mask'],
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'},
                  'attention_mask': {0: 'batch_size', 1: 'sequence'},
                  'last_hidden_state': {0: 'batch_size', 1: 'sequence'}},
    do_constant_folding=True,
    opset_version=13,
)



In [25]:
base_model.config.save_pretrained(bert_embedding_onnx_path)

### Models Comparison

Let us now compare the two models.

In [26]:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction
sentences = ["This is an example sentence", "Each sentence is converted"]

tokenizer = AutoTokenizer.from_pretrained(bert_embedding_path.__str__( ))
onnx_model = ORTModelForFeatureExtraction.from_pretrained(
    bert_embedding_onnx_path)
inputs_2 = tokenizer([sentences[0], sentences[1]],
                     padding="longest", return_tensors="pt")


The ONNX file bio-bert-embedder.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.


In [27]:
inputs_2.pop("token_type_ids")

tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]])

In [28]:
inputs_2

{'input_ids': tensor([[    2,  1805,  1744,  1683,  6239, 21011,     3],
        [    2,  2562, 21011,  1744, 10215,     3,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}

In [29]:
onnx_inputs = {
    "input_ids": inputs_2.get("input_ids").numpy(),
    "attention_mask": inputs_2.get("attention_mask").numpy(),
}
output_two = onnx_model.model.run(None, onnx_inputs)

In [30]:
onnx_embeddings = output_two[0]

In [31]:
sentence_transformer_embeddings = sentence_transformer.encode(sentences)

In [32]:
sentence_transformer_embeddings.shape

(2, 1024)

In [33]:
onnx_embeddings[0]

array([-0.31763083,  0.4523321 ,  0.50023586, ...,  0.13559678,
       -0.1584263 , -0.04599122], dtype=float32)

In [35]:
output_two

[array([[-0.31763083,  0.4523321 ,  0.50023586, ...,  0.13559678,
         -0.1584263 , -0.04599122],
        [ 0.02632874,  0.01244779, -0.08892691, ...,  0.13396056,
          0.11261779,  0.08904688]], dtype=float32)]

Note that there are some weird behaviour but those are not expected in production.

### Converting the Model to ONNX

MOdel output successufly build as ONNX server, next step is to try it for inference.

### Making inference

In [1]:
import numpy as np
import tritonclient.http as httpclient

Run the following code in command line:

```
 docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002  --shm-size 128M -v ${PWD}/models_repository/retrieval:/models  espymur/triton-onnx:dev tritonserver --model-repository=/models
```

In [74]:
client = httpclient.InferenceServerClient(url="127.0.0.1:60815")

In [75]:
text_input = httpclient.InferInput('TEXT', shape=[-1], datatype='BYTES')

In [76]:
outputs = httpclient.InferRequestedOutput('sequences_text', binary_data=False)

In [77]:
sentences = ["what cause covid"]

In [78]:
np_input_data = np.asarray([sentences], dtype=object)

text = httpclient.InferInput('TEXT', [1], "BYTES")
text.set_data_from_numpy(np_input_data.reshape([1]))

<tritonclient.http._infer_input.InferInput at 0x131182680>

In [79]:
results = client.infer(model_name="ensemble_model", inputs=[text], outputs=[outputs])

InferenceServerException: [400] unexpected inference output 'sequences_text' for model 'ensemble_model'

In [71]:
inference_output = results.as_numpy('sequences_text')

In [45]:
from pprint import pprint

In [46]:
pprint(inference_output)

array('what cause covid-19.]? context: The novel coronavirus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which causes coronavirus disease 2019 (Covid-19), has rapidly spread throughout the world leading to hundreds of deaths. The prevalence of antibodies against SARS-CoV-2 in the general population is unknown. To determine the prevalence of anti-SARS-CoV-2 antibodies in the general population in Buenos Aires, Argentina. We performed a cross-sectional study on a representative sample of the general population aged 18 years or older from the province of Buenos Aires, Argentina using an enzyme-linked immunosorbent assay (ELISA) to detect anti-SARS-CoV-2 antibodies. The overall prevalence of anti-SARS-CoV-2 antibodies was 25% (95% confidence interval [95% CI], 22% -31%). The prevalence was higher in women (34%) than in men (19%) (p = 0.002). The age-adjusted prevalence was higher in women (33%) than in men (19%) (p = 0.005). The highest prevalence was found in individual

The embedding model is working as expected. Let us write the code for the API.