# Techniques for Improving the Effectiveness of RAG Systems

---

> If you haven't already, please visit the [main course page](https://apps.learn.learn.nvidia.com/learning/course/course-v1:DLI+S-FX-20+V1/block-v1:DLI+S-FX-20+V1+type@sequential+block@43eee6e2d779407286f142ccb8483fe0/block-v1:DLI+S-FX-20+V1+type@vertical+block@e2b8cfd88f2a45c89ef908f7929c266c) and watch the introduction presentation video.

## Lesson 00: Introduction

Welcome to *Techniques for Improving the Effectiveness of RAG Systems*.

In this workshop, you will learn techniques that can take your RAG system from an interesting proof-of-concept to a serious asset. 

We'll cover the design of hybrid retrievers, the use of multiple smaller fine-tuned expert models instead of a single large general-purpose model, and methods to evaluate RAG performance with each iterative design change, using both human-as-a-judge and LLM-as-a-judge evaluation frameworks. 

With the lessons learned in this workshop, you’ll be able to build applications that deliver on the expectations of what serious LLM-based RAG applications can do.

## Workshop Structure

In addition to this introduction, the workshop consists of four lesson notebooks, which you will be working through in order.

- **Lesson 1: Exploring and Preparing your Dataset for Retrieval.** We will prepare the data we will use in the rest of the app, using strategies for splitting data into chunks for easy retrieval and using an LLM on ingest to facilitate other use cases. We will use the Router, Chunker, and LLM.

- **Lesson 2: Loading the Vector/Document Database.** We will create indexes with which to search our data--particularly vector indexes that rely on representations of the text as vectors (embeddings). We will use the Router, Embedder, and Hybrid Search.

- **Lesson 3: Evaluating Retrieval.** We will implement an interface that allows us to collect data on the performance of our app--a notoriously difficult challenge with the wide scope of many language use cases. We will use the Router, Hybrid Search, LLM, Judge UI, and Human Eval Database.

- **Lesson 4: Better Generations.** We will combine the previous elements into the final web app, including an initial triage step to assess the user's intent and intelligently select the right settings for the search and LLM prompt. We will use the App, the Hybrid Search, and the LLM.



---

## RAG Application

You will be working with a RAG application developed largely for internal use at NVIDIA.

**The final RAG web app that we are going to build together...**


<div style="text-align: center;">
<img src="./img/Librarian.png" width="850" alt="Librarian Final Web RAG app">
</div>

**... relying on the below architecture and its components:**


<div style="text-align: center;">
<img src="img/00_overview.png" width="850" alt="Overall Architecture">
</div>

Here is a rapid overview of the different components and services that will be seen and be part of this course:
1. the Router: coordinates data movement between services and components
2. the Chunker: splits long texts into more manageable pieces
3. the Embedder: converts text to numbers that encode the meaning of the text
4. the Hybrid Search: enables retrieval of the chunked and embedded text in addition to typical keyword search
5. the LLM (large language model) Service: synthesizes retrieved text into something useful
6. the Judge: allows data collection to evaluate system performance
7. the Human Eval Database: stores results from the Judge UI
8. the App: makes the whole system easy to use

Note that each notebook will focus on a subset of the components, until the end when they are brought together into the full app.

---

## Application Microservices

## Modular RAG System Components

We're going to be building a modular RAG system in this course--the foundation for a robust, scalable app. Each RAG component will be running in its own container. You will be launching these containers below for your present work, but we are also providing you with all the assets and source code needed to launch them yourself at another time.

To take a look at the source code, navigate in the left-hand panel to the various directories, each of which represents a different component such as `chunking` or `router`. The source code for all these containers is yours to use as you see fit! It's ultimately intended as a starting point or inspiration for any application that you might be looking to build.

---

## Launch the Application Components

As a first order of business, we are going to launch all the services that you will be working with throughout the workshop. To do this we are going to use `docker-compose`.

The command `docker-compose up -d` will look at `docker-compose.yml` and run container-based services based on the configuration there. For the sake of time we've prebuilt the Docker images for you, but again, you can build these container images yourself at a later time using the provided source code in the component directories.

Launching the 6 component services will take about a minute. The containers actually spin up faster than that but some wait for others to meet specific initialization criteria due to interdependencies.

To run the `docker-compose up -d` command, copy/paste the following command into the open terminal tab and run it. You'll find the Jupyter Lab terminal tab already open for you up near the top of the Jupyter Lab environment, next to the currently focused-on "Lesson 00.ipynb" tab. We *could* run this command here in the notebook, but its output is spurious and can cause the notebook to stall.

```
docker-compose up -d
```

---

## View the Services

We can view all the container services launched by `docker-compose` with the `docker-compose ps` command, which we run here with some additional formatting to make the output easier to read.

Note: we are now running these bash command line commands in this notebook which has a `bash` kernel (code execution backend). We didn't do this above because the `docker-compose up -d` command streams output which can get quite long and cause the Jupyter notebook to hang.

In [1]:
docker-compose ps --format "table {{.Service}}\t{{.State}}\t{{.Ports}}"

SERVICE    STATE     PORTS
chunking   running   0.0.0.0:5005->5005/tcp, :::5005->5005/tcp
judge      running   0.0.0.0:5007->5007/tcp, :::5007->5007/tcp
mixtral    running   0.0.0.0:9998-9999->9998-9999/tcp, :::9998-9999->9998-9999/tcp
redis      running   6379/tcp, 8001/tcp
router     running   0.0.0.0:5006->5006/tcp, :::5006->5006/tcp
triton     running   0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp
web        running   0.0.0.0:5000->5000/tcp, :::5000->5000/tcp


---

## Viewing Service Logs

We can use `docker-compose logs <service_name>` to view the logs for any of the running services. We can obtain the names of the services from the output above, or by inspecting `docker-compose.yml`. Here we look at a few the logs for a few of the services as an example.

In [2]:
docker-compose logs chunking

[36mchunking-1  | [0mINFO:     Started server process [7]
[36mchunking-1  | [0mINFO:     Waiting for application startup.
[36mchunking-1  | [0mINFO:     Application startup complete.
[36mchunking-1  | [0mINFO:     Uvicorn running on http://0.0.0.0:5005 (Press CTRL+C to quit)
[36mchunking-1  | [0mINFO:     108.31.235.14:49205 - "GET / HTTP/1.1" 307 Temporary Redirect
[36mchunking-1  | [0mINFO:     108.31.235.14:49205 - "GET /docs HTTP/1.1" 200 OK
[36mchunking-1  | [0mINFO:     108.31.235.14:49205 - "GET /openapi.json HTTP/1.1" 200 OK
[36mchunking-1  | [0mINFO:     172.19.0.2:38518 - "POST /api/chunking HTTP/1.1" 200 OK
[36mchunking-1  | [0mINFO:     172.19.0.2:35340 - "POST /api/chunking HTTP/1.1" 200 OK
[36mchunking-1  | [0mINFO:     172.19.0.2:42638 - "POST /api/chunking HTTP/1.1" 200 OK
[36mchunking-1  | [0mINFO:     172.19.0.2:52374 - "POST /api/chunking HTTP/1.1" 200 OK
[36mchunking-1  | [0mINFO:     172.19.0.2:46368 - "POST /api/chunking HTTP/1.1" 200 OK
[

In [3]:
docker-compose logs triton

[36mtriton-1  | [0m
[36mtriton-1  | [0m== Triton Inference Server ==
[36mtriton-1  | [0m
[36mtriton-1  | [0mNVIDIA Release 22.01 (build 31237563)
[36mtriton-1  | [0m
[36mtriton-1  | [0mCopyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
[36mtriton-1  | [0m
[36mtriton-1  | [0mVarious files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
[36mtriton-1  | [0m
[36mtriton-1  | [0mThis container image and its contents are governed by the NVIDIA Deep Learning Container License.
[36mtriton-1  | [0mBy pulling and using the container, you accept the terms and conditions of this license:
[36mtriton-1  | [0mhttps://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[36mtriton-1  | [0m
[36mtriton-1  | [0mI0809 18:56:38.813183 7 metrics.cc:298] Collecting metrics for GPU 0: NVIDIA A100 80GB PCIe
[36mtriton-1  | [0mI0809 18:56:39.060123 7 libtorch.cc:1227] TRITONBACKEND_Initialize: pytorch
[36mtriton-1  | [0mI

---

## (Optional) Stopping and Restarting the Services

If at any point you need to stop and restart the services, for example if you do something inadvertent that crashes one of the application you can restart all of the services by executing the following `restart.sh` script, which basically does `docker-compose down && docker-compose up -d` along with resetting the state of the `redis` service which you will be doing later, but which takes time and we would not want you to have to repeat if a restart is required.

In [None]:
./restart.sh

Bringing containerized services down...


---

## Next Lesson

Move to the next lesson by double-clicking *Lesson 01.ipynb* on the file-viewer on the left-hand side of your Jupyter Lab environment.