This is a tutorial from Arcee and AI Makerspace on how to train an end to end RAG model jointly with retriever and generator to create a Domain Adapted Language Model (a DALM). E2E RAG training allows you to create performant contextualized retrievers for your RAG system.

The model we train will learn how to retrieve documents from our context documents and generate responses in a unified system.

We will:

* Clone and Install DALM dependencies
* Prepare a toy dataset
* Run E2E RAG training for our DALM
* Run inference on our trained DALM
* Export our trained retriever model
* Query example DALMs at https://app.arcee.ai

![e2erag](https://i.imgur.com/0uMWN8H.png)




In [None]:
#Make sure you have a GPU enabled
!nvidia-smi

Sat Sep 23 09:36:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Install the DALM repo `indomain`

We will install DALM package https://github.com/arcee-ai/DALM from pypi!

**DALM**




In [1]:
!pip install -qqq indomain

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.7/50.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?2

# Prepare Dataset of Examples

E2E RAG requires a dataset of [passage, question, answer] triples

At inference time, our model will take a users query, draw from the available passages, and pass relevant context to the generator to create an answer.

## Toy Dataset

For this tutorial we will train a simple toy dataset to show how the E2E RAG training can be accomplished. To build your own DALM, you should train on your own documents

## Generating Q+A Pairs

If you do not have labeled triples, you can generate QA pairs with:

```
dalm qa-gen path/to/dataset.csv
```

for example to generate QA pairs with the "taesiri/arxiv_qa" hugging face dataset:

```
!dalm qa-gen "taesiri/arxiv_qa" --output-dir qa-outputs --passage-column-name text --title-column-name title
```

For more information, you can run `dalm qa-gen --help`



In [2]:
# Load our toy dataset
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/arcee-ai/DALM/main/dalm/datasets/toy_data_train.csv")
df.to_csv("dataset.csv")
df.head()

Unnamed: 0,Question,Abstract,Answer
0,Photosynthesis definition,Process where plants convert light into energy...,Energy conversion in plants
1,"Author of ""Romeo and Juliet""",Famous play written by William Shakespeare.,William Shakespeare
2,Capital of France,The capital city of France is Paris.,Paris
3,Declaration of Independence date,"Document signed on July 4, 1776 declaring Amer...",1776
4,World's largest ocean,The Pacific Ocean is the largest and deepest o...,Pacific Ocean


# Train DALM model with gpt-neo-125m generator and bge-large-en retriever

For this demo notebook, we will train a small 125 million GPT Neo model and we will train the small bge retriever model.

For better results, you will want to substitute `meta-llama/Llama-2-7b` and `bge-large-en`. You will need a bigger GPU (like A10080GB) to train this - and make sure to adjust batch size to saturate your GPU memory


In [3]:
!dalm train-rag-e2e \
"./dataset.csv" \
"BAAI/bge-large-en" \
"EleutherAI/gpt-neo-125m" \
--output-dir "./rag_e2e_checkpoints" \
--with-tracking \
--report-to all \
--per-device-train-batch-size 1

Downloading (…)lve/main/config.json: 100% 720/720 [00:00<00:00, 3.53MB/s]
Downloading model.safetensors: 100% 1.34G/1.34G [00:05<00:00, 265MB/s]
Downloading (…)okenizer_config.json: 100% 366/366 [00:00<00:00, 1.63MB/s]
Downloading (…)solve/main/vocab.txt: 100% 232k/232k [00:00<00:00, 46.0MB/s]
Downloading (…)/main/tokenizer.json: 100% 711k/711k [00:00<00:00, 28.2MB/s]
Downloading (…)cial_tokens_map.json: 100% 125/125 [00:00<00:00, 591kB/s]
Downloading (…)lve/main/config.json: 100% 1.01k/1.01k [00:00<00:00, 5.51MB/s]
Downloading model.safetensors: 100% 526M/526M [00:01<00:00, 366MB/s]
Downloading (…)okenizer_config.json: 100% 560/560 [00:00<00:00, 2.53MB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 3.64MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 47.8MB/s]
Downloading (…)cial_tokens_map.json: 100% 357/357 [00:00<00:00, 1.84MB/s]
10/04/2023 19:52:43 - INFO - dalm.training.rag_e2e.train_rage2e - Distributed environment: NO
Num processes: 

# Evaluate Retriever Training

Here we will evalauate our contextualized retriever against our passage data that we will have in our database at retrieval time

In [10]:
!dalm eval-rag "./dataset.csv" \
  --retriever-name-or-path "BAAI/bge-large-en" \
  --generator-name-or-path "EleutherAI/gpt-neo-125m" \
  --passage-column-name Abstract \
  --query-column-name Question \
  --answer-column-name Answer \
  --evaluate-generator \
  --query-batch-size 5 \
  --retriever-peft-model-path ./rag_e2e_checkpoints/retriever \
  --generator-peft-model-path ./rag_e2e_checkpoints/generator

Running tokenizer on dataset (num_proc=4): 100% 19/19 [00:00<00:00, 40.43 examples/s]
Filter: 100% 19/19 [00:00<00:00, 187.35 examples/s]
Starting to generate passage embeddings (Number of passages: 19)
100% 3/3 [00:00<00:00,  7.81it/s]
Construct passage index
Evaluation start
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing th

# Query Example DALM

In this tutorial, we showed how to train a retriever model with a small generator. To see a DALM in production, we can query DALM models on https://app.arcee.ai to get an idea for their behavior at scale.

For example, we can query `DALM-Patent` on https://app.arcee.ai that has been trained on the USPTO Patent database

![infergif](https://arcee-public.s3.us-east-2.amazonaws.com/infer.gif)
