This is a tutorial from Arcee and AI Makerspace on how to train an end to end RAG model jointly with retriever and generator to create a Domain Adapted Language Model (a DALM). E2E RAG training allows you to create performant contextualized retrievers for your RAG system.

The model we train will learn how to retrieve documents from our context documents and generate responses in a unified system.

We will:

* Clone and Install DALM dependencies
* Prepare a toy dataset
* Run E2E RAG training for our DALM
* Run inference on our trained DALM
* Export our trained retriever model
* Query example DALMs at https://app.arcee.ai

![e2erag](https://i.imgur.com/0uMWN8H.png)




In [None]:
#Make sure you have a GPU enabled
!nvidia-smi

Sat Sep 23 09:36:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Clone DALM repo and Install dependenciese

We will clone and install the dependencies in https://github.com/arcee-ai/DALM to prepare for our E2E RAG training

**DALM**




In [None]:
!git clone https://github.com/arcee-ai/DALM

Cloning into 'DALM'...
remote: Enumerating objects: 1366, done.[K
remote: Counting objects: 100% (253/253), done.[K
remote: Compressing objects: 100% (93/93), done.[K
remote: Total 1366 (delta 192), reused 180 (delta 160), pack-reused 1113[K
Receiving objects: 100% (1366/1366), 18.82 MiB | 22.33 MiB/s, done.
Resolving deltas: 100% (782/782), done.


In [None]:
%cd DALM
!pip install --upgrade -e .

/content/DALM
Obtaining file:///content/DALM
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate (from indomain==0.0.4)
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes (from indomain==0.0.4)
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from indomain==0.0.4)
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecti

# Prepare Dataset of Examples

E2E RAG requires a dataset of [passage, question, answer] triples

At inference time, our model will take a users query, draw from the available passages, and pass relevant context to the generator to create an answer.

## Toy Dataset

For this tutorial we will train a simple toy dataset to show how the E2E RAG training can be accomplished. To build your own DALM, you should train on your own documents

## Generating Q+A Pairs

If you do not have labeled triples, you can generate QA pairs with:

```
dalm qa-gen dalm/datasets/toy_data_train.csv
```

for example to generate QA pairs with the "taesiri/arxiv_qa" hugging face dataset:

```
!dalm qa-gen "taesiri/arxiv_qa" --output-dir qa-outputs --passage-column-name text --title-column-name title
```



In [None]:
#let's take a look at our toy_data
%cat ./dalm/datasets/toy_data_train.csv

Question,Abstract,Answer
Photosynthesis definition,Process where plants convert light into energy through chlorophyll.,Energy conversion in plants
Author of "Romeo and Juliet",Famous play written by William Shakespeare.,William Shakespeare
Capital of France,The capital city of France is Paris.,Paris
Declaration of Independence date,"Document signed on July 4, 1776 declaring American independence.,July 4", 1776
World's largest ocean,The Pacific Ocean is the largest and deepest ocean on Earth.,Pacific Ocean
Inventor of the telephone,Alexander Graham Bell invented the first practical telephone.,Alexander Graham Bell
Natural satellite of Earth,The Moon is Earth's only natural satellite.,Moon
Element with symbol "H",Hydrogen is a chemical element with the symbol H.,Hydrogen
Novel "Pride and Prejudice" "author,Classic novel written by Jane Austen.",Jane Austen
Human body's powerhouse,Mitochondria are the powerhouse of the cell.,Mitochondria
Light's speed,"The speed of light in a vacuum is ab

# Train DALM model with gpt-neo-125m generator and bge-large-en retriever

For this demo notebook, we will train a small 125 million GPT Neo model and we will train the small bge retriever model.

For better results, you will want to substitute `meta-llama/Llama-2-7b` and `bge-large-en`. You will need a bigger GPU (like A10080GB) to train this - and make sure to adjust batch size to saturate your GPU memory


In [None]:
!dalm train-rag-e2e \
"./dalm/datasets/toy_data_train.csv" \
"BAAI/bge-large-en" \
"EleutherAI/gpt-neo-125m" \
--output-dir "./dalm/training/rag_e2e/rag_e2e_checkpoints" \
--with-tracking \
--report-to all \
--per-device-train-batch-size 1

Downloading (…)lve/main/config.json: 100% 720/720 [00:00<00:00, 3.64MB/s]
Downloading model.safetensors: 100% 1.34G/1.34G [00:05<00:00, 266MB/s]
Downloading (…)okenizer_config.json: 100% 366/366 [00:00<00:00, 2.44MB/s]
Downloading (…)solve/main/vocab.txt: 100% 232k/232k [00:00<00:00, 3.69MB/s]
Downloading (…)/main/tokenizer.json: 100% 711k/711k [00:00<00:00, 10.4MB/s]
Downloading (…)cial_tokens_map.json: 100% 125/125 [00:00<00:00, 772kB/s]
Downloading (…)lve/main/config.json: 100% 1.01k/1.01k [00:00<00:00, 7.07MB/s]
Downloading model.safetensors: 100% 526M/526M [00:01<00:00, 291MB/s]
Downloading (…)okenizer_config.json: 100% 560/560 [00:00<00:00, 3.20MB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 6.96MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 6.87MB/s]
Downloading (…)cial_tokens_map.json: 100% 357/357 [00:00<00:00, 2.43MB/s]
09/23/2023 09:37:46 - INFO - dalm.training.rag_e2e.train_rage2e - Distributed environment: NO
Num processes: 

# Evaluate Retriever Training

Here we will evalauate our contextualized retriever against our passage data that we will have in our database at retrieval time

In [None]:
#must login to HF for this eval
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
!python dalm/eval/eval_rag.py \
  --dataset_path "./dalm/datasets/toy_data_train.csv" \
  --retriever_name_or_path "BAAI/bge-large-en" \
  --generator_name_or_path "EleutherAI/gpt-neo-125m" \
  --passage_column_name Abstract \
  --query_column_name Question \
  --answer_column_name Answer \
  --evaluate_generator \
  --query_batch_size 5 \
  --retriever_peft_model_path ./dalm/training/rag_e2e/rag_e2e_checkpoints/retriever \
  --generator_peft_model_path ./dalm/training/rag_e2e/rag_e2e_checkpoints/generator

Filter: 100% 19/19 [00:00<00:00, 1220.08 examples/s]
Starting to generate passage embeddings (Number of passages: 19)
100% 3/3 [00:00<00:00, 14.76it/s]
Construct passage index
Evaluation start
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


# Query Example DALM

In this tutorial, we showed how to train a retriever model with a small generator. To see a DALM in production, we can query DALM models on https://app.arcee.ai to get an idea for their behavior at scale.

For example, we can query `DALM-Patent` on https://app.arcee.ai that has been trained on the USPTO Patent database

![infergif](https://arcee-public.s3.us-east-2.amazonaws.com/infer.gif)
