<a href="https://colab.research.google.com/github/abdulsamadkhan/Llama2_Chat/blob/main/RAFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAFT
##Creating RAFT dataset using LLM 3.5
This tutorial is about Retrieval Augmented Fine Tuning (RAFT). In RAFT, given a question, and a set of retrieved documents, the model is trained to ignore those documents that don't help answer the question, these documents are called, distractor documents. RAFT uses a chain-of-thought to improve the model's ability to reason. according to the paper, RAFT consistently improves the model's performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG.




To understand our goal better, the paper expand on analogy between training an LLM in the real-world setting of preparing for an exam.

###Closed-Book Exam:
 A closed book exam often refers to a scenario where the LLMs do not have access to any additional documents or references to answer the questions
during the exam. in our case this is the scenario when we fine-tune a general LLM for our domain knowleddge.
as shown in the figure.

###Open Book Exam:
it is refered to the scenario in which the LLM uses  external sources of information (e.g., a website or
a book chapter). In such scenarios, typically, the LLM is
paired with a retriever which retrieves ‘k’ documents (or
specific segments of the document) which are appended to
the prompt. It is only through these documents retrieved that
the LLM gains access to “new knowledge”. As a result, we
argue that the LLM’s performance in these settings, where it
is trained as a general-purpose LLM is largely dependent on
the quality of the retriever and how accurately the retriever
can identify the most relevant piece of information.

###Domain Specific Open-Book Exam (RAFT):
In this scnario, we
know apriori the domain in which the LLM will be tested
. The LLM can respond to the prompt
using use any and all information from this specific domain,
which it has been fine-tuned on. The crux, the LLM will be used to respond to the
questions, whose answers can be found within a collection
of documents .

###RAFT-DATASET
The main factor which distinguishes RAFT with other methods is its data generation, the paper uses some notation which I will explain through creating our own raft-datset using `gpt-turbo3.5`.


#Installing Libraries

In [8]:
#required libraries
!pip install llama-index
!pip install llama-index-packs-raft-dataset




In [9]:
# class for generating the RAFTDataset
from llama_index.packs.raft_dataset import RAFTDatasetPack
#from base import RAFTDatasetPack

In [10]:
#loading openAI API Key
import openai
from google.colab import userdata
openai.api_key = userdata.get('OPENAI_KEY')


#Download Data
For this tutorial, I made a text file about Wasim Akram, the well-known Pakistani cricketer. I gathered the info from Wikipedia

In [11]:
# this data will be stored in output directory ./WasimAkram.txt
!wget  "https://raw.githubusercontent.com/abdulsamadkhan/Llama2_Chat/main/Data/WasimAkram.txt" -O './WasimAkram.txt'



--2024-03-31 06:26:11--  https://raw.githubusercontent.com/abdulsamadkhan/Llama2_Chat/main/Data/WasimAkram.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17286 (17K) [text/plain]
Saving to: ‘./WasimAkram.txt’


2024-03-31 06:26:11 (28.6 MB/s) - ‘./WasimAkram.txt’ saved [17286/17286]



Since we want to use `gpt 3.5 turbu` so we need to create our own lm object. If we do not create and pass this object to `RAFTDatssetPack` object, by default it will be `gpt4`

In [12]:
#create llm object
from llama_index.llms.openai import OpenAI
llm = OpenAI(temperature=0, n=1, model="gpt-3.5-turbo")

#Arguments of RAFTDatasetPack
Arguments:
*   `file_path` - the path at which the document is located
* `llm` - llm object, we can use any llm here, I have used gpt3.5 turbo, if no llm is used it will pick GPT4 by default. GPT4 is expenvie
* `embed-model` you can provide any embedding model, if you do not then gpt embedding model is picked
* `num_questions_per_chunk` the number of data points / triplets to generate per chunk

* `number_distract_docs` - the number of distractor documents to include per data point / triplet
* `chunk_size` - the size of each chunk in number of tokens:


In [13]:


raft_dataset = RAFTDatasetPack(file_path="./WasimAkram.txt",llm=llm,
                                 num_questions_per_chunk=2,
                                 num_distract_docs=3,
                                 )

# Chunks detais
After creating the raft_dataset object I explored the RAFTDatasetPack object. The RAFTDatasetPack uses `SemanticSplitterNodeParser(...)`
which splits documents into semantically similar chunks. In our case it creates 9 chunks.
Note: Due to this splitter, i realized that `chunk_size` parameter has no effect on different values passed.



In [14]:
chunks = raft_dataset.get_chunks("./WasimAkram.txt", raft_dataset.chunk_size)
len(chunks)

In [36]:
#qs = raft_dataset.generate_instructions_gen(chunks[0],3)
qs




["What is Wasim Akram's nickname in the cricket world?",
 'How many siblings does Wasim Akram have?',
 'When was Wasim Akram diagnosed with diabetes?']

#some important points
*  `num_distract_docs` are `min(num_distract_docs, len(chunks)-1)` in our example, it will be `3`
* Each chunk is passed to LLM to create questions using a well crafted prompt->`generate_instructions_gen(chunk,#numberofQuestions )`

3. `oracle_context` is the true chunk from which the questions and answers both are derived. This is termed as `D*` in the paper

4. `cot_answer` is generated based on prompt using the `oracle_context` and `question`. this is termed as `A*` in the paper.
The prompt for generating `cot_answer` using given question and chunk is given below
          prompt = f"""
            Question: {question}\nContext: {chunk}\n
            Answer this question using the information given in the context above. Here is things to pay attention to:
            - First provide step-by-step reasoning on how to answer the question.
            - In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
            - End your response with final answer in the form <ANSWER>: $answer, the answer should be succinct.
        """   

In [16]:
dataset = raft_dataset.run()

* For each chunk, uses `gpt3.5-turbo` to synthetically generate dataset.


In [38]:
#printing the dataset
dataset

Dataset({
    features: ['id', 'type', 'question', 'context', 'oracle_context', 'cot_answer', 'instruction'],
    num_rows: 20
})

## mapping to Notation:
 * `question` is mapped to `Q ` in the paper
 * `context` is mapped set of documents `D_k ` in the paper
 * `oracle_context` is mapped to `D* ` in the paper
 * `cot_answer` is mapped to `A* ` in the paper.

In [18]:
#oracle document is relevant, distractor document is irrelevant
#exploring the first data point
len(dataset[0]['context']['sentences'][0])

3

In [19]:
output_path = "./RAFTDATA"
# Save as .arrow format
dataset.save_to_disk(output_path)

# Save as .jsonl format
dataset.to_json(output_path + ".jsonl")

Saving the dataset (0/1 shards):   0%|          | 0/20 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

279797

https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-raft-dataset/examples/raft_dataset.ipynb
https://github.com/ShishirPatil/gorilla/tree/main/raft
