In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key:")

Enter Your OpenAI API Key:··········


# FLARE

### What is FLARE?

[FLARE](https://arxiv.org/pdf/2305.06983.pdf) stands for `Forward-Looking Active REtrieval` augmented generation.  This is a method that was introduced in 2023 in a paper titled "Active Retrieval Augmented Generation" by Zhengbao Jiang, et al. The original implementation can be found in [the paper's GitHub repo](https://github.com/jzbjyb/FLARE), which the LangChain implementation draws heavy inspiration from.


## 🚀 **Why FLARE Matters**

- Enhances large language models by fetching info from external sources while generating content.

- Tackles the issue of models making stuff up or being factually wrong.

- Aims for outputs that are more accurate and well-informed.

## 🔄 **FLARE's Approach:**
- Starts by answering a question.

- Looks up relevant documents when hitting "uncertain" tokens (based on log probs).

- Uses these documents to keep generating.

- Repeats the process until done.

- Highlights uncertain tokens and forms questions for LLM to find answers.




<img src="https://raw.githubusercontent.com/jzbjyb/FLARE/main/res/flare.gif">

## 🤔 **How FLARE Stands Out from Traditional RAG:**

- Traditional RAG fetches info once based on the input, then generates an output.

- Can be restrictive for long texts.

- FLARE keeps deciding **when and what to fetch** during the generation.

- Predicts the upcoming sentence to guide future content queries.

- If a sentence has low-confidence tokens, it fetches relevant info and rewrites the sentence.


### 💬 **FLARE Prompt Structure:**

- Guides language models to create search queries during answer generation.

- For example: "The colors on the flag of Ghana mean... Red stands for [Search(Ghana flag red meaning)] the blood of martyrs..."

### 🔄 **Direct FLARE Method:**

- Uses the next sentence generated by the language model for retrieval decisions.

- Accepts the sentence if the model is confident.

- If unsure, transforms the sentence into search queries to get more info, then revises the sentence.

### 🛠 **Setting Up the FLARE Chain:**

1. **LLM for Answers**: Needs to return log probabilities to spot uncertain tokens. Recommended: OpenAI wrapper (not ChatOpenAI wrapper, as it doesn't return logprobs).

2. **LLM for Hypothetical Questions**: Can be any model. In this case, we use ChatOpenAI for its speed and cost-effectiveness.

3. **Retriever**: Any retriever works. Here, the SERPER search engine is used for cost efficiency.

#### 🔑 **Key Parameters in FLARE:**

1. **`max_generation_len`**: Caps the number of tokens generated before checking for uncertainties.

2. **`min_prob`**: Tokens generated with a probability below this threshold are marked as uncertain.

In [None]:
pip install langchain langchain-openai langchain-community

In [3]:
from langchain_community.document_loaders import ArxivLoader
from langchain_community.vectorstores import FAISS

from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI

from langchain.chains import FlareChain

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.callbacks.manager import AsyncCallbackManagerForRetrieverRun, CallbackManagerForRetrieverRun

# We set this so we can see what exactly is going on
from langchain.globals import set_verbose

set_verbose(True)

## Same pattern as we saw in the RAG lecture.

In [None]:
pip install arxiv frontend tools pymupdf faiss-cpu

In [5]:
!mkdir static

In [8]:
# instantiate llm
llm = ChatOpenAI(model="gpt-4-0125-preview")

#instantiate embeddings model
embeddings = OpenAIEmbeddings()

# fetch docs from arxiv, in this case it's the FLARE paper
docs = ArxivLoader(query="2305.06983", load_max_docs=2).load()

# instantiate text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=16)

# split the document into chunks
doc_chunks = text_splitter.split_documents(docs)

# create vector store consisting of embeddigs for document
vector_store = FAISS.from_documents(doc_chunks, embedding=embeddings)

vector_store_retriever = vector_store.as_retriever()

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Set up the FLARE chain

In [10]:
flare = FlareChain.from_llm(
    llm=llm,
    retriever=vector_store_retriever,
    max_generation_len=1024,
    min_prob=0.7,
    verbose=True,
    handle_parsing_errors = True
)

# Start generation

In [11]:
flare.invoke("What is FLARE?")



[1m> Entering new FlareChain chain...[0m
[36;1m[1;3mCurrent Response: [0m
Prompt after formatting:
[32;1m[1;3mRespond to the user message using any relevant context. If context is provided, you should ground your answer in that context. Once you're done responding return FINISHED.

>>> CONTEXT: 
>>> USER INPUT: What is FLARE?
>>> RESPONSE: [0m


[1m> Entering new QuestionGeneratorChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven a user input and an existing partial response as context, ask a question to which the answer is the given term/entity/phrase:

>>> USER INPUT: What is FLARE?
>>> EXISTING PARTIAL RESPONSE:  
FLARE stands for Fast Light Attack and Reconnaissance Aircraft. It is a type of military aircraft designed for quick response and reconnaissance missions. It is often used for close air support, surveillance, and intelligence gathering.

The question to which the answer is the term/entity/phrase " stands for Fast Light Attack and Reconnaissance" is:

{'user_input': 'What is FLARE?',
 'response': 'FLARE stands for Forward-Looking Active Retrieval augmented generation. It is a method that uses reinforcement learning or supervised training to retrieve information before generation, potentially improving the quality of the generated text. It has been shown to achieve superior or competitive performance on various knowledge-intensive generation tasks and datasets. \n\n'}

In [12]:
flare.invoke("What are the steps in a FLARE pipeline?")



[1m> Entering new FlareChain chain...[0m
[36;1m[1;3mCurrent Response: [0m
Prompt after formatting:
[32;1m[1;3mRespond to the user message using any relevant context. If context is provided, you should ground your answer in that context. Once you're done responding return FINISHED.

>>> CONTEXT: 
>>> USER INPUT: What are the steps in a FLARE pipeline?
>>> RESPONSE: [0m


[1m> Entering new QuestionGeneratorChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven a user input and an existing partial response as context, ask a question to which the answer is the given term/entity/phrase:

>>> USER INPUT: What are the steps in a FLARE pipeline?
>>> EXISTING PARTIAL RESPONSE:  
The steps in a FLARE pipeline typically include data preprocessing, feature extraction, model training, and model evaluation. Data preprocessing involves cleaning and formatting the data to make it suitable for analysis. Feature extraction involves selecting and transforming relevant features from the

{'user_input': 'What are the steps in a FLARE pipeline?',
 'response': 'The steps in a FLARE pipeline include iterative retrieval and refinement, generating temporary next sentences, and checking for low-probability tokens. FLARE can also be combined with a browser to potentially improve retrieval quality. '}

In [13]:
flare.invoke("What does a typical retrieval instruction look like in FLARE?")



[1m> Entering new FlareChain chain...[0m
[36;1m[1;3mCurrent Response: [0m
Prompt after formatting:
[32;1m[1;3mRespond to the user message using any relevant context. If context is provided, you should ground your answer in that context. Once you're done responding return FINISHED.

>>> CONTEXT: 
>>> USER INPUT: What does a typical retrieval instruction look like in FLARE?
>>> RESPONSE: [0m


[1m> Entering new QuestionGeneratorChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven a user input and an existing partial response as context, ask a question to which the answer is the given term/entity/phrase:

>>> USER INPUT: What does a typical retrieval instruction look like in FLARE?
>>> EXISTING PARTIAL RESPONSE:  A typical retrieval instruction in FLARE would look like a query that specifies the type of data you want to retrieve, the location of the data, and the format in which you want the data to be returned. For example, you could specify that you want to retrieve

{'user_input': 'What does a typical retrieval instruction look like in FLARE?',
 'response': 'A typical retrieval instruction in FLARE is a prompt that guides the language model to generate search queries when necessary, while also encouraging retrieval during the generation process. This can be done using reinforcement learning or supervised training, and the retrieval queries should reflect the intents of future generations. '}