Souce:
- https://huggingface.co/learn/cookbook/en/advanced_rag
- https://arc.net/l/quote/vntkseji

- Flare:
  - https://ayushtues.medium.com/flare-advanced-rag-implemented-from-scratch-07ca75c89800

# Generate Answers

### Assumptions
- the faiss_index embeddings are up to date 

In [3]:
# give the paths
# CHANGE ME
QUESTIONS_FILE = 'data/test/questions_webpages.txt'
OUTPUT_FILE = 'system_outputs/webpages.txt'
FAISS_FILE = 'faiss_index_total' # it's actually a folder but whatever

EMBEDDING_MODEL = "thenlper/gte-base" # make sure this matches whatever was used to create the doc embeddings
GENERATOR_MODEL = "google/flan-t5-large"
RERANKER_MODEL = "colbert-ir/colbertv2.0"


In [15]:
# https://arc.net/l/quote/vntkseji
# https://huggingface.co/learn/cookbook/en/advanced_rag

# Flare+T5: https://ayushtues.medium.com/flare-advanced-rag-implemented-from-scratch-07ca75c89800
!pip install -q torch transformers transformers accelerate bitsandbytes langchain sentence-transformers faiss-gpu openpyxl

[31mERROR: Could not find a version that satisfies the requirement faiss-gpu (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for faiss-gpu[0m[31m
[0m

In [16]:
pip install -U "transformers==4.38.0" --upgrade

Note: you may need to restart the kernel to use updated packages.


In [17]:
!pip install unstructured



In [18]:
!pip install torch



In [19]:
# fix colab error: https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [20]:
!pip install ragatouille



In [1]:
from tqdm.notebook import tqdm
import pandas as pd
from typing import Optional, List, Tuple
import matplotlib.pyplot as plt

pd.set_option(
    "display.max_colwidth", None
)

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
%cd drive/MyDrive/ANLP
!ls

### Load your knowledge base

In [2]:
from langchain.docstore.document import Document as LangchainDocument
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Retriever - embeddings

### 1.1 Split the documents into chunks

### 1.2 Building the vector database

##### Nearest Neighbor search algorithm

[FAISS](https://github.com/facebookresearch/faiss)

##### Distances
[here](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/#distance-between-embeddings).
- **Cosine similarity** computes similarity between two vectors as the cosinus of their relative angle: it allows us to compare vector directions are regardless of their magnitude. Using it requires to normalize all vectors, to rescale them into unit norm.
- **Dot product** takes into account magnitude, with the sometimes undesirable effect that increasing a vector's length will make it more similar to all others.
- **Euclidean distance** is the distance between the ends of vectors.

In [4]:
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
EMBEDDING_MODEL_NAME = EMBEDDING_MODEL
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    # model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  #  True for cosine similarity
)

In [24]:
new_db = FAISS.load_local("faiss_index_total", embedding_model)
docs = new_db.similarity_search("When is Andrew Carnegie' birthday?", k=3)
docs = [doc.page_content for doc in docs]
docs

['History -\n\nCMU - Carnegie Mellon University\n\nCarnegie Mellon University\n\n— — —\n\nAndrew Carnegie\n\nA self-educated "working boy" who loved books, Andrew Carnegie emigrated from Scotland in 1848 and settled in Pittsburgh, Pa. Attending night school and borrowing books, Carnegie went from factory worker in a textile mill to successful entrepreneur and industrialist. He rose to prominence by founding what became the world\'s largest steel producing company by the end of the 19th century.\n\nCarnegie Technical Schools\n\nAt one point the richest man in the world, Carnegie believed that "to die rich is to die disgraced." He turned his attention to writing, social activism and philanthropy, determined to establish educational opportunities for the general public where few existed.\n\nIn 1900, he donated $1 million for the creation of a technical institute for the city of Pittsburgh, envisioning a school where working-class men and women of Pittsburgh could learn practical skills, t

# 2. Reader - LLM

### 2.1. Reader model


In [5]:
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import T5Tokenizer, T5ForConditionalGeneration
import os

In [6]:
HUGGINGFACEHUB_API_TOKEN = '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'

In [36]:
# initialize the LLM and its tokenizer, we are using Flan T5 Large for this
tokenizer = T5Tokenizer.from_pretrained(GENERATOR_MODEL)
model = T5ForConditionalGeneration.from_pretrained(GENERATOR_MODEL)

# function to get the prediction and scores from the LLM, given a prompt
def get_prediction_and_scores(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    outputs =  model.generate(input_ids, output_scores=True, return_dict_in_generate=True, max_length=100)
    generated_sequence = outputs.sequences[0]

    # get the probability scores for each generated token
    transition_scores = torch.exp(model.compute_transition_scores(
        outputs.sequences, outputs.scores, normalize_logits=True
    )[0])
    return tokenizer.decode(generated_sequence), generated_sequence, transition_scores

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [28]:
# Google Gemma

# huggingfacehub_api_token = HUGGINGFACEHUB_API_TOKEN
# quantization_config = BitsAndBytesConfig(load_in_8bit=True)

# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b",
#                                              quantization_config=quantization_config,
#                                              token = huggingfacehub_api_token)
# tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", token= huggingfacehub_api_token)



In [29]:
# # smaller
# name = 'MBZUAI/LaMini-GPT-774M'
# model = AutoModelForCausalLM.from_pretrained(name)
# tokenizer = AutoTokenizer.from_pretrained(name)

In [30]:
# READER_LLM = pipeline(
#     model=model,
#     tokenizer=tokenizer,
#     task="text-generation",
#     do_sample=True,
#     temperature=0.2,
#     repetition_penalty=1.3,
#     return_full_text=False,
#     max_new_tokens=30,
# )

## Re-ranking Rtriever

In [10]:
from ragatouille import RAGPretrainedModel

RERANKER = RAGPretrainedModel.from_pretrained(RERANKER_MODEL)

[Mar 01, 02:31:23] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




### 2.2. Prompt

The RAG prompt template below is what we will feed to the Reader LLM: it is important to have it formatted in the Reader LLM's chat template.

We give it our context and the user's question.

In [None]:
# prompt_in_chat_format = '''
# <start_of_turn>user
# Instructions for you: Using the information contained in the context,
# give a comprehensive answer to the question.
# Respond only to the question asked, response should be concise and relevant to the question.
# Provide the number of the source document when relevant.
# If the answer cannot be deduced from the context, do not give an answer <end_of_turn>
# <start_of_turn>model
# sounds good!<end_of_turn>
# <start_of_turn>user
# Here is the context {context}
# and the Question: {question}<end_of_turn>
# '''

In [None]:
# prompt_in_chat_format = [
#     {
#         "role": "system",
#         "content": """Using the information contained in the context, give a comprehensive answer to the question.
# Respond only to the question asked, response should be concise and relevant to the question.
# Provide the number of the source document when relevant.
# Give very short answers..
# If the answer cannot be deduced from the context, do not give an answer.""",
#     },
#     {
#         "role": "user",
#         "content": """Context:
# {context}
# ---
# Here is the question you need to answer.
# Question: {question}""",
#     },
# ]
# from langchain.prompts import PromptTemplate
# # RAG_PROMPT_TEMPLATE = PromptTemplate(
# #  template=prompt_in_chat_format, input_variables=["context", "question"]
# # )

# RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(
#     prompt_in_chat_format, tokenize=False, add_generation_prompt=True)
# print(RAG_PROMPT_TEMPLATE)

In [8]:
KNOWLEDGE_VECTOR_DATABASE = FAISS.load_local(FAISS_FILE, embedding_model)

In [32]:
from transformers import Pipeline


def answer_with_rag_without_flare(
    question: str,
    # llm: Pipeline,
    knowledge_index: FAISS,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 5,
    num_docs_final: int = 3,
) -> Tuple[str, List[LangchainDocument]]:


    # Gather documents with retriever
    print("=> Retrieving documents...")
    relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    if reranker:
        print("=> Reranking documents...")
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    # final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)
    input_text = question
    new_input_text = f"Keep your answers short and concise. Given the below context:\n{context}\n\n Answer the following \n{input_text}\n"

    # Redact an answer
    print("=> Generating answer...")
    generated_sequence, _, _ = get_prediction_and_scores(new_input_text)
    input_text = f"{input_text} {generated_sequence}"

    answer = input_text
    return answer, relevant_docs

In [33]:
# user_query = 'What are the masters programs in LTI?'
# answer, relevant_docs = answer_with_rag_without_flare(
#     user_query, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER
# )

=> Retrieving documents...




=> Reranking documents...
Your documents are roughly 331.0 tokens long at the 90th percentile! This is quite long and might slow down reranking!
 Provide fewer documents, build smaller chunks or run on GPU if it takes too long for your needs!
Your documents are roughly 331.0 tokens long at the 90th percentile! This is quite long and might slow down reranking!
 Provide fewer documents, build smaller chunks or run on GPU if it takes too long for your needs!


100%|██████████| 1/1 [00:00<00:00,  1.11it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (989 > 512). Running this sequence through the model will result in indexing errors


=> Generating answer...


In [34]:

# print("==================================Answer==================================")
# print(len(relevant_docs))
# print(f"{answer}")

3
What are the masters programs in LTI? <pad> Master of Language Technologies</s>


In [35]:

# print("==================================Source docs==================================")
# for i, doc in enumerate(relevant_docs):
#     print(f"Document {i}------------------------------------------------------------")
#     print(doc)

Document 0------------------------------------------------------------
Courses that satisfy LTI Ph.D. degree requirements may also be used to satisfy requirements for

one M.S. degree. The most common choice is the LTI s Master of Language Technologies (MLT)

degree because its requirements are similar (but not identical) to the Ph.D. requirements. Other M.S. degrees within the LTI and outside of the LTI are also possible. LTI Ph.D.  Graduate Student Handbook  Page 20

Students interested in an M.S. degree other than the MLT degree should  discuss their plans with

their Ph.D. advisor due to the additional courses and project work that may be involved.

3.7 Grading and Evaluation

3.7.1 University Policy on Grades

Carnegie Mellon s Grading  policy offers details concerning university grading principles for

students taking courses and covers the specifics of assigning and c hanging grades, grading

options, drop/ withdrawals, and course repeats. It also defines the undergraduate and g

In [36]:
# user_query = 'What is the Buggy race schedule this year?'

In [37]:
# answer, relevant_docs = answer_with_rag_without_flare(
#     user_query, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER
# )

=> Retrieving documents...
=> Reranking documents...
Your documents are roughly 317.2 tokens long at the 90th percentile! This is quite long and might slow down reranking!
 Provide fewer documents, build smaller chunks or run on GPU if it takes too long for your needs!


100%|██████████| 1/1 [00:01<00:00,  1.10s/it]


=> Generating answer...


In [38]:

# print("==================================Answer==================================")
# print(len(relevant_docs))
# print(f"{answer}")

3
What is the Buggy race schedule this year? <pad> April 13, 2024</s>


In [39]:

# print("==================================Source docs==================================")
# for i, doc in enumerate(relevant_docs):
#     print(f"Document {i}------------------------------------------------------------")
#     print(doc)

Document 0------------------------------------------------------------
Buggy Races Keep Rolling at Carnegie Mellon -

News - Carnegie Mellon University

Carnegie Mellon University

— — —

Buggy Races Keep Rolling at Carnegie Mellon

April 10, 2019

Buggy Races Keep Rolling at Carnegie Mellon

In its 99th year, the tradition is a Spring Carnival treat

By Heidi Opdyke

opdyke(through)andrew.cmu.edu

Media Inquiries

Julie Mattera

Marketing and Communications

jmattera(through)cmu.edu

412-268-2902

Sweepstakes, also known as the

Buggy Races , has come a long way at Carnegie Mellon University. The slick, torpedo-like vessels carrying drivers with nerves of steel are a far cry from the two-man teams that once changed places halfway through a race and rode in everything from rain barrels with bicycle wheels to three-wheeled ash cans 99 years ago.

Today, it takes six people to maneuver the .84 -mile course around Schenley Park's Flagstaff Hill.

But while five pushers and a driver naviga

## Flare T5
- essentially an extra acceptance step

In [33]:
from transformers import Pipeline


def answer_with_rag_flare(
    input_text: str,
    # llm: Pipeline,
    knowledge_index: FAISS,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 5,
    num_docs_final: int = 3,
    threshold = .5
) -> Tuple[str, List[LangchainDocument]]:

    relevant_docs = None
    while True:
        generated_sequence, tokens, scores = get_prediction_and_scores(input_text)
        if torch.min(scores)< threshold:
            confident_tokens = tokens[torch.where(scores>threshold)]
            query = tokenizer.decode(confident_tokens)

            # Gather documents with retriever
            print("=> Retrieving documents...")
            relevant_docs = knowledge_index.similarity_search(query=input_text, k=num_retrieved_docs)
            relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

            # Optionally rerank results
            if reranker:
                print("=> Reranking documents...")
                relevant_docs = reranker.rerank(input_text, relevant_docs, k=num_docs_final)
                relevant_docs = [doc["content"] for doc in relevant_docs]

            relevant_docs = relevant_docs[:num_docs_final]

            # Build the final prompt
            context = "\nExtracted documents:\n"
            context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

            # final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)
            new_input_text = f"Keep your answers short and concise. Given the below context:\n{context}\n\n Answer the following \n{input_text}\n"

            # Redact an answer
            print("=> Generating answer...")
            generated_sequence, seq, _ = get_prediction_and_scores(new_input_text)
            # input_text = f"{input_text} {generated_sequence}"
            if "</s>" in generated_sequence:
                input_text = tokenizer.decode(seq, skip_special_tokens=True)
                break
        else: # tokens are alrady high confidence
            # input_text = f'{input_text} {generated_sequence}'
            if "</s>" in generated_sequence:
                input_text = tokenizer.decode(tokens, skip_special_tokens=True)
                break
    answer = input_text
    print(relevant_docs)
    if relevant_docs is None:
      return answer, None

    return answer, relevant_docs

In [34]:
user_query = 'What is the Buggy race schedule this year?'
answer, relevant_docs = answer_with_rag_flare(
    user_query, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER
)

tensor([   0,    3,    7,  152,    3,    9, 6992,   23,   32,    1])
=> Retrieving documents...




=> Reranking documents...
Your documents are roughly 317.2 tokens long at the 90th percentile! This is quite long and might slow down reranking!
 Provide fewer documents, build smaller chunks or run on GPU if it takes too long for your needs!


100%|██████████| 1/1 [00:01<00:00,  1.17s/it]


=> Generating answer...
tensor([    0,  1186, 10670,   460,  2266,     1])
['Buggy Races Keep Rolling at Carnegie Mellon -\n\nNews - Carnegie Mellon University\n\nCarnegie Mellon University\n\n— — —\n\nBuggy Races Keep Rolling at Carnegie Mellon\n\nApril 10, 2019\n\nBuggy Races Keep Rolling at Carnegie Mellon\n\nIn its 99th year, the tradition is a Spring Carnival treat\n\nBy Heidi Opdyke\n\nopdyke(through)andrew.cmu.edu\n\nMedia Inquiries\n\nJulie Mattera\n\nMarketing and Communications\n\njmattera(through)cmu.edu\n\n412-268-2902\n\nSweepstakes, also known as the\n\nBuggy Races , has come a long way at Carnegie Mellon University. The slick, torpedo-like vessels carrying drivers with nerves of steel are a far cry from the two-man teams that once changed places halfway through a race and rode in everything from rain barrels with bicycle wheels to three-wheeled ash cans 99 years ago.\n\nToday, it takes six people to maneuver the .84 -mile course around Schenley Park\'s Flagstaff Hill.\n\

In [35]:

print("==================================Answer==================================")
print(len(relevant_docs))
# tokenizer.decode("hello",skip_special_tokens=True)
# print(f"{answer}")
print(answer)

3
April 13, 2024


In [None]:
# user_query = 'What did the first doctorate graduate from CMU study?'
# answer, relevant_docs = answer_with_rag_flare(
#     user_query, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER
# )

In [None]:
# print("==================================Answer==================================")
# # print(len(relevant_docs))
# print(f"{answer}")

## Generate Answers

In [None]:
def generate_answer(question):
    answer, _ = answer_with_rag_flare(
        question, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER
    )
    return answer

# note that this overwrites previously generated answers to the answer file
def generate_answers_all(qfile, afile):
    questions_file = open(qfile, 'r')
    questions = questions_file.readlines()
    ans_file = open(afile, "w+")
    for q in questions:
        ans = generate_answer(q)
        ans_file.write(ans + '\n')
    questions_file.close()
    ans_file.close()


In [None]:
generate_answers_all(QUESTIONS_FILE, OUTPUT_FILE)
