## Mixtral + RAG
- Mixtral + RAG achieves what Mixtral alone can't: Retrieves content Mixtral was not trained on. Also seems to do fine if already knows about a topic.
- Evaluate starting from Mistral + basic FAISS 
- Will likely need to improve retriever (easiest to hardest): perhaps just by pararaph as already have somewhere, writing those paragraphs to FAISS docs, or via Contriever, or finetuning 

### Get the data and build a Retriever

In [31]:

output_dir = '/home/mainuser/Desktop/LLMs/RagOverArXiv/rag_output_dir'
logging_dir = '/home/mainuser/Desktop/LLMs/RagOverArXiv/rag_logging_dir'
index_dir = '/home/mainuser/Desktop/LLMs/RagOverArXiv/rag_index_dir'



- The documents consist of a few dozen ArXiv papers about modern LLMs

In [None]:
from pathlib import Path
PDFS_PATH = Path('/content/drive/MyDrive/PdfRag/clusterofstars')
PDFS = list(PDFS_PATH.glob('*.pdf'))
PDFS[0], len(PDFS)

(PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/In-Context Retrieval-Augmented Language Models.pdf'),
 26)

In [None]:
# fastai function to clean GPU memory
import sys,gc,traceback
import torch
def clean_ipython_hist():
    # Code in this function mainly copied from IPython source
    if not 'get_ipython' in globals(): return
    ip = get_ipython()
    user_ns = ip.user_ns
    ip.displayhook.flush()
    pc = ip.displayhook.prompt_count + 1
    for n in range(1, pc): user_ns.pop('_i'+repr(n),None)
    user_ns.update(dict(_i='',_ii='',_iii=''))
    hm = ip.history_manager
    hm.input_hist_parsed[:] = [''] * pc
    hm.input_hist_raw[:] = [''] * pc
    hm._i = hm._ii = hm._iii = hm._i00 =  ''



def clean_tb():
    # h/t Piotr Czapla
    if hasattr(sys, 'last_traceback'):
        traceback.clear_frames(sys.last_traceback)
        delattr(sys, 'last_traceback')
    if hasattr(sys, 'last_type'): delattr(sys, 'last_type')
    if hasattr(sys, 'last_value'): delattr(sys, 'last_value')

def clean_mem():
    clean_tb()
    clean_ipython_hist()
    gc.collect()
    torch.cuda.empty_cache()



### Task 1: Prepare the data and  build a PDF Data Loader

In [None]:
from PyPDF2 import PdfReader
reader = PdfReader(os.path.expanduser(PDFS[0]))
pages = reader.pages
documents = []
for page in pages:
  documents.append(page.extract_text())
#print(documents[-1])

#### First drop everything from References onwards. References were 'confusing' RAG into retrieving primarily titles of papers mentioned there, which is likely not very useful

In [None]:
import PyPDF2

def load_pdf_to_string(pdf_path):
    # Open the PDF file in binary mode
    with open(pdf_path, 'rb') as file:
        # Create a PDF file reader object
        pdf_reader = PyPDF2.PdfReader(file)

        # Initialize an empty string to hold the text
        text = ''

        # Loop through each page and extract the text
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            page_text = page.extract_text()
            references_index= page_text.upper().find('\nREFERENCES\n')
            if references_index != -1:
              page_text = page_text[:references_index]
              text += page_text
              return text
            text += page_text
    return text

# Use the function to load a PDF into a string
text = load_pdf_to_string(os.path.expanduser(PDFS[1]))

In [None]:
def get_title(pdf_path): return os.path.expanduser(pdf_path).split('/')[-1]

In [None]:
get_title(PDFS[-1])

'TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise.pdf'

In [None]:
text.find('References\n')

-1

In [None]:
all_docs_and_titles = [(load_pdf_to_string(os.path.expanduser(pdf_path)),get_title(pdf_path)) for pdf_path in PDFS]

In [None]:
all_docs = [doc[0] for doc in all_docs_and_titles]
all_titles = [doc[1] for doc in all_docs_and_titles]

In [None]:
from langchain.document_loaders.onedrive_file import CHUNK_SIZE
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter, RecursiveCharacterTextSplitter
from langchain.text_splitter import Document

CHUNK_SIZE = 1000
CHUNK_OVERLAP = 30

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap = CHUNK_OVERLAP,
    length_function=len,
)
#text_splitter.split_text(all_pages[0])
# docs = [Document(page_content=pages) for pages in all_pages]
docs  = [text_splitter.split_text(doc) for doc in all_docs]
# # docs

In [None]:
len(docs)

26

In [None]:
tot_len = 0
for text in docs[0]:
    tot_len += len(text)
tot_len #OK, makes sense

37667

In [None]:
len(docs[0])

39

### Task 2: Create an "Index"

- Not yet sure if should use Qdrant or FAISS


#### Selecting the VectorStore


In [3]:
from langchain.vectorstores import Qdrant, FAISS

In [30]:
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)


In [None]:
#vector_store = FAISS.from_documents(docs, embedder)
#vector_store = FAISS.from_documents((docs[i][j] for i in range(len(docs)) for j in range(len(docs[i]))), embedder)
#vector_store = FAISS.from_documents(docs, embedder)
#vector_store = FAISS.from_documents(docs[0], embedder)
from langchain.schema.document import Document

docs = [Document(page_content=doc[i],metadata={'source':all_titles[j]}) for j,doc in enumerate(docs) for i in range(len(doc))]
for index, pdf in enumerate(docs):
   content = docs[index]
   if index == 0:
       vector_store = FAISS.from_documents([content], embedder)
   else:
      vector_store_i = FAISS.from_documents([content], embedder)
      vector_store.merge_from(vector_store_i)

vector_store
#vector_store.save_local(index_dir)

<langchain.vectorstores.faiss.FAISS at 0x7cf4a9a4ff40>

In [None]:
vector_store.save_local(index_dir)

### To reload the embeddings made above on the next Colab nb use, run the code below.

In [3]:
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)
embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)

vector_store = FAISS.load_local(index_dir, embedder,allow_dangerous_deserialization=True)

Check that the VectorStore is working by embedding a query and retrieving passages from our reviews that are close to it.

In [32]:
query = "What is Retrieval-augmented generation?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

In [33]:
query = "What is Self-Rag?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

Let's see how much time the `CacheBackedEmbeddings` pattern saves us:

In [10]:
%%timeit -n 1 -r 1
query = "What is Self-Rag?"
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

5.34 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [None]:
%%timeit
query = "What is Self-Rag?"
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

7.37 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
docs[0].page_content

'predictions are mostly aligned with their assessments. Appendix Table 6 shows several annotated\nexamples and explanations on assessments.\n6 C ONCLUSION\nThis work introduces SELF-RAG, a new framework to enhance the quality and factuality of LLMs\nthrough retrieval on demand and self-reflection. SELF-RAGtrains an LM to learn to retrieve, generate,\nand critique text passages and its own generation by predicting the next tokens from its original\nvocabulary as well as newly added special tokens, called reflection tokens. SELF-RAGfurther enables\nthe tailoring of LM behaviors at test time by leveraging reflection tokens. Our holistic evaluations on\nsix tasks using multiple metrics demonstrate that SELF-RAGsignificantly outperforms LLMs with\nmore parameters or with conventional retrieval-augmented generation approaches.\n10Preprint.\nETHICAL CONCERNS\nThis work aims to improve the factuality of LLM outputs, the lack of which continues to cause nu-'

As we can see, even over a significant number of runs - the cached query is significantly faster than the first instance of the query!

With that, we're ready to move onto Task 3!

We will be leveraging Tim Dettmer's `bitsandbytes` as well as `accelerate` and `transformers` from Hugging Face to make our model as small as possible. The overall quality of the model is fairly well retained!

#### Mixtral

In [6]:
from exllamav2 import *
from exllamav2.generator import *
import sys, torch

config = ExLlamaV2Config()
config.model_dir = "/home/mainuser/Desktop/LLMs/MixtralInference/Mixtral-8x7B-instruct-exl2"
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)

print("Loading model...")
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.eos_token_id])
gen_settings = ExLlamaV2Sampler.Settings()

Loading model...


In [28]:
from transformers import Pipeline
from ragatouille import RAGPretrainedModel
from typing import Optional, List, Tuple
from langchain.docstore.document import Document
# RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(

#     prompt_in_chat_format, tokenize=False, add_generation_prompt=True

# )
RERANKER = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
from langchain.docstore.document import Document as LangchainDocument
def answer_with_rag(
    question: str,
    generator: ExLlamaV2StreamingGenerator,
    knowledge_index: FAISS = vector_store,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 5,
) -> Tuple[str, List[LangchainDocument]]:
    # Gather documents with retriever
    print("=> Retrieving documents...")
    embedding_vector = core_embeddings_model.embed_query(question)
    relevant_docs = vector_store.similarity_search_by_vector(embedding_vector, k = num_docs_final)#num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    # if reranker:
    #     print("=> Reranking documents...")
    #     relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
    #     #print(f"Type is : {type(relevant_docs[0])}")
    #     print(dir(relevant_docs[0]))
    #     relevant_docs = [doc['page_content'] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    #final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)


    ######
    print(f"Context: {context}")
    #print("Without RAG")
    #instruction_ids = tokenizer.encode(f"[INST] {question} [/INST]", add_bos = True)
    print('With RAG')
    instruction_ids = tokenizer.encode(f"[INST] Use the folowing context: {context}.  Using this context, please answer the following question: {question} [/INST]", add_bos = True)
    context_ids = instruction_ids if generator.sequence_ids is None \
            else torch.cat([generator.sequence_ids, instruction_ids], dim = -1)

    generator.begin_stream(context_ids, gen_settings)

    while True:
        chunk, eos, _ = generator.stream()
        if eos: break
        print(chunk, end = "")
        sys.stdout.flush()
    ######



    # Redact an answer
    # print("=> Generating answer...")
    # answer = llm(final_prompt)[0]["generated_text"]

    # return answer, relevant_docs

In [29]:
question = "What is DsPy?"
answer_with_rag(question, generator, vector_store, RERANKER)

=> Retrieving documents...
Context: 
Extracted documents:
Document 0:::
Akin to type signatures in programming languages, DSPy signatures simply define an interface and
provide type-like hints on the expected behavior. To use a signature, we must declare a module with
that signature, like we instantiated a Predict module above. A module declaration like this returns
afunction having that signature.
ThePredict Module The core module for working with signatures in DSPy is Predict (simplified
pseudocode in Appendix D.1). Internally, Predict stores the supplied signature, an optional LM to
use (initially None , but otherwise overrides the default LM for this module), and a list of demon-
strations for prompting (initially empty). Like layers in PyTorch, the instantiated module behaves as
a callable function: it takes in keyword arguments corresponding to the signature input fields (e.g.,
question ), formats a prompt to implement the signature and includes the appropriate demonstra-Document

- Self-Rag is anwsered perfectly fine without context sicne Mixtral knows about it.
- However, for DsPy without context, I get: "Without RAG
Without RAG
 Based on the provided context, there is no direct mention of a technology or concept called "DSPy". It is possible that there may be a typo or misunderstanding in the question. If "DSPy" is intended to refer to a specific technology or concept, please provide additional context or clarify the question so that I can give an accurate and helpful response.
"
- With RAG, get a good answer
"With RAG
 DsPy is a framework for natural language processing that uses natural language signatures to assign work to a language model (LM). It is designed to improve the efficiency and effectiveness of prompts used for language models, and to abstract the input/output behavior of a module. DsPy includes a number of modules like Predict, ChainOfThought, ProgramOfThought, MultiChainComparison, and ReAct, which can be used interchangeably to implement a DsPy signature. DsPy signatures define an interface and provide type-like hints on the expected behavior, similar to type signatures in programming languages. They can be compiled into self-improving and pipeline-adaptive prompts or finetunes, and handle structured formatting and parsing logic to reduce brittle string manipulation in user programs. DsPy is the second iteration of the Demonstrate–Search–Predict framework (DSP) and has been shown to significantly improve the quality of simple programs when composed with language models like GPT-3.5 and llama2-13b-chat."

- Mixtral + RAG achieves what Mixtral alone can't: Retrieves content Mixtral was not trained on. Also seems to do fine if already knows about a topic.
