# Retrieval Augmented Generation

## Intro to RAG

* RAG is a framework that combines
    * a pre-trained LLM (the generator)
    * an external knowledge source (retriever) to provide accurate domain-specific information without retraining the LLM
    * why?
        * LLMs lacks domain specific knowledge
        * LLMs can be separated to internal and/or confidential data (ie querying company's travel policy)

* RAG Workflow
    * text embedding
        * convert the user's prompt/question into a high-dim vec using a question encoder
        * convert each text chunk of the knowledge base into a high-dim vec using a context encoder
    * retrieval
        * compare the prompt vector to vectors in the knowledge base
        * identify the top-k matches
    * augmented query creation
        * combine the retrieved text chunks with the original prompt to form an augmented query
    * model generation
        * LLM uses the augmented query to produce a contextually relevant response

* Details
    * prompt encoding
        * use token embeddings to convert words/sub-words into vectors
        * average those token vectors to get a single vector representation of the prompt
    * context encoding
        * split large docs into smaller chunks
        * embed each chunk separately
        * store these chunk embeddings in a vector database
    * retrieval
        * the system computes similarity/distance between the prompt vector and each chunk vector
        * dot product focuses on magnitude and direction vs cosine sim focuses on direction only
        * select top-k chunks with minimal distance
    * final response
        * combine the retrieved text and the original question into the LLM to generate a domain-specific, accurate response



## RAG Encoders and FAISS

* Context encoder
    * context encoder + tokenizer -> encodes large test passages into dense vector embeddings
    * process
        * tokenize text passages (paragraphs)
        * use pre-traied DPR context encoder to create embeddings
        * store these embeddings in a vector database

* FAISS
    * a library for efficient similarity search on large sets of high-dim vectors
    * process
        * convert embeddings to a numpy float32 arr
        * initialize FAISS index (eg for L2 distance)
        * add context embeddings into FAISS for fast lookups

* Question encoder
    * DPR question encoder + tokenizer, similar to context encoder but for questions, returns question embedding

* Retrieval w/ FAISS
    * encode the question
    * compute distances to all store context embeddings
    * return top-k closest matches
    
* Response generation with decoder
    * wo/ context relying only on the pre-trained model
    * w/ augmentation
        * retrieve top-k passage from FAISS
        * concatenate the passages with the question
        * feed that combined input into the generator
        * generates a final, context-informed answer