# Q&A pipeline with Retrieval Augmented Generation (RAG) 

The following notebook serves as a demo for a small project for the KPMG Machine Learning position take home assignment. The task was to create a chatbot that answers mock "business inquiries", with the following instructions:
- The solution should be based on a RAG (Retreival Augmented Generation model) which consists of
    - a Retriever layer
    - an LLM Layer
- The suggested [dataset](https://huggingface.co/datasets/wikipedia/viewer/20220301.en) to use for the Retriever is the English wikipedia dataset from HuggingFace. Instead I used its smaller brother, the simplified Wikipedia [dataset](https://huggingface.co/datasets/wikipedia/viewer/20220301.simple)
- "Business inquiries" for this example should be considered anything a person might ask from e.g. the Wikipedia dataset

### RAG

LLMs have proven to be powerful and versatile tools for a broad range of applications. They are however (as their name suggests) very large, therefore training them for a new specific task from scratch requires tremendous amount of computational power, meaning a lot of time and resources would have to be invested. One of the ways to overcome this issue is to use a pretrained model, and tailor it to our needs. A method for this is using Retrieval Augmented Generation, or RAG models. RAGs consist of two main blocks: the retriever and the LLM block.

![RAG Structure](F:/Users/bence/Documents/_PROJECTS/KPMG_LLM/large-language-models/rag/rag_s.png)

**Retriever block**

The aim of the retriever block is to create a richer prompt for the LLM block by adding context to the original query. It achieves this by having its own knowledge database. In this case this is the simplified wikipedia database from HuggingFace. It then converts this text based database into vector representations with an embedding network (in this case a pretrained transformer model, which transforms the text into a 384 dimensional vector space). The original prompt/query is also passed through the same embedding network, ensuring that the knowledge base and the query will be in the same representation space as the data. 

Afterwards, the closest *k* vectors to the query vector are found and selected as context. The corresponding texts from the document dataset are then added to a template prompt (e.g. "Consider the following context and then answer the question: *context*, *question*")

**LLM block**

The LLM block, as its name suggests it contains a pretrained LLM, which answers the enriched prompt. In this example, I used a pretrained Llama 2 model (the open source LLM model developed by Meta). The LLM is expected to produce a better quality result then the base LLM. This is tested at the end of the notebook

### How to use? 

- Enter a query in the appropriate cell (QUERY variable)
- Run all the code
- **NOTES**: the RAG model can take a few minutes to produce a result, as it gets a long prompt and the model runs locally

In [3]:
from langchain.docstore.document import Document
from langchain.document_loaders import HuggingFaceDatasetLoader

from encoder.encoder import Encoder
from generator.generator import Generator
from retriever.vector_db import VectorDatabase

This following cell contains the template for the enhanced prompt: {context} gets replaced with the chosen data from the Wikipedia dataset, while {question} gets replaced with the original query

In [4]:
TEMPLATE = """
Use the following pieces of context to answer the question at the end. 
{context}
Question: {question}
Answer:
"""

In [5]:
# load wikipedia dataset
loader = HuggingFaceDatasetLoader("wikipedia", name="20220301.simple")
docs = loader.load()[:100]

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
#set device to cuda (just in case)
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [None]:
# initiate our classes for the Encoder, Retriever and Generator
encoder = Encoder()
faiss_db = VectorDatabase()
generator = Generator(TEMPLATE)

In [None]:
# Create passages of same length from documents
# This should be able to split documents, but I do not think I got it to work correctly, 
# therefore the whole articles get represented in the vector space
passages = faiss_db.create_passages(docs)
faiss_db.store_passages_db(passages, encoder.encoder)

### Enter the query

In [None]:
QUERY = "When is April Fools day?"

In [None]:
#QUERY_USER = input('What would you like to know?\n')

### Find 
Convert query to vector space and find the closest docs

In [None]:
# retrieve the k most similar documents to our query
context = faiss_db.retrieve_most_similar_document(QUERY, k=1)

### Answers - comparison

Get results from both the RAG model, and the base LLaMa model and compare them

In [None]:
# RAG response
print(generator.get_answer(context[:4096], QUERY))

In [None]:
# Base LLM model
print(generator.get_answer('', QUERY))