# What is RAG?

RAG stands for Retrieval Augmented Generation.

Each step of simple (naive) RAG can be roughly broken down to:

* Retrieval - seeking relevant information from a source (or sources) within an embedded database, given a specific query. For example, getting relevant passages of source text from a database given a question.
* Augmented - using the relevant retrieved information to modify an input to a generative model (e.g. an LLM).
* Generation - generating an output given an context input. For example, in the case of an LLM, generating a passage of text given an input prompt.

Besides the simple RAG technique, various other RAG approaches exist, such as Advanced RAG, Speculative RAG, Fusion RAG, and more. For a better understanding, I highly recommend checking out these two articles: https://homayounsrp.medium.com/6-types-of-retrieval-augmented-generation-rag-techniques-you-should-know-b45de9071c79 and https://bhavikjikadara.medium.com/exploring-the-different-types-of-rag-in-ai-c118edf6d73c.

So, RAG is very powerful when we need work with specific custom data and prevent all possible hallucinations about our answer. This approach is also much cheaper than attempting to fine-tune an LLM.

RAG can be used as mentioned before with custom data for example:
* Customer Q&A support chat
* Textbook Q&A
* Company internal documentation chat

# Import necessary packages and dependencies

In [None]:
!pip install -U langchain-community PyMuPDF faiss-cpu langchain_huggingface -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m86.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import fitz
import re
import os
import pandas as pd
import time

from langchain.docstore.document import Document
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma, FAISS
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_huggingface import (HuggingFaceEmbeddings,
                                   ChatHuggingFace,
                                   HuggingFaceEndpoint)

from google.colab import userdata
token = userdata.get('token_huggingface')
os.environ['HUGGINGFACEHUB_API_TOKEN'] = token
# spacy can be used for making chunks by sentence
# from spacy.lang.en import English

# Loading pdf files and making vector db

We need to create a custom embeddings database for our RAG model. For this purpose, we can use a vector database such as FAISS or Chroma; in my case, I'll be using FAISS. We also require a transformer encoder model, and I've chosen 'all-MiniLM-L6-v2' from HuggingFace for this.

In [None]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def text_formatter(text: str) -> str:
    cleaned_text = text.replace("\n", " ").strip()
    # we can also provide more formatting functions here
    return cleaned_text

def create_db(root_directory, embedding_model):
  if len(os.listdir(root_directory)) == 0:
    print('Nothing was added to the source folder!')
    return None
  else:
    prepared_documents = []
    for filename in os.listdir(root_directory):
      if filename.endswith('.pdf'):
        print(f'Reading {filename}...')

        file_path = os.path.join(root_directory, filename)

        # using fitz pdf reader more robust
        pdf_document = fitz.open(file_path)
        # document_metadata = {}
        # document_metadata['total_pages'] = len(pdf_document)
        # document_metadata['title'] = pdf_document.metadata['title']
        doc = [Document(page_content=text_formatter(page.get_text())) for page in pdf_document]
        # doc = [Document(page_content=text_formatter(page.get_text()),
        #                 metadata={**document_metadata, 'page': page_number}
        #                 ) for page_number, page in enumerate(pdf_document)]
        # langchain pdf reader
        # loader = PyPDFLoader(file_path)
        # pdf_document = loader.load()

        # can use RecursiveCharacterTextSplitter, TokenTextSplitter, MarkdownHeaderTextSplitter
        text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=64)
        splited_documents = text_splitter.split_documents(doc)
        prepared_documents.extend(splited_documents)
        print('Sleeping...')
        time.sleep(10)
    db = FAISS.from_documents(splited_documents, embedding_model)
    return db


destination_file_path = '/db'
db = create_db("/content_files", embedding_model)
if db is not None:
  db.save_local(destination_file_path)
else:
  print("Root file directory is empty!")

Reading Aurélien-Géron-Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow_-Concepts-Tools-and-Techniques-to-Build-Intelligent-Systems-O’Reilly-Media-2019.pdf...
Sleeping...


# Retrieval part of the RAG

In [None]:
# making classic retriever method
def retriever(query, db=db, k=3):
  retriever = db.as_retriever(search_type="similarity_score_threshold",
                              search_kwargs={"score_threshold": .4, "k": k})
  return retriever.get_relevant_documents(query)

In [None]:
query = 'What is Random Forest?'
retrieved_text = retriever(query)
retrieved_text

  return retriever.get_relevant_documents(query)


[Document(id='7a10b208-72b8-43d7-a884-1c88ed29b6b7', metadata={}, page_content='9 “Random Decision Forests,” T. Ho (1995). 10 The BaggingClassifier class remains useful if you want a bag of something other than Decision Trees. 11 There are a few notable exceptions: splitter is absent (forced to "random"), presort is absent (forced to False), max_samples is absent (forced to 1.0), and base_estimator is absent (forced to DecisionTreeClassi fier with the provided hyperparameters). Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance. Random Forests As we have discussed, a Random Forest9 is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with max_samples set to the size of the training set. Instead of building a BaggingClassifier and pass‐ ing it a DecisionTreeClassifier, you can instead use the RandomForestClassifier class, which is more convenient and optimized for Decision Trees

# Generation part of the RAG

For this part, we need access to an LLM API, or alternatively, a local LLM such as Llama. In my case, I will use the Hugging Face LLM API for 'HuggingFaceH4/zephyr-7b-beta'. If a more powerful model is required, one can consult the LLM leaderboard on Hugging Face (https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/).

## The simple approach of Generation part of the RAG

In [None]:
llm = HuggingFaceEndpoint(repo_id='HuggingFaceH4/zephyr-7b-beta',
                          task='text-generation')

In [None]:
chat_model = ChatHuggingFace(llm=llm)

In [None]:
def ask(query, context):
  messages = [
      SystemMessage(content='You are an AI assistant that helps people find information based on provided context. **Always answer on English**'),
      # HumanMessage(content = query)
      HumanMessage(content=f'Answer the next question: {query} based only on the following context information: {context}. If u don\'t have specific context you must answer that you don\'t know!')
  ]
  response = chat_model.invoke(messages)
  return response

In [None]:
def ask_rag(query):
  context = [context_text.page_content for context_text in retriever(query)]
  return ask(query, context)

In [None]:
query = 'What is Random Forest?'
response = ask_rag(query)

In [None]:
response.content

"Random Forest is an ensemble learning algorithm that combines multiple decision trees to make more accurate and reliable predictions than a single decision tree can provide. Each tree is trained on a different random subset of the features and observations (with replacement) in the training set, which introduces variance in the model and helps to reduce overfitting. The final prediction is made by selecting the class that receives the majority vote from all the individual trees. Random Forest also has additional hyperparameters that allow for greater control over the trees' depth and complexity. It is a popular and effective machine learning algorithm that is widely used in various applications such as fraud detection, customer segmentation, and image classification.\n\nThe algorithm is typically trained using the bagging (bootstrapped aggregating) method, which involves creating several samples of the training set with replacement and training a decision tree on each sample. The fina

## Conversation memory
...