<center><h1>RAG using Gemma, Langchain and ChromaDB</h1></center>
<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>


# Introduction

This notebook demonstrates how to build a retrieval augmented generation (RAG) system using Gemma as a large language model (LLM), Langchain for tools to process input files, and ChromaDB as vector database.

## What is RAG?

Retriever augmented generation (RAG) is a system that improves the response generated by a LLM in two ways:
- First, the information is retrieved from a dataset that is stored in vector database; the query is used to perform similarity search in the documents stored in the vector database.
- Second, by restraining the context provided to the LLM to content that is similar with the initial query, stored in the vector database, we can reduce significantly (or even eliminate) LLM's halucinations, since the answer is provided from the context of the stored documents.

An important advantage of this approach is that we do not need to fine-tune the LLM with our custom data; instead, the data is ingested (cleaned, transformed, chunked, and indexed in the vector database).

## Procedure

We create two classes:
* AIAgent - An AI Agent that query Gemma LLM using a custom prompt that instruct Gemma to generate and answer (from the query) by refering to the context (as well provided); the answer to the AI Agent query function is then returned.
* RAGSystem - initialized with the dataset with Data Science information, with an AIAgent object. In the init function of this class, we ingest the data from the dataset in the vector database. This class have as well a query member function. In this function we first perform similarity search with the query to the vector database. Then, we call the generate function of the ai agent object. Before returning the answer, we use a predefined template to compose the overal response from the question, answer and the context retrieved.


# Packages instalation and configurations

In [1]:
# install required libraries
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install langchain
!pip install sentence-transformers
!pip install chromadb

Looking in indexes: https://pypi.org/simple/
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl.metadata (1.8 kB)
Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.0
Collecting langchain
  Downloading langchain-0.1.14-py3-none-any.whl.metadata (13 kB)
Collecting langchain-community<0.1,>=0.0.30 (from langchain)
  Downloading langchain_community-0.0.31-py3-none-any.whl.metadata (8.4 kB)
Collecting langchain-core<0.2.0,>=0.1.37 (from langchain)
  Downloading langchain_core-0.1.37-py3-none-any.whl.metadata (6.0 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl.metadata (2.0 kB)
Collecting langsmith<0.2.0,>=0.1.17 (fro

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

from IPython.display import display, Markdown


# AI Agent class

In [3]:
class AIAgent:
    """Gemma 2b-it based assistant that replies given the retrieved documents"""
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/2b-it/2")
        self.gemma_lm = AutoModelForCausalLM.from_pretrained("/kaggle/input/gemma/transformers/2b-it/2")

    def create_prompt(self, query, context):
        # prompt template
        prompt = f"""
        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        Only use the context provided to answer.
        Answer with simple words. If needed, include explanations.
        Question: {query}
        Context: {context}
        Answer:
        """
        return prompt
    
    def generate(self, query, retrieved_info):
        prompt = self.create_prompt(query, retrieved_info)
        input_ids = self.tokenizer(query, return_tensors="pt").input_ids
        # Answer generation
        answer = self.gemma_lm.generate(
            input_ids,
            max_length=512, # limit the answer to 512
        )
        # Decode and return the answer
        answer = self.tokenizer.decode(answer[0], skip_special_tokens=True)
        return answer

In [4]:
class RAGSystem:
    """Sentence embedding based Retrieval Based Augmented generation.
        Given database of pdf files, retriever finds num_retrieved_docs relevant documents"""
    def __init__(self, ai_agent, num_retrieved_docs=3):
        # load the data
        self.num_docs = 3
        self.ai_agent = ai_agent
        loader = CSVLoader("/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv")
        documents = loader.load()
        self.template = "\n\nQuestion:\n{question}\n\nAnswer:\n{answer}\n\nContext:\n{context}"
        
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=800, 
            chunk_overlap=100)
        all_splits = text_splitter.split_documents(documents)
        # create a vectorstore database
        embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
        self.vector_db = Chroma.from_documents(documents=all_splits, 
                                               embedding=embeddings, 
                                               persist_directory="chroma_db")
        self.retriever = self.vector_db.as_retriever()

    def retrieve(self, query):
        # retrieve top k similar documents to query
        docs = self.retriever.get_relevant_documents(query)
        return docs
    
    def query(self, query):
        # generate the answer
        context = self.retrieve(query)
        answer = self.ai_agent.generate(query, context)
        
        return self.template.format(question=query, 
                                   answer=answer,
                                   context=context)
        
        

In [5]:
def colorize_text(text):
    for word, color in zip(["Question", "Answer", "Context"], ["blue", "red", "green"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

# Test the RAG system

In [6]:
ai_agent = AIAgent()
rag_system = RAGSystem(ai_agent)

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
answer = rag_system.query("What is SVM?")
display(Markdown(colorize_text(answer)))



**<font color='blue'>Question:</font>**
What is SVM?

**<font color='red'>Answer:</font>**
What is SVM?

SVM stands for Support Vector Machine. It is a supervised machine learning algorithm used for both classification and regression tasks.

**Key Concepts of SVM:**

* **Support Vector Machines (SVMs):** These are hyperplanes in a high-dimensional space that best separate the different classes of data.
* **Kernels:** A kernel function is used to map the data into a higher-dimensional space, where the SVMs are constructed.
* **Cost Function:** A cost function is used to penalize the model for misclassifying data points.
* **Training Data:** The algorithm is trained on a set of labeled data points.
* **Testing Data:** The trained model is then tested on a set of unlabeled data points.

**How SVMs Work:**

1. **Data Transformation:** The data is transformed using a kernel function.
2. **Finding the Optimal Hyperplane:** The algorithm finds the best hyperplane that maximizes the margin between the different classes of data.
3. **Cost Function Minimization:** The cost function is minimized to penalize the model for misclassifying data points.
4. **Classification:** New data points are classified by finding the hyperplane that best separates them from the training data.

**Types of SVMs:**

* **Linear SVMs:** The kernel function is the linear function.
* **RBF (Radial Basis Function) SVMs:** The kernel function is the RBF function.
* **Polynomial SVMs:** The kernel function is the polynomial function.

**Advantages of SVMs:**

* High accuracy
* Robust to noise and outliers
* Can handle high-dimensional data

**Disadvantages of SVMs:**

* Can be sensitive to the choice of kernel function
* Not suitable for high-dimensional data with a large number of features

**<font color='green'>Context:</font>**
[Document(page_content='question: What’s singular value decomposition? How is it typically used for machine learning? \u200d⭐️\nanswer: * Singular Value Decomposition (SVD) is a general matrix decomposition method that factors a matrix X into three matrices L (left singular values), Σ (diagonal matrix) and R^T (right singular values).\n* For machine learning, Principal Component Analysis (PCA) is typically used. It is a special type of SVD where the singular values correspond to the eigenvectors and the values of the diagonal matrix are the squares of the eigenvalues. We use these features as they are statistically descriptive.', metadata={'row': 141, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}), Document(page_content='question: What is supervised machine learning? 👶\nanswer: Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data. Labeled data indicates that the\xa0input data has already been tagged with the appropriate output. Basically, it is the task of learning a function that maps the input set and returns an output. Some of its examples are: Linear Regression, Logistic Regression, KNN, etc.', metadata={'row': 0, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}), Document(page_content='question: Do you know any dimensionality reduction techniques? \u200d⭐️\nanswer: * Singular Value Decomposition (SVD)\n* Principal Component Analysis (PCA)\n* Linear Discriminant Analysis (LDA)\n* T-distributed Stochastic Neighbor Embedding (t-SNE)\n* Autoencoders\n* Fourier and Wavelet Transforms', metadata={'row': 140, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}), Document(page_content='question: What is unsupervised learning? 👶\nanswer: Unsupervised learning aims to detect patterns in data where no labels are given.', metadata={'row': 132, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'})]

In [8]:
answer = rag_system.query("What is regularization?")
display(Markdown(colorize_text(answer)))



**<font color='blue'>Question:</font>**
What is regularization?

**<font color='red'>Answer:</font>**
What is regularization?

Regularization is a technique used in machine learning and data science to reduce overfitting and improve the generalization performance of a model. It involves adding a penalty term to the loss function that is proportional to the size of the model parameters. This penalty term encourages the model to find a simpler solution that is less likely to overfit to the training data.

Here are some key points about regularization:

* **Types of regularization:**
    * L1 regularization (Lasso): penalizes the absolute value of the model parameters.
    * L2 regularization (Ridge): penalizes the squared value of the model parameters.
    * Elastic net regularization: combines L1 and L2 penalties.
* **Regularization techniques:**
    * Early stopping: stops the training process when the validation error starts to increase.
    * Dropout: randomly drops out some neurons during training.
    * Early stopping with L2 regularization: stops the training process when the L2 norm of the model parameters starts to increase.

Regularization can be used to improve the performance of a model in a variety of ways, including:

* Reducing overfitting: By encouraging the model to find a simpler solution, regularization helps to prevent it from memorizing the training data and making poor predictions on unseen data.
* Improving generalization performance: Regularization helps to ensure that the model is able to make accurate predictions on data that it has not seen during training.
* Preventing overfitting: Regularization helps to prevent the model from overfitting to the training data.

Regularization is a powerful technique that can be used to improve the performance of machine learning models. However, it is important to choose the right regularization technique for the task at hand, as there is no one-size-fits-all solution.

**<font color='green'>Context:</font>**
[Document(page_content='question: What is regularization? Why do we need it? 👶\nanswer: Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data.', metadata={'row': 41, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}), Document(page_content='question: How does L2 regularization look like in a linear model? \u200d⭐️\nanswer: L2 regularization adds a penalty term to our cost function which is equal to the sum of squares of models coefficients multiplied by a lambda hyperparameter. This technique makes sure that the coefficients are close to zero and is widely used in cases when we have a lot of features that might correlate with each other.', metadata={'row': 44, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}), Document(page_content='question: What regularization techniques for neural nets do you know? \u200d⭐️\nanswer: * L1 Regularization - Defined as the sum of absolute values of the individual parameters. The L1 penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded. \n* L2 Regularization - Defined as the sum of square of individual parameters. Often supported by regularization hyperparameter alpha. It results in weight decay. \n* Data Augmentation - This requires some fake data to be created as a part of training set. \n* Drop Out : This is most effective regularization technique for neural nets. Few random nodes in each layer is deactivated in forward pass. This allows the algorithm to train on different set of nodes in each iterations.', metadata={'row': 91, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}), Document(page_content='question: What kind of regularization techniques are applicable to linear models? \u200d⭐️\nanswer: AIC/BIC, Ridge regression, Lasso, Elastic Net, Basis pursuit denoising, Rudin–Osher–Fatemi model (TV), Potts model, RLAD,\nDantzig Selector,SLOPE', metadata={'row': 43, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'})]

In [9]:
answer = rag_system.query("Please explain bias and variance?")
display(Markdown(colorize_text(answer)))



**<font color='blue'>Question:</font>**
Please explain bias and variance?

**<font color='red'>Answer:</font>**
Please explain bias and variance?

**Bias** is the systematic error in a statistical estimate that is consistently larger or smaller than the true value. Bias can be caused by a variety of factors, including sampling error, measurement error, and sampling bias.

**Variance** is a measure of how much the sample mean varies from the population mean. Variance is affected by the sample size, the variability of the population, and the sampling method.

Here is a table summarizing the key differences between bias and variance:

| Feature | Bias | Variance |
|---|---|---|
| Cause | Systematic error | Variability |
| Type | Large | Small |
| Effect on estimate | Overestimates the true value | Underestimates the true value |
| Measures | Bias, sample mean | Variance, sample standard deviation |

Bias and variance are two important concepts in statistical inference. By understanding bias and variance, you can make more accurate and reliable statistical inferences.

**<font color='green'>Context:</font>**
[Document(page_content='question: What’s the interpretation of the bias term in linear models? \u200d⭐️\nanswer: Bias is simply, a difference between predicted value and actual/true value. It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias.', metadata={'row': 50, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}), Document(page_content='question: What is the bias-variance trade-off? 👶\nanswer: **Bias** is the error introduced by approximating the true underlying function, which can be quite complex, by a simpler model. **Variance** is a model sensitivity to changes in the training dataset.\n\n**Bias-variance trade-off** is a relationship between the expected test error and the variance and the bias - both contribute to the level of the test error and ideally should be as small as possible:\n\n```\nExpectedTestError = Variance + Bias² + IrreducibleError\n```\n\nBut as a model complexity increases, the bias decreases and the variance increases which leads to *overfitting*. And vice versa, model simplification helps to decrease the variance but it increases the bias which leads to *underfitting*.', metadata={'row': 13, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}), Document(page_content='question: What’s the normal distribution? Why do we care about it? 👶\nanswer: The normal distribution is a continuous probability distribution whose probability density function takes the following formula:\n\n![formula](https://mathworld.wolfram.com/images/equations/NormalDistribution/NumberedEquation1.gif)\n\nwhere μ is the mean and σ is the standard deviation of the distribution.\n\nThe normal distribution derives its importance from the **Central Limit Theorem**, which states that if we draw a large enough number of samples, their mean will follow a normal distribution regardless of the initial distribution of the sample, i.e **the distribution of the mean of the samples is normal**. It is important that each sample is independent from the other.', metadata={'row': 4, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}), Document(page_content='Assigning random values to weights is better than just 0 assignment. \n* a) If weights are initialized with very high values the term np.dot(W,X)+b becomes significantly higher and if an activation function like sigmoid() is applied, the function maps its value near to 1 where the slope of gradient changes slowly and learning takes a lot of time.\n* b) If weights are initialized with low values it gets mapped to 0, where the case is the same as above. This problem is often referred to as the vanishing gradient.', metadata={'row': 89, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'})]

# Conclusions

We tested a RAG system developed with Gemma as LLM, Langchain for data loaders utilities, and ChromaDB as database. 
The RAG system is initialized with a dataset, that is used to populate the vector database, and with an AI Agent, that will query Gemma, given the initial query and the retrieved context.
To verify that the result is composed based on the context provided, we include as well the context in the exported result.
