# npr MC1: Cleantech Retrieval Augemented Generation

**Dominik Filliger, Nils Fahrni, Noah Leuenberger**

> The topic of Mini-Challenge 1 is retrieval augmented generation (RAG) incorporating a combination of unsupervised learning, pre-training and in-context learning techniques.

- [Description of the task](https://spaces.technik.fhnw.ch/storage/uploads/spaces/81/exercises/NPR-Mini-Challenge-1-Cleantech-RAG-1708982891.pdf)
- [Introduction to RAG](https://spaces.technik.fhnw.ch/storage/uploads/spaces/81/exercises/Retrieval-Augmented-Generation-Intro-1709021241.pdf)

This notebook serves as the main entry point for our solution to the NPR Mini-Challenge 1. We will provide a detailed explanation of our approach and the code we used to solve the task. However, we have outsourced the code for the evaluation, Langchain LLM model creation and vectorstore interaction to script files which can be found in the `src` directory.

Additionally, scripts for the development subset and subset evaluation set creation can be found in the `scripts` directory and will be referenced in their respective sections.


# Setup


In [None]:
import os

from dotenv import load_dotenv

load_dotenv()
from src.generation import get_llm_model, LLMModel

azure_model = get_llm_model(LLMModel.GPT_3_AZURE)

## Observability & Monitoring

> Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting. It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues, and export data to improve.

We will use Phoenix to visualize traces to quickly debug pipelines. The library offers way more feature which we will not use. Down below we add the Phoenix callbacks to Langchain, our main library for the solution, to visualize the traces.


In [None]:
from phoenix.trace.langchain import LangChainInstrumentor
import phoenix as px

px.close_app()
session = px.launch_app()

LangChainInstrumentor().instrument()

To get quick access to the Phoenix dashboard, the dashboard is rendered in the notebook. The dashboard is interactive and can be used to explore the traces.


In [None]:
session.view()

# Data Loading & Preprocessing


In [None]:
import pandas as pd

df = pd.read_csv('data/Cleantech Media Dataset/cleantech_media_dataset_v2_2024-02-23.csv')
df.head()

In [None]:
from src.preprocessing.preprocessor import Preprocessor

preprocessed_df_path = 'data/Cleantech Media Dataset/cleantech_media_dataset_v2_2024-02-23_preprocessed.csv'
if os.path.exists(preprocessed_df_path):
    preprocessed_df = pd.read_csv(preprocessed_df_path)
else:
    preprocessed_df = Preprocessor(df).preprocess()
    preprocessed_df.to_csv(preprocessed_df_path, index=False)

# Indexing

The indexing involves creating a vector representation of the content.

The chosen embedding model is the BGE-Small-EN model from HuggingFace. The model is a transformer-based model which is trained on the BGE dataset. The model is used to create embeddings for the content of the documents.

In [None]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

In [None]:
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
)


def get_document_metadata(row):
    return {
        "url": row['url'],
        "domain": row['domain'],
        "title": row['title'],
        "author": row['author'],
        "date": row['date'],
        "origin_doc_id": row['id']
    }


documents = [Document(page_content=row['content'], metadata=get_document_metadata(row))
             for index, row in preprocessed_df.iterrows()
             for split in recursive_text_splitter.split_text(row['content'])]

print(f"Number of documents: {len(documents)}, Number of rows in df: {len(preprocessed_df)}")

## Vector Store

We will use [ChromaDB](https://www.trychroma.com/) to store the embeddings. For easier interaction with the embeddings, we will use the VectorStore class which is a wrapper around the embeddings and ChromaDB. It provides a simple interface to interact with the embeddings and ChromaDB functionality we need for the task.

### ChromaDB Setup

If the environment variables `CHROMADB_HOST` or `CHROMADB_PORT` are not set, the VectorStore will use a local non-persistent ChromaDB client, which is not recommended. Instead we recommend setting up a ChromaDB instance. The ChromaDB instance can be set up using the following command and Docker:

```bash
docker-compose up -d chromadb
```

Set the environment variables `CHROMADB_HOST` and `CHROMADB_PORT` to the host and port of the ChromaDB instance. The default values are `localhost` and `8192`.

### VectorStore Usage

The vector store is directly tied to the embeddings. Therefore a vector store is embedding specific and can only be used with the embeddings it was created with.

In [None]:
from src.vectorstore import VectorStore

print("ChromeDB Host: ", os.getenv('CHROMADB_HOST'))
print("ChromeDB Port: ", os.getenv('CHROMADB_PORT'))
print("ChromaDB Collection: ", os.getenv('CHROMADB_COLLECTION'))

bge_vector_store = VectorStore(embedding_function=bge_embeddings)

In the next step we will add the prepared documents from the previous step to the VectorStore.

In [None]:
%%script false --no-raise-error
bge_vector_store.add_documents(documents, verbose=True, batch_size=128)

After adding the documents to the vector store we can now perform similarity searches on the documents to verify that the interaction with the vector store works as expected.

In [None]:
bge_vector_store.similarity_search_w_scores("The company is also aiming to reduce gas flaring?")

# Baseline Pipeline

The baseline pipeline is a first simple implementation of the RAG pipeline.


In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel

retriever = bge_vector_store.get_retriever()

In [None]:
base_rag_prompt = """
Answer the question to your best knowledge when looking at the following context:
{context}
                
Question: {question}
"""

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


base_rag_chain = (
        RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
        | ChatPromptTemplate.from_template(base_rag_prompt)
        | azure_model
        | StrOutputParser()
)

base_rag = RunnableParallel(
    {
        "context": retriever,
        "question": RunnablePassthrough()
    }
).assign(answer=base_rag_chain)

In [None]:
base_rag.invoke("Is the company aiming to reduce gas flaring?")

# Evaluation

In order to compare the performance of different pipelines we need to evaluate them. The evaluation is done with the `ragas` library. The library provides a function to evaluate the performance of the pipeline. `ragas` provides predefined metrics for the evaluation which are described in the [documentation](https://docs.ragas.io/en/stable/concepts/metrics/index.html). We will use the following metrics to evaluate the performance of our pipelines:

- **Context Relevancy**: The context relevancy metric measures how well the generated response is related to the context. The metric is calculated as the cosine similarity between the context and the generated response.

## Evaluation Set
In order to provide a fair comparison between the different pipelines we will use the same evaluation set for all pipelines. The evaluation set was created before hand with the script `scripts/generate_testset.py`. With that we can evaluate the performance of our pipelines with a subset of the data which saves time and resources.

In [None]:
df_eval_subset = pd.read_csv('data/Cleantech Media Dataset/cleantech_media_dataset_v2_2024-02-23_subset_eval.csv')
df_eval_subset = df_eval_subset.dropna(subset=['answer'])
df_eval_subset = df_eval_subset.drop_duplicates().sample(10)
df_eval_subset

## RAGEvaluator
The RAGEvaluator evaluation class is a wrapper around the `ragas` library. It provides a simple interface to evaluate the performance of the pipelines. The class provides a method to evaluate the performance of the pipeline and returns the results as a pandas DataFrame. The metrics are calculated for each example in the evaluation set and results can be aggregated over the whole evaluation set to get an overall performance of the pipeline.

In [None]:
from src.evaluation import RAGEvaluator

base_evaluator = RAGEvaluator(name="Baseline",
                              chain=base_rag,
                              llm_model=azure_model,
                              embeddings=bge_embeddings)

In [None]:
base_evaluator.create_dataset_from_df(df_eval_subset)
default_eval_result = base_evaluator.evaluate(raise_exceptions=False)

In [None]:
base_evaluator.summarize_metrics()

# Experiment 1: Looking at the impact of context and its chunking strategy

Contrary to the apparent structure of the data, which seems to have already chunked the data according to the first proposed chunking strategy, this step will introduce the concatenation of these premade chunks into one single document. This will help us to see if it is beneficial for the LLM to have the entire document context instead of just a chunk of said document.

First, to restructure the cleantech-dataset's content structure, we can call the `preprocess()` method on the `Preprocessor` Object which was instantiated using the `concatenate_contents=True` attribute. This will turn the list of all prechunked contents into a joined string, representing the content of every document.

In order to embed the processed documents we again can turn them into langchain-digestible Documents.

In [None]:
full_content_documents = [Document(page_content=row['content'], metadata=get_document_metadata(row))
                          for _, row in preprocessed_df.iterrows()]

assert len(full_content_documents) == len(preprocessed_df)

And in order to look at this experiment in an encapsulated manner, a new `VectorStore` will be created.

In [None]:
bge_full_content_vector_store = VectorStore(embedding_function=bge_embeddings,
                                            collection="cleantech-full-content-bge-small-en")

In [None]:
full_retriever = bge_full_content_vector_store.get_retriever()

In [None]:
%%script false --no-raise-error
bge_full_content_vector_store.add_documents(full_content_documents, verbose=True, batch_size=128)

In [None]:
bge_full_content_vector_store.similarity_search_w_scores("The company is also aiming to reduce gas flaring?")

In [None]:
full_rag = RunnableParallel(
    {
        "context": full_retriever,
        "question": RunnablePassthrough()
    }
).assign(answer=base_rag_chain)

In [None]:
full_rag.invoke("Is the company aiming to reduce gas flaring?")

In [None]:
full_evaluator = RAGEvaluator(name="Full Content",
                              chain=full_rag,
                              llm_model=azure_model,
                              embeddings=bge_embeddings)

In [None]:
full_evaluator.create_dataset_from_df(df_eval_subset)
full_content_eval_results = full_evaluator.evaluate(raise_exceptions=False)

# Experiment 2: Using a Multi-Query Retrieval Strategy

At the heart of the RAG is the retriever, which is responsible for finding the most relevant documents for a given question. The baseline RAG uses the vector retriever to find the most relevant document, using cosine-similarity. 

We will now experiment with a multi-query retrieval strategy. The idea is to use multiple queries to retrieve a multidude of documents and take a unique union of the results. This way we can increase the diversity of the documents and potentially improve the quality of the generated answer. 

For this we will use the MultiQueryRetriever from langchain.


In [None]:
from langchain.retrievers.multi_query import MultiQueryRetriever

In [None]:
mqr_retriever = MultiQueryRetriever.from_llm(
    retriever=retriever, llm=azure_model
)

In [None]:

## using the langchain template for the prompt
template = """You are an AI language model assistant. Your task is to generate five 
different versions of the given user question to retrieve relevant documents from a vector 
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search. 
Provide these alternative questions separated by newlines. Original question: {question}"""
prompt_perspectives = ChatPromptTemplate.from_template(template)

generate_queries = (
        prompt_perspectives
        | azure_model
        | StrOutputParser()
        | (lambda x: x.split("\n"))
)

In [None]:
from langchain.load import dumps, loads


def get_unique_union(documents: list[list]):
    """ Unique union of retrieved docs """
    # Flatten list of lists, and convert each Document to string
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
    # Get unique documents
    unique_docs = list(set(flattened_docs))
    # Return
    return [loads(doc) for doc in unique_docs]


# Retrieve
mqr_retrieval_chain = (
        generate_queries
        | mqr_retriever.map()
        | get_unique_union
)

In [None]:
mqr_rag = RunnableParallel(
    {
        "context": mqr_retrieval_chain,
        "question": RunnablePassthrough()
    }
).assign(answer=base_rag_chain)

In [None]:
mqr_rag.invoke("Is the company aiming to reduce gas flaring?")

In [None]:
mqr_evaluator = RAGEvaluator(name="Multi-Query Retrieval",
                             chain=mqr_rag,
                             llm_model=azure_model,
                             embeddings=bge_embeddings)

In [None]:
mqr_evaluator.create_dataset_from_df(df_eval_subset)
mqr_eval_results = mqr_evaluator.evaluate(raise_exceptions=False)

In [None]:
mqr_evaluator.summarize_metrics()

In [None]:
base_evaluator.summarize_metrics()

In [None]:
full_evaluator.summarize_metrics()