# npr MC1: Cleantech Retrieval Augemented Generation

**Dominik Filliger, Nils Fahrni, Noah Leuenberger**

> The topic of Mini-Challenge 1 is retrieval augmented generation (RAG) incorporating a combination of unsupervised learning, pre-training and in-context learning techniques.

- [Description of the task](https://spaces.technik.fhnw.ch/storage/uploads/spaces/81/exercises/NPR-Mini-Challenge-1-Cleantech-RAG-1708982891.pdf)
- [Introduction to RAG](https://spaces.technik.fhnw.ch/storage/uploads/spaces/81/exercises/Retrieval-Augmented-Generation-Intro-1709021241.pdf)

This notebook serves as the main entry point for our solution to the NPR Mini-Challenge 1. We will provide a detailed explanation of our approach and the code we used to solve the task. However, we have outsourced the code for the evaluation, Langchain LLM model creation and vectorstore interaction to script files which can be found in the `src` directory.

Additionally, scripts for the development subset and subset evaluation set creation can be found in the `scripts` directory and will be referenced in their respective sections.


# Setup


In [None]:
import os
from dotenv import load_dotenv
from tqdm import tqdm
load_dotenv()
from src.generation import get_llm_model, LLMModel
azure_model = get_llm_model(LLMModel.GPT_3_AZURE)

## Observability & Monitoring

> Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting. It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues, and export data to improve.

We will use Phoenix to visualize traces to quickly debug pipelines. The library offers way more feature which we will not use. Down below we add the Phoenix callbacks to Langchain, our main library for the solution, to visualize the traces.


In [None]:
from phoenix.trace.langchain import LangChainInstrumentor
import phoenix as px

px.close_app()
session = px.launch_app()

LangChainInstrumentor().instrument()

To get quick access to the Phoenix dashboard, the dashboard is rendered in the notebook. The dashboard is interactive and can be used to explore the traces.


In [None]:
session.view()

# Data Loading & Preprocessing

In order to save time and resources, we will only load a randomly sampled subset of the data. This subset will be used for development and testing purposes. The full dataset will be used for a final evaluation of our chosen approach.

In [None]:
import pandas as pd
df = pd.read_csv('data/Cleantech Media Dataset/cleantech_media_dataset_v2_2024-02-23_subset.csv')
df.head()

## Splitting content into paragraphs

The content is currently stored as a string which represents a list of string. 

As a base start we will take this string and transform it into a list of strings, explode this list into multiple rows and then remove the duplicates. This will allow us to work with the data in a more structured way. 

On top of that we will also remove special characters and empty strings from the content, following suit of our embedding model (bge-small-en).

To do all this we will use the custom `Preprocessor` class.

###### todo validate if this is the correct approach

In [None]:
from src.preprocessing.preprocessor import Preprocessor

preprocesser = Preprocessor(df)
df = preprocesser.preprocess()

# Indexing

The indexing involves creating a vector representation of the content. 

In [None]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en", 
    model_kwargs={"device": "cpu"}, 
    encode_kwargs={"normalize_embeddings": True}
)

In [None]:
from langchain_core.documents import Document

def create_documents(df):
    docs = []
    for index, row in tqdm(df.iterrows()):
        content = row['content']

        row = row.fillna('')

        metadata = {
            "url": row['url'],
            "domain": row['domain'],
            "title": row['title'],
            "author": row['author'],
            "date": row['date']
        }

        docs.append(Document(page_content=content, metadata=metadata))

    return docs

documents = create_documents(df)

assert len(documents) == len(df)

## Vector Store

We will use [ChromaDB](https://www.trychroma.com/) to store the embeddings. For easier interaction with the embeddings, we will use the VectorStore class which is a wrapper around the embeddings and ChromaDB. It provides a simple interface to interact with the embeddings and ChromaDB functionality we need for the task.

### ChromaDB Setup

If the environment variables `CHROMADB_HOST` or `CHROMADB_PORT` are not set, the VectorStore will use a local non-persistent ChromaDB client, which is not recommended. Instead we recommend setting up a ChromaDB instance. The ChromaDB instance can be set up using the following command and Docker:

```bash
docker-compose up -d chromadb
```

Set the environment variables `CHROMADB_HOST` and `CHROMADB_PORT` to the host and port of the ChromaDB instance. The default values are `localhost` and `8192`.

### VectorStore Usage

The vector store is directly tied to the embeddings. Therefore a vector store is embedding specific and can only be used with the embeddings it was created with.

In [None]:
from src.vectorstore import VectorStore

print("ChromeDB Host: ", os.getenv('CHROMADB_HOST'))
print("ChromeDB Port: ", os.getenv('CHROMADB_PORT'))

bge_vector_store = VectorStore(embedding_function=bge_embeddings,
                               collection="cleantech-bge-small-en")

In the next step we will add the prepared documents from the previous step to the VectorStore.

In [None]:
%%script false --no-raise-error
bge_vector_store.add_documents(documents, verbose=True, batch_size=128)

After adding the documents to the vector store we can now perform similarity searches on the documents to verify that the interaction with the vector store works as expected.

In [None]:
bge_vector_store.similarity_search_w_scores("The company is also aiming to reduce gas flaring?")

# Baseline Pipeline

The baseline pipeline is a first simple implementation of the RAG pipeline.


In [None]:
rag_prompt = """
Answer the question to your best knowledge when looking at the following context:
{context}
                
Question: {question}
"""

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain_from_docs = (
        RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
        | ChatPromptTemplate.from_template(rag_prompt)
        | azure_model
        | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {
        "context": bge_vector_store.get_retriever(), 
        "question": RunnablePassthrough()
    }
).assign(answer=rag_chain_from_docs)

In [None]:
rag_chain_with_source.invoke("Is the company aiming to reduce gas flaring?")

# Evaluation

In order to compare the performance of different pipelines we need to evaluate them. The evaluation is done with the `ragas` library. The library provides a function to evaluate the performance of the pipeline. `ragas` provides predefined metrics for the evaluation which are described in the [documentation](https://docs.ragas.io/en/stable/concepts/metrics/index.html). We will use the following metrics to evaluate the performance of our pipelines:

- **Context Relevancy**: The context relevancy metric measures how well the generated response is related to the context. The metric is calculated as the cosine similarity between the context and the generated response.

## Evaluation Set
In order to provide a fair comparison between the different pipelines we will use the same evaluation set for all pipelines. The evaluation set was created before hand with the script `scripts/generate_testset.py`. With that we can evaluate the performance of our pipelines with a subset of the data which saves time and resources.

In [None]:
df_eval_subset = pd.read_csv('data/Cleantech Media Dataset/cleantech_media_dataset_v2_2024-02-23_subset_eval.csv')
df_eval_subset = df_eval_subset.dropna(subset=['answer'])
df_eval_subset = df_eval_subset.drop_duplicates().sample(2)
df_eval_subset

## RAGEvaluator
The RAGEvaluator evaluation class is a wrapper around the `ragas` library. It provides a simple interface to evaluate the performance of the pipelines. The class provides a method to evaluate the performance of the pipeline and returns the results as a pandas DataFrame. The metrics are calculated for each example in the evaluation set and results can be aggregated over the whole evaluation set to get an overall performance of the pipeline.

In [None]:
from src.evaluation import RAGEvaluator
rag_evaluator = RAGEvaluator(chain=rag_chain_with_source,
                             llm_model=azure_model,
                             embeddings=bge_embeddings)

In [None]:
rag_evaluator.create_dataset_from_df(df_eval_subset)
rag_evaluator.evaluate(raise_exceptions=False)