**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: RAG

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/rag_guide.ipynb](guides/rag_guide.ipynb)


***
<br>

* Remember to include some reflections on your results. Are there, for example, any hyperparameters that are particularly important?

* You should follow the steps given in the `rag_guide` notebook to create your own RAG system.

<br>

***

#### Imports

In [2]:
from decouple import config

from langchain_ibm import WatsonxLLM
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams

In [3]:
from typing import Literal, Any
from copy import deepcopy

from typing_extensions import TypedDict
import matplotlib.pyplot as plt
import numpy as np
from decouple import config
from pydantic import BaseModel, Field
from IPython.display import Image, display
from tqdm import tqdm

from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter
from langchain.prompts import PromptTemplate
from langchain_ibm import WatsonxEmbeddings
from langchain_ibm import WatsonxLLM
from langgraph.graph import START, StateGraph
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams

import litellm
from litellm import completion
import instructor
from instructor import Mode

In [4]:
import os
from dotenv import load_dotenv

load_dotenv()  # Ensure this runs before accessing environment variables

WX_API_KEY = os.getenv("WX_API_KEY")


#### Retrieve secrets

In [5]:
WX_API_KEY = config("WX_API_KEY")
WX_PROJECT_ID = config("WX_PROJECT_ID")
WX_API_URL = "https://us-south.ml.cloud.ibm.com"


#### Authenticate and initialize LLM

In [17]:
llm = WatsonxLLM(

        model_id= "ibm/granite-3-8b-instruct",
        url=WX_API_URL,
        apikey=WX_API_KEY,
        project_id=WX_PROJECT_ID,

        params={
            GenParams.DECODING_METHOD: "greedy", # Greedy decoding selects the highest probability token at each step
            GenParams.TEMPERATURE: 0.4, # 0 makes the model strictly deterministic, reducing creative variations.
            GenParams.MIN_NEW_TOKENS: 5,
            GenParams.MAX_NEW_TOKENS: 1_000, # allows for lengthy responses, but it might generate unnecessary text.
            GenParams.REPETITION_PENALTY:1.2 # A penalty discourages repetition, which helps improve output quality
        }

)

Changes:

Temperature: from 0 to 0.3

-> Since RAG depends on retrieved documents, a non-zero temperature (e.g., 0.3-0.7) could help improve response variety while still maintaining accuracy.

#### Use LLM

In [18]:
llm_result = llm.generate(["Hi how are you?"])

print(type(llm_result))
print(llm_result)

<class 'langchain_core.outputs.llm_result.LLMResult'>
generations=[[Generation(text="\nI'm an artificial intelligence and don't have feelings, but I'm here to help you. How can I assist you today?", generation_info={'finish_reason': 'eos_token'})]] llm_output={'token_usage': {'generated_token_count': 31, 'input_token_count': 5}, 'model_id': 'ibm/granite-3-8b-instruct', 'deployment_id': None} run=[RunInfo(run_id=UUID('72e3b4f1-abd5-4d0c-8064-855aabe9c33a'))] type='LLMResult'


#### Load documents

load text documents from our local system to build a knowledge base. 

In [19]:
document = TextLoader("data/madeup_company.md").load()[0]
document.metadata

{'source': 'data/madeup_company.md'}

#### Split documents

Since we are dealing with a markdown file, we can use `MarkdownHeaderTextSplitter`. This splitter will split the document into chunks based on the headers in the markdown file. This is a good way to maintain the structure of the document and ensure that the chunks are coherent.

In [20]:
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ("####", "Header 4")]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = text_splitter.split_text(document.page_content)

#### Preprocess chunks

If we looks at the metadata for each chunk, we see that it is neatly split into header types (Header 1, header 2, etc.) and the text content. This is useful! We can write a function to add the header type(s) to the text content itself, so that the model can use this information to generate better responses.

In [21]:
def update_documents_with_headers(chunks):
    """
    Creates a new list of Document objects with page_content prepended with headers
    in [Header1/Header2/Header3]: format
    
    Returns new objects rather than modifying the original chunks
    """
    updated_chunks = []
    
    for doc in chunks:
        # Create a deep copy of the document to avoid modifying the original
        new_doc = deepcopy(doc)
        
        # Get all headers that exist in metadata
        headers = []
        for i in range(1, 4):
            key = f'Header {i}'
            if key in new_doc.metadata:
                headers.append(new_doc.metadata[key])
        
        # Create the header prefix and update page_content
        if headers:
            prefix = f"[{'/'.join(headers)}]: "
            new_doc.page_content = prefix + "\n" + new_doc.page_content
        
        updated_chunks.append(new_doc)
    
    return updated_chunks


docs = update_documents_with_headers(chunks)

#### Initialize the embedding model


In [22]:
embed_params = {}

watsonx_embedding = WatsonxEmbeddings(
    model_id="ibm/granite-embedding-278m-multilingual",
    url=WX_API_URL,
    project_id=WX_PROJECT_ID,
    apikey=WX_API_KEY,
    params=embed_params,
)

#### Create vector index

In [23]:
local_vector_db = Chroma.from_documents(
    collection_name="my_collection",
    embedding=watsonx_embedding,
    persist_directory="my_vector_db", # This will save the vector database to disk! Delete it if you want to start fresh.
    documents=docs,
    
)

#### Retrieve documents with semantic search

In LangChain we use `VectorStoreRetriever` as a sort of wrapper around the vector index. This retriever is used to search for documents based on their embeddings. 

In [24]:
# Use the vectorstore as a retriever
retriever = local_vector_db.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 3, # Retrieves the top k=3 most similar documents.
    }
)

#### Create a RAG prompt template

In [25]:
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. Base your answer strictly on retrieved context. If uncertain, state 'I don’t know. Use three sentences maximum and keep the answer concise.

Question:
{question}

Context: 
{context} 

Answer:
"""

prompt = PromptTemplate.from_template(template)

Changes:
"Base your answer strictly on retrieved context. If uncertain, state 'I don’t know"

#### Combining our RAG pipeline

In [26]:
question = "What is CloudMate?"

retrieved_docs = local_vector_db.similarity_search(question)
docs_content = "\n\n".join(f"Document {i+1}:\n{doc.page_content}" for i, doc in enumerate(retrieved_docs))
formated_prompt = prompt.invoke({"question": question, "context": docs_content})

In [27]:
print(formated_prompt.to_string()[:1000])

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. Base your answer strictly on retrieved context. If uncertain, state 'I don’t know. Use three sentences maximum and keep the answer concise.

Question:
What is CloudMate?

Context: 
Document 1:
[About MadeUpCompany/Products and Services/CloudMate – Secure and Scalable Cloud Storage]: 
CloudMate is our flagship cloud storage solution, designed for businesses of all sizes. Features include:
- ✅ Seamless data migration with automated backups
- ✅ Military-grade encryption and multi-factor authentication
- ✅ Role-based access control for enterprise security
- ✅ AI-powered file organization and search capabilities

Document 2:
[About MadeUpCompany/Products and Services/CloudMate – Secure and Scalable Cloud Storage]: 
CloudMate is our flagship cloud storage solution, designed for businesses of all sizes. Features include:
- ✅ Seamless data migration with automated backups
- 

In [28]:
answer = llm.invoke(formated_prompt)

In [29]:
print(answer)

CloudMate is a flagship cloud storage solution by MadeUpCompany, offering seamless data migration, military-grade encryption, role-based access control, and AI-powered file organization across various business tiers.


# Evaluation


The evaluation of the RAG system is performed using a LLM-as-a-judge. 
After retrieving relevant documents and generating an answer, the same LLM is prompted to assess the response based on three criteria: 
- Faithfulness – Are all claims in the answer directly supported by the context?
- Answer Relevance – Does the answer directly address the question?
- Context Relevance – Does the context contain only what's needed to answer the question?

The model outputs scores from 1 to 5 for each category.

In [38]:

def llm_judge(question, context, generated_answer):
    """
    Evaluates the generated answer using an LLM judge and returns structured scores as JSON.
    """
    strict_prompt = f"""
You are an evaluation system. Given a question, the retrieved context, and a generated answer, evaluate the answer based on:

1. Faithfulness (1-5): Are all claims in the answer directly inferable from the context?
2. Answer Relevance (1-5): Does the answer directly address the question?
3. Context Relevance (1-5): Does the context exclusively contain information needed to answer the question?

Only return a JSON object in the following format and NOTHING ELSE:

{{
  "faithfulness": <score>,
  "answer_relevance": <score>,
  "context_relevance": <score>
}}

Question:
{question}

Context:
{context}

Answer:
{generated_answer}

Evaluation:
"""
    # Invoke the model with stricter formatting request
    evaluation_raw = llm.invoke(strict_prompt)

    try:
        return json.loads(evaluation_raw.strip().split("```")[0])  # Avoid extra formatting like markdown
    except json.JSONDecodeError:
        print("Raw model output (for debugging):", evaluation_raw)
        return {"error": "Invalid JSON output from LLM judge."}



In [39]:
# Example usage:
question = "What is CloudMate?"
retrieved_docs = local_vector_db.similarity_search(question)
docs_content = "\n\n".join(f"Document {i+1}:\n{doc.page_content}" for i, doc in enumerate(retrieved_docs))

# Generate answer
formated_prompt = prompt.invoke({"question": question, "context": docs_content})
answer = llm.invoke(formated_prompt)
print("Generated Answer:", answer)

# Evaluate answer
evaluation_result = llm_judge(question, docs_content, answer)
print("Evaluation Result:", evaluation_result)

Generated Answer: CloudMate is a flagship cloud storage solution by MadeUpCompany, offering seamless data migration, military-grade encryption, role-based access control, and AI-powered file organization across various business tiers.
Evaluation Result: {'faithfulness': 5, 'answer_relevance': 5, 'context_relevance': 5}


The LLM judge evaluates:

Relevance = 5: The answer directly and completely answers the question.

Conciseness = 5: It's clear, well-structured, and not verbose.

Factual Consistency=5: Everything stated is backed by the retrieved documents (no hallucinations).



### Drawbacks of the LLM-as-a judge evaluation method:
One key limitation of this approach is that the model may exhibit bias when evaluating its own outputs, especially if the same model architecture is used for both generation and evaluation. This could lead to inflated scores, where Faithfulness, Answer Relevance, and Context Relevance are all rated 5. 

Additionally, without reference answers or human oversight, the evaluation lacks an objective baseline, making it harder to detect hallucinations or overly generic responses.