# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Change the working directory to somewhere in your Google Drive.
# You could check the path by right clicking on the folder.
%cd [change to the directory you prefer]

[Errno 2] No such file or directory: '[change to the directory you prefer]'
/content


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [2]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean
!python3 -m pip install trafilatura
from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m219.9 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.meta

In [3]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [4]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [5]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 3, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

## Test the LLM inference pipeline

In [None]:
# You can try out different questions here.
test_question='RTX 5090多少錢'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

我無法提供最新的價格資訊，因為市場上商品和服務可能會有變動。然而，我可以告訴你，RTX 5090是一款高端顯示卡，由NVIDIA推出。

根據目前可得知的情況，這些是 RTx系列的一部分，但我無法提供最新的價格資訊。如果您想知道最新的市場報導或購買建議，我會推薦查詢線上商店、專業評論家，或直接聯繫NVIDIA官方網站。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [None]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",temperature=0.2,max_tokens=512, verbose=False):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.temperature=temperature
        self.verbose=verbose
        self.max_tokens=max_tokens
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            if self.verbose:
              print(f" Agent Role {self.role_description}")
              print(f" Tasks: {self.task_description}")
              print(f" User message: {message}")
            messages = [
                {"role": "system", "content": f"你的角色：{self.role_description}，請用繁體中文回答"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"你的任務：{self.task_description}\n 訊息，資料：{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [7]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="你是一個精通問題分析的AI，專門負責提取關鍵問題的部分，並且避免回覆與問題無關的內容",
    task_description="""1. 請從以下問題中提取核心問題，刪除不相關資訊
                        2. 請不要回答問題答案
                        3. 生成回覆的時候請直接回答該問題，不要有廢話
                    """,
    verbose=False
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="你是個專門提取關鍵字的AI，負責從問題中找出最關鍵的關鍵字，以便進行有效的網路搜尋",
    task_description="""1. 請從以下問題中取出重要關鍵字，去掉助詞及不必要的敘述
                        2. 取出重要關鍵字的額外規則:若問題中有類似"最"、"多少"、"多久"、"多長"、"是誰"、"在哪裡"、"第一"、"最後"、"誰的"、"哪個"，則這些詞必須被列為關鍵字
                        3. 請不要回答問題
                        4. 回答的時候請只說關鍵字，中間請以頓號分開
                        5. 若題目中有"依據..."，則也要將"依據..."列為關鍵字
                     """,
    verbose=False
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。",
    task_description="""以下列點為將給你的資訊:
              1. 先給你你要回答的題目
              2. 網路上搜尋到的資訊
              目標:請依1.要回答的題目，再從2.去尋找用來回答的正確答案
              回答方式要求:要以精簡的方式回答，且回答的語言為繁體中文，若答案有人名，則僅限人名的部分可用英文回答，不得使用簡體中文與殘體中文
              尋找答案方法提示:裡面會有一大堆跟正確答案不相關的資訊，從中過濾出必要資訊。舉例:在整個台灣本島中，有幾個直轄市? 那你去分析這個問題的時候就要先知道整個範圍是在台灣本島，而不是泰國，日本這些國家中的直轄市數量，目標是去尋找直轄市數量。
              回答格式要求:請提供明確答案，答案只會有一個正確的，不會有什麼答案是A和B都是這種，不要重複問題，若無法回答請不要說「我不知道」，而是以你原本就知道的知識回答。舉例:題目:台灣首都在哪裡? 你的回答是台北市，而非台北市和新北市(請不要有這種有多於一個答案的回答)
              另外你的回答不能出現任何問句，若你想要用問句回答則請用你原本的資料庫的答案回答
              你的回答請多解釋一兩句話，但答案的關鍵字一定要有""",
    verbose=False
)

## RAG pipeline

In [8]:
import requests

def fetch_html(url):
    """ Fetches clean text from a webpage using requests & BeautifulSoup. """
    headers = {"User-Agent": "Mozilla/5.0"}

    try:
        response = requests.get(url, headers=headers, timeout=10)

        if response.status_code != 200:
            print(f"Failed to fetch {url}, Status Code: {response.status_code}")
            return None

        # Parse HTML
        soup = BeautifulSoup(response.text, "html.parser")

        # Remove unnecessary elements
        for script in soup(["script", "style", "header", "footer", "nav", "aside"]):
            script.extract()

        # Extract visible text
        text = soup.get_text(separator=" ")

        # Clean up extra spaces
        clean_text = " ".join(text.split())

        return clean_text[:10000]  # Truncate to avoid excessive length

    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [9]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.
    # print("=== Step 1: Extracting Core Question ===")
    # core_question=question_extraction_agent.inference(question)
    # print(f"Extracted Question: {core_question}")

    # print("=== Step 2: Extracting Keywords for Search ===")
    # search_keywords=keyword_extraction_agent.inference(core_question)
    # print(f"Search Keywords:{search_keywords}")

    # print("=== Step 3: Retrieving Relevant Information ===")
    # search_results=list(_search(search_keywords,num_results=3))
    # print(f"Search Results: {search_results}")
    # if not search_results:
    #   print("No relevant search results found. The model will attempt to answer based n its internal knowledge.")
    #   retrieved_text="無搜尋結果，請根據內部知識回答"
    # else:
    #   retrieved_texts=[fetch_html(url) for url in search_results]
    #   retrieved_texts=[text for text in retrieved_texts if text]
    #   retrieved_text="\n\n".join(retrieved_texts)[:16000]
    #   print(f"Retrieved Text: {retrieved_text}")
    core_question=question_extraction_agent.inference(question)
    print(f"core question:{core_question}")
    search_keywords=keyword_extraction_agent.inference(core_question)
    print(f"search keywords:{search_keywords}")
    support_keywords=keyword_extraction_agent.inference(question)
    # print(f"support keywords{support_keywords}")
    search_results=await search(search_keywords)
    # support_results=support_agent.inference(support_keywords)
    MAX_CONTEXT_SIZE = 14000  # Leave space for question and system prompt
    print('\n')
    # Ensure the text fits within the model’s limit
    retrieved_text = "\n\n".join(search_results)  # Join all search results into one text block
    retrieved_text = retrieved_text[:MAX_CONTEXT_SIZE] if len(retrieved_text) > MAX_CONTEXT_SIZE else retrieved_text
    # print("=== Step 4 : Answering the Question ===")
    qa_prompt= f"""
    1. 問題：{core_question}
    2. 請根據以下檢索結果回答問題：
    ===============================
    {retrieved_text}
    ===============================
    """
    answer=qa_agent.inference(qa_prompt)
    # print(f"Answer: {answer}")
    return answer


## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [10]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "b12901166"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        # if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            # continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        # if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            # continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)

core question:核心問題：「虎山雄風飛揚」是哪間學校的校歌？
search keywords:虎山雄風飛揚; 校歌


1 虎山雄風飛揚是光華國小的校歌。
core question:核心問題：2025年初，NCC規定民眾透過境外郵購自用產品回台加收審查費多少錢？
search keywords:2025年初、NCC規定民眾透過境外郵購自用產品回台加收審查費多少錢？

最關鍱字： 
# N CC # 境 外 郡 購 自 用 產 品 回 台 加 收 審 查 費


2 NCC規定民眾透過境外郵購自用產品回台加收審查費750元，僅適用於通過寄送的案件，而非攜帶入國。
core question:核心問題：第一代 iPhone 是由哪位蘋果 CEO 發表？
search keywords:第一代 iPhone · 蘋果 CEO


3 史蒂夫·乔布斯是第一代 iPhone 的发表者。
core question:核心問題：托福網路測驗 TOEFL iBT 要達到多少分才能申請進階英文免修？
search keywords:托福網路測驗; TOEFL iBT ; 免修


4 托福網路測驗 TOEFL iBT 要達到多少分才能申請進階英文免修？  根據國立臺南大學通識教育中心的資訊，TOFELi BT 72 分以上可獲得大一學生及轉入生的「英語」課程全額退費。
core question:核心問題：Rugby Union 中觸地 try 可得幾分？
search keywords:觸地 try、Rugby Union


5 觸地 try 可得 5 分。
core question:核心問題：卑南族的祖先發源地位於現今哪個行政區劃？
search keywords:卑南族 # 祖先發源地 依據資料


6 卑南族的祖先發源地位於現今台東縣太麻里鄉美和村附近。
core question:核心問題：熊仔的碩班指導教授為誰？
search keywords:熊仔 · 碩班指導教授


7 熊仔的碩班指導教授為李琳山。
core question:核心問題：誰發現了電磁感應定律？
search keywords:誰 · 發現了電磁感應定律


8 迈克尔·法拉第
core question:核心問題：距離國立臺灣史前文化博物館最近的臺鐵車站為？
search keyw

In [11]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)