<a href="https://colab.research.google.com/github/catherineabcde/Generative-AI-Text-and-Image-Synthesis/blob/main/Week%207_1%20Create%20a%20RAG%20database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **<font color="#0000FF">打造自己的 RAG 強化的對話機器人（Week 7 作業）</font>**

本週的作業我想要打造一個可以查詢日本動畫吉伊卡哇角色相關資訊的對話機器人，因為我很喜歡這部動畫，想要透過這個機器人讓更多人能瞭解這個動畫的角色！

我搜索了現有的角色整理網站，覺得維基百科整理得最完整，因此我先將這個網頁的資料爬下來整理成 json 檔作為 RAG 可以檢索使用的資料庫。

### 1. 資料處理

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**1.1 引入爬蟲需要的套件**

In [None]:
# HTTP request
import requests

# HTML parsing
from bs4 import BeautifulSoup
import lxml

# data saving
import json
import csv
import pandas as pd

**1.2 爬蟲資料設定**

In [None]:
# 1. HTTP resquests
url = "https://zh.wikipedia.org/zh-tw/%E5%90%89%E4%BC%8A%E5%8D%A1%E5%93%87"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'

# 2. Parsing HTML
soup = BeautifulSoup(response.text, 'html.parser')

**1.3 爬蟲主函式**

In [None]:
def extract_characters(soup):

    characters = []

    # Find the <dl> tag
    for dl in soup.find_all('dl'):

        dt = dl.find('dt')
        name = dt.get_text(strip=True)

        # Deal with all <dd>
        dds = dl.find_all('dd')

        voice_actor = ""
        description = ""

        for dd in dds:
            text = dd.get_text(strip=True)

            if '聲優' in text:
                a_tag = dd.find('a')
                if a_tag:
                    voice_actor = a_tag.get_text(strip=True)
                voice_actor = text
            else:
                description = text

        # Store the characters' info
        characters.append({
            'name': name,
            'voice_actor': voice_actor,
            'description': description
        })

    return characters

**1.4 進行爬蟲並儲存資料**

In [None]:
characters = extract_characters(soup)

with open('chiikawa_characters.json', 'w', encoding='utf-8') as f:
    json.dump(characters, f, ensure_ascii=False, indent=4)

# Check the number of characters
print(f"Number of characters: {len(characters)}")

完成 json 檔後，將檔案存到[我的雲端硬碟](https://drive.google.com/file/d/1Ze3bvDChoZw2Vu9N2lElDwFldeB3jONC/view?usp=sharing)中。

### 2. 建立資料夾

In [None]:
import os
import json
from langchain_core.documents import Document

upload_dir = "uploaded_docs"
os.makedirs(upload_dir, exist_ok=True)
print(f"請將你的 .txt, .pdf, .docx, ... 檔案放到這個資料夾中： {upload_dir}")

這邊我也同時將檔案存到老師指定的資料夾中。

### 3. 更新必要套件並引入

In [None]:
!pip install -U langchain langchain-community langchain-text-splitters langchain-huggingface faiss-cpu pypdf python-docx sentence-transformers transformers

In [None]:
from langchain_community.document_loaders import TextLoader, PyPDFLoader, UnstructuredWordDocumentLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

### 4. 依 Google 建議加入 EmbeddingGemma 前綴

Google 建議, 在文本部份的 Embedding 要用以下格式:

    title: {title|none} | text: ...

而問題 Query 要用:

    Query：task: search result | query: ...

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

In [None]:
class EmbeddingGemmaEmbeddings(HuggingFaceEmbeddings):
    def __init__(self, **kwargs):
        super().__init__(
            model_name="google/embeddinggemma-300m",   # HF 上的官方模型
            encode_kwargs={"normalize_embeddings": True},  # 一般檢索慣例
            **kwargs
        )

    def embed_documents(self, texts):
        # 文件向量：title 可用 "none"，或自行帶入檔名/章節標題以微幅加分
        texts = [f'title: chiikawa library | text: {t}' for t in texts]
        return super().embed_documents(texts)

    def embed_query(self, text):
        # 查詢向量：官方建議的 Retrieval-Query 前綴
        return super().embed_query(f'task: search result | query: {text}')

### 5. 載入文件

這裡參考助教的方式，將資料轉成帶有 metadata 的格式，希望讓檢索表現更好。

In [None]:
import re
def classify(name, content):

    # 主角
    if any(kw in content for kw in ['主角', '吉伊']):
        return '主角'

    # 盔甲
    if '盔甲' in name:
        return '盔甲'

    # 怪物
    if '怪物' in content or '奇美拉' in content:
        return '怪物'

    return '配角'

def extract_keywords(description):

    keywords = []

    keyword_list = [
    # 性格
    '可愛', '溫柔', '開朗', '害羞', '勇敢', '膽小',
    # 特徵
    '主角', '朋友', '盔甲',
    # 食物相關
    '草莓', '甜食', '拉麵',
    # 能力
    '除草檢定', '證照', '討伐', '攝影', '喝酒'
    ]

    for keyword in keyword_list:
        if keyword in description:
            keywords.append(keyword)

    return keywords

def voice_actor_cln(voice_actor):

    text = voice_actor.replace('聲優：', '').replace('聲優:', '')
    text = re.sub(r'\[\d+\]', '', text)
    text = text.strip()

    return text

def convert_to_documents(characters):

    docs = []
    for i, character in enumerate(characters, 1):
        name = character['name']
        voice_actor = voice_actor_cln(character['voice_actor'])
        content = character['description']

        if not content or not content.strip():
            continue

        # classify
        character_type = classify(name, content)
        # extract keywords
        keywords = extract_keywords(content)

        # create documents
        doc = Document(
            page_content=content,
            metadata={
                'doc_id': str(i),
                'name': name,
                'character_type': character_type,
                'voice_actor': voice_actor,
                'keywords': keywords,
                'source': 'wikipedia'
            }
        )
        docs.append(doc)

    return docs

In [None]:
folder_path = upload_dir
documents = []
for file in os.listdir(folder_path):
    path = os.path.join(folder_path, file)
    if file.endswith(".txt"):
        loader = TextLoader(path)
        documents.extend(loader.load())
    elif file.endswith(".pdf"):
        loader = PyPDFLoader(path)
        documents.extend(loader.load())
    elif file.endswith(".docx"):
        loader = UnstructuredWordDocumentLoader(path)
        documents.extend(loader.load())
    elif file.endswith(".json"):
        with open(path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        docs = convert_to_documents(data)
        documents.extend(docs)
    else:
        continue

### 5. 建立向量資料庫

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
split_docs = splitter.split_documents(documents)

In [None]:
from google.colab import userdata
hf_token = userdata.get('HuggingFace')

In [None]:
from huggingface_hub import login
login(token=hf_token)

In [None]:
embedding_model = EmbeddingGemmaEmbeddings()
vectorstore = FAISS.from_documents(split_docs, embedding_model)

### 6. 儲存向量資料庫

In [None]:
vectorstore.save_local("faiss_db")

In [None]:
!zip -r faiss_db.zip faiss_db

In [None]:
print("✅ 壓縮好的向量資料庫已儲存為 'faiss_db.zip'，請下載此檔案備份。")