# 🤖💼 Capstone Project: Taiwan Enterprise QA System（台灣企業內部智能問答系統）
## 🔍Project Introduction（專案簡介）

### 👉 Blogpost: https://hackmd.io/zmmxNnCrQMWTchRSEE63Mg
### 👉 YouTube video: https://youtu.be/66rlfSux6OU

This project is the capstone submission for the Gen AI Intensive Course (Q1 2025), co-hosted by Google and Kaggle. The core of this project is a prototype of a Generative AI-powered internal Q&A assistant, specifically designed for a Taiwan-based company. The system aims to assist employees in efficiently accessing information related to HR, finance, and IT policies, thereby reducing the repetitive workload on administrative staff.

Built on the Retrieval-Augmented Generation (RAG) architecture, this solution integrates the Google Gemini API with ChromaDB, enabling the assistant to retrieve the most relevant content from internal documents and generate accurate, semantically coherent responses in Traditional Chinese. This project demonstrates the practical potential and creative applications of LLMs in internal business environments in Taiwan.

---

本專案為參加 2025 年第一季 Google x Kaggle《Gen AI Intensive》課程之結業專題。專案核心為一個以生成式 AI 技術為基礎，專為台灣公司設計的「內部問答機器人原型」，旨在協助企業內部員工更有效率地查詢人資、財務與資訊部門的相關制度資訊，減少重複性行政詢答所耗費的人工成本與時間。

本系統整合了 Google Gemini API 與 ChromaDB 向量資料庫，實現 Retrieval-Augmented Generation（RAG）架構，能根據使用者問題，自內部文件中擷取最相關段落，再搭配大語言模型生成精準、語意連貫的中文回答。透過此專案，展示了 LLM 在繁體中文企業內部應用上的可行性與創意潛力。

## ⚙️ Technologies Used（使用技術）

This project demonstrates **three official Gen AI capabilities** as defined in the capstone criteria:

1. **Document Understanding**  
　The system loads and parses internal `.docx` files (HR, Finance, IT), segments them into meaningful paragraphs, and prepares them for semantic embedding. This simulates real-world enterprise documents and enables accurate, context-aware retrieval.

2. **Embeddings & Vector Database (Vector Search)**  
　Using the `text-embedding-004` model from the Google Gemini API, each document chunk is embedded into semantic vectors and stored in ChromaDB, a vector database. Queries are also embedded and matched using vector similarity search.

3. **Retrieval-Augmented Generation (RAG)**  
　When a user asks a question, the system retrieves the most relevant document chunks and injects them into a prompt. Gemini then generates grounded answers in Traditional Chinese, ensuring responses are accurate and aligned with company policies.  
　To ensure stable and consistent answers, the model uses a fixed low-temperature setting (`temperature=0`), which reduces hallucination and enhances response reliability in enterprise use cases.

These three capabilities form a complete RAG-based Q&A system for internal enterprise use. The assistant demonstrates the practical integration of LLMs into real-world workflows and offers a foundation for future expansion.

---

本專案依照官方定義，實作並展示了 **三項指定的 Gen AI 能力**：

1. **文件理解（Document Understanding）**  
　系統讀取 `.docx` 格式的公司內部規章文件（人資、財務、資訊），並切分為段落語意單位，作為語意向量化與檢索的基礎，模擬企業真實文件應用場景。

2. **語意向量與向量資料庫（Embeddings & Vector Search）**  
　使用 Gemini 的 `text-embedding-004` 模型將每段內容轉為語意向量，並儲存在 ChromaDB 資料庫中。使用者提問後，系統透過語意相似度進行段落匹配與檢索。

3. **檢索增強生成（Retrieval-Augmented Generation, RAG）**  
　針對每個問題，系統自資料庫找出最相關段落，插入 prompt 中，並由 Gemini 模型產生結合文件知識的回覆，保證回覆語意連貫且符合公司制度。  
　此外，系統設定固定低溫度（`temperature=0`）以強化輸出一致性並降低幻覺風險，提升企業環境下的信賴度與可控性。

這三項技術整合形成一個可實際應用的企業內部問答解決方案，展示生成式 AI 技術在繁體中文場域的落地潛力與擴展性。

## 🔄 System Workflow & Architecture（系統流程與架構）

This project follows a modular architecture based on the Retrieval-Augmented Generation (RAG) pipeline. The full system flow includes document ingestion, embedding generation, vector indexing, similarity search, and response generation. The detailed steps are as follows:

1. Upload `.docx` files containing internal policy documents (e.g., HR, Finance, IT)
2. Segment documents into smaller text chunks (typically at paragraph level)
3. Convert each chunk into a semantic vector using `text-embedding-004` (Gemini embedding model)
4. Store the vectors in ChromaDB to enable semantic similarity search
5. User submits a natural language query
6. The system embeds the query and retrieves the top-N most relevant chunks
7. Retrieved chunks are inserted into a prompt and passed to `generate_content()` for response generation
8. A fluent and contextually grounded answer is returned in Traditional Chinese

---

本系統採用 RAG（檢索增強生成）架構，流程模組化、可擴展，涵蓋從資料前處理到回答生成的完整鏈條。具體步驟如下：

1. 上傳包含公司內部規章的 `.docx` 文件（如人資、財務、資訊等）
2. 將文件切分為段落單位的文字區塊
3. 使用 Gemini 的 `text-embedding-004` 模型將每段轉為語意向量
4. 將向量儲存於 ChromaDB 向量資料庫中，便於後續語意檢索
5. 使用者輸入自然語言問題作為查詢
6. 系統將問題向量化，並從資料庫中找出最相關的 N 段文字
7. 將檢索段落插入 prompt 中，送入 Gemini `generate_content()` 模型進行回答生成
8. 最終回傳一則流暢、依據文件內容生成的繁體中文回答

## 💡 Use Cases & Innovation Highlights（應用場景與創意亮點）

This internal Q&A assistant was developed to address a real pain point commonly found in Taiwan-based companies: the repetitive administrative questions faced by HR, IT, and Finance departments. Employees often inquire about topics such as leave policies, reimbursement procedures, account permissions, and software installations—questions that are typically answered manually by staff over and over again.

The assistant enables employees to ask natural language questions and receive document-grounded answers instantly, improving response speed and reducing the burden on administrative teams. This boosts operational efficiency while freeing up staff to focus on higher-value tasks.

---
 
本問答機器人專案聚焦於台灣企業中常見的行政痛點：來自員工的大量重複性詢問，經常壓垮人資、資訊與財務部門的工作效率。例如請假制度、報帳流程、系統帳號權限、軟體安裝方式等問題，皆需仰賴人工反覆說明與回覆。

本系統可讓員工以自然語言提問，並立即取得來自公司內部文件的準確回答，不僅提升查詢效率，也有效降低行政部門的工作負擔，使其能專注於更具策略性的任務上。

## 🔚 Conclusion & Future Directions（結語與未來展望）

This project demonstrates a practical and scalable application of Generative AI technologies in a real-world enterprise setting. From embedding document knowledge to generating stable, policy-aligned responses in Traditional Chinese, the system successfully integrates Gemini's generative capabilities with ChromaDB's retrieval strength in a full RAG pipeline.

Through this hands-on experience, I gained valuable insights into prompt design, vector database management, and building reliable Q&A workflows with LLMs. More importantly, I learned how to align AI system design with actual user needs in a business environment.

Looking ahead, this system has strong potential to be deployed within a real organization—either as a chatbot widget on internal portals or integrated with existing employee helpdesk systems. Future improvements include multi-turn conversations, voice interface, support for additional document formats (e.g., PDFs, Google Docs), and **FAQ shortcut buttons** to improve accessibility and ease of use for employees.

---

本專案成功展現了生成式 AI 技術在企業內部場域的實用性與可擴展性。透過向量化內部文件內容，搭配穩定生成的繁體中文回覆，系統完整實現了以 Gemini 與 ChromaDB 為核心的 RAG 架構，並能對應企業制度問答需求。

在本次實作中，我實際操作了 prompt 設計、向量資料庫建構與 LLM 應用流程，也學會了如何從使用者角度出發，設計出貼近需求的 AI 系統。

此系統具備導入企業實務的潛力，例如可作為內部入口網站的智慧客服，或串接既有的人力資源與資訊服務系統。功能上則可進一步擴充：多輪對話、語音查詢介面、更多文件格式支援（如 PDF、Google 文件），以及**FAQ 快速查詢按鈕**等輔助功能，提升使用體驗與資訊可近性。

## ⚙️ Setup（環境安裝）

### 📦 Install dependencies（安裝依賴套件）

This notebook uses the following libraries:

- `google-genai==1.7.0`: Gemini API client for text generation  
- `chromadb==0.6.3`: Vector database for semantic search  
- `python-docx`: For parsing internal `.docx` documents  
- `protobuf==3.20.3`, `google-api-core==2.11.1`: Compatible versions to avoid dependency conflicts

To prevent installation errors or warnings caused by background packages  
(such as `google-cloud-bigtable`, `automl`, or `pandas-gbq`),  
we first uninstall these **unused but pre-installed** packages.  
This ensures a smooth setup with **no pip errors or warnings**.

---

本 Notebook 使用以下核心套件：

- `google-genai==1.7.0`：呼叫 Gemini API 生成回答  
- `chromadb==0.6.3`：用於語意搜尋的向量資料庫  
- `python-docx`：讀取公司內部 Word 文件  
- `protobuf==3.20.3`, `google-api-core==2.11.1`：穩定相容版本，避免套件依賴衝突

由於 Kaggle 環境中預設安裝了一些非本專案使用的 Google 套件（如 `pandas-gbq`, `bigtable`, `automl` 等），  
這些套件會對 `protobuf` 或 `google-api-core` 有不同的版本需求，導致安裝時出現錯誤或警告。

因此我們先移除這些不必要的預設套件，再安裝專案真正需要的相容版本，確保安裝過程**零錯誤、零衝突**，提交時更安心。

In [None]:
# 🧹 Step 1: 移除預設環境中可能引發依賴衝突的套件
!pip uninstall -qqy jupyterlab kfp protobuf google-api-core tensorflow \
                   google-cloud-bigtable google-cloud-automl pandas-gbq

# ✅ Step 2: 安裝本專案所需的相容版本套件，確保不會出現任何安裝錯誤
!pip install -qU \
    google-genai==1.7.0 \
    chromadb==0.6.3 \
    python-docx \
    protobuf==3.20.3 \
    google-api-core==2.11.1

### 📥 Import Gemini SDK（匯入 Gemini SDK 並確認版本）
  
We import the core modules from the `google.genai` library and display the installed version to ensure correct API usage.

匯入 Gemini API 的核心模組，並顯示目前安裝版本，以確保後續使用的 API 正確性。

In [None]:
from google import genai
from google.genai import types

from IPython.display import Markdown

genai.__version__

### 🔐 Load Gemini API Key（從 Kaggle Secret 載入 Gemini 金鑰）
  
To authenticate access to Gemini API, the API key is retrieved securely from Kaggle Secrets. This avoids exposing the key in plain text.
 
為了安全地使用 Gemini API，本專案透過 Kaggle Secrets 取得 API 金鑰，避免將金鑰明文寫入程式中。

In [None]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

## 📄 Load Internal Documents（載入公司內部文件）
  
This step loads three internal `.docx` files related to HR, Finance, and IT policies using the `python-docx` library. Text is cleaned and combined into a list for later processing.
 
The three internal documents (HR, Finance, and IT) used in this project are mock data manually created for testing purposes.  
The content was designed to simulate realistic company policy documents while avoiding the use of any sensitive or proprietary information.

使用 `python-docx` 套件載入三份內部文件（人資、財務、資訊），並將每份文件中的段落整理為純文字，儲存於 list 供後續處理使用。

本專案中使用的三份公司內部文件（人資、財務、資訊）皆為為測試目的所自製的模擬資料。  
內容設計模擬真實公司制度說明，並未涉及任何敏感或真實商業資訊。

In [None]:
from docx import Document

def load_docx_text(path):
    doc = Document(path)
    return "\n".join([para.text.strip() for para in doc.paragraphs if para.text.strip()])

# 載入三份文件
hr_doc = load_docx_text("/kaggle/input/company-hr-qa/HR_QA.docx")
finance_doc = load_docx_text("/kaggle/input/company-finance-qa/Finance_QA.docx")
it_doc = load_docx_text("/kaggle/input/company-it-qa/IT_QA.docx")

# 存成一個 list
documents = [hr_doc, finance_doc, it_doc]

#檢查內容
for i, doc in enumerate(documents):
    print(f"Document {i+1} preview:\n{doc[:300]}\n{'-'*40}")

## 🤖 Initialize Gemini Client（初始化 Gemini 並列出支援模型）

Create a Gemini client instance and list available models that support `embedContent`. This ensures we are using a model compatible with the embedding task.

初始化 Gemini 用戶端，並列出支援 `embedContent` 功能的模型，以確認所選模型可執行語意向量生成任務。

In [None]:
client = genai.Client(api_key=GOOGLE_API_KEY)

for m in client.models.list():
    if "embedContent" in m.supported_actions:
        print(m.name)

## 🧠 Define GeminiEmbeddingFunction（定義嵌入函式供 ChromaDB 使用）
 
Define a custom class `GeminiEmbeddingFunction` that uses Gemini’s `text-embedding-004` model to generate embeddings. Includes retry logic for quota-based API errors.

自訂一個嵌入函式 `GeminiEmbeddingFunction`，使用 Gemini 模型 `text-embedding-004` 進行語意向量生成，並加入自動重試機制，以處理配額錯誤。

In [None]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types


# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

## 🗃️ Initialize Vector Store（初始化向量資料庫）
 
Set up a ChromaDB vector database with the custom Gemini embedding function. Add the internal documents to the collection, ready for similarity-based retrieval.
 
使用先前定義的 Gemini 嵌入函式初始化 ChromaDB，並將三份內部文件加入資料庫，以利後續語意相似度查詢。

In [None]:
import chromadb

DB_NAME = "googlecardb"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)

db.add(documents=documents, ids=[str(i) for i in range(len(documents))])

## 💬 Generate Answer from User Query（根據使用者提問產生回覆）

This part switches the embedding function into query mode, allowing user questions to be embedded and used for vector search. The system retrieves the top 3 most relevant document chunks from ChromaDB and injects them into a carefully designed prompt.

The prompt guides Gemini to act as a helpful assistant, replying in clear and friendly Traditional Chinese suitable for non-technical employees. If no relevant document content is found, the model is instructed to answer using general Gemini knowledge.

此部分將嵌入函式切換為「查詢模式」，使使用者的問題也能進行語意嵌入與檢索。系統從資料庫中取出最相關的三段內部文件內容，並將其整合至精心設計的提示詞（prompt）中。

提示內容引導 Gemini 扮演一位親切、知識豐富的助理，以自然流暢的【台灣繁體中文】回答問題，適合非技術背景的員工閱讀。若文件無相關資訊，則允許 Gemini 回復一般知識型答案。

### 🧪 Test Four Query Scenarios（測試四種查詢情境）

To demonstrate the system's ability to handle different query types, we run four representative questions:

- An HR-related question (document-supported)
- A finance-related question (document-supported)
- An IT-related question (document-supported)
- A general question outside document scope (fallback to Gemini knowledge)

Each question is used to perform vector search, and the retrieved passages are passed into the Gemini prompt to generate the final answer in Traditional Chinese.

本段示範系統如何處理四種常見問題類型：

- 人事部門相關（可由文件回答）
- 財務制度相關（可由文件回答）
- 資訊作業相關（可由文件回答）
- 公司文件範圍外問題（由 Gemini 自有知識生成）

每個問題皆經由語意檢索找出相關段落，並傳入 prompt 中請 Gemini 產生最終回應。

In [None]:
# 🧪 Four test queries (中文 + 英文解釋)
test_questions = [
    ("如何請病假", "HR-related question: How to apply for sick leave"),
    ("如何報差旅費", "Finance-related question: How to file travel expense reimbursement"),
    ("VPN無法連線", "IT-related question: VPN connection issue"),
    ("Excel 的 SUM 函數怎麼寫？", "Out-of-scope question: How to write the SUM function in Excel?")
]

# Set to query mode for question embedding
embed_fn.document_mode = False

# Iterate over each test query
for query_zh, label_en in test_questions:
    print(f"\n🔎 {label_en}\n❓ 中文問題：{query_zh}")
    
    result = db.query(query_texts=[query_zh], n_results=3)
    [all_passages] = result["documents"]
    
    query_oneline = query_zh.replace("\n", " ")

    # Prompt includes instruction to return both Chinese and English
    prompt = f"""你是一個親切且知識豐富的 AI 助理，會根據公司內部文件內容來回答問題。
請以【台灣繁體中文】回答，語氣要自然、清楚，適合給非技術背景的一般員工閱讀。
請務必用完整句子回答問題，內容要詳細，若有背景資料可以一起補充說明。
請在中文回答後，**附上對應的英文翻譯版本**。
如果提供的段落跟問題無關，你可以忽略那些段落。
如果使用者問的是公司內部文件內容以外的問題，請以原本 Gemini 的數據生成回答，並同樣提供中英文版本。

You are a helpful and knowledgeable AI assistant. Please answer the following question based on internal company documents.
Respond in **Traditional Chinese**, using clear and friendly language suitable for non-technical employees.
Please use complete sentences and provide detailed answers. If helpful, include relevant background information.
**After the Traditional Chinese response, please provide an English translation of your answer.**
If any retrieved content is irrelevant, you may ignore it.
If the question is beyond the scope of internal documents, answer as Gemini normally would, and still provide both Chinese and English versions.

問題 (Question)：{query_oneline}
"""


    # Append retrieved document passages
    for passage in all_passages:
        passage_oneline = passage.replace("\n", " ")
        prompt += f"文件段落：{passage_oneline}\n"

    # Generate answer using Gemini
    low_temp_config = types.GenerateContentConfig(temperature=0)
    answer = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=prompt,
        config=low_temp_config
    )


    display(Markdown(f"💡 Gemini 回答（Answer）：\n\n{answer.text}\n\n{'-'*80}"))

### 🧪 User Input Demo Section（使用者互動輸入區塊 ）

In the previous section, we demonstrated responses to four common office-related questions to showcase the system’s capability.
This section is designed to allow users to input their own question for a more personalized and interactive experience.

⚠️ Note: Using Python's input() in a Kaggle Notebook will cause an ERROR during “Save Version” or auto-execution, interrupting the notebook run.
To avoid this, we set a default question "請問要怎麼請病假？" in the code as a fallback to ensure successful execution.

🔧 If you’d like to enter your own question, simply uncomment the line below by removing the #:
```python
#query = input("請輸入您的問題：")
```
前面我們示範了 **四種常見辦公室問題** 的查詢結果，讓大家快速了解系統的功能。  
本區塊則是設計讓使用者可以 **自行輸入問題**，體驗更個人化的互動回應。

⚠️ **提醒**：由於 Kaggle Notebook 在儲存（Save Version）或執行整份筆記時，無法處理 `input()` 函式，會導致程式出現 `ERROR` 並中斷執行。  
因此在程式中，我們預設放入了一個範例問題 `"請問要怎麼請病假？"` 作為備用，確保 Notebook 可以順利執行。

🔧 **如果您希望輸入自己的問題**，請將下列程式碼中的註解解除（刪除前面的 `#`）即可啟用互動輸入模式：
```python
#query = input("請輸入您的問題：")
```

In [None]:
# Switch to query mode for embedding user questions (vs. document embedding)
embed_fn.document_mode = False

# 💬 預設查詢問題（Kaggle notebook 中使用）
# For Kaggle notebook auto-execution: preset a default query
# 📌 這樣設計是為了避免使用 input() 導致 notebook 無法自動執行並出現錯誤
# 📌 This avoids using input() which causes errors during auto-execution in Kaggle notebooks
query = "請問要怎麼請病假？"  # Default query for demo purposes

# ❗ 如果您希望自行輸入問題，請取消下方 input() 這行的註解
# ❗ To input your own question, uncomment the line below
#query = input("請輸入您的問題：")  # Please enter your question (in Traditional Chinese)


# Perform semantic search in ChromaDB using the input query
result = db.query(query_texts=[query], n_results=3)
[all_passages] = result["documents"]

# Sanitize query by removing newline characters
query_oneline = query.replace("\n", " ")

# Build a bilingual prompt for Gemini model
prompt = f"""你是一個親切且知識豐富的 AI 助理，會根據公司內部文件內容來回答問題。
請以【台灣繁體中文】回答，語氣要自然、清楚，適合給非技術背景的一般員工閱讀。
請務必用完整句子回答問題，內容要詳細，若有背景資料可以一起補充說明。
請在中文回答後，**附上對應的英文翻譯版本**。
如果提供的段落跟問題無關，你可以忽略那些段落。
如果使用者問的是公司內部文件內容以外的問題，請以原本 Gemini 的數據生成回答，並同樣提供中英文版本。

You are a helpful and knowledgeable AI assistant. Please answer the following question based on internal company documents.
Respond in **Traditional Chinese**, using clear and friendly language suitable for non-technical employees.
Please use complete sentences and provide detailed answers. If helpful, include relevant background information.
**After the Traditional Chinese response, please provide an English translation of your answer.**
If any retrieved content is irrelevant, you may ignore it.
If the question is beyond the scope of internal documents, answer as Gemini normally would, and still provide both Chinese and English versions.

問題 (Question)：{query_oneline}
"""


# Append retrieved document passages to the prompt
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"文件段落 (Document passage)：{passage_oneline}\n"

# Send the prompt to Gemini model and generate an answer
low_temp_config = types.GenerateContentConfig(temperature=0)
answer = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt,
    config=low_temp_config
)

# Display the result as Markdown-formatted output
Markdown(answer.text)