# 專案實戰: 智能問答系統 (QA System with RAG)

**專案類型**: 檢索增強生成 (Retrieval-Augmented Generation)
**難度**: ⭐⭐⭐⭐⭐ 專家級
**預計時間**: 5-6 小時
**技術棧**: Hugging Face, FAISS, LangChain, Sentence Transformers

---

## 📚 學習目標

完成本專案後,您將能夠:

1. ✅ 理解問答系統的核心架構
2. ✅ 掌握檢索增強生成 (RAG) 技術
3. ✅ 使用 FAISS 建立向量數據庫
4. ✅ 整合 LangChain 構建完整 QA 系統
5. ✅ 部署生產級問答助手

---

## Part 1: RAG 架構概覽

### 1.1 什麼是 RAG?

**RAG (Retrieval-Augmented Generation)** 結合:
- **檢索 (Retrieval)**: 從知識庫找相關資訊
- **生成 (Generation)**: 基於檢索結果生成答案

### 1.2 RAG vs 傳統 QA

| 方法 | 知識來源 | 準確性 | 可擴展性 | 成本 |
|------|---------|--------|----------|------|
| **傳統 QA** | 模型參數 | 中 | 低 (需重新訓練) | 高 (訓練成本) |
| **RAG** | 外部知識庫 | 高 | 高 (動態更新) | 低 (只需檢索) |

### 1.3 系統架構

```
用戶問題
    ↓
問題編碼 (Sentence Transformer)
    ↓
向量檢索 (FAISS)
    ├── 找出最相關的 K 個文檔
    └── 基於餘弦相似度排序
    ↓
Context 構建
    ├── 合併檢索到的文檔
    └── 添加到 Prompt
    ↓
答案生成 (LLM)
    ├── 基於 Context 生成
    └── 引用來源
    ↓
回答輸出
```

## Part 2: 環境準備

In [None]:
# 安裝必要套件
# !pip install transformers sentence-transformers faiss-cpu langchain -q

import transformers
import sentence_transformers
import faiss
import langchain

print(f"✅ Transformers: {transformers.__version__}")
print(f"✅ Sentence-Transformers: {sentence_transformers.__version__}")
print(f"✅ FAISS: {faiss.__version__}")
print(f"✅ LangChain: {langchain.__version__}")

## Part 3: 知識庫構建

### 3.1 準備知識文檔

In [None]:
# 示範知識庫 (NLP 相關知識)
knowledge_base = [
    "Natural Language Processing (NLP) is a field of AI that focuses on the interaction between computers and human language.",
    "Transformers are a type of neural network architecture introduced in the paper 'Attention is All You Need' in 2017.",
    "BERT stands for Bidirectional Encoder Representations from Transformers, developed by Google in 2018.",
    "GPT (Generative Pre-trained Transformer) is an autoregressive language model developed by OpenAI.",
    "Word embeddings are dense vector representations of words that capture semantic relationships.",
    "Tokenization is the process of splitting text into smaller units called tokens.",
    "Named Entity Recognition (NER) identifies and classifies entities like person names, organizations, and locations.",
    "Sentiment analysis determines the emotional tone of a text, such as positive, negative, or neutral.",
    "The attention mechanism allows models to focus on relevant parts of the input when generating output.",
    "Transfer learning in NLP involves pre-training a model on large datasets and fine-tuning it for specific tasks.",
    "LSTM (Long Short-Term Memory) networks are a type of RNN designed to handle long-term dependencies.",
    "Hugging Face is a popular platform for sharing and using pre-trained NLP models.",
    "TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate word importance.",
    "Machine translation automatically converts text from one language to another.",
    "Text summarization condenses long documents into shorter versions while preserving key information."
]

print(f"✅ 知識庫包含 {len(knowledge_base)} 篇文檔")
print(f"\n前 3 篇預覽:")
for i, doc in enumerate(knowledge_base[:3], 1):
    print(f"{i}. {doc[:80]}...")

### 3.2 文檔向量化

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# 載入 Sentence Transformer 模型
embedding_model_name = "all-MiniLM-L6-v2"  # 輕量高效

print(f"載入 Embedding 模型: {embedding_model_name}")
embedding_model = SentenceTransformer(embedding_model_name)
print("✅ 模型載入完成")

# 向量化知識庫
print("\n向量化知識庫...")
document_embeddings = embedding_model.encode(
    knowledge_base,
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"✅ 向量化完成")
print(f"   文檔數量: {document_embeddings.shape[0]}")
print(f"   向量維度: {document_embeddings.shape[1]}")

### 3.3 建立 FAISS 索引

In [None]:
import faiss

# 建立 FAISS 索引
dimension = document_embeddings.shape[1]  # 向量維度

# 使用 L2 距離 (也可用 Inner Product)
index = faiss.IndexFlatL2(dimension)

# 添加向量到索引
index.add(document_embeddings.astype('float32'))

print(f"✅ FAISS 索引建立完成")
print(f"   索引中文檔數: {index.ntotal}")
print(f"   向量維度: {dimension}")

## Part 4: 檢索功能實作

### 4.1 語義檢索

In [None]:
def semantic_search(query, top_k=3):
    """
    語義檢索: 找出最相關的文檔

    Args:
        query: 查詢問題
        top_k: 返回前 K 個結果

    Returns:
        List of (document, score) tuples
    """
    # 編碼查詢
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)

    # 檢索
    distances, indices = index.search(query_embedding.astype('float32'), top_k)

    # 組織結果
    results = []
    for idx, distance in zip(indices[0], distances[0]):
        results.append({
            'document': knowledge_base[idx],
            'score': float(distance),
            'index': int(idx)
        })

    return results


# 測試檢索
test_queries = [
    "What is BERT?",
    "How does attention mechanism work?",
    "Explain word embeddings"
]

print("🔍 語義檢索測試\n")
for query in test_queries:
    print(f"查詢: {query}")
    results = semantic_search(query, top_k=2)

    for i, result in enumerate(results, 1):
        print(f"  {i}. (距離: {result['score']:.4f})")
        print(f"     {result['document'][:100]}...")
    print()

## Part 5: RAG 問答系統

### 5.1 基礎 RAG 實作

In [None]:
from transformers import pipeline

# 載入問答模型
qa_pipeline = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad"
)

def rag_qa(question, top_k=3):
    """
    RAG 問答系統

    流程:
    1. 檢索相關文檔
    2. 合併為 Context
    3. 從 Context 中提取答案
    """
    # Step 1: 檢索相關文檔
    retrieved_docs = semantic_search(question, top_k=top_k)

    # Step 2: 構建 Context
    context = " ".join([doc['document'] for doc in retrieved_docs])

    # Step 3: 問答
    result = qa_pipeline(
        question=question,
        context=context
    )

    return {
        'answer': result['answer'],
        'confidence': result['score'],
        'context': context,
        'retrieved_docs': retrieved_docs
    }


# 測試 RAG QA
questions = [
    "What is BERT?",
    "Who developed GPT?",
    "What is the purpose of tokenization?"
]

print("🤖 RAG 問答系統測試\n")
print("=" * 70)

for q in questions:
    result = rag_qa(q)

    print(f"\n❓ Question: {q}")
    print(f"✅ Answer: {result['answer']}")
    print(f"   Confidence: {result['confidence']:.2%}")
    print(f"   Retrieved docs: {len(result['retrieved_docs'])}")
    print("-" * 70)

### 5.2 使用 LangChain 簡化實作

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS as LangChainFAISS
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline

# 1. 建立 Embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

# 2. 建立向量存儲
vectorstore = LangChainFAISS.from_texts(
    texts=knowledge_base,
    embedding=embeddings
)

print("✅ LangChain 向量存儲建立完成")

# 3. 建立 LLM
llm_pipeline = hf_pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    max_length=512
)

llm = HuggingFacePipeline(pipeline=llm_pipeline)

# 4. 建立 RetrievalQA Chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff", "map_reduce", "refine"
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

print("✅ QA Chain 建立完成")

### 5.3 使用 QA Chain

In [None]:
# 問答測試
question = "What is BERT and who developed it?"

result = qa_chain({"query": question})

print(f"Question: {question}")
print(f"\nAnswer: {result['result']}")
print(f"\n來源文檔:")
for i, doc in enumerate(result['source_documents'], 1):
    print(f"  {i}. {doc.page_content[:100]}...")

## Part 6: 完整 QA 系統類別

In [None]:
class IntelligentQASystem:
    """
    智能問答系統
    """
    def __init__(self, knowledge_base, embedding_model_name="all-MiniLM-L6-v2"):
        print("初始化問答系統...")

        # 載入 Embedding 模型
        self.embedding_model = SentenceTransformer(embedding_model_name)

        # 載入問答模型
        self.qa_model = pipeline(
            "question-answering",
            model="distilbert-base-cased-distilled-squad"
        )

        # 建立向量索引
        self.knowledge_base = knowledge_base
        self.index = self._build_index()

        print("✅ 系統初始化完成")

    def _build_index(self):
        """建立 FAISS 索引"""
        embeddings = self.embedding_model.encode(
            self.knowledge_base,
            convert_to_numpy=True
        )

        dimension = embeddings.shape[1]
        index = faiss.IndexFlatL2(dimension)
        index.add(embeddings.astype('float32'))

        return index

    def retrieve(self, query, top_k=3):
        """檢索相關文檔"""
        query_embedding = self.embedding_model.encode(
            [query],
            convert_to_numpy=True
        )

        distances, indices = self.index.search(
            query_embedding.astype('float32'),
            top_k
        )

        results = []
        for idx, dist in zip(indices[0], distances[0]):
            results.append({
                'text': self.knowledge_base[idx],
                'distance': float(dist),
                'relevance': 1 / (1 + float(dist))  # 轉換為相關性分數
            })

        return results

    def answer(self, question, top_k=3, min_confidence=0.3):
        """
        回答問題

        Args:
            question: 用戶問題
            top_k: 檢索文檔數
            min_confidence: 最低信心度閾值

        Returns:
            Dict with answer, confidence, sources
        """
        # Step 1: 檢索
        retrieved_docs = self.retrieve(question, top_k=top_k)

        # Step 2: 構建 Context
        context = " ".join([doc['text'] for doc in retrieved_docs])

        # Step 3: 抽取答案
        try:
            qa_result = self.qa_model(
                question=question,
                context=context
            )

            answer = qa_result['answer']
            confidence = qa_result['score']

            # 檢查信心度
            if confidence < min_confidence:
                return {
                    'answer': "I'm not confident about the answer. Could you rephrase?",
                    'confidence': confidence,
                    'status': 'low_confidence',
                    'retrieved_docs': retrieved_docs
                }

            return {
                'answer': answer,
                'confidence': confidence,
                'status': 'success',
                'context': context,
                'retrieved_docs': retrieved_docs
            }

        except Exception as e:
            return {
                'answer': "Sorry, I encountered an error processing your question.",
                'confidence': 0.0,
                'status': 'error',
                'error': str(e)
            }

    def add_documents(self, new_documents):
        """動態添加新文檔到知識庫"""
        # 向量化新文檔
        new_embeddings = self.embedding_model.encode(
            new_documents,
            convert_to_numpy=True
        )

        # 添加到索引
        self.index.add(new_embeddings.astype('float32'))

        # 更新知識庫
        self.knowledge_base.extend(new_documents)

        print(f"✅ 添加 {len(new_documents)} 篇文檔")
        print(f"   知識庫總數: {len(self.knowledge_base)}")


# 創建 QA 系統實例
qa_system = IntelligentQASystem(knowledge_base)

### 5.2 測試問答系統

In [None]:
# 綜合測試
test_questions = [
    "What is Natural Language Processing?",
    "Who developed BERT?",
    "What is the attention mechanism?",
    "Explain transfer learning in NLP.",
    "What platform is used for sharing NLP models?"
]

print("🎯 問答系統綜合測試")
print("=" * 80)

for i, question in enumerate(test_questions, 1):
    result = qa_system.answer(question)

    print(f"\nQ{i}: {question}")
    print(f"A: {result['answer']}")
    print(f"   信心度: {result['confidence']:.2%}")
    print(f"   狀態: {result['status']}")

    if 'retrieved_docs' in result:
        print(f"   檢索文檔數: {len(result['retrieved_docs'])}")
        print(f"   最相關: {result['retrieved_docs'][0]['text'][:80]}...")

    print("-" * 80)

## Part 7: 從文件構建知識庫

### 7.1 載入長文檔並切分

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader

def load_and_split_documents(file_paths, chunk_size=500, chunk_overlap=50):
    """
    載入文檔並切分為 chunks

    Args:
        file_paths: 文檔路徑列表
        chunk_size: 每個 chunk 的字符數
        chunk_overlap: chunks 間重疊字符數

    Returns:
        List of text chunks
    """
    # 文本切分器
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )

    all_chunks = []

    for file_path in file_paths:
        # 載入文檔
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()

        # 切分
        chunks = text_splitter.split_text(text)
        all_chunks.extend(chunks)

    return all_chunks


# 範例: 載入 README 文檔
# doc_paths = ['README.md', 'docs/GUIDE.md']
# chunks = load_and_split_documents(doc_paths)
# print(f"✅ 載入 {len(chunks)} 個文檔塊")

### 7.2 從網頁構建知識庫

In [None]:
# 示範: 爬取網頁內容 (需要 beautifulsoup4)
import requests
from bs4 import BeautifulSoup

def scrape_webpage(url):
    """
    爬取網頁文本
    """
    try:
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser')

        # 移除 script 和 style 標籤
        for script in soup(["script", "style"]):
            script.decompose()

        # 提取文本
        text = soup.get_text(separator='\n', strip=True)

        return text

    except Exception as e:
        print(f"爬取失敗: {e}")
        return None


# 示範: 從 Hugging Face 文檔構建知識庫
# urls = [
#     'https://huggingface.co/docs/transformers/index',
#     'https://huggingface.co/docs/transformers/quicktour'
# ]
#
# web_docs = []
# for url in urls:
#     text = scrape_webpage(url)
#     if text:
#         web_docs.append(text)
#
# qa_system.add_documents(web_docs)

## Part 8: 進階功能

### 8.1 混合檢索 (Hybrid Retrieval)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class HybridRetriever:
    """
    混合檢索: 語義檢索 + 關鍵詞檢索
    """
    def __init__(self, documents, semantic_weight=0.7):
        self.documents = documents
        self.semantic_weight = semantic_weight
        self.keyword_weight = 1 - semantic_weight

        # 語義檢索: Sentence Transformer
        self.embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
        self.doc_embeddings = self.embedding_model.encode(documents)

        # 關鍵詞檢索: TF-IDF
        self.tfidf = TfidfVectorizer()
        self.tfidf_matrix = self.tfidf.fit_transform(documents)

    def retrieve(self, query, top_k=5):
        """混合檢索"""
        # 語義檢索分數
        query_embedding = self.embedding_model.encode([query])
        semantic_scores = cosine_similarity(query_embedding, self.doc_embeddings)[0]

        # 關鍵詞檢索分數
        query_tfidf = self.tfidf.transform([query])
        keyword_scores = cosine_similarity(query_tfidf, self.tfidf_matrix)[0]

        # 混合分數
        hybrid_scores = (
            self.semantic_weight * semantic_scores +
            self.keyword_weight * keyword_scores
        )

        # 排序
        top_indices = np.argsort(hybrid_scores)[::-1][:top_k]

        results = []
        for idx in top_indices:
            results.append({
                'text': self.documents[idx],
                'hybrid_score': hybrid_scores[idx],
                'semantic_score': semantic_scores[idx],
                'keyword_score': keyword_scores[idx]
            })

        return results


# 測試混合檢索
hybrid_retriever = HybridRetriever(knowledge_base, semantic_weight=0.7)

query = "transformer attention mechanism"
results = hybrid_retriever.retrieve(query, top_k=3)

print(f"查詢: {query}\n")
for i, result in enumerate(results, 1):
    print(f"{i}. 混合分數: {result['hybrid_score']:.4f}")
    print(f"   (語義: {result['semantic_score']:.4f}, 關鍵詞: {result['keyword_score']:.4f})")
    print(f"   {result['text'][:100]}...\n")

### 8.2 答案重排序 (Re-ranking)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 使用交叉編碼器進行重排序
reranker_model_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
reranker_tokenizer = AutoTokenizer.from_pretrained(reranker_model_name)
reranker_model = AutoModelForSequenceClassification.from_pretrained(reranker_model_name)

def rerank_documents(query, documents, top_k=3):
    """
    使用交叉編碼器重排序文檔
    """
    # 構建 query-document pairs
    pairs = [[query, doc] for doc in documents]

    # 編碼
    inputs = reranker_tokenizer(
        pairs,
        padding=True,
        truncation=True,
        return_tensors='pt',
        max_length=512
    )

    # 計算相關性分數
    with torch.no_grad():
        scores = reranker_model(**inputs).logits.squeeze().tolist()

    # 排序
    ranked = sorted(
        zip(documents, scores),
        key=lambda x: x[1],
        reverse=True
    )

    return ranked[:top_k]


# 測試重排序
query = "What is BERT?"
candidate_docs = [doc['text'] for doc in qa_system.retrieve(query, top_k=5)]

reranked = rerank_documents(query, candidate_docs, top_k=3)

print(f"查詢: {query}\n")
print("重排序結果:")
for i, (doc, score) in enumerate(reranked, 1):
    print(f"{i}. (分數: {score:.4f})")
    print(f"   {doc[:100]}...\n")

## Part 9: 生產部署 (FastAPI)

### 9.1 FastAPI 服務

In [None]:
%%writefile qa_api.py
# qa_api.py - 問答系統 API

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import List, Optional
import uvicorn

# 初始化 FastAPI
app = FastAPI(
    title="Intelligent QA System API",
    description="RAG-based Question Answering System",
    version="1.0.0"
)

# 全局變數: QA 系統實例
qa_system = None

@app.on_event("startup")
async def load_qa_system():
    """啟動時載入 QA 系統"""
    global qa_system

    print("載入問答系統...")
    # 載入知識庫
    knowledge_base = load_knowledge_base()  # 從文件或數據庫載入
    qa_system = IntelligentQASystem(knowledge_base)
    print("✅ 問答系統載入完成")

# Request/Response Models
class QuestionInput(BaseModel):
    question: str = Field(..., min_length=5, max_length=500)
    top_k: Optional[int] = Field(default=3, ge=1, le=10)

class AnswerResponse(BaseModel):
    answer: str
    confidence: float
    status: str
    sources: List[str]

class DocumentInput(BaseModel):
    documents: List[str]

# 問答端點
@app.post("/ask", response_model=AnswerResponse)
async def ask_question(input_data: QuestionInput):
    """
    回答用戶問題
    """
    try:
        result = qa_system.answer(
            question=input_data.question,
            top_k=input_data.top_k
        )

        sources = [
            doc['text'][:100] + "..."
            for doc in result.get('retrieved_docs', [])
        ]

        return AnswerResponse(
            answer=result['answer'],
            confidence=result['confidence'],
            status=result['status'],
            sources=sources
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# 添加文檔端點
@app.post("/add_documents")
async def add_documents(input_data: DocumentInput):
    """
    動態添加新文檔到知識庫
    """
    try:
        qa_system.add_documents(input_data.documents)
        return {
            "message": f"Added {len(input_data.documents)} documents",
            "total_documents": len(qa_system.knowledge_base)
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# 健康檢查
@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "qa_system_loaded": qa_system is not None,
        "knowledge_base_size": len(qa_system.knowledge_base) if qa_system else 0
    }

# 系統資訊
@app.get("/info")
async def system_info():
    return {
        "knowledge_base_size": len(qa_system.knowledge_base),
        "embedding_model": "all-MiniLM-L6-v2",
        "qa_model": "distilbert-base-cased-distilled-squad"
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

### 9.2 測試 API

In [None]:
%%writefile test_qa_api.py
# test_qa_api.py

import requests
import json

API_URL = "http://localhost:8000"

# 測試問答
response = requests.post(
    f"{API_URL}/ask",
    json={
        "question": "What is BERT?",
        "top_k": 3
    }
)

result = response.json()
print(f"Question: What is BERT?")
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.2%}")
print(f"\nSources:")
for i, source in enumerate(result['sources'], 1):
    print(f"  {i}. {source}")

# 添加新文檔
response = requests.post(
    f"{API_URL}/add_documents",
    json={
        "documents": [
            "RAG is a powerful technique that combines retrieval and generation."
        ]
    }
)

print(f"\nAdd documents: {response.json()}")

## Part 10: 總結與擴展

### ✅ 本專案完成內容

1. **核心 RAG 系統**
   - 文檔向量化 (Sentence Transformer)
   - FAISS 向量檢索
   - 問答模型整合

2. **進階功能**
   - 混合檢索 (語義+關鍵詞)
   - 答案重排序
   - 信心度過濾

3. **生產部署**
   - FastAPI 服務
   - 動態知識庫更新
   - API 文檔 (Swagger)

### 🚀 進階擴展方向

#### 技術優化
- [ ] 使用更大的 Embedding 模型
- [ ] 整合 GPT-3.5/GPT-4 API
- [ ] 實作對話式問答 (Multi-turn QA)
- [ ] 添加引用來源追蹤

#### 功能擴展
- [ ] 多語言支持
- [ ] 結構化數據問答 (表格、圖表)
- [ ] 複雜推理 (Chain-of-Thought)
- [ ] 事實驗證

#### 應用場景
- [ ] 企業知識管理系統
- [ ] 智能文檔助手
- [ ] 法律/醫療問答
- [ ] 教育輔助系統

### 📚 延伸閱讀

- [RAG 論文](https://arxiv.org/abs/2005.11401)
- [FAISS 文檔](https://github.com/facebookresearch/faiss)
- [LangChain 文檔](https://python.langchain.com/docs/get_started/introduction)
- [Sentence Transformers](https://www.sbert.net/)

---

**專案版本**: v1.0
**建立日期**: 2025-10-17
**作者**: iSpan NLP Team
**授權**: MIT License