In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UwcN41PQf73W"
   },
   "source": [
    "# <b>Hugging Face 金融情緒分析實作</b>\n",
    "\n",
    "這個筆記本將引導您完成使用 Hugging Face `transformers` 和 `datasets` 函式庫，對金融文本進行情緒分析的完整流程。我們將涵蓋從資料載入、前處理、模型微調（Fine-tuning）到結果評估與預測的每一個步驟。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. 環境設定與套件安裝\n",
    "首先，我們需要安裝 Hugging Face 提供的 `datasets` 和 `transformers` 函式庫。`datasets` 可以幫助我們輕鬆下載和處理資料集，而 `transformers` 則提供了預訓練模型和訓練工具。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Pfwn6jilq0ro"
   },
   "outputs": [],
   "source": [
    "!pip install datasets transformers scikit-learn pandas torch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pbAzKp4WM4O9"
   },
   "source": [
    "## 2. 資料載入與前處理\n",
    "我們將使用 `FinanceInc/auditor_sentiment` 資料集，這是一個專為金融領域設計的情緒分析資料集，包含正面（Positive）、中性（Neutral）和負面（Negative）三種標籤。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "KpamGxqud9SG"
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "import pandas as pd\n",
    "from collections import Counter\n",
    "\n",
    "# 從 Hugging Face Hub 載入資料集\n",
    "dataset = load_dataset(\"FinanceInc/auditor_sentiment\")\n",
    "\n",
    "print(\"資料集結構：\")\n",
    "print(dataset)\n",
    "\n",
    "# 查看訓練集和測試集的前5筆資料\n",
    "print(\"\\n訓練集範例：\")\n",
    "print(dataset[\"train\"][:5])\n",
    "\n",
    "# 標籤對應關係：{0: \"Negative\", 1: \"Neutral\", 2: \"Positive\"}\n",
    "# 分析標籤分佈，檢查是否存在類別不平衡問題\n",
    "print(\"\\n訓練集標籤分佈：\", Counter(dataset[\"train\"][\"label\"]))\n",
    "print(\"測試集標籤分佈：\", Counter(dataset[\"test\"][\"label\"]))\n",
    "\n",
    "# 將資料轉換為 Pandas DataFrame 以便進行更複雜的處理\n",
    "df_train = pd.DataFrame(dataset[\"train\"])\n",
    "df_test = pd.DataFrame(dataset[\"test\"])\n",
    "\n",
    "# 檢查並移除重複值，避免模型過擬合\n",
    "df_train.drop_duplicates(subset=[\"sentence\"], inplace=True)\n",
    "df_test.drop_duplicates(subset=[\"sentence\"], inplace=True)\n",
    "print(\"\\n移除重複值後的大小：\")\n",
    "print(f\"訓練集大小: {df_train.shape}\")\n",
    "print(f\"測試集大小: {df_test.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 3. 資料集劃分與 Tokenization\n",
    "接著，我們將訓練集切分為訓練集和驗證集，並使用與預訓練模型匹配的 Tokenizer 將文本轉換為模型可以理解的數字格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yt-UA3uXsrD4"
   },
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "# 將原始訓練集切分為 80% 訓練集和 20% 驗證集\n",
    "train_val_split = dataset[\"train\"].train_test_split(test_size=0.2, seed=42)\n",
    "train_dataset = train_val_split[\"train\"]\n",
    "val_dataset = train_val_split[\"test\"]\n",
    "test_dataset = dataset[\"test\"]\n",
    "\n",
    "print(\"資料集劃分：\")\n",
    "print(f\"訓練集大小：{len(train_dataset)}\")\n",
    "print(f\"驗證集大小：{len(val_dataset)}\")\n",
    "print(f\"測試集大小：{len(test_dataset)}\")\n",
    "\n",
    "# =============================================================================\n",
    "# 【作業要求 1：嘗試不同預訓練BERT模型】\n",
    "# 請修改下方的 model_name 變數來更換不同的模型。\n",
    "# 推薦嘗試的模型：\n",
    "# 1. ProsusAI/finbert (金融領域特化模型，效果最好)\n",
    "# 2. distilbert-base-uncased (輕量化模型，速度快)\n",
    "# 3. roberta-base (改進版BERT，性能更強)\n",
    "# 4. bert-base-uncased (基礎BERT模型，作為基準線)\n",
    "# =============================================================================\n",
    "model_name = \"ProsusAI/finbert\"  # 推薦使用金融領域模型\n",
    "\n",
    "print(f\"\\n使用的模型：{model_name}\")\n",
    "\n",
    "# 使用 AutoTokenizer 載入與模型對應的 Tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "\n",
    "# 定義 Tokenization 函數\n",
    "def tokenize_function(example):\n",
    "    # padding=\"max_length\" -> 將所有句子填充到相同長度\n",
    "    # truncation=True -> 截斷超過模型最大長度的句子\n",
    "    return tokenizer(example[\"sentence\"], padding=\"max_length\", truncation=True)\n",
    "\n",
    "# 對所有資料集進行 Tokenization\n",
    "tokenized_train = train_dataset.map(tokenize_function, batched=True)\n",
    "tokenized_val = val_dataset.map(tokenize_function, batched=True)\n",
    "tokenized_test = test_dataset.map(tokenize_function, batched=True)\n",
    "\n",
    "print(\"\\nTokenization 完成！\")\n",
    "print(\"Tokenized 訓練集範例：\", tokenized_train[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1B6hjgeKNCOY"
   },
   "source": [
    "## 4. 模型建立與訓練\n",
    "現在我們可以載入預訓練模型，並使用 `Trainer` API 來進行微調（Fine-tuning）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CrZKiuOAARK0"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer\n",
    "from sklearn.metrics import accuracy_score, precision_recall_fscore_support\n",
    "import numpy as np\n",
    "\n",
    "# 檢查是否有可用的 GPU，若有則使用 GPU 進行訓練\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "print(f\"使用的裝置: {device}\")\n",
    "\n",
    "# 載入預訓練模型，並指定 num_labels=3 (對應三種情緒)\n",
    "model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)\n",
    "model.to(device) # 將模型移至 GPU\n",
    "\n",
    "# 定義評估指標計算函數\n",
    "def compute_metrics(pred):\n",
    "    labels = pred.label_ids\n",
    "    preds = np.argmax(pred.predictions, axis=1)\n",
    "    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average=\"weighted\")\n",
    "    acc = accuracy_score(labels, preds)\n",
    "    return {\n",
    "        'accuracy': acc,\n",
    "        'f1': f1,\n",
    "        'precision': precision,\n",
    "        'recall': recall\n",
    "    }\n",
    "\n",
    "# 設定訓練參數\n",
    "training_args = TrainingArguments(\n",
    "    output_dir=\"./results\",\n",
    "    eval_strategy=\"epoch\",      # 每個 epoch 結束後進行一次評估\n",
    "    learning_rate=2e-5,\n",
    "    per_device_train_batch_size=16,\n",
    "    per_device_eval_batch_size=16,\n",
    "    num_train_epochs=3,\n",
    "    weight_decay=0.01,\n",
    "    logging_dir='./logs',\n",
    "    logging_steps=10,\n",
    "    report_to=\"none\" # 關閉 wandb 等報告\n",
    ")\n",
    "\n",
    "# 建立 Trainer\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=tokenized_train,\n",
    "    eval_dataset=tokenized_val,\n",
    "    compute_metrics=compute_metrics,\n",
    ")\n",
    "\n",
    "# 開始訓練\n",
    "trainer.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "O2CSuye8NPbm"
   },
   "source": [
    "## 5. 結果評估\n",
    "訓練完成後，我們在驗證集和從未見過的測試集上評估模型的最終表現。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QUKztZcotkI7"
   },
   "outputs": [],
   "source": [
    "# 評估模型在驗證集的表現\n",
    "print(\"\\n=== 驗證集評估結果 ===\")\n",
    "val_result = trainer.evaluate(eval_dataset=tokenized_val)\n",
    "print(val_result)\n",
    "\n",
    "# 評估模型在測試集的最終表現\n",
    "print(\"\\n=== 測試集評估結果 (最終成績) ===\")\n",
    "test_result = trainer.evaluate(eval_dataset=tokenized_test)\n",
    "print(test_result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PSgCkDKhwF_o"
   },
   "source": [
    "## 6. 自行選擇句子進行預測\n",
    "最後，我們可以載入訓練好的模型，並對自己選擇的句子進行情緒預測。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "V5ypsiDh1ACF"
   },
   "outputs": [],
   "source": [
    "# 先展示測試集中的句子，方便您選擇\n",
    "print(\"=== 測試集句子展示 ===\")\n",
    "print(\"以下是測試集中的前 50 個句子：\\n\")\n",
    "\n",
    "label_map = {0: \"Negative\", 1: \"Neutral\", 2: \"Positive\"}\n",
    "\n",
    "for i in range(50):\n",
    "    sentence = test_dataset[i][\"sentence\"]\n",
    "    label = test_dataset[i][\"label\"]\n",
    "    print(f\"[{i}] {sentence}\")\n",
    "    print(f\"    真實標籤：{label_map[label]}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ZmWdctbTwO_f"
   },
   "outputs": [],
   "source": [
    "# =============================================================================\n",
    "# 【作業要求 2：自行選擇測試句子進行預測】\n",
    "# 請從上方顯示的測試集句子中，選擇 5 個你感興趣的句子索引填入下方列表。\n",
    "# 建議可以選擇不同情緒的句子，或是看起來比較模稜兩可的句子來測試模型。\n",
    "# =============================================================================\n",
    "test_indices = [13, 26, 36, 43, 48]  # 請修改這裡！\n",
    "\n",
    "print(f\"\\n您選擇的測試句子索引：{test_indices}\")\n",
    "# =============================================================================\n",
    "\n",
    "# 根據選擇的索引，取得句子和真實標籤\n",
    "test_texts = [test_dataset[i][\"sentence\"] for i in test_indices]\n",
    "true_labels = [test_dataset[i][\"label\"] for i in test_indices]\n",
    "\n",
    "print(\"\\n=== 開始預測 ===\\n\")\n",
    "\n",
    "# 使用 pipeline 可以更方便地進行預測\n",
    "from transformers import pipeline\n",
    "\n",
    "classifier = pipeline(\"sentiment-analysis\", model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1)\n",
    "predictions = classifier(test_texts)\n",
    "\n",
    "# Hugging Face pipeline 的標籤可能是 LABEL_0, LABEL_1, ... 需要對應回我們的標籤\n",
    "pipe_label_map = {\"LABEL_0\": \"Negative\", \"LABEL_1\": \"Neutral\", \"LABEL_2\": \"Positive\"}\n",
    "\n",
    "correct_count = 0\n",
    "for i, idx in enumerate(test_indices):\n",
    "    text = test_texts[i]\n",
    "    true_label_text = label_map[true_labels[i]]\n",
    "    pred_label_text = pipe_label_map[predictions[i]['label']]\n",
    "    \n",
    "    is_correct = (true_label_text == pred_label_text)\n",
    "    if is_correct:\n",
    "        correct_count += 1\n",
    "    correct_mark = \"✓ 正確\" if is_correct else \"✗ 錯誤\"\n",
    "\n",
    "    print(f\"{i+1}. 測試集索引 [{idx}] {correct_mark}\")\n",
    "    print(f\"   句子：{text}\")\n",
    "    print(f\"   真實標籤：{true_label_text}\")\n",
    "    print(f\"   預測標籤：{pred_label_text} (信心分數: {predictions[i]['score']:.4f})\\n\")\n",
    "\n",
    "accuracy = (correct_count / len(test_indices)) * 100\n",
    "print(f\"預測準確率：{correct_count}/{len(test_indices)} = {accuracy:.2f}%\")\n",
    "\n",
    "print(\"\\n【作業要求 3：使用大語言模型進行情緒分類】\")\n",
    "print(\"對於上方預測標示為 '✗ 錯誤' 的句子，請參考 'LLM情緒分類比較.md' 檔案的說明，\")\n",
    "print(\"嘗試使用 Zero-shot 和 Few-shot prompting 讓大語言模型進行預測，並比較結果。\")"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}