# 資料前處理 - 產生 filtered_df.db

此 notebook 從 vga.db 讀取資料，不修改原始資料庫，直接產生 filtered_df.db

In [13]:
import sqlite3
import json
import pandas as pd

## 1. gpus 前處理

In [14]:
# 連接到 gpus.db
gpus_conn = sqlite3.connect('gpus.db')
df = pd.read_sql_query("SELECT * FROM gpus", gpus_conn)

# 將 score 轉為整數並去重
df['score'] = pd.to_numeric(df['score'], errors='coerce').astype('Int64')
df_deduplicated = df.sort_values(by='score', ascending=False).drop_duplicates(subset='name', keep='first')

# 更新 gpus.db
cursor = gpus_conn.cursor()
cursor.execute("DELETE FROM gpus")
gpus_conn.commit()
df_deduplicated.to_sql('gpus', gpus_conn, if_exists='append', index=False)

print(f"GPU 資料去重完成: {len(df_deduplicated)} 個型號")
df_deduplicated

GPU 資料去重完成: 134 個型號


Unnamed: 0,id,name,score
0,1,NVIDIA RTX PRO 6000 Blackwell,15792
1,2,NVIDIA GeForce RTX 5090,14480
2,3,NVIDIA GeForce RTX 5090 D,14425
3,4,NVIDIA GeForce RTX 4090,9236
4,5,NVIDIA GeForce RTX 5080,8762
...,...,...,...
129,137,AMD Radeon Graphics (Granite Ridge),129
130,138,NVIDIA GeForce GTX 750 Ti,123
131,139,AMD Radeon Graphics (Raphael),122
132,140,NVIDIA GeForce GTX 1050,120


## 2. 載入 GPU mapping

In [15]:
with open("3 gpu_mapping_checklist.json", "r", encoding="utf-8") as f:
    mapping = json.load(f)

print(f"Mapping 載入完成: {len(mapping)} 個對應")

Mapping 載入完成: 124 個對應


## 3. 從 vga.db 讀取並處理資料

In [16]:
# 從 vga.db 讀取所有資料（不修改原始資料庫）
vga_conn = sqlite3.connect("vga.db")
vga_df = pd.read_sql_query("SELECT * FROM vga", vga_conn)
vga_conn.close()

print(f"原始資料: {len(vga_df)} 筆")

# 新增 pure_chipset 欄位
vga_df['pure_chipset'] = vga_df['chipset'].map(mapping)

# 新增 score 欄位（保持整數型態）
gpus_dict = df_deduplicated.set_index('name')['score'].to_dict()
vga_df['score'] = vga_df['pure_chipset'].map(gpus_dict).astype('Int64')

# 計算 CP 值
vga_df['CP'] = vga_df.apply(
    lambda row: row['score'] / row['price'] if pd.notna(row['score']) and row['price'] != 0 else None,
    axis=1
)

print(f"處理後資料: {len(vga_df)} 筆")
vga_df.head()

原始資料: 101558 筆
處理後資料: 101558 筆


Unnamed: 0,date,chipset,product,price,pure_chipset,score,CP
0,20200105,NVIDIA / AMD 顯示卡周邊配件,酷碼 VGA Holder 顯卡用支架 千斤頂顯卡支撐架/(0005-KUH00)*任搭顯卡價,369,,,
1,20200105,NVIDIA / AMD 顯示卡周邊配件,酷碼 ELV8 A.RGB 顯卡支撐架(MAZ-IMGB-N30NA-R1),790,,,
2,20200105,NVIDIA / AMD 顯示卡周邊配件,NVIDIA GEFORCE RTX NVLINK BRIDGE 3-SLOT(間隔 60m...,2790,,,
3,20200105,NVIDIA / AMD 顯示卡周邊配件,NVIDIA GEFORCE RTX NVLINK BRIDGE 4-SLOT(間隔 80m...,2790,,,
4,20200105,NVIDIA / AMD 顯示卡周邊配件,華碩 ROG-NVLINK 4 SLOT橋接器(間隔 80mm/RTX 2080 2080T...,2990,,,


## 4. 過濾資料

In [17]:
# 定義排除關鍵字
chipset_exclude_keywords = [
    'AMD 工作站繪圖卡 (客訂交貨.歡迎議價)',
    'NVIDIA / AMD 外接顯卡轉接盒 (需另購顯卡)',
    'NVIDIA / AMD 顯示卡周邊配件',
    'NVIDIA Quadro 專業繪圖卡 (歡迎議價)',
    'NVIDIA / AMD 外接式顯卡轉接盒',
    'NVIDIA Quadro 專業繪圖卡',
    'NVIDIA 外接式顯卡轉接盒',
    'AMD 工作站繪圖卡'
]

product_exclude_keywords = [
    '贈', '抽', '送', '加購', '登錄', '活動', '限量', '現省',
    '現折', '現賺', '再加', '加送', '加價購', '送ROG', '延長線',
    '[合購]', '[紅包'
]

# 應用過濾條件
product_mask = ~vga_df['product'].astype(str).apply(
    lambda p: any(keyword in p for keyword in product_exclude_keywords)
)
chipset_mask = ~vga_df['chipset'].astype(str).apply(
    lambda c: any(keyword in c for keyword in chipset_exclude_keywords)
)

filtered_df = vga_df[product_mask & chipset_mask].copy()
print(f"關鍵字過濾後: {len(filtered_df)} 筆")

# 過濾 pure_chipset 和 score 為空的資料
filtered_df = filtered_df[
    filtered_df['pure_chipset'].notna() & 
    (filtered_df['pure_chipset'].str.strip() != '')
]
print(f"pure_chipset 過濾後: {len(filtered_df)} 筆")

filtered_df = filtered_df[
    filtered_df['score'].notna() & 
    (filtered_df['score'] != '')
]
print(f"score 過濾後: {len(filtered_df)} 筆")

# 按 CP 值排序
filtered_df = filtered_df.sort_values(by='CP', ascending=False)

filtered_df.head(20)

關鍵字過濾後: 84413 筆
pure_chipset 過濾後: 73034 筆
score 過濾後: 72217 筆


Unnamed: 0,date,chipset,product,price,pure_chipset,score,CP
45328,20210823,NVIDIA RTX3090,❤ 華碩 TUF-RTX3090-24G-GAMING(1725MHz/30cm/三風扇) ...,5990,NVIDIA GeForce RTX 3090,5117,0.854257
101179,20251116,AMD Radeon RX9060XT-8G,[雙11任搭]藍寶石 脈動 PULSE RX9060XT GAMING OC 8GB(329...,8888,AMD Radeon RX 9060 XT,3719,0.418429
101507,20251118,AMD Radeon RX9060XT-8G,[雙11任搭]藍寶石 脈動 PULSE RX9060XT GAMING OC 8GB(329...,8888,AMD Radeon RX 9060 XT,3719,0.418429
99912,20250924,AMD Radeon RX9060XT-8G,撼訊 RX9060XT 8G-A 遊蕩者Reaper(3130MHz/22cm/雙風扇/三年...,9990,AMD Radeon RX 9060 XT,3719,0.372272
101504,20251118,AMD Radeon RX9060XT-8G,Acer Nitro RX9060XT OC 8GB(3320MHz/27cm/雙風扇/三年保固),9990,AMD Radeon RX 9060 XT,3719,0.372272
100856,20251106,AMD Radeon RX9060XT-8G,Acer Nitro RX9060XT OC 8GB(3320MHz/27cm/雙風扇/三年保固),9990,AMD Radeon RX 9060 XT,3719,0.372272
98718,20250817,AMD Radeon RX9060XT-8G,撼訊 RX9060XT 8G-A 遊蕩者Reaper(3130MHz/22cm/雙風扇/三年...,9990,AMD Radeon RX 9060 XT,3719,0.372272
97123,20250704,AMD Radeon RX9060XT-8G,撼訊 RX9060XT 8G-A 遊蕩者Reaper(3130MHz/22cm/雙風扇/三年...,9990,AMD Radeon RX 9060 XT,3719,0.372272
99597,20250917,AMD Radeon RX9060XT-8G,[任搭AMD CPU] 藍寶石 脈動 PULSE RX9060XT GAMING OC 8G...,9990,AMD Radeon RX 9060 XT,3719,0.372272
99598,20250917,AMD Radeon RX9060XT-8G,撼訊 RX9060XT 8G-A 遊蕩者Reaper(3130MHz/22cm/雙風扇/三年...,9990,AMD Radeon RX 9060 XT,3719,0.372272


## 5. 儲存到 filtered_df.db

In [18]:
# 儲存到 filtered_df.db
filtered_conn = sqlite3.connect("filtered_df.db")
filtered_df.to_sql("filtered_df", filtered_conn, if_exists="replace", index=False)
filtered_conn.commit()
filtered_conn.close()

# 關閉 gpus 連接
gpus_conn.close()

print(f"\n✓ 完成！")
print(f"  - vga.db: 保持不變")
print(f"  - filtered_df.db: 已產生 ({len(filtered_df)} 筆)")


✓ 完成！
  - vga.db: 保持不變
  - filtered_df.db: 已產生 (72217 筆)
