<a href="https://colab.research.google.com/github/ginny0410/aop113b/blob/main/HW04_%E7%B4%85%E9%85%92%E5%93%81%E8%B3%AA%E9%A0%90%E6%B8%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 問題定義

* **目標**：根據紅酒的11個化學特徵，預測紅酒品質等級（低品質、中品質、高品質）。
* **任務類型**：多類別監督式分類。
* **核心模型**：**K-Nearest Neighbors (KNN)。**。
* **評估指標**：主要使用 **Accuracy**，輔以 Precision、Recall、F1、混淆矩陣。

## 資料收集

| 來源                   | 特徵數 | 樣本數 | 類別 |
| -------------------- | --- | --- | -- |
| scikit-learn 內建 Wine Quality | 11   | 1599 | 3  |

In [None]:
# 載入必要套件
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
import warnings
warnings.filterwarnings('ignore')

# 載入Wine Quality資料集
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df_raw = pd.read_csv(url, sep=';')

# 確認
print("資料集形狀:", df_raw.shape)
print("\n前5筆資料:")
print(df_raw.head())
print("\n資料集資訊:")
print(df_raw.info())

資料集形狀: (1599, 12)

前5筆資料:
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8      

## 資料前處理

### 資料清理

In [None]:
# 檢查遺漏值
print("遺漏值統計:")
print(df_raw.isnull().sum())

# 檢查重複值
print(f"\n重複資料筆數: {df_raw.duplicated().sum()}")

# 移除重複資料
df_clean = df_raw.drop_duplicates()
print(f"清理後資料形狀: {df_clean.shape}")

# 將品質分數轉換為類別標籤
def quality_to_category(quality):
    if quality <= 5:
        return 0  # 低品質
    elif quality <= 6:
        return 1  # 中品質
    else:
        return 2  # 高品質

df_clean['quality_category'] = df_clean['quality'].apply(quality_to_category)
category_names = ['低品質', '中品質', '高品質']

print("\n品質分布:")
print(df_clean['quality_category'].value_counts().sort_index())

遺漏值統計:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

重複資料筆數: 240
清理後資料形狀: (1359, 12)

品質分布:
quality_category
0    640
1    535
2    184
Name: count, dtype: int64


### 探索性分析

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# 設定中文字型
#plt.rcParams['font.sans-serif'] = ['SimHei', 'Arial Unicode MS']
#plt.rcParams['axes.unicode_minus'] = False

# 品質分布圖
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
df_clean['quality'].hist(bins=6, alpha=0.7, color='skyblue')
plt.title('原始品質分數分布')
plt.xlabel('品質分數')
plt.ylabel('頻次')

plt.subplot(1, 2, 2)
quality_counts = df_clean['quality_category'].value_counts().sort_index()
plt.bar(range(len(category_names)), quality_counts.values, color=['red', 'orange', 'green'])
plt.title('品質類別分布')
plt.xlabel('品質類別')
plt.ylabel('頻次')
plt.xticks(range(len(category_names)), category_names)
plt.tight_layout()
plt.show()

# 相關係數熱力圖
plt.figure(figsize=(12, 10))
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
correlation_matrix = df_clean[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('特徵相關係數熱力圖')
plt.tight_layout()
plt.show()

### 資料分割

In [None]:
# 準備
feature_columns = [col for col in df_clean.columns if col not in ['quality', 'quality_category']]
X = df_clean[feature_columns].values
y = df_clean['quality_category'].values

print("特徵名稱:")
for i, col in enumerate(feature_columns):
    print(f"{i+1}. {col}")

print(f"\n特徵矩陣形狀: {X.shape}")
print(f"目標變數形狀: {y.shape}")

特徵名稱:
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol

特徵矩陣形狀: (1359, 11)
目標變數形狀: (1359,)


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

print(f"訓練集大小: {X_train.shape}")
print(f"測試集大小: {X_test.shape}")
print(f"訓練集標籤分布: {np.bincount(y_train)}")
print(f"測試集標籤分布: {np.bincount(y_test)}")

訓練集大小: (1087, 11)
測試集大小: (272, 11)
訓練集標籤分布: [512 428 147]
測試集標籤分布: [128 107  37]


### 特徵縮放

In [None]:
from sklearn.preprocessing import StandardScaler

# KNN 依賴距離計算，必須標準化特徵
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("標準化前特徵範圍:")
print(f"最小值: {X_train.min(axis=0)[:5]}...")  # 只顯示前5個特徵
print(f"最大值: {X_train.max(axis=0)[:5]}...")

print("\n標準化後特徵範圍:")
print(f"平均值: {X_train_scaled.mean(axis=0)[:5]}...")
print(f"標準差: {X_train_scaled.std(axis=0)[:5]}...")

標準化前特徵範圍:
最小值: [4.7   0.12  0.    0.9   0.012]...
最大值: [15.9    1.58   1.    15.5    0.611]...

標準化後特徵範圍:
平均值: [-4.83002074e-15  4.55487636e-15 -7.66227519e-16 -9.61001696e-16
  4.64490962e-15]...
標準差: [1. 1. 1. 1. 1.]...


## 模型訓練

採用 KNN 演算法訓練模型。

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

# 建立KNN分類器管線
knn_clf = make_pipeline(
    StandardScaler(),          # 確保推論時特徵縮放一致
    KNeighborsClassifier(
        n_neighbors=5,         # 預設 k=5
        weights="uniform",     # 等權重
        metric="euclidean"     # 歐氏距離
    )
)

# 訓練模型
knn_clf.fit(X_train, y_train)
print("訓練完成")

訓練完成


## 模型評估

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns


# 預測
y_pred = knn_clf.predict(X_test)

# 計算準確率
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.3f}")

# 分類報告
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=category_names))

# 混淆矩陣
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# 視覺化混淆矩陣
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=category_names,
            yticklabels=category_names)
plt.title('混淆矩陣')
plt.xlabel('預測標籤')
plt.ylabel('實際標籤')
plt.show()

## 模型調整

In [None]:
from sklearn.model_selection import GridSearchCV

# 建立管線
pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())

# 定義參數網格
param_grid = {
    "kneighborsclassifier__n_neighbors": range(3, 21, 2),  # k值從3到20
    "kneighborsclassifier__weights": ["uniform", "distance"],
    "kneighborsclassifier__metric": ["euclidean", "manhattan", "minkowski"]
}

# 網格搜索
grid = GridSearchCV(
    pipe,
    param_grid,
    cv=5,                    # 5折交叉驗證
    scoring="accuracy",
    n_jobs=-1,
    verbose=1
)

print("開始網格搜索...")
grid.fit(X_train, y_train)

print(f"最佳參數: {grid.best_params_}")
print(f"CV 平均 Accuracy: {grid.best_score_:.3f}")

# 使用最佳模型
best_model = grid.best_estimator_

開始網格搜索...
Fitting 5 folds for each of 54 candidates, totalling 270 fits
最佳參數: {'kneighborsclassifier__metric': 'euclidean', 'kneighborsclassifier__n_neighbors': 19, 'kneighborsclassifier__weights': 'distance'}
CV 平均 Accuracy: 0.604


In [None]:
# 使用最佳模型預測
y_pred_best = best_model.predict(X_test)
acc_best = accuracy_score(y_test, y_pred_best)

print(f"基礎模型準確率: {acc:.3f}")
print(f"優化後模型準確率: {acc_best:.3f}")
print(f"準確率提升: {acc_best - acc:.3f}")

# 最佳模型的詳細評估
print("\n優化後模型分類報告:")
print(classification_report(y_test, y_pred_best, target_names=category_names))

基礎模型準確率: 0.603
優化後模型準確率: 0.614
準確率提升: 0.011

優化後模型分類報告:
              precision    recall  f1-score   support

         低品質       0.70      0.72      0.71       128
         中品質       0.55      0.55      0.55       107
         高品質       0.50      0.43      0.46        37

    accuracy                           0.61       272
   macro avg       0.58      0.57      0.57       272
weighted avg       0.61      0.61      0.61       272



**調參要點**

| 參數            | 說明                                               |
| ------------- | ------------------------------------------------ |
| `n_neighbors` | k 值過小易受雜訊影響，過大則平滑過度；Wine dataset建議k=5-15。              |
| `weights`     | `"uniform"`：等權；`"distance"`：距離反比權重，適合類別不平衡資料。    |
| `metric`      | `euclidean`適合連續特徵；化學特徵間距離用歐氏距離較合適。 |

## 模型部署

### 儲存模型

In [None]:
import joblib

# 儲存最佳模型和相關資訊
model_artifacts = {
    "pipeline": best_model,
    "feature_names": feature_columns,
    "target_names": category_names,
    "scaler_params": {
        "mean": scaler.mean_,
        "scale": scaler.scale_
    }
}

joblib.dump(model_artifacts, "wine_quality_knn_pipeline.joblib")
print("模型已儲存至 wine_quality_knn_pipeline.joblib")

模型已儲存至 wine_quality_knn_pipeline.joblib


### 推論預測

In [None]:
# 載入模型進行推論
loaded_artifacts = joblib.load("wine_quality_knn_pipeline.joblib")
loaded_pipeline = loaded_artifacts["pipeline"]
target_names = loaded_artifacts["target_names"]

# 範例：預測新的葡萄酒樣本
# 特徵順序：fixed acidity, volatile acidity, citric acid, residual sugar,
#          chlorides, free sulfur dioxide, total sulfur dioxide, density,
#          pH, sulphates, alcohol
sample_wine = [[7.4, 0.7, 0.0, 1.9, 0.076, 11.0, 34.0, 0.9978, 3.51, 0.56, 9.4]]

# 進行預測
pred_idx = loaded_pipeline.predict(sample_wine)[0]
pred_proba = loaded_pipeline.predict_proba(sample_wine)[0]

print(f"預測品質類別: {target_names[pred_idx]}")
print("各類別機率:")
for i, prob in enumerate(pred_proba):
    print(f"  {target_names[i]}: {prob:.3f}")

預測品質類別: 低品質
各類別機率:
  低品質: 1.000
  中品質: 0.000
  高品質: 0.000


## 結論

透過 KNN 演算法在 Wine Quality 資料集上進行葡萄酒品質分類：

- 基礎模型效能：使用預設參數 k=5 達到約 55-60% 準確率
- 優化後效能：經 GridSearch 調整參數後準確率提升至 61-65%
- 模型特點：對特徵縮放敏感，StandardScaler 必不可少。適合的 k 值範圍為 5-15，distance weighting 對不平衡資料有改善效果。
- 實務應用：可用於葡萄酒廠的品質自動分級系統

未來改進:
- 嘗試特徵工程（如多項式特徵、特徵選擇）
- 考慮其他演算法（Random Forest、SVM）
- 處理類別不平衡問題（SMOTE等技術）