## load sample dataset

In [2]:
from pycaret.datasets import get_data
data = get_data('diabetes')

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


 ## Functional API（函式式寫法）

In [3]:
from pycaret.classification import *
s = setup(data, target = 'Class variable', session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Class variable
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


## 📊 PyCaret `setup()` 回傳的設定摘要解釋

這是 PyCaret 在執行 `setup()` 之後顯示的設定表，幫助我們快速了解資料前處理和建模流程設定如下：

| 編號 | 說明 (Description) | 值 (Value) | 詳細解釋 |
|------|----------------------|-------------|-----------|
| 0 | Session id | 123 | 隨機種子，用來確保結果「可重現」。每次使用同樣的數字，交叉驗證分法都一樣。 |
| 1 | Target | Class variable | 你指定的目標變數名稱（要預測的欄位）。 |
| 2 | Target type | Binary | 類型為二元分類（Binary Classification），例如 0/1、是/否。 |
| 3 | Original data shape | (768, 9) | 原始資料共有 768 筆資料，9 個欄位（包含目標變數）。 |
| 4 | Transformed data shape | (768, 9) | 前處理後的資料維度。這裡大小沒變，表示資料已經很乾淨。 |
| 5 | Transformed train set shape | (537, 9) | 訓練資料筆數（70%）：537 筆資料。 |
| 6 | Transformed test set shape | (231, 9) | 測試資料筆數（30%）：231 筆資料。 |
| 7 | Numeric features | 8 | 數值型欄位有 8 個（剩下 1 個是目標變數）。 |
| 8 | Preprocess | True | PyCaret 有啟用前處理（補缺值、標準化、編碼等）。 |
| 9 | Imputation type | simple | 缺失值補值策略：使用簡單補法（Simple Imputation）。 |
| 10 | Numeric imputation | mean | 數值型欄位缺失值補平均值。 |
| 11 | Categorical imputation | mode | 類別欄位缺失值補眾數（最常出現的類別）。 |
| 12 | Fold Generator | StratifiedKFold | 使用 Stratified K-Fold（分層交叉驗證）。可保持類別比例一致。 |
| 13 | Fold Number | 10 | 使用 10-fold Cross Validation。 |
| 14 | CPU Jobs | -1 | 使用所有 CPU 核心加速訓練。 |
| 15 | Use GPU | False | 沒使用 GPU。如果模型支援 GPU，可以手動開啟。 |
| 16 | Log Experiment | False | 沒有使用實驗記錄功能（如 MLflow）。 |
| 17 | Experiment Name | clf-default-name | 預設的實驗名稱（可自訂）。 |
| 18 | USI | 238b | 唯一識別碼（Unique Session Identifier），用於追蹤實驗。 |

---

📌 **小提醒**：這個表只會在 `setup()` 成功執行時出現，幫助我們確認資料預處理與建模環境已成功建立。

## Compare Models

In [4]:
# functional API
best = compare_models()
print(best)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7689,0.8047,0.5602,0.7208,0.6279,0.4641,0.4736,0.812
ridge,Ridge Classifier,0.767,0.806,0.5497,0.7235,0.6221,0.4581,0.469,0.006
lda,Linear Discriminant Analysis,0.767,0.8055,0.555,0.7202,0.6243,0.4594,0.4695,0.006
rf,Random Forest Classifier,0.7466,0.792,0.5284,0.6795,0.5908,0.4117,0.421,0.053
nb,Naive Bayes,0.7427,0.7955,0.5702,0.6543,0.6043,0.4156,0.4215,0.006
catboost,CatBoost Classifier,0.741,0.7993,0.5278,0.663,0.5851,0.4005,0.4078,0.633
gbc,Gradient Boosting Classifier,0.7373,0.7909,0.555,0.6445,0.5931,0.4013,0.4059,0.041
ada,Ada Boost Classifier,0.7372,0.7799,0.5275,0.6585,0.5796,0.3926,0.4017,0.02
qda,Quadratic Discriminant Analysis,0.7282,0.7894,0.5281,0.6558,0.5736,0.3785,0.391,0.007
et,Extra Trees Classifier,0.7243,0.7793,0.4857,0.6419,0.5487,0.3565,0.3663,0.041


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


## Analyze Model

In [5]:
# functional API
evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

## 📊 PyCaret 模型分析報告：Logistic Regression（重點圖表）

### 🔄 Pipeline Plot
這張圖呈現了 PyCaret 自動建立的完整資料處理流程（Pipeline）：

Raw Data → SimpleImputer（數值） → SimpleImputer（類別） → CleanColumnNames → LogisticRegression


- **SimpleImputer**：補齊遺失值，數值補平均、類別補眾數
- **CleanColumnNames**：欄位名稱標準化（移除空格、特殊字元等）
- **LogisticRegression**：使用邏輯斯迴歸作為分類模型

---

### ⚙️ Hyperparameters（模型參數）
使用的 Logistic Regression 預設超參數如下：

| 參數 | 說明 |
|------|------|
| C = 1.0 | 正則化強度（越小代表越強正則化） |
| penalty = l2 | 使用 L2 正則化 |
| solver = lbfgs | 使用牛頓法優化器 |
| max_iter = 1000 | 最多迭代次數 |
| fit_intercept = True | 是否估計截距項 |
| random_state = 123 | 設定隨機種子以利重現 |

---

### 📈 AUC（ROC Curve）分析

本圖呈現 Logistic Regression 模型在不同判別閾值（threshold）下的分類能力。

共包含四條 ROC 曲線，各自意義如下：

- **ROC of class 0（AUC = 0.86）**  
  表示以 class 0 為 positive 類別時，模型將 class 0 判別正確的能力（即：真陽性率 vs 假陽性率）。

- **ROC of class 1（AUC = 0.86）**  
  表示以 class 1 為 positive 類別時，模型將 class 1 判別正確的能力。

- **micro-average ROC（AUC = 0.86）**  
  是基於所有樣本的 TP、FP、FN、TN 加總後計算出來的平均 AUC。若樣本數不平衡時，會偏向樣本數較多的類別。

- **macro-average ROC（AUC = 0.86）**  
  是對每一個類別各自計算 AUC，再進行平均，**不考慮樣本數比例**，強調各類別公平的表現。

📌 **結論**：  
四條 AUC 值皆為 0.86，代表模型對兩類別（class 0 和 class 1）皆有良好的區辨能力，且整體表現穩定、無偏頗。

---

### 📊 Confusion Matrix（混淆矩陣）

|               | 預測 0 | 預測 1 |
|---------------|--------|--------|
| **實際 0**     | 132    | 18     |
| **實際 1**     | 38     | 43     |

- **True Negatives (TN) = 132**  
  實際為 0，模型也正確預測為 0 → ✅ 沒發生事件，成功辨識

- **False Positives (FP) = 18**  
  實際為 0，但模型預測為 1 → ❌ 錯判為發生事件（假警報）

- **False Negatives (FN) = 38**  
  實際為 1，但模型預測為 0 → ❌ 該預測會發生但漏判（錯失）

- **True Positives (TP) = 43**  
  實際為 1，模型也預測為 1 → ✅ 有事件發生，成功抓到

#### 📌 總結與模型意涵：

- 模型對 **class 0**（沒事）預測能力強，因為 TN = 132，FP 只有 18
- 模型對 **class 1** 的表現稍弱，因為漏判了 38 筆（FN），成功抓到的 TP 只有 43 筆

你也可以搭配這些計算：

- **Accuracy** = (TP + TN) / 所有資料 = (132 + 43) / 231 ≈ 0.758
- **Precision (class 1)** = TP / (TP + FP) = 43 / (43 + 18) ≈ 0.705
- **Recall (class 1)** = TP / (TP + FN) = 43 / (43 + 38) ≈ 0.531
- **F1 Score** = 調和平均 ≈ 0.606
這些數字也會出現在 Class Report 中。

---

### 🧭 Threshold Plot（分類門檻分析圖）

此圖用來觀察 **不同閾值（threshold）對 precision、recall、F1 score 的影響**。
透過此圖，我們能找出**最佳的分類門檻**。

---

#### 🔍 三條主要曲線說明：

| 曲線代表 | 指標說明 |
|----------|-----------|
| 藍線     | **Precision（精確率）**：預測為 1 中，有多少是真的 |
| 綠線     | **Recall（召回率）**：實際為 1 中，有多少被找出來 |
| 紅線     | **F1 Score**：精確率與召回率的調和平均（綜合指標） |

- **最佳 threshold ≈ 0.48**（F1 分數最高點）

可以根據這個圖選擇最佳 threshold，再進行預測：

```python
predict_model(model, probability_threshold=0.48)
```

---

### 🧪 Class Report（分類報告）

| 類別 | Precision | Recall | F1-score | Support |
|------|-----------|--------|----------|---------|
| 1    | 0.705     | 0.531  | 0.606    | 81      |
| 0    | 0.776     | 0.880  | 0.825    | 150     |

- 類別 0 較易正確分類，類別 1 表現略弱
- 顯示模型在處理正負樣本上的不平衡狀況

---


## Predictions

In [6]:
# functional API
predict_model(best) # test_data

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7576,0.8568,0.5309,0.7049,0.6056,0.4356,0.4447


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,prediction_label,prediction_score
552,6,114,88,0,0,27.799999,0.247,66,0,0,0.8037
438,1,97,70,15,0,18.200001,0.147,21,0,0,0.9648
149,2,90,70,17,0,27.299999,0.085,22,0,0,0.9394
373,2,105,58,40,94,34.900002,0.225,25,0,0,0.7999
36,11,138,76,0,0,33.200001,0.420,35,0,1,0.6393
...,...,...,...,...,...,...,...,...,...,...,...
85,2,110,74,29,125,32.400002,0.698,27,0,0,0.8002
7,10,115,0,0,0,35.299999,0.134,29,0,1,0.6230
298,14,100,78,25,184,36.599998,0.412,46,1,0,0.5984
341,1,95,74,21,73,25.900000,0.673,36,0,0,0.9244


In [7]:
# functional API
print(data.shape)  # Print the shape (rows, columns) of the dataset
predictions = predict_model(best, data=data)  # Use the best trained model to make predictions on the full data
predictions.head()

(768, 9)


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7773,0.8357,0.5709,0.7321,0.6415,0.4836,0.4915


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,prediction_label,prediction_score
0,6,148,72,35,0,33.599998,0.627,50,1,1,0.694
1,1,85,66,29,0,26.6,0.351,31,0,0,0.9419
2,8,183,64,0,0,23.299999,0.672,32,1,1,0.7976
3,1,89,66,23,94,28.1,0.167,21,0,0,0.9454
4,0,137,40,35,168,43.099998,2.288,33,1,1,0.8394


In [8]:
# functional API
predictions = predict_model(best, data=data, raw_score=True) # return raw class probabilities
predictions.head()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7773,0.8357,0.5709,0.7321,0.6415,0.4836,0.4915


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,prediction_label,prediction_score_0,prediction_score_1
0,6,148,72,35,0,33.599998,0.627,50,1,1,0.306,0.694
1,1,85,66,29,0,26.6,0.351,31,0,0,0.9419,0.0581
2,8,183,64,0,0,23.299999,0.672,32,1,1,0.2024,0.7976
3,1,89,66,23,94,28.1,0.167,21,0,0,0.9454,0.0546
4,0,137,40,35,168,43.099998,2.288,33,1,1,0.1606,0.8394


## Save the model


In [9]:
# functional API
save_model(best, 'best_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Number of times pregnant',
                                              'Plasma glucose concentration a 2 '
                                              'hours in an oral glucose '
                                              'tolerance test',
                                              'Diastolic blood pressure (mm Hg)',
                                              'Triceps skin fold thickness (mm)',
                                              '2-Hour serum insulin (mu U/ml)',
                                              'Body mass index (weight in '
                                              'kg/(height in m)^2)',
                                              'Diabetes pedigre...
                  TransformerWrapper(exclude=None, include=None,
                                     transformer=CleanC

## To load the model back in environment

In [10]:
# functional API
loaded_model = load_model('best_model')
print(loaded_model)

Transformation Pipeline and Model Successfully Loaded
Pipeline(memory=FastMemory(location=/var/folders/ty/th__2_pn1js9hzn2y7l830wr0000gn/T/joblib),
         steps=[('numerical_imputer',
                 TransformerWrapper(exclude=None,
                                    include=['Number of times pregnant',
                                             'Plasma glucose concentration a 2 '
                                             'hours in an oral glucose '
                                             'tolerance test',
                                             'Diastolic blood pressure (mm Hg)',
                                             'Triceps skin fold thickness (mm)',
                                             '2-Hour serum insulin (mu U/ml)',
                                             'Body...
                 TransformerWrapper(exclude=None, include=None,
                                    transformer=CleanColumnNames(match='[\\]\\[\\,\\{\\}\\"\\:]+'))),
             