# **1. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import (train_test_split, StratifiedKFold, 
                                    StratifiedGroupKFold, cross_val_score, KFold)
from sklearn.preprocessing import (LabelEncoder, MinMaxScaler, OneHotEncoder, StandardScaler, 
                                   RobustScaler, OrdinalEncoder, TargetEncoder)
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             confusion_matrix, classification_report)
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
import optuna
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline

# **2. Memuat Dataset dari Hasil Clustering**

Memuat dataset hasil clustering dari file CSV ke dalam variabel DataFrame.

In [2]:
df = pd.read_csv('beverage_labeled.csv')
display(df.info(), df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Product          5000 non-null   float64
 1   Unit_Price       5000 non-null   float64
 2   Quantity         5000 non-null   float64
 3   Discount         5000 non-null   float64
 4   Total_Price      5000 non-null   float64
 5   Region           5000 non-null   float64
 6   Last_Order_Date  5000 non-null   float64
 7   Cluster          5000 non-null   int64  
 8   Customer_Type    5000 non-null   object 
 9   Category         5000 non-null   object 
dtypes: float64(7), int64(1), object(2)
memory usage: 390.8+ KB


None

Unnamed: 0,Product,Unit_Price,Quantity,Discount,Total_Price,Region,Last_Order_Date,Cluster,Customer_Type,Category
0,107.0,1.12,59.0,0.05,62.78,289.0,379.0,0,B2B,Soft Drinks
1,153.0,0.89,1.0,0.0,0.89,317.0,646.0,0,B2C,Water
2,59.0,50.25,11.0,0.0,552.75,303.0,644.0,0,B2C,Alcoholic Beverages
3,63.0,14.26,11.0,0.0,156.86,331.0,27.0,0,B2C,Alcoholic Beverages
4,118.0,1.99,7.0,0.0,13.93,303.0,780.0,0,B2C,Soft Drinks


# **3. Data Splitting**

Tahap Data Splitting bertujuan untuk memisahkan dataset menjadi dua bagian: data latih (training set) dan data uji (test set).

In [3]:
X = df.drop(columns='Cluster', axis=1)
y = df['Cluster']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42, stratify=y)

print(f'y_train value counts:\n{y_train.value_counts(normalize=True)}')
print(f'y_test value counts:\n{y_test.value_counts(normalize=True)}')

y_train value counts:
Cluster
0    0.97425
1    0.02575
Name: proportion, dtype: float64
y_test value counts:
Cluster
0    0.974
1    0.026
Name: proportion, dtype: float64


# **4. Membangun Model Klasifikasi**


## **a. Membangun Model Klasifikasi**

Setelah memilih algoritma klasifikasi yang sesuai, langkah selanjutnya adalah melatih model menggunakan data latih.

Berikut adalah rekomendasi tahapannya.
1. Pilih algoritma klasifikasi yang sesuai, seperti Logistic Regression, Decision Tree, Random Forest, atau K-Nearest Neighbors (KNN).
2. Latih model menggunakan data latih.

In [8]:
preprocessor = ColumnTransformer([
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False), ['Customer_Type', 'Category']) 
],
    remainder='passthrough',
    verbose_feature_names_out=False).set_output(transform='pandas')

X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep = preprocessor.transform(X_test)

display(X_train_prep.shape, X_test_prep.shape, X_train_prep, X_test_prep)

(4000, 13)

(1000, 13)

Unnamed: 0,Customer_Type_B2B,Customer_Type_B2C,Category_Alcoholic Beverages,Category_Juices,Category_Soft Drinks,Category_Water,Product,Unit_Price,Quantity,Discount,Total_Price,Region,Last_Order_Date
1491,1.0,0.0,0.0,0.0,0.0,1.0,153.0,0.96,16.0,0.05,14.59,311.0,147.0
1573,0.0,1.0,0.0,0.0,0.0,1.0,131.0,1.61,13.0,0.00,20.93,296.0,1061.0
2337,1.0,0.0,1.0,0.0,0.0,0.0,82.0,0.97,35.0,0.05,32.25,315.0,893.0
2297,0.0,1.0,0.0,0.0,0.0,1.0,149.0,1.07,13.0,0.00,13.91,337.0,265.0
3298,1.0,0.0,0.0,1.0,0.0,0.0,172.0,2.94,74.0,0.10,195.80,329.0,974.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0.0,1.0,1.0,0.0,0.0,0.0,74.0,1.08,7.0,0.00,7.56,321.0,45.0
2251,0.0,1.0,0.0,1.0,0.0,0.0,176.0,2.93,11.0,0.00,32.23,331.0,776.0
3480,0.0,1.0,0.0,0.0,0.0,1.0,139.0,0.47,8.0,0.00,3.76,315.0,679.0
2840,0.0,1.0,0.0,0.0,0.0,1.0,139.0,1.46,6.0,0.00,8.76,302.0,335.0


Unnamed: 0,Customer_Type_B2B,Customer_Type_B2C,Category_Alcoholic Beverages,Category_Juices,Category_Soft Drinks,Category_Water,Product,Unit_Price,Quantity,Discount,Total_Price,Region,Last_Order_Date
960,0.0,1.0,0.0,0.0,0.0,1.0,139.0,1.38,2.0,0.00,2.76,337.0,1014.0
1859,1.0,0.0,0.0,0.0,0.0,1.0,149.0,1.52,72.0,0.10,98.50,307.0,891.0
3913,1.0,0.0,0.0,1.0,0.0,0.0,169.0,2.49,70.0,0.10,156.87,296.0,538.0
4249,1.0,0.0,1.0,0.0,0.0,0.0,64.0,1.52,96.0,0.10,131.33,311.0,1023.0
4545,0.0,1.0,0.0,1.0,0.0,0.0,172.0,3.23,3.0,0.00,9.69,331.0,736.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1041,1.0,0.0,0.0,0.0,0.0,1.0,149.0,1.59,38.0,0.15,51.36,315.0,1011.0
2194,1.0,0.0,0.0,1.0,0.0,0.0,174.0,2.80,95.0,0.10,239.40,329.0,460.0
2105,1.0,0.0,0.0,0.0,1.0,0.0,103.0,3.01,27.0,0.05,77.21,307.0,275.0
4596,1.0,0.0,0.0,1.0,0.0,0.0,177.0,2.81,11.0,0.10,27.82,315.0,1056.0


In [9]:
# helper function untuk eval

def evaluate(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print('==='*100)
    print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
    print(f'Precision: {precision_score(y_test, y_pred, average="weighted")}')
    print(f'Recall: {recall_score(y_test, y_pred, average="weighted")}')
    print(f'F1 Score: {f1_score(y_test, y_pred, average="weighted")}')
    print('==='*100)
    print(classification_report(y_test, y_pred))
    print('==='*100)
    print(confusion_matrix(y_test, y_pred))

## **b. Evaluasi Model Klasifikasi**

Berikut adalah **rekomendasi** tahapannya.
1. Lakukan prediksi menggunakan data uji.
2. Hitung metrik evaluasi seperti Accuracy dan F1-Score (Opsional: Precision dan Recall).
3. Buat confusion matrix untuk melihat detail prediksi benar dan salah.

### XGBoost

In [12]:
xgb = XGBClassifier(n_estimators=10000,
                    eval_metric='auc',
                    learning_rate=0.01,
                    random_state=42, 
                    n_jobs=-1,
                    early_stopping_rounds=1000)

xgb.fit(X_train_prep, y_train,
        verbose=100,
        eval_set=[(X_test_prep, y_test)])

evaluate(xgb, X_test_prep, y_test)

[0]	validation_0-auc:0.99931
[100]	validation_0-auc:0.99992
[200]	validation_0-auc:0.99992
[300]	validation_0-auc:1.00000
[400]	validation_0-auc:1.00000
[500]	validation_0-auc:1.00000
[600]	validation_0-auc:1.00000
[700]	validation_0-auc:1.00000
[800]	validation_0-auc:1.00000
[900]	validation_0-auc:1.00000
[1000]	validation_0-auc:1.00000
[1100]	validation_0-auc:1.00000
[1200]	validation_0-auc:1.00000
[1240]	validation_0-auc:1.00000
Accuracy: 0.999
Precision: 0.9990370370370371
Recall: 0.999
F1 Score: 0.9990091771569225
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       974
           1       0.96      1.00      0.98        26

    accuracy                           1.00      1000
   macro avg       0.98      1.00      0.99      1000
weighted avg       1.00      1.00      1.00      1000

[[973   1]
 [  0  26]]


In [13]:
cat = CatBoostClassifier(n_estimators=10000,
                         eval_metric='AUC',
                         learning_rate=0.01,
                         random_state=42,
                         early_stopping_rounds=1000,
                         verbose=100)
cat.fit(X_train_prep, y_train,
        eval_set=[(X_test_prep, y_test)])

evaluate(cat, X_test_prep, y_test)

0:	test: 0.9788343	best: 0.9788343 (0)	total: 176ms	remaining: 29m 15s
100:	test: 1.0000000	best: 1.0000000 (22)	total: 995ms	remaining: 1m 37s
200:	test: 1.0000000	best: 1.0000000 (22)	total: 1.86s	remaining: 1m 30s
300:	test: 1.0000000	best: 1.0000000 (22)	total: 2.69s	remaining: 1m 26s
400:	test: 1.0000000	best: 1.0000000 (22)	total: 3.47s	remaining: 1m 23s
500:	test: 1.0000000	best: 1.0000000 (22)	total: 4.28s	remaining: 1m 21s
600:	test: 1.0000000	best: 1.0000000 (22)	total: 5.08s	remaining: 1m 19s
700:	test: 1.0000000	best: 1.0000000 (22)	total: 5.74s	remaining: 1m 16s
800:	test: 1.0000000	best: 1.0000000 (22)	total: 6.7s	remaining: 1m 16s
900:	test: 1.0000000	best: 1.0000000 (22)	total: 7.59s	remaining: 1m 16s
1000:	test: 1.0000000	best: 1.0000000 (22)	total: 8.35s	remaining: 1m 15s
Stopped by overfitting detector  (1000 iterations wait)

bestTest = 1
bestIteration = 22

Shrink model to first 23 iterations.
Accuracy: 0.998
Precision: 0.9980040983606558
Recall: 0.998
F1 Score: 0.

## **Analisis Hasil Evaluasi Model Klasifikasi**

### 1. **Overview Hasil Evaluasi**
Dalam skenario ini, kita membandingkan dua model, yaitu **XGBoost** dan **CatBoost**, menggunakan metrik evaluasi umum seperti *accuracy*, *precision*, *recall*, *F1-score*, serta *confusion matrix*. Berikut ringkasan hasilnya:

1. **XGBoost**  
   - **Confusion Matrix**  
     
     \begin{bmatrix}
       973 & 1 \\
       0   & 26
     \end{bmatrix}
      
     - True Negative (TN) = 973  
     - False Positive (FP) = 1  
     - False Negative (FN) = 0  
     - True Positive (TP) = 26  
   - **Accuracy** = (973 + 26) / 1000 = 0.999  
   - **Precision (kelas 1)** = 26 / (26 + 1) ≈ 0.963  
   - **Recall (kelas 1)** = 26 / (26 + 0) = 1.00  
   - **F1-Score (kelas 1)** ≈ 0.981

2. **CatBoost**  
   - **Confusion Matrix**  
     
     \begin{bmatrix}
       974 & 0 \\
       2   & 24
     \end{bmatrix}
     
     - True Negative (TN) = 974  
     - False Positive (FP) = 0  
     - False Negative (FN) = 2  
     - True Positive (TP) = 24  
   - **Accuracy** = (974 + 24) / 1000 = 0.998  
   - **Precision (kelas 1)** = 24 / (24 + 0) = 1.00  
   - **Recall (kelas 1)** = 24 / (24 + 2) ≈ 0.923  
   - **F1-Score (kelas 1)** ≈ 0.959

Secara keseluruhan, kedua model memiliki *accuracy* yang sangat tinggi di atas 0.99, dengan performa yang sedikit lebih unggul pada XGBoost (0.999) dibandingkan CatBoost (0.998).

---

### 2. **Analisis Berdasarkan Confusion Matrix**

1. **XGBoost**  
   - TN = 973, TP = 26, FP = 1, FN = 0  
   - Hanya ada 1 kesalahan prediksi di mana kelas 0 diprediksi menjadi kelas 1 (FP).  
   - Model tidak pernah salah memprediksi kelas 1 (FN = 0).  
   - Hal ini menunjukkan bahwa model sangat sensitif (recall tinggi) terhadap kelas 1.

2. **CatBoost**  
   - TN = 974, TP = 24, FP = 0, FN = 2  
   - Model tidak pernah salah memprediksi kelas 0 (FP = 0), namun ada 2 kesalahan memprediksi kelas 1 sebagai kelas 0 (FN).  
   - Hal ini menunjukkan model sedikit menurun *sensitivity*-nya terhadap kelas 1 karena terdapat 2 instance kelas 1 yang tidak terdeteksi.

**Kesimpulan Confusion Matrix**  
- **XGBoost** lebih baik dalam mendeteksi kelas 1 (tidak ada FN), tetapi memiliki 1 *false positive*.  
- **CatBoost** tidak melakukan *false positive* sama sekali, namun melewatkan 2 instance kelas 1 (FN).  
- Secara total, *trade-off* terjadi antara FP (XGBoost) dan FN (CatBoost).

---

### 3. **Analisis Berdasarkan Classification Report**

1. **XGBoost**  
   - **Precision (kelas 1)**: ~0.963  
     - Masih sangat tinggi, meski ada 1 instance kelas 0 yang diprediksi sebagai kelas 1.  
   - **Recall (kelas 1)**: 1.00  
     - Model sama sekali tidak melewatkan kelas 1 (FN=0).  
   - **F1-Score (kelas 1)**: ~0.981  
     - Kombinasi precision dan recall yang hampir sempurna.  
   - **Macro average** (jika dihitung untuk kedua kelas) juga mendekati 0.99 atau lebih.

2. **CatBoost**  
   - **Precision (kelas 1)**: 1.00  
     - Tidak ada *false positive* untuk kelas 1.  
   - **Recall (kelas 1)**: ~0.923  
     - Terdapat 2 *false negative*, sehingga recall lebih rendah dibanding XGBoost.  
   - **F1-Score (kelas 1)**: ~0.959  
     - Masih tergolong tinggi, meskipun sedikit di bawah XGBoost.  
   - **Macro average** juga sangat baik (>0.95), menandakan keseimbangan performa untuk kedua kelas.

**Kesimpulan Classification Report**  
- XGBoost memiliki keunggulan di recall untuk kelas 1, sedangkan CatBoost memiliki precision sempurna untuk kelas 1.  
- Perbedaan performa kedua model relatif kecil, keduanya sama-sama memiliki *accuracy* > 0.998, menandakan model sangat andal.

---

### 4. **Kesimpulan dan Rekomendasi**

1. **Kesimpulan**  
   - **XGBoost** menunjukkan *accuracy* 0.999 dengan keunggulan pada *recall* untuk kelas 1 (tidak ada *false negative*).  
   - **CatBoost** memiliki *accuracy* 0.998 dengan keunggulan *precision* kelas 1 yang sempurna (tidak ada *false positive*).  
   - Kedua model sama-sama berkinerja sangat tinggi, hanya berbeda di titik *trade-off* antara FP dan FN.

2. **Rekomendasi**  
   - **Pemilihan Model**:  
     - Jika prioritas utama adalah **tidak melewatkan** instance kelas 1 (misalnya, kasus deteksi penipuan), maka XGBoost mungkin lebih cocok karena FN=0.  
     - Jika prioritas utama adalah **menghindari kesalahan menandai** kelas 0 sebagai kelas 1, CatBoost lebih unggul karena FP=0.  
   - **Evaluasi Tambahan**:  
     - Lakukan *cross-validation* atau uji di dataset lain untuk memastikan kinerja tetap stabil.  
   - **Tuning Lebih Lanjut**:  
     - Meskipun hasil sudah bagus, *hyperparameter tuning* tambahan bisa mengoptimalkan performa lebih lanjut.  
   - **Kumpulkan Data Tambahan**:  
     - Jika dataset masih kurang beragam, penambahan data baru dapat meningkatkan generalisasi model.  
   - **Eksplorasi Model Lain**:  
     - Mencoba model serupa (LightGBM atau ensemble) dapat memberikan alternatif, terutama jika ada kriteria tertentu seperti kecepatan inferensi atau interpretabilitas.

Dengan demikian, kedua model sudah sangat layak untuk implementasi. Pemilihan akhir dapat disesuaikan dengan tujuan bisnis (meminimalkan FN vs. meminimalkan FP).