# **1. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# **2. Memuat Dataset dari Hasil Clustering**

Memuat dataset hasil clustering dari file CSV ke dalam variabel DataFrame.

In [2]:
train_df = pd.read_csv("clustered_co2.csv")
train_df.sample(5)

Unnamed: 0.1,Unnamed: 0,Make,Model,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption City (L/100 km),Fuel Consumption Hwy (L/100 km),Fuel Consumption Comb (L/100 km),Fuel Consumption Comb (mpg),CO2 Emissions(g/km),Outlier,cluster
1682,1830,MERCEDES-BENZ,B 250 4MATIC,STATION WAGON - SMALL,2.0,4,AS7,Z,10.0,7.5,8.9,32,205,1,1
1354,1401,DODGE,CHARGER FFV,FULL-SIZE,3.6,6,A5,E,18.7,12.4,15.9,18,254,1,2
1914,2110,TOYOTA,HIGHLANDER AWD,SUV - STANDARD,3.5,6,AS6,X,13.0,9.8,11.6,24,267,1,0
5877,6661,CADILLAC,CT5 AWD,MID-SIZE,3.0,6,AS10,Z,13.3,9.3,11.5,25,268,1,1
4435,4872,HONDA,CIVIC SEDAN Si,MID-SIZE,1.5,4,M6,Z,8.4,6.2,7.4,38,173,1,1


# **3. Data Splitting**

Tahap Data Splitting bertujuan untuk memisahkan dataset menjadi dua bagian: data latih (training set) dan data uji (test set).

In [3]:
categorical_features = train_df.select_dtypes(include=['object']).columns
X = train_df.drop(columns=['cluster', *categorical_features])
y = train_df['cluster']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

In [4]:
# menghitung panjang/jumlah data
print("Jumlah data: ",len(X))
# menghitung panjang/jumlah data pada x_test
print("Jumlah data latih: ",len(X_train))
# menghitung panjang/jumlah data pada x_test
print("Jumlah data test: ",len(X_test))

Jumlah data:  6282
Jumlah data latih:  5025
Jumlah data test:  1257


# **4. Membangun Model Klasifikasi**


## **a. Membangun Model Klasifikasi**

Setelah memilih algoritma klasifikasi yang sesuai, langkah selanjutnya adalah melatih model menggunakan data latih.

Berikut adalah rekomendasi tahapannya.
1. Pilih algoritma klasifikasi yang sesuai, seperti Logistic Regression, Decision Tree, Random Forest, atau K-Nearest Neighbors (KNN).
2. Latih model menggunakan data latih.

### random forest

In [5]:
RF_model = RandomForestClassifier()
RF_model.fit(X_train, y_train)

RF_y_train_pred = RF_model.predict(X_train)
RF_y_test_pred = RF_model.predict(X_test)

First, submitter will try to use random forest algorithm

### logistic regression

In [6]:
LR_model = LogisticRegression(max_iter=1000)
LR_model.fit(X_train, y_train)

LR_y_train_pred = LR_model.predict(X_train)
LR_y_test_pred = LR_model.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


submitter also try logistic regression

### Decission Tree

In [7]:
DT_model = DecisionTreeClassifier()
DT_model.fit(X_train, y_train)

DT_y_train_pred = DT_model.predict(X_train)
DT_y_test_pred = DT_model.predict(X_test)

## **b. Evaluasi Model Klasifikasi**

Berikut adalah **rekomendasi** tahapannya.
1. Lakukan prediksi menggunakan data uji.
2. Hitung metrik evaluasi seperti Accuracy dan F1-Score (Opsional: Precision dan Recall).
3. Buat confusion matrix untuk melihat detail prediksi benar dan salah.

### random forest

In [8]:

print("Training Set Classification Report:")
print(classification_report(y_train, RF_y_train_pred))
print( '-' * 50)
print("Test Set Classification Report:")
print(classification_report(y_test, RF_y_test_pred))

Training Set Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2544
           1       1.00      1.00      1.00      2213
           2       1.00      1.00      1.00       268

    accuracy                           1.00      5025
   macro avg       1.00      1.00      1.00      5025
weighted avg       1.00      1.00      1.00      5025

--------------------------------------------------
Test Set Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       643
           1       0.93      0.91      0.92       552
           2       1.00      1.00      1.00        62

    accuracy                           0.93      1257
   macro avg       0.95      0.95      0.95      1257
weighted avg       0.93      0.93      0.93      1257



evaluation results for the Random Forest algorithm shows that this model have the average as follow: accuracy of 90%, recall of 95% and f1-score of 95% as well.

### logistic regression

In [9]:
print("Training Set Classification Report:")
print(classification_report(y_train, LR_y_train_pred))
print( '-' * 50)
print("Test Set Classification Report:")
print(classification_report(y_test, LR_y_test_pred))

Training Set Classification Report:
              precision    recall  f1-score   support

           0       0.65      0.73      0.69      2544
           1       0.64      0.56      0.60      2213
           2       1.00      1.00      1.00       268

    accuracy                           0.67      5025
   macro avg       0.76      0.76      0.76      5025
weighted avg       0.67      0.67      0.66      5025

--------------------------------------------------
Test Set Classification Report:
              precision    recall  f1-score   support

           0       0.65      0.71      0.68       643
           1       0.62      0.55      0.58       552
           2       1.00      1.00      1.00        62

    accuracy                           0.65      1257
   macro avg       0.76      0.75      0.75      1257
weighted avg       0.65      0.65      0.65      1257



### Decission Tree

In [10]:
print("Training Set Classification Report:")
print(classification_report(y_train, DT_y_train_pred))
print( '-' * 50)
print("Test Set Classification Report:")
print(classification_report(y_test, DT_y_test_pred))

Training Set Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2544
           1       1.00      1.00      1.00      2213
           2       1.00      1.00      1.00       268

    accuracy                           1.00      5025
   macro avg       1.00      1.00      1.00      5025
weighted avg       1.00      1.00      1.00      5025

--------------------------------------------------
Test Set Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.90      0.90       643
           1       0.88      0.90      0.89       552
           2       1.00      1.00      1.00        62

    accuracy                           0.90      1257
   macro avg       0.93      0.93      0.93      1257
weighted avg       0.90      0.90      0.90      1257



## **c. Tuning Model Klasifikasi (Optional)**

Gunakan GridSearchCV, RandomizedSearchCV, atau metode lainnya untuk mencari kombinasi hyperparameter terbaik

In [11]:
#Type your code here

## **d. Evaluasi Model Klasifikasi setelah Tuning (Optional)**

Berikut adalah rekomendasi tahapannya.
1. Gunakan model dengan hyperparameter terbaik.
2. Hitung ulang metrik evaluasi untuk melihat apakah ada peningkatan performa.

In [12]:
#Type your code here

## **e. Analisis Hasil Evaluasi Model Klasifikasi**

submitter applying three different algorithm random forest, logistic and regression. Found that random forest and decission tree have perfect evaluation score on training set while logistic regression have the worst of three.
result shows that random forest have a better evalution score overall by 95% average for each recall, precission, and f1-score followed by decission tree at 93% each.