Berikut penjelasan tentang data:

Data diabetes berisi pengukuran yang diambil dari 442 pasien diabetes yang terdiri dari:

a. 10 variabel dasar (fitur):
- age – umur dalam tahun
- sex - pria atau wanita
- bmi - indeks massa tubuh
- bp - tekanan darah rata-rata
- s1 - TC : kolesterol serum total
- s2 - LDL: lipoprotein densitas rendah
- s3 - HDL: lipoprotein densitas tinggi
- s4 - TCH : kolesterol total / HDL
- s5 - LTG: kemungkinan log kadar trigliserida serum
- s6 - GLU : kadar gula darah

b. Variable target (Y)

- Satu variabel target: ukuran kuantitatif perkembangan penyakit satu tahun setelah
data dasar

### Import Package

In [17]:
# Impor library yang diperlukan
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
import pandas as pd

### Load Dataset

In [7]:
# Load data diabetes
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

In [19]:
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


In [20]:
pd.DataFrame(y)

Unnamed: 0,0
0,151.0
1,75.0
2,141.0
3,206.0
4,135.0
...,...
437,178.0
438,104.0
439,132.0
440,220.0


### Train Test Split

In [8]:
# Bagi data menjadi data latih dan data uji
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Define n estimator

In [9]:
# Inisialisasi daftar n_estimators yang akan diuji
n_estimators_list = [50, 100, 200]

### Loop for each estimator

In [10]:
# Dictionary untuk menyimpan hasil RMSE dan MAE untuk setiap n_estimators
results = {}


# Loop melalui setiap nilai n_estimators
for n_estimators in n_estimators_list:
    # Buat model Random Forest Regressor
    rf_regressor = RandomForestRegressor(n_estimators=n_estimators, random_state=42)
    
    # Latih model
    rf_regressor.fit(X_train, y_train)
    
    # Lakukan prediksi pada data uji
    y_pred = rf_regressor.predict(X_test)
    
    # Hitung RMSE dan MAE
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    
    # Simpan hasil ke dictionary
    results[n_estimators] = {'RMSE': rmse, 'MAE': mae}


In [22]:
pd.DataFrame(y_pred)

Unnamed: 0,0
0,142.925
1,176.095
2,152.285
3,255.045
4,108.525
...,...
84,82.995
85,71.975
86,92.955
87,74.425


### Evaluation Metrics (RMSE & MAE)

In [11]:
# Cetak hasil
for n_estimators, metrics in results.items():
    print(f"n_estimators = {n_estimators}: RMSE = {metrics['RMSE']}, MAE = {metrics['MAE']}")

n_estimators = 50: RMSE = 55.17426146392798, MAE = 44.76539325842697
n_estimators = 100: RMSE = 54.332408273184846, MAE = 44.053033707865175
n_estimators = 200: RMSE = 54.461217375612414, MAE = 44.276123595505624


### Select for the best model

In [12]:
# Pilih model dengan RMSE terendah
best_model_n_estimators = min(results, key=lambda x: results[x]['RMSE'])

print(f"\nBest model: n_estimators = {best_model_n_estimators}")


Best model: n_estimators = 100
