
# Prediksi Harga Mobil Bekas — E2E Workflow (EDA ➜ Modeling ➜ Ekspor `model.pkl`)
Notebook ini menyajikan alur **mengalir** dari memahami data, **EDA**, pemodelan, evaluasi, hingga ekspor artefak untuk aplikasi Streamlit.
Dataset yang dipakai: **`toyota.csv`** (harus berada **satu folder** dengan notebook).



## 0. Persiapan


In [None]:

# !pip install -q pandas numpy scikit-learn matplotlib
import os, json, pickle, warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor

pd.set_option("display.max_columns", 100)
print("Pustaka siap.")


## 1. Muat Data & Pemahaman Awal
Letakkan `toyota.csv` di folder yang sama. Atur kolom target (harga).


In [None]:

DATA_PATH = "toyota.csv"          # terkunci ke file ini
TARGET_COLUMN = "price"           # ganti jika perlu

assert os.path.exists(DATA_PATH), "File 'toyota.csv' harus berada satu folder dengan notebook."
df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
display(df.head())

print("\nInfo singkat:")
print(df.dtypes)

print("\nStatistik deskriptif (numerik):")
display(df.describe())

# Deteksi tipe kolom
numeric_cols = df.select_dtypes(include=["number"]).columns.tolist()
cat_cols = df.select_dtypes(include=["object","category","bool"]).columns.tolist()
if TARGET_COLUMN in numeric_cols:
    numeric_cols.remove(TARGET_COLUMN)

print("\nFitur numerik   :", numeric_cols)
print("Fitur kategorik :", cat_cols)


## 2. Data Quality Check
Cek missing values, duplikasi, dan anomali sederhana.


In [None]:

# Missing values
na_counts = df.isna().sum().sort_values(ascending=False)
print("Jumlah missing values per kolom:")
display(na_counts)

# Persentase missing
na_pct = (df.isna().mean()*100).sort_values(ascending=False).round(2)
print("Persentase missing (%):")
display(na_pct)

# Duplikasi
dup_count = df.duplicated().sum()
print(f"Jumlah baris duplikat: {dup_count}")



## 3. EDA: Univariate (Distribusi/Counts)
- Numerik: histogram
- Kategorikal: 10 kategori teratas berdasarkan frekuensi
> **Catatan plotting:** Menggunakan **matplotlib** dan **satu plot per grafik**.


In [None]:

# Hist numerik (satu plot per fitur)
for col in numeric_cols + ([TARGET_COLUMN] if TARGET_COLUMN in df.columns and df[TARGET_COLUMN].dtype!='O' else []):
    plt.figure()
    df[col].dropna().hist(bins=30)
    plt.title(f"Distribusi: {col}")
    plt.xlabel(col)
    plt.ylabel("Frekuensi")
    plt.show()

# Bar chart untuk kategorikal (top 10)
for col in cat_cols:
    vc = df[col].astype(str).value_counts().head(10)
    plt.figure()
    vc.plot(kind="bar")
    plt.title(f"Top 10 Kategori: {col}")
    plt.xlabel(col)
    plt.ylabel("Jumlah")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()


## 4. EDA: Korelasi Numerik
Heatmap sederhana korelasi antar fitur numerik + target.


In [None]:

num_for_corr = numeric_cols.copy()
if TARGET_COLUMN in df.columns and pd.api.types.is_numeric_dtype(df[TARGET_COLUMN]):
    num_for_corr = num_for_corr + [TARGET_COLUMN]

if len(num_for_corr) >= 2:
    corr = df[num_for_corr].corr(numeric_only=True)
    plt.figure()
    plt.imshow(corr, interpolation='nearest')
    plt.title("Korelasi (numerik)")
    plt.colorbar()
    ticks = range(len(corr.columns))
    plt.xticks(ticks, corr.columns, rotation=45, ha="right")
    plt.yticks(ticks, corr.columns)
    plt.tight_layout()
    plt.show()
else:
    print("Tidak cukup kolom numerik untuk korelasi.")


## 5. (Opsional) Pembersihan Sederhana
Contoh langkah ringan: drop duplikasi. Langkah lain (outlier, imputasi) menyesuaikan kebutuhan.


In [None]:

# Drop duplikasi jika ada
before = df.shape[0]
df = df.drop_duplicates().reset_index(drop=True)
after = df.shape[0]
print(f"Baris sebelum/ sesudah drop duplicates: {before} -> {after}")

# Recompute tipe kolom setelah pembersihan
numeric_cols = df.select_dtypes(include=["number"]).columns.tolist()
cat_cols = df.select_dtypes(include=["object","category","bool"]).columns.tolist()
if TARGET_COLUMN in numeric_cols:
    numeric_cols.remove(TARGET_COLUMN)


## 6. Split Data


In [None]:

assert TARGET_COLUMN in df.columns, f"Kolom target '{TARGET_COLUMN}' tidak ditemukan."
y = df[TARGET_COLUMN]
X = df.drop(columns=[TARGET_COLUMN])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=True
)
print("Train:", X_train.shape, " Test:", X_test.shape)
display(X_train.head())


## 7. Pipeline (Preprocess ➜ Model) & Training
- Numerik: `StandardScaler`
- Kategorikal: `OneHotEncoder(handle_unknown="ignore")`
- Model: `RandomForestRegressor`


In [None]:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

numeric_transformer = Pipeline([("scaler", StandardScaler(with_mean=False))])
categorical_transformer = Pipeline([("ohe", OneHotEncoder(handle_unknown="ignore"))])

preprocess = ColumnTransformer([
    ("num", numeric_transformer, X_train.select_dtypes(include=["number"]).columns.tolist()),
    ("cat", categorical_transformer, X_train.select_dtypes(include=["object","category","bool"]).columns.tolist()),
])

model = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)
pipe = Pipeline([("preprocess", preprocess), ("model", model)])
pipe.fit(X_train, y_train)
print("Training selesai.")


## 8. Evaluasi


In [None]:

pred_train = pipe.predict(X_train)
pred_test  = pipe.predict(X_test)

rmse_train = mean_squared_error(y_train, pred_train, squared=False)
rmse_test  = mean_squared_error(y_test, pred_test, squared=False)
mae_test   = mean_absolute_error(y_test, pred_test)
r2_test    = r2_score(y_test, pred_test)

print(f"RMSE train: {rmse_train:.4f}")
print(f"RMSE test : {rmse_test:.4f}")
print(f"MAE test  : {mae_test:.4f}")
print(f"R2  test  : {r2_test:.4f}")


## 9. (Opsional) Pentingnya Fitur
Tampilkan 20 fitur engineered (pasca One-Hot) paling penting menurut RandomForest.


In [None]:

# Ambil nama fitur setelah transformasi
try:
    feat_names = pipe.named_steps["preprocess"].get_feature_names_out()
except Exception:
    # fallback: gunakan kolom asli (kurang akurat untuk OHE)
    feat_names = pipe.named_steps["preprocess"].transformers_[0][2] + pipe.named_steps["preprocess"].transformers_[1][2]

importances = pipe.named_steps["model"].feature_importances_
imp = pd.DataFrame({"feature": feat_names, "importance": importances}).sort_values("importance", ascending=False)
display(imp.head(20))

# Plot top 20 (satu plot)
top = imp.head(20).iloc[::-1]  # balik biar bar chart dari kecil ke besar
plt.figure()
plt.barh(top["feature"], top["importance"])
plt.title("Top 20 Feature Importance (engineered)")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()


## 10. Ekspor Artefak
- `model.pkl` — pipeline lengkap
- `columns.json` — urutan kolom fitur mentah (sebelum transform)
- `example_features.json` — 1 contoh baris untuk uji cepat di aplikasi


In [None]:

with open("model.pkl", "wb") as f:
    pickle.dump(pipe, f)

with open("columns.json", "w") as f:
    json.dump(X.columns.tolist(), f, ensure_ascii=False, indent=2)

example = X_train.iloc[0:1].to_dict(orient="records")[0]
with open("example_features.json", "w") as f:
    json.dump(example, f, ensure_ascii=False, indent=2)

print("Disimpan: model.pkl, columns.json, example_features.json")


## 11. Verifikasi Artefak
Muat `model.pkl` dan prediksi 1 baris.


In [None]:

with open("model.pkl", "rb") as f:
    loaded = pickle.load(f)

pred_one = float(loaded.predict(X_test.iloc[0:1])[0])
print("Prediksi 1 baris (sanity check):", pred_one)


## 12. Lanjut ke Aplikasi Streamlit
1) Pastikan file berikut berada di folder yang sama:
   - `model.pkl`
   - `columns.json`
   - `streamlit_app_chat_car_price.py`  *(kode sudah disiapkan sebelumnya)*
2) Set API key Gemini dan jalankan:
```bash
pip install streamlit google-generativeai scikit-learn pandas numpy
export GEMINI_API_KEY=YOUR_KEY
streamlit run streamlit_app_chat_car_price.py
```
Di sidebar aplikasi:
- Upload `model.pkl` **atau** aktifkan "Gunakan path lokal".
- Aktifkan **Gemini (LLM Chat)** untuk mode percakapan pakar.
