# Regression Pipeline (End-to-End)

**Name**: Anom Nur Maulid  
**Class**: TK4601  
**NIM**: 1103223193  

## Objective
Build an end-to-end regression model to predict a continuous target value from numeric audio features.


DATASETS OVERVIEW AND DESCRIPTION FOR MACHINE LEARNING CLASS INDIVIDUAL TASK



1. Main Objective
"To design and implement an end-to-end regression pipeline (using machine learning and/or deep learning) that can predict a continuous target value from the input features (for example, the release year of a song)."

2. Task Overview:
"In this assignment, you will build an end-to-end regression model. You will work with the provided dataset, perform data cleaning and preprocessing, handle missing values and outliers, and engineer or select relevant features. You are required to implement machine learning or deep learning regression algorithms to predict the target variable. The workflow should include data preprocessing, model training, basic hyperparameter tuning, and evaluation using appropriate regression metrics (such as MSE, RMSE, MAE, or R²), along with a brief interpretation of the results."

3. Link Datasets:
https://drive.google.com/file/d/1f8eaAZY-7YgFxLcrL3OkvSRa3onNNLb9/view




## 1. Mount Google Drive
Mount Google Drive so the notebook can access the dataset stored in Drive.


In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## 2. Import Libraries
We import libraries for:
- data processing (pandas, numpy)
- model training (scikit-learn)
- evaluation metrics (MAE, RMSE, R²)


In [2]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


## 3. Locate Dataset File
We verify the dataset folder and confirm the dataset file exists before loading.


In [3]:
DATA_DIR = "/content/drive/MyDrive/UAS ML DL/Regression (ML)"

print("DATA_DIR exists?", os.path.exists(DATA_DIR))
print("\nIsi folder:")
for f in sorted(os.listdir(DATA_DIR)):
    print("-", f)

data_path = os.path.join(DATA_DIR, "midterm-regresi-dataset.csv")
print("\nDataset exists?", os.path.exists(data_path))
print("Dataset path:", data_path)


DATA_DIR exists? True

Isi folder:
- Regression.ipynb
- midterm-regresi-dataset.csv

Dataset exists? True
Dataset path: /content/drive/MyDrive/UAS ML DL/Regression (ML)/midterm-regresi-dataset.csv


## 4. Load Data & Sanity Check
The dataset has no header:
- first column = target (`y`)
- remaining columns = features (`X`)

We check:
- shapes
- target range
- missing value ratio


In [4]:
df = pd.read_csv(data_path, header=None)

# Convert defensively (if any non-numeric strings exist)
df = df.apply(pd.to_numeric, errors="coerce")

y = df.iloc[:, 0].astype(float)
X = df.iloc[:, 1:].astype(float)

print("df shape:", df.shape)
print("X shape :", X.shape)
print("y shape :", y.shape)
print("Target min/max:", y.min(), y.max())
print("Missing ratio X:", float(X.isna().mean().mean()))

df.head()


df shape: (515345, 91)
X shape : (515345, 90)
y shape : (515345,)
Target min/max: 1922.0 2011.0
Missing ratio X: 0.0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,81,82,83,84,85,86,87,88,89,90
0,2001,49.94357,21.47114,73.0775,8.74861,-17.40628,-13.09905,-25.01202,-12.23257,7.83089,...,13.0162,-54.40548,58.99367,15.37344,1.11144,-23.08793,68.40795,-1.82223,-27.46348,2.26327
1,2001,48.73215,18.4293,70.32679,12.94636,-10.32437,-24.83777,8.7663,-0.92019,18.76548,...,5.66812,-19.68073,33.04964,42.87836,-9.90378,-32.22788,70.49388,12.04941,58.43453,26.92061
2,2001,50.95714,31.85602,55.81851,13.41693,-6.57898,-18.5494,-3.27872,-2.35035,16.07017,...,3.038,26.05866,-50.92779,10.93792,-0.07568,43.2013,-115.00698,-0.05859,39.67068,-0.66345
3,2001,48.2475,-1.89837,36.29772,2.58776,0.9717,-26.21683,5.05097,-10.34124,3.55005,...,34.57337,-171.70734,-16.96705,-46.67617,-12.51516,82.58061,-72.08993,9.90558,199.62971,18.85382
4,2001,50.9702,42.20998,67.09964,8.46791,-15.85279,-16.81409,-12.48207,-9.37636,12.63699,...,9.92661,-55.95724,64.92712,-17.72522,-1.49237,-7.50035,51.76631,7.88713,55.66926,28.74903


## 5. Train/Validation Split
We split the dataset into training and validation sets to evaluate model performance on unseen data.


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train:", X_train.shape, "Valid:", X_valid.shape)


Train: (412276, 90) Valid: (103069, 90)


## 6. Evaluation Metrics
We evaluate regression performance using:
- MAE (Mean Absolute Error)
- RMSE (Root Mean Squared Error)
- R² (higher is better)


In [8]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def eval_reg(model, X_tr, y_tr, X_va, y_va, name="model"):
    model.fit(X_tr, y_tr)
    pred = model.predict(X_va)
    mae = mean_absolute_error(y_va, pred)
    rmse = np.sqrt(mean_squared_error(y_va, pred))
    r2 = r2_score(y_va, pred)
    print(f"{name} | MAE={mae:.3f} | RMSE={rmse:.3f} | R2={r2:.4f}")
    return {"model": name, "mae": mae, "rmse": rmse, "r2": r2}


## 7. Baseline Model (Ridge Regression)
Because features are numeric, we use:
- StandardScaler for scaling
- Ridge Regression as a strong linear baseline


In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge

ridge_baseline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("reg", Ridge(alpha=1.0))
])

results = []
results.append(eval_reg(ridge_baseline, X_train, y_train, X_valid, y_valid, "Ridge_baseline"))
results


Ridge_baseline | MAE=6.778 | RMSE=9.523 | R2=0.2380


[{'model': 'Ridge_baseline',
  'mae': 6.778169897633017,
  'rmse': np.float64(9.523311986020673),
  'r2': 0.23796617303765089}]

## 8. Model Comparison (Non-linear Models)
We compare the linear baseline with non-linear/tree-based models:
- RandomForestRegressor
- GradientBoostingRegressor

These models can capture non-linear relationships between audio features and the target.


In [10]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

rf = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("reg", RandomForestRegressor(
        n_estimators=300,
        random_state=42,
        n_jobs=-1
    ))
])

gbr = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("reg", GradientBoostingRegressor(random_state=42))
])

results.append(eval_reg(rf, X_train, y_train, X_valid, y_valid, "RandomForest"))
results.append(eval_reg(gbr, X_train, y_train, X_valid, y_valid, "GradientBoosting"))

import pandas as pd
pd.DataFrame(results).sort_values("rmse")


RandomForest | MAE=6.436 | RMSE=9.064 | R2=0.3097
GradientBoosting | MAE=6.561 | RMSE=9.305 | R2=0.2724


Unnamed: 0,model,mae,rmse,r2
1,RandomForest,6.436016,9.064085,0.309687
2,GradientBoosting,6.560876,9.305405,0.27244
0,Ridge_baseline,6.77817,9.523312,0.237966


## 9. Basic Hyperparameter Tuning (Ridge Alpha)
We tune Ridge regularization strength (`alpha`) using cross-validation.
This satisfies the basic hyperparameter tuning requirement.


In [12]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

param_grid = {"reg__alpha": np.logspace(-3, 3, 13)}
cv = KFold(n_splits=5, shuffle=True, random_state=42)

ridge_for_tuning = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("reg", Ridge())
])

gs = GridSearchCV(
    ridge_for_tuning,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=cv,
    n_jobs=-1
)

gs.fit(X_train, y_train)

print("Best alpha:", gs.best_params_)
print("Best CV RMSE:", -gs.best_score_)

best_ridge = gs.best_estimator_
results.append(eval_reg(best_ridge, X_train, y_train, X_valid, y_valid, "Ridge_tuned"))

pd.DataFrame(results).sort_values("rmse")


Best alpha: {'reg__alpha': np.float64(100.0)}
Best CV RMSE: 9.557535103431007
Ridge_tuned | MAE=6.778 | RMSE=9.523 | R2=0.2380


Unnamed: 0,model,mae,rmse,r2
1,RandomForest,6.436016,9.064085,0.309687
2,GradientBoosting,6.560876,9.305405,0.27244
3,Ridge_tuned,6.77829,9.523306,0.237967
0,Ridge_baseline,6.77817,9.523312,0.237966


## Conclusion & Interpretation
- The dataset contains **515,345 rows** and **90 numeric features**, with **no missing values**.
- Baseline Ridge achieved **RMSE ≈ 9.52** and **R² ≈ 0.238**, indicating a limited linear relationship between features and target.
- RandomForest produced the best validation performance (**MAE ≈ 6.44**, **RMSE ≈ 9.06**, **R² ≈ 0.310**), suggesting that non-linear models capture patterns better for this task.
- Basic tuning of Ridge (`alpha`) did not improve performance, so Ridge remains a strong linear baseline but not the best overall model.


In [13]:
import os
import pandas as pd

metrics_df = pd.DataFrame(results).sort_values("rmse")
out_metrics = os.path.join(DATA_DIR, "regression_model_results.csv")
metrics_df.to_csv(out_metrics, index=False)

print("Saved:", out_metrics)
metrics_df


Saved: /content/drive/MyDrive/UAS ML DL/Regression (ML)/regression_model_results.csv


Unnamed: 0,model,mae,rmse,r2
1,RandomForest,6.436016,9.064085,0.309687
2,GradientBoosting,6.560876,9.305405,0.27244
3,Ridge_tuned,6.77829,9.523306,0.237967
0,Ridge_baseline,6.77817,9.523312,0.237966


## 10. Deep Learning Model (MLP)
We add a simple Multi-Layer Perceptron (MLP) regressor using TensorFlow/Keras.
This satisfies the "deep learning" option and allows GPU usage (if available).
We evaluate using MAE, RMSE, and R² on the same validation split.


In [14]:
import tensorflow as tf
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("TF version:", tf.__version__)
print("GPU devices:", tf.config.list_physical_devices('GPU'))

# Prepare numeric data for DL (impute + scale)
imputer = SimpleImputer(strategy="median")
scaler = StandardScaler()

X_train_dl = imputer.fit_transform(X_train)
X_valid_dl = imputer.transform(X_valid)

X_train_dl = scaler.fit_transform(X_train_dl).astype("float32")
X_valid_dl = scaler.transform(X_valid_dl).astype("float32")

y_train_dl = y_train.values.astype("float32")
y_valid_dl = y_valid.values.astype("float32")

print("X_train_dl:", X_train_dl.shape, "X_valid_dl:", X_valid_dl.shape)


TF version: 2.19.0
GPU devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
X_train_dl: (412276, 90) X_valid_dl: (103069, 90)


## Train MLP (Keras) + Evaluation
We train an MLP regressor with EarlyStopping to reduce overfitting.
Then we evaluate using MAE, RMSE, and R² on the validation set.


In [15]:
import tensorflow as tf
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

tf.keras.backend.clear_session()
tf.random.set_seed(42)

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X_train_dl.shape[1],)),
    tf.keras.layers.Dense(256, activation="relu"),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(128, activation="relu"),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1)  # regression output
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss="mse",
    metrics=[tf.keras.metrics.MeanAbsoluteError(name="mae")]
)

early = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=3,
    restore_best_weights=True
)

history = model.fit(
    X_train_dl, y_train_dl,
    validation_data=(X_valid_dl, y_valid_dl),
    epochs=30,
    batch_size=4096,
    callbacks=[early],
    verbose=1
)

pred_dl = model.predict(X_valid_dl, batch_size=8192).ravel()

mae_dl = mean_absolute_error(y_valid, pred_dl)
rmse_dl = np.sqrt(mean_squared_error(y_valid, pred_dl))
r2_dl = r2_score(y_valid, pred_dl)

print(f"MLP (DL) | MAE={mae_dl:.3f} | RMSE={rmse_dl:.3f} | R2={r2_dl:.4f}")


Epoch 1/30
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 29ms/step - loss: 3774403.7500 - mae: 1938.5902 - val_loss: 1636306.7500 - val_mae: 1227.6024
Epoch 2/30
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 1036544.6875 - mae: 898.1799 - val_loss: 402688.7500 - val_mae: 508.2179
Epoch 3/30
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 375174.9062 - mae: 486.0881 - val_loss: 260150.2812 - val_mae: 395.2066
Epoch 4/30
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 264901.5312 - mae: 399.6921 - val_loss: 202166.9062 - val_mae: 343.2538
Epoch 5/30
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 212668.9531 - mae: 355.1539 - val_loss: 166404.1562 - val_mae: 307.9325
Epoch 6/30
[1m101/101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 180469.0312 - mae: 325.0807 - val_loss: 141237.7188 - val_mae: 282.204

## Add DL result to the comparison table
We append the MLP (DL) metrics to the existing results table and re-rank models by RMSE.


In [16]:
import pandas as pd

# Tambahkan hasil DL ke list results
results.append({"model": "MLP_DL", "mae": float(mae_dl), "rmse": float(rmse_dl), "r2": float(r2_dl)})

# Tampilkan tabel perbandingan terbaru
pd.DataFrame(results).sort_values("rmse")


Unnamed: 0,model,mae,rmse,r2
1,RandomForest,6.436016,9.064085,0.309687
2,GradientBoosting,6.560876,9.305405,0.27244
3,Ridge_tuned,6.77829,9.523306,0.237967
0,Ridge_baseline,6.77817,9.523312,0.237966
4,MLP_DL,25.171946,32.954737,-8.125022


## Save updated metrics (including DL)
We save the updated comparison table (including the DL model) to CSV for documentation.


In [17]:
import os
import pandas as pd

metrics_df = pd.DataFrame(results).sort_values("rmse")
out_metrics = os.path.join(DATA_DIR, "regression_model_results.csv")
metrics_df.to_csv(out_metrics, index=False)

print("Saved:", out_metrics)
metrics_df


Saved: /content/drive/MyDrive/UAS ML DL/Regression (ML)/regression_model_results.csv


Unnamed: 0,model,mae,rmse,r2
1,RandomForest,6.436016,9.064085,0.309687
2,GradientBoosting,6.560876,9.305405,0.27244
3,Ridge_tuned,6.77829,9.523306,0.237967
0,Ridge_baseline,6.77817,9.523312,0.237966
4,MLP_DL,25.171946,32.954737,-8.125022


## Conclusion & Interpretation
- The dataset contains **515,345 rows** and **90 numeric features**, with **no missing values**.
- Baseline Ridge achieved **RMSE ≈ 9.52** and **R² ≈ 0.238**, showing limited linear fit.
- RandomForest produced the best validation performance (**MAE ≈ 6.44**, **RMSE ≈ 9.06**, **R² ≈ 0.310**), indicating non-linear models capture patterns better on this dataset.
- Basic tuning of Ridge (`alpha`) did not improve performance.
- A simple MLP (Deep Learning) model was tested but underperformed (**RMSE ≈ 32.95**, **R² < 0**), so the best model remains RandomForest.
