# Regression model on European Motorbike Listings

This notebook demonstrates how to build a machine learning model to predict the price of used motorbikes based on various features like mileage, power, make/model, version, and other factors. The model uses a Random Forest Regressor, and we employ several machine learning techniques such as data pre-processing, feature encoding, normalization, and hyperparameter tuning to achieve an accurate prediction.

The dataset has been provided by Mexwell on Keggle: https://www.kaggle.com/datasets/mexwell/motorbike-marketplace

## Import Libraries

In [None]:
import joblib
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
from sklearn.metrics import mean_squared_error

## Prepocessing and splitting the dataset

In [None]:
df = pd.read_csv("europe-motorbikes-zenrows.csv")
df = df.dropna()
df = df.drop(columns=["fuel", "gear", "offer_type", "link"])

df["month"] = pd.to_datetime(df["date"], format="%m/%Y").dt.month
df["year"] = pd.to_datetime(df["date"], format="%m/%Y").dt.year
df = df.drop(columns=["date"])

df = df[(df["price"] > 500) & (df["price"] < 50000)]

y = np.log(df["price"] / 1000.0)
X = df.drop(columns=["price"])

le_model = LabelEncoder()
le_version = LabelEncoder()
X["make_model"] = le_model.fit_transform(X["make_model"])
X["version"] = le_version.fit_transform(X["version"])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

## Define and train the model

In [None]:
param_dist = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf = RandomForestRegressor(random_state=42)

random_search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=10, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, random_state=42)
random_search.fit(X_train, y_train)

print(f"Best Hyperparameters: {random_search.best_params_}")

## Evaluate the model

In [None]:
best_rf_model = random_search.best_estimator_

cv_scores = cross_val_score(best_rf_model, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-validated MSE (in thousands of €): {-cv_scores.mean():.4f}")

y_pred = best_rf_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)

joblib.dump(best_rf_model, 'motorbike_price_model.joblib')
joblib.dump(scaler, 'scaler.joblib')
print(f"Test MSE (in thousands of €): {test_mse:.4f}")

### Model accuracy at the moment
 - Cross-validated MSE (in thousands of €): 0.0823
 - Test MSE (in thousands of €): 0.0780


## Prediction

In [None]:
new_input = {
    "mileage": 150,  # mileage (numeric)
    "power": 218.0,  # power (numeric)
    "make_model": "Honda",  # make_model (categorical)
    "version": "CBR1000RR-R Fireblade SP",  # version (categorical)
    "month": 3,  # You must manually provide the correct month (from date)
    "year": 2020  # You must manually provide the correct year (from date)
}

new_input_encoded = new_input.copy()
new_input_encoded["make_model"] = le_model.transform([new_input["make_model"]])[0]
new_input_encoded["version"] = le_version.transform([new_input["version"]])[0]

input_data = [
    new_input_encoded["mileage"],
    new_input_encoded["power"],
    new_input_encoded["make_model"],
    new_input_encoded["version"],
    new_input_encoded["month"],
    new_input_encoded["year"]
]

input_data_scaled = scaler.transform([input_data])[0]

predicted_price = best_rf_model.predict([input_data_scaled])[0]

print(f"Predicted Price: €{np.exp(predicted_price) * 1000:.2f}")
