In [None]:
<div style="border: solid blue 2px; padding: 15px; margin: 10px">
  <b>Overall Summary of the Project – Iteration 1</b><br><br>

  Hi Bailey, I’m <b>Victor Camargo</b> (<a href="https://hub.tripleten.com/u/e9cc9c11" target="_blank">TripleTen Hub profile</a>). Thanks for submitting your project — I’m happy to say it is complete and approved.<br><br>

  <b>Nice work on:</b><br>
  ✔️ Preparing the data properly, dropping irrelevant columns, splitting into train/validation/test sets, and identifying categorical features.<br>
  ✔️ Training three different models (Linear Regression, Random Forest, and LightGBM) with appropriate preprocessing and hyperparameters.<br>
  ✔️ Measuring and comparing RMSE, training time, and prediction speed, then selecting the best-performing model.<br>
  ✔️ Providing a clear and well-structured conclusion with practical takeaways for Rusty Bargain.<br><br>

  This is a strong, well-executed project that meets all requirements. Congratulations on the milestone — excellent work.<br><br>

  <hr>

  🔹 <b>Legend:</b><br>
  🟢 Green = well done<br>
  🟡 Yellow = suggestions<br>
  🔴 Red = must fix<br>
  🔵 Blue = your comments or questions<br><br>
  
  <b>Please ensure</b> that all cells run smoothly from top to bottom and display their outputs before submitting — this helps keep your analysis easy to follow.  
  <b>Kind reminder:</b> try not to move, change, or delete reviewer comments, as they are there to track progress and provide better support during your revisions.<br><br>

  <b>Feel free to reach out if you need help in Questions channel.</b><br>
</div>


Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('/datasets/car_data.csv')

df = df.drop(['DateCrawled','DateCreated','LastSeen','NumberOfPictures'], axis=1)

y = df['Price']
X = df.drop('Price', axis=1)

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

cat_features = ['VehicleType','Gearbox','Model','FuelType','Brand','NotRepaired']

print("Train shape:", X_train.shape)
print("Valid shape:", X_valid.shape)
print("Test shape:", X_test.shape)

Train shape: (212621, 11)
Valid shape: (70874, 11)
Test shape: (70874, 11)


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Great job on the data preparation step. You correctly dropped irrelevant columns, separated features from the target, and created proper train/validation/test splits with random state set for reproducibility. Also, listing the categorical features here is a good move to streamline model building later.
</div>


## Model training

In [2]:
import time
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

results = {}

cat_cols = ['VehicleType','Gearbox','Model','FuelType','Brand','NotRepaired']
num_cols = [c for c in X_train.columns if c not in cat_cols]

def fit_eval(name, model, Xtr, ytr, Xva, yva):
    t0 = time.time(); model.fit(Xtr, ytr); train_time = time.time() - t0
    t1 = time.time(); preds = model.predict(Xva); pred_time = time.time() - t1
    rmse = mean_squared_error(yva, preds, squared=False)
    print(f"{name:>15} | RMSE: {rmse:,.2f} | train: {train_time:.2f}s | predict: {pred_time:.2f}s")
    return rmse, train_time, pred_time

prep_ohe = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), num_cols),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')), # <-- NEW
            ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=True))
        ]), cat_cols)
    ],
    remainder='drop'
)
lr_pipe = Pipeline([('prep', prep_ohe), ('model', LinearRegression())])
results['Linear'] = fit_eval('LinearRegression', lr_pipe, X_train, y_train, X_valid, y_valid)

prep_ord = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), num_cols),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')), # <-- NEW
            ('ord', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
        ]), cat_cols)
    ]
)
rf_pipe = Pipeline([
    ('prep', prep_ord),
    ('model', RandomForestRegressor(
        n_estimators=150,
        max_depth=14,
        max_features='sqrt',
        n_jobs=-1,
        random_state=42
    ))
])
results['RandomForest'] = fit_eval('RandomForest', rf_pipe, X_train, y_train, X_valid, y_valid)

try:
    import lightgbm as lgb
    Xtr_lgb = X_train.copy(); Xva_lgb = X_valid.copy()
    for c in num_cols:
        med = Xtr_lgb[c].median()
        Xtr_lgb[c] = Xtr_lgb[c].fillna(med); Xva_lgb[c] = Xva_lgb[c].fillna(med)
    for c in cat_cols:
        Xtr_lgb[c] = Xtr_lgb[c].astype('object').fillna('missing').astype('category')
        Xva_lgb[c] = Xva_lgb[c].astype('object').fillna('missing').astype('category')

    lgbm = lgb.LGBMRegressor(
        objective='regression',
        learning_rate=0.1,
        n_estimators=600,
        num_leaves=31,
        subsample=0.9,
        colsample_bytree=0.9,
        random_state=42
    )
    t0 = time.time()
    lgbm.fit(
        Xtr_lgb, y_train,
        categorical_feature=cat_cols,
        eval_set=[(Xva_lgb, y_valid)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=40, verbose=False)]
    )
    train_time = time.time() - t0
    t1 = time.time(); preds = lgbm.predict(Xva_lgb); pred_time = time.time() - t1
    rmse = mean_squared_error(y_valid, preds, squared=False)
    print(f"{'LightGBM':>15} | RMSE: {rmse:,.2f} | train: {train_time:.2f}s | predict: {pred_time:.2f}s")
    results['LightGBM'] = (rmse, train_time, pred_time)
except Exception as e:
    print("LightGBM not available, skipping. Details:", e)

print("\nSummary (lower RMSE is better):")
for k, (rmse, tr, pr) in results.items():
    print(f"{k:>12}: RMSE={rmse:,.2f} | train={tr:.2f}s | predict={pr:.2f}s")

LinearRegression | RMSE: 3,499.49 | train: 1.01s | predict: 0.17s
   RandomForest | RMSE: 1,856.16 | train: 12.16s | predict: 0.69s


New categorical_feature is ['Brand', 'FuelType', 'Gearbox', 'Model', 'NotRepaired', 'VehicleType']


       LightGBM | RMSE: 1,707.89 | train: 11.20s | predict: 1.31s

Summary (lower RMSE is better):
      Linear: RMSE=3,499.49 | train=1.01s | predict=0.17s
RandomForest: RMSE=1,856.16 | train=12.16s | predict=0.69s
    LightGBM: RMSE=1,707.89 | train=11.20s | predict=1.31s


In [3]:
print(results.keys())

dict_keys(['Linear', 'RandomForest', 'LightGBM'])


We trained three different models to compare prediction quality, training speed, and prediction speed:

1. **Linear Regression** – serves as a baseline sanity check with OHE encoding.
2. **Random Forest Regressor** – a tree-based ensemble model using OHE encoding.
3. **LightGBM** – a gradient boosting model that handles categorical features natively.

For each model, we measured:
- **RMSE** (Root Mean Squared Error) on the validation set, which reflects prediction quality.
- **Training time**, since Rusty Bargain is interested in how long it takes to prepare the model.
- **Prediction time**, to evaluate speed in a real-time application.

These results allow us to identify the trade-off between accuracy and efficiency across the three approaches.

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Excellent implementation of the model training stage. You correctly compared three distinct models: Linear Regression (baseline), Random Forest (tree-based ensemble), and LightGBM (gradient boosting). Each was set up with suitable preprocessing for numerical and categorical data, and you included metrics for RMSE, training time, and prediction speed. This gives a well-rounded comparison that addresses all the project requirements.
</div>


## Model analysis

In [4]:
import pandas as pd

if 'results' not in globals():
    raise RuntimeError("Run the Model training cell first (it creates the `results` dict).")

df_results = pd.DataFrame(results, index=['RMSE','Train time (s)','Predict time (s)']).T
display(df_results.sort_values('RMSE'))

best_model_name = df_results['RMSE'].idxmin()
best_rmse = df_results.loc[best_model_name, 'RMSE']
print(f"\nBest on validation: {best_model_name} (RMSE={best_rmse:,.2f})")

Unnamed: 0,RMSE,Train time (s),Predict time (s)
LightGBM,1707.886017,11.199506,1.305753
RandomForest,1856.159471,12.157575,0.689686
Linear,3499.492992,1.009378,0.171097



Best on validation: LightGBM (RMSE=1,707.89)


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Great work on the model analysis stage. You collected the RMSE, training time, and prediction time into a summary DataFrame and sorted by RMSE to clearly highlight the best-performing model. Printing the best model name and its RMSE adds clarity and ensures the results are easy to interpret.
</div>


In this project, we built and compared models to predict used car prices for Rusty Bargain.
The key steps were:

1. **Data preparation**
- Cleaned the dataset and removed irrelevant columns (`DateCrawled`, `DateCreated`, `LastSeen`, `NumberOfPictures`).
- Split the data into training, validation, and test sets.
- Identified categorical and numerical features, imputing missing values appropriately.

2. **Model training**
- Trained three baseline models with pipelines:
- **Linear Regression** (with OHE) as a simple baseline.
- **Random Forest Regressor** (with OrdinalEncoder for categoricals).
- **LightGBM** (handling categorical features natively, with early stopping).
- Collected metrics on **RMSE**, **training time**, and **prediction time**.

3. **Model analysis**
- Results on the validation set:
- **LightGBM**: RMSE ≈ 1,707.89 (best), training ≈ 9.26s, prediction ≈ 1.20s
- **Random Forest**: RMSE ≈ 1,856.16, training ≈ 10.70s, prediction ≈ 0.66s
- **Linear Regression**: RMSE ≈ 3,499.49 (worst), training ≈ 0.71s, prediction ≈ 0.17s
- LightGBM achieved the lowest RMSE, meaning it delivered the most accurate predictions.
- Random Forest performed reasonably well but required more time and gave higher error.
- Linear Regression served only as a sanity check baseline and was not competitive.

4. **Final evaluation**
- Based on the validation results, **LightGBM** was chosen as the best model.
- It provides a good balance of accuracy and speed, making it suitable for real-time pricing in Rusty Bargain’s app.

---

### Key Takeaways
- **LightGBM is the most effective model**, achieving the lowest RMSE with acceptable training and prediction times.
- **Random Forest** is a solid alternative but less accurate.
- **Linear Regression** is too simplistic for this task, confirming the need for more advanced methods.
- Rusty Bargain can deploy the LightGBM model to provide reliable car price predictions to users, improving customer trust and engagement.


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Excellent job on the conclusion and final evaluation. You summarized the entire workflow clearly, compared the models with their RMSE and timing metrics, and selected LightGBM as the best option with solid justification. The takeaways section is well written and highlights the practical impact of your results for Rusty Bargain.
</div>


# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed

We tested three models to predict used car prices for Rusty Bargain.
LightGBM gave the best balance of accuracy and speed, achieving the lowest error (RMSE ≈ 1,708).
This model is recommended for deployment in the app to provide fast and reliable price estimates for customers.