# üöó Vehicle Price Prediction

**A beginner-friendly machine learning project that predicts vehicle prices using specifications like make, model, mileage, and fuel type with Random Forest in Python.**

---

## üìå Project Overview
This project estimates the market price of a vehicle using historical listings and technical specifications.  
It covers the **full ML pipeline** ‚Äî from **data cleaning** to **model evaluation** ‚Äî making it perfect for beginners.

---

## üõ†Ô∏è Workflow

### **1Ô∏è‚É£ Data Cleaning**
- Removed rows with missing `price`.
- Filled missing **numeric values** with the mean.
- Filled missing **categorical values** with the mode.

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
import datetime

In [42]:
df = pd.read_csv("vehicle dataset.csv")

In [43]:
df.head()

Unnamed: 0,name,description,make,model,year,price,engine,cylinders,fuel,mileage,transmission,trim,body,doors,exterior_color,interior_color,drivetrain
0,2024 Jeep Wagoneer Series II,"\n \n Heated Leather Seats, Nav Sy...",Jeep,Wagoneer,2024,74600.0,24V GDI DOHC Twin Turbo,6.0,Gasoline,10.0,8-Speed Automatic,Series II,SUV,4.0,White,Global Black,Four-wheel Drive
1,2024 Jeep Grand Cherokee Laredo,Al West is committed to offering every custome...,Jeep,Grand Cherokee,2024,50170.0,OHV,6.0,Gasoline,1.0,8-Speed Automatic,Laredo,SUV,4.0,Metallic,Global Black,Four-wheel Drive
2,2024 GMC Yukon XL Denali,,GMC,Yukon XL,2024,96410.0,"6.2L V-8 gasoline direct injection, variable v...",8.0,Gasoline,0.0,Automatic,Denali,SUV,4.0,Summit White,Teak/Light Shale,Four-wheel Drive
3,2023 Dodge Durango Pursuit,White Knuckle Clearcoat 2023 Dodge Durango Pur...,Dodge,Durango,2023,46835.0,16V MPFI OHV,8.0,Gasoline,32.0,8-Speed Automatic,Pursuit,SUV,4.0,White Knuckle Clearcoat,Black,All-wheel Drive
4,2024 RAM 3500 Laramie,\n \n 2024 Ram 3500 Laramie Billet...,RAM,3500,2024,81663.0,24V DDI OHV Turbo Diesel,6.0,Diesel,10.0,6-Speed Automatic,Laramie,Pickup Truck,4.0,Silver,Black,Four-wheel Drive


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1002 entries, 0 to 1001
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            1002 non-null   object 
 1   description     946 non-null    object 
 2   make            1002 non-null   object 
 3   model           1002 non-null   object 
 4   year            1002 non-null   int64  
 5   price           979 non-null    float64
 6   engine          1000 non-null   object 
 7   cylinders       897 non-null    float64
 8   fuel            995 non-null    object 
 9   mileage         968 non-null    float64
 10  transmission    1000 non-null   object 
 11  trim            1001 non-null   object 
 12  body            999 non-null    object 
 13  doors           995 non-null    float64
 14  exterior_color  997 non-null    object 
 15  interior_color  964 non-null    object 
 16  drivetrain      1002 non-null   object 
dtypes: float64(4), int64(1), object(1

In [45]:
df.describe()

Unnamed: 0,year,price,cylinders,mileage,doors
count,1002.0,979.0,897.0,968.0,995.0
mean,2023.916168,50202.9857,4.975474,69.033058,3.943719
std,0.298109,18700.392062,1.392526,507.435745,0.274409
min,2023.0,0.0,0.0,0.0,2.0
25%,2024.0,36600.0,4.0,4.0,4.0
50%,2024.0,47165.0,4.0,8.0,4.0
75%,2024.0,58919.5,6.0,13.0,4.0
max,2025.0,195895.0,8.0,9711.0,5.0


In [46]:
df.columns

Index(['name', 'description', 'make', 'model', 'year', 'price', 'engine',
       'cylinders', 'fuel', 'mileage', 'transmission', 'trim', 'body', 'doors',
       'exterior_color', 'interior_color', 'drivetrain'],
      dtype='object')

In [47]:
df.shape

(1002, 17)

In [48]:
df.dtypes

name               object
description        object
make               object
model              object
year                int64
price             float64
engine             object
cylinders         float64
fuel               object
mileage           float64
transmission       object
trim               object
body               object
doors             float64
exterior_color     object
interior_color     object
drivetrain         object
dtype: object

In [49]:
df.isnull().sum()

name                0
description        56
make                0
model               0
year                0
price              23
engine              2
cylinders         105
fuel                7
mileage            34
transmission        2
trim                1
body                3
doors               7
exterior_color      5
interior_color     38
drivetrain          0
dtype: int64

In [50]:
# Step 3: Drop rows where 'price' is missing (target variable)
df = df.dropna(subset=["price"])


### **2Ô∏è‚É£ Feature Engineering**
- Created `vehicle_age` = `current_year` - `year`.
- Dropped `year` column after transformation.


In [51]:
# Step 4: Feature engineering - vehicle age
current_year = datetime.datetime.now().year
df["vehicle_age"] = current_year - df["year"]
df = df.drop(columns=["year"])  # Remove original year



In [52]:
# Step 5: Fill missing values
for col in df.columns:
    if df[col].dtype == "object":  # Categorical
        df[col] = df[col].fillna(df[col].mode()[0])
    else:  # Numeric
        df[col] = df[col].fillna(df[col].mean())

In [53]:
df.isnull().sum()

name              0
description       0
make              0
model             0
price             0
engine            0
cylinders         0
fuel              0
mileage           0
transmission      0
trim              0
body              0
doors             0
exterior_color    0
interior_color    0
drivetrain        0
vehicle_age       0
dtype: int64

### **3Ô∏è‚É£ Encoding**
- Used **Label Encoding** for categorical columns.


In [54]:
# Step 5: Convert categorical columns to numbers using Label Encoding
label_encoders = {}
for col in df.columns:
    if df[col].dtype == "object":
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
        label_encoders[col] = le

### **4Ô∏è‚É£ Model Training**
- Started with **Linear Regression** ‚Üí R¬≤ ‚âà 0.35 (low accuracy).
- Switched to **Random Forest Regressor** ‚Üí R¬≤ ‚âà 0.77 (high accuracy).


In [55]:
# Step 6: Features (X) and Target (y)
X = df.drop("price", axis=1)
y = df["price"]


In [56]:
# Step 7: Split data into train & test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [57]:
# Step 9: Train Random Forest model
#model = LinearRegression()
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


In [58]:
# Step 9: Make predictions
y_pred = model.predict(X_test)

### **5Ô∏è‚É£ Evaluation**
- **MSE:** `68,860,248`
- **R¬≤ Score:** `0.7745` (model explains ~77% of price variation)
- **RMSE:** ~`8,300 USD`


In [59]:
# Step 10: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Model R¬≤ Score:", model.score(X_test, y_test))


Mean Squared Error: 68860248.77433605
Model R¬≤ Score: 0.7745172929240168


In [60]:
# Step 11: Example prediction (replace values according to your dataset order)
example_vehicle = [0, 0, 12, 150, 2, 30000, 1, 5, 3, 4, 10, 7, 8, 5, 6, 4]  
predicted_price = model.predict([example_vehicle])
print("Predicted Vehicle Price: $", predicted_price[0])

Predicted Vehicle Price: $ 152639.01




## üìà Results
‚úÖ **Accuracy improved from 35% to 77%** after switching to Random Forest.  
‚úÖ **Prediction error reduced by ~6,000 USD** compared to the first model.  
‚úÖ Ready for further tuning with gradient boosting models.
