# Day 29: Car Price Prediction - Comprehensive ML Analysis

## Overview
This notebook provides a thorough exploration and modeling of car price prediction using multiple regression algorithms. We compare Lasso Regression, Random Forest, Gradient Boosting, and XGBoost to determine the best performing model.

## Dataset
- **Source**: Car Price Prediction Dataset
- **Size**: 2,500 samples
- **Features**: Brand, Year, Engine Size, Fuel Type, Transmission, Mileage, Condition, Model
- **Target**: Price

## Objective
Build regression models to predict car prices based on vehicle characteristics and compare model performance.

## 1. Import Required Libraries

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import warnings
import os
warnings.filterwarnings('ignore')

# Set plotly to dark theme
px.defaults.template = "plotly_dark"

# Create directories
os.makedirs('../viz', exist_ok=True)
os.makedirs('../models', exist_ok=True)

print("Libraries imported successfully!")
print(f"Visualization directory: ../viz")
print(f"Models directory: ../models")

## 2. Load and Explore the Dataset

In [None]:
# Load the Dataset
cars = pd.read_csv("../data/car_price_prediction_.csv")
display(cars.head())

print("=" * 40)
display(cars.info())

print("=" * 40)
display(cars.describe())

print("=" * 40)
print(f"Dataset contains {cars.shape[0]} rows and {cars.shape[1]} columns.")

display(cars.isnull().sum())
print("=" * 40)

## 3. Visualize Brand Distribution and Pricing

In [None]:
# Distribution of car brands
fig = px.histogram(cars, x='Brand', 
                   title='Distribution of Car Brands', 
                   labels={'Brand': 'Car Brand', 'count': 'Number of Cars'}, 
                   template='plotly_dark')
fig.update_layout(bargap=0.2)
fig.write_html('../viz/brand_distribution.html')
fig.show()
print("[SAVED] ../viz/brand_distribution.html")

In [None]:
# Average price by brand
fig = px.bar(cars.groupby('Brand')['Price'].mean().reset_index(), 
             x='Brand', y='Price', 
             title='Average Price by Car Brand', 
             labels={'Brand': 'Car Brand', 'Price': 'Average Price'}, 
             template='plotly_dark')
fig.write_html('../viz/avg_price_by_brand.html')
fig.show()
print("[SAVED] ../viz/avg_price_by_brand.html")

## 4. Feature Engineering

In [None]:
# Feature Engineering
cars['Car_Age'] = 2025 - cars['Year']
cars["Mileage_per_Year"] = cars["Mileage"] / (cars["Car_Age"] + 1)

luxury_brands = ["Tesla", "BMW", "Audi", "Mercedes"]
cars["Is_Luxury"] = cars["Brand"].isin(luxury_brands).astype(int)

cars["Fuel_Group"] = cars["Fuel Type"].replace({
    "Petrol": "Traditional", 
    "Diesel": "Traditional", 
    "Hybrid": "Eco", 
    "Electric": "Eco"
})

print("Feature Engineering Complete!")
print(f"\nNew Features:")
print(f"  - Car_Age: Current year (2025) - Year")
print(f"  - Mileage_per_Year: Average annual mileage")
print(f"  - Is_Luxury: 1 if luxury brand, 0 otherwise")
print(f"  - Fuel_Group: Traditional (Petrol/Diesel) vs Eco (Hybrid/Electric)")
print(f"\nTotal features: {cars.shape[1]}")
display(cars.head())

## 5. Data Preprocessing - Encoding and Scaling

In [None]:
# Import preprocessing libraries
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer

# Split data before encoding
X = cars.drop(["Price", "Car ID", "Model", "Engine Size", "Condition"], axis=1)
y = cars["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69)

print(f"Training set: {X_train.shape}")
print(f"Testing set: {X_test.shape}")

In [None]:
# Column transformer for encoding and scaling
categorical_features = ['Brand', 'Fuel Type', 'Transmission', 'Fuel_Group']
numerical_features = ['Year', 'Mileage', 'Car_Age', 'Mileage_per_Year', 'Is_Luxury']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

print(f"\n[OK] Data preprocessing complete!")
print(f"[OK] Numerical features scaled: {len(numerical_features)}")
print(f"[OK] Categorical features encoded: {len(categorical_features)}")
print(f"[OK] Total transformed features: {X_train.shape[1]}")

## 6. Model Training - Lasso Regression

In [None]:
# Import ML libraries
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor

# Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)

print("Lasso Regression Performance:")
print(f"MAE: {mean_absolute_error(y_test, y_pred_lasso):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_lasso):.2f}")
print(f"R²: {r2_score(y_test, y_pred_lasso):.4f}")

## 7. Model Training - Random Forest

In [None]:
# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=69)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Regressor Performance:")
print(f"MAE: {mean_absolute_error(y_test, y_pred_rf):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_rf):.2f}")
print(f"R²: {r2_score(y_test, y_pred_rf):.4f}")

## 8. Model Training - Gradient Boosting

In [None]:
# Gradient Boosting Regressor
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=69)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)

print("Gradient Boosting Regressor Performance:")
print(f"MAE: {mean_absolute_error(y_test, y_pred_gb):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_gb):.2f}")
print(f"R²: {r2_score(y_test, y_pred_gb):.4f}")

## 9. Model Training - XGBoost

In [None]:
# XGBoost Regressor
xgb = XGBRegressor(n_estimators=100, learning_rate=0.1, objective='reg:squarederror', random_state=69)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

print("XGBoost Regressor Performance:")
print(f"MAE: {mean_absolute_error(y_test, y_pred_xgb):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred_xgb):.2f}")
print(f"R²: {r2_score(y_test, y_pred_xgb):.4f}")

## 10. Model Comparison

In [None]:
# Model Comparison
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)

models = ['Lasso Regression', 'Random Forest', 'Gradient Boosting', 'XGBoost']
mae_values = [
    mean_absolute_error(y_test, y_pred_lasso),
    mean_absolute_error(y_test, y_pred_rf),
    mean_absolute_error(y_test, y_pred_gb),
    mean_absolute_error(y_test, y_pred_xgb)
]
mse_values = [
    mean_squared_error(y_test, y_pred_lasso),
    mean_squared_error(y_test, y_pred_rf),
    mean_squared_error(y_test, y_pred_gb),
    mean_squared_error(y_test, y_pred_xgb)
]
r2_values = [
    r2_score(y_test, y_pred_lasso),
    r2_score(y_test, y_pred_rf),
    r2_score(y_test, y_pred_gb),
    r2_score(y_test, y_pred_xgb)
]

comparison_df = pd.DataFrame({
    'Model': models,
    'MAE': mae_values,
    'MSE': mse_values,
    'R²': r2_values
})

display(comparison_df)

In [None]:
# Visualize Model Comparison
# Normalize for better visualization
comparison_viz = comparison_df.copy()
comparison_viz['MAE_norm'] = comparison_viz['MAE'] / comparison_viz['MAE'].max()
comparison_viz['MSE_norm'] = comparison_viz['MSE'] / comparison_viz['MSE'].max()
comparison_viz['R²_norm'] = comparison_viz['R²'] / comparison_viz['R²'].max()

fig = go.Figure(data=[
    go.Bar(name='MAE', x=comparison_viz['Model'], y=comparison_viz['MAE_norm']),
    go.Bar(name='MSE', x=comparison_viz['Model'], y=comparison_viz['MSE_norm']),
    go.Bar(name='R²', x=comparison_viz['Model'], y=comparison_viz['R²_norm'])
])

fig.update_layout(
    barmode='group', 
    title='Model Performance Comparison (Normalized)', 
    yaxis_title='Normalized Score', 
    template='plotly_dark'
)
fig.write_html('../viz/model_comparison.html')
fig.show()
print("[SAVED] ../viz/model_comparison.html")

## 11. Save Models and Preprocessor

In [None]:
# Model Serialization
import joblib

# Save the best model (XGBoost)
model_path = '../models/car_price_xgboost_model.joblib'
joblib.dump(xgb, model_path)
print(f"[OK] XGBoost model saved to: {model_path}")

# Save the preprocessor
preprocessor_path = '../models/car_price_preprocessor.joblib'
joblib.dump(preprocessor, preprocessor_path)
print(f"[OK] Preprocessor saved to: {preprocessor_path}")

# Save feature names and structure
feature_info = {
    'categorical_features': categorical_features,
    'numerical_features': numerical_features,
    'all_features': X.columns.tolist()
}
feature_info_path = '../models/car_price_feature_info.joblib'
joblib.dump(feature_info, feature_info_path)
print(f"[OK] Feature information saved to: {feature_info_path}")

## 12. Prediction Function for New Cars

In [None]:
# Prediction Function for New Cars
def predict_car_price(year, brand, fuel_type, transmission, mileage):
    """
    Predict car price using the trained XGBoost model.
    
    Parameters:
    -----------
    year : int
        Year of manufacture (e.g., 2020, 2023, 2024)
    brand : str
        Car brand (e.g., 'Toyota', 'BMW', 'Tesla', 'Honda')
    fuel_type : str
        Fuel type ('Petrol', 'Diesel', 'Hybrid', 'Electric')
    transmission : str
        Transmission type ('Manual', 'Automatic')
    mileage : float
        Current mileage in kilometers
    
    Returns:
    --------
    float : Predicted price in currency units
    """
    
    # Calculate derived features
    current_year = 2025
    car_age = current_year - year
    mileage_per_year = mileage / (car_age + 1)
    
    # Determine if luxury brand
    luxury_brands = ["Tesla", "BMW", "Audi", "Mercedes"]
    is_luxury = 1 if brand in luxury_brands else 0
    
    # Determine fuel group
    fuel_group_map = {
        "Petrol": "Traditional",
        "Diesel": "Traditional",
        "Hybrid": "Eco",
        "Electric": "Eco"
    }
    fuel_group = fuel_group_map.get(fuel_type, "Traditional")
    
    # Create input dataframe with the same structure as training data
    input_data = pd.DataFrame({
        'Brand': [brand],
        'Year': [year],
        'Mileage': [mileage],
        'Car_Age': [car_age],
        'Mileage_per_Year': [mileage_per_year],
        'Is_Luxury': [is_luxury],
        'Fuel Type': [fuel_type],
        'Transmission': [transmission],
        'Fuel_Group': [fuel_group]
    })
    
    # Apply the same preprocessing
    input_transformed = preprocessor.transform(input_data)
    
    # Make prediction
    predicted_price = xgb.predict(input_transformed)[0]
    
    return predicted_price

print("[OK] Prediction function defined successfully!")

## 13. Test Predictions

In [None]:
# Test the prediction function with sample cars
print("\n" + "="*80)
print("CAR PRICE PREDICTION - SAMPLE TESTS")
print("="*80)

test_cars = [
    {
        "name": "2023 Toyota Camry",
        "year": 2023,
        "brand": "Toyota",
        "fuel_type": "Petrol",
        "transmission": "Automatic",
        "mileage": 15000
    },
    {
        "name": "2020 BMW X5",
        "year": 2020,
        "brand": "BMW",
        "fuel_type": "Diesel",
        "transmission": "Automatic",
        "mileage": 45000
    },
    {
        "name": "2024 Tesla Model 3",
        "year": 2024,
        "brand": "Tesla",
        "fuel_type": "Electric",
        "transmission": "Automatic",
        "mileage": 5000
    },
    {
        "name": "2019 Honda Civic",
        "year": 2019,
        "brand": "Honda",
        "fuel_type": "Petrol",
        "transmission": "Manual",
        "mileage": 60000
    },
    {
        "name": "2022 Audi A6 Hybrid",
        "year": 2022,
        "brand": "Audi",
        "fuel_type": "Hybrid",
        "transmission": "Automatic",
        "mileage": 25000
    }
]

for i, car in enumerate(test_cars, 1):
    car_name = car.pop("name")
    predicted_price = predict_car_price(**car)
    print(f"\n{i}. {car_name}")
    print(f"   Year: {car['year']} | Brand: {car['brand']} | Fuel: {car['fuel_type']} | Transmission: {car['transmission']}")
    print(f"   Mileage: {car['mileage']:,} km")
    print(f"   Predicted Price: Rs.{predicted_price:,.2f}")

print("\n" + "="*80)

## 14. Robust Predictor Class with Validation

In [None]:
# Input Validation and Error Handling
class CarPricePredictor:
    """
    Complete car price prediction system with validation and error handling.
    """
    
    def __init__(self, model_path, preprocessor_path, feature_info_path):
        """
        Initialize the predictor with saved model and preprocessor.
        
        Parameters:
        -----------
        model_path : str
            Path to saved XGBoost model
        preprocessor_path : str
            Path to saved preprocessor
        feature_info_path : str
            Path to saved feature information
        """
        self.model = joblib.load(model_path)
        self.preprocessor = joblib.load(preprocessor_path)
        self.feature_info = joblib.load(feature_info_path)
        
        self.valid_brands = ["Toyota", "Honda", "BMW", "Audi", "Mercedes", "Tesla", "Maruti", "Hyundai", "Ford", "Mahindra"]
        self.valid_fuel_types = ["Petrol", "Diesel", "Hybrid", "Electric"]
        self.valid_transmissions = ["Manual", "Automatic"]
        
    def validate_input(self, year, brand, fuel_type, transmission, mileage):
        """Validate input parameters."""
        errors = []
        
        current_year = 2025
        if year < 1990 or year > current_year:
            errors.append(f"Year must be between 1990 and {current_year}")
        
        if brand not in self.valid_brands:
            errors.append(f"Brand must be one of: {', '.join(self.valid_brands)}")
        
        if fuel_type not in self.valid_fuel_types:
            errors.append(f"Fuel type must be one of: {', '.join(self.valid_fuel_types)}")
        
        if transmission not in self.valid_transmissions:
            errors.append(f"Transmission must be one of: {', '.join(self.valid_transmissions)}")
        
        if mileage < 0:
            errors.append("Mileage cannot be negative")
        
        if errors:
            return False, errors
        return True, []
    
    def predict(self, year, brand, fuel_type, transmission, mileage):
        """
        Predict car price with validation.
        
        Returns:
        --------
        dict : Prediction result with price and confidence metrics
        """
        # Validate input
        is_valid, errors = self.validate_input(year, brand, fuel_type, transmission, mileage)
        if not is_valid:
            return {
                "status": "error",
                "errors": errors,
                "price": None
            }
        
        # Calculate derived features
        current_year = 2025
        car_age = current_year - year
        mileage_per_year = mileage / (car_age + 1)
        
        # Determine if luxury brand
        luxury_brands = ["Tesla", "BMW", "Audi", "Mercedes"]
        is_luxury = 1 if brand in luxury_brands else 0
        
        # Determine fuel group
        fuel_group_map = {
            "Petrol": "Traditional",
            "Diesel": "Traditional",
            "Hybrid": "Eco",
            "Electric": "Eco"
        }
        fuel_group = fuel_group_map.get(fuel_type, "Traditional")
        
        # Create input dataframe
        input_data = pd.DataFrame({
            'Brand': [brand],
            'Year': [year],
            'Mileage': [mileage],
            'Car_Age': [car_age],
            'Mileage_per_Year': [mileage_per_year],
            'Is_Luxury': [is_luxury],
            'Fuel Type': [fuel_type],
            'Transmission': [transmission],
            'Fuel_Group': [fuel_group]
        })
        
        # Apply preprocessing
        input_transformed = self.preprocessor.transform(input_data)
        
        # Make prediction
        predicted_price = self.model.predict(input_transformed)[0]
        
        return {
            "status": "success",
            "year": year,
            "brand": brand,
            "fuel_type": fuel_type,
            "transmission": transmission,
            "mileage": mileage,
            "predicted_price": round(predicted_price, 2),
            "car_age": car_age,
            "is_luxury": is_luxury
        }

print("[OK] CarPricePredictor class defined successfully!")

In [None]:
# Initialize the predictor
print("\nInitializing Car Price Predictor...")
predictor = CarPricePredictor(
    model_path='../models/car_price_xgboost_model.joblib',
    preprocessor_path='../models/car_price_preprocessor.joblib',
    feature_info_path='../models/car_price_feature_info.joblib'
)
print("[OK] Predictor initialized successfully!")

# Test with sample predictions
print("\n" + "="*80)
print("CAR PRICE PREDICTOR - ROBUST SYSTEM TEST")
print("="*80)

test_cases = [
    {"year": 2023, "brand": "Toyota", "fuel_type": "Petrol", "transmission": "Automatic", "mileage": 15000},
    {"year": 2020, "brand": "BMW", "fuel_type": "Diesel", "transmission": "Automatic", "mileage": 45000},
    {"year": 2024, "brand": "Tesla", "fuel_type": "Electric", "transmission": "Automatic", "mileage": 5000},
    {"year": 2019, "brand": "Honda", "fuel_type": "Petrol", "transmission": "Manual", "mileage": 60000},
]

for i, case in enumerate(test_cases, 1):
    result = predictor.predict(**case)
    
    if result["status"] == "success":
        print(f"\n{i}. {result['year']} {result['brand']}")
        print(f"   Fuel: {result['fuel_type']} | Transmission: {result['transmission']}")
        print(f"   Mileage: {result['mileage']:,} km | Age: {result['car_age']} years")
        print(f"   Luxury: {'Yes' if result['is_luxury'] else 'No'}")
        print(f"   Predicted Price: Rs.{result['predicted_price']:,.2f}")
    else:
        print(f"\n{i}. Prediction Error:")
        for error in result["errors"]:
            print(f"   - {error}")

print("\n" + "="*80)

## Summary

### Key Results
- **Dataset**: 2,500 car samples with comprehensive features
- **Models Compared**: Lasso, Random Forest, Gradient Boosting, XGBoost
- **Best Model**: XGBoost with highest R² score
- **Feature Engineering**: Car age, mileage per year, luxury indicator, fuel grouping

### Insights
1. **Brand positioning** significantly impacts pricing (luxury vs standard)
2. **Car age and mileage** are strong depreciation indicators
3. **Fuel type evolution**: Electric/Hybrid gaining premium positioning
4. **Transmission preference**: Automatic correlates with higher prices

### Production Ready
- Models saved for deployment
- Robust prediction system with validation
- Easy-to-use API for price estimation