# Car Price Prediction with Regression Model

## Project Overview

This project aims to predict the **price** of cars based on various features, such as the **make**, **model**, **year**, **mileage**, and **condition**. By analyzing these factors and their impact on car prices, we can build a predictive model to estimate the value of a car based on its attributes.

## About the Dataset

The dataset used in this project is a synthetic dataset generated to simulate real-world car price variability. It contains information on car prices and associated features, offering a solid foundation for data analysis and regression modeling.

## Data Source

This dataset is available on Kaggle in the following link:
> https://www.kaggle.com/datasets/mrsimple07/car-prices-prediction-data/data

## Dataset Summary
The dataset contains multiple features related to car characteristics, which are detailed below:

- **Make**: The brand or manufacturer of the car (e.g., Toyota, Honda, Ford).
- **Model**: The specific model of the car (e.g., Camry, Civic, F-150).
- **Year**: The manufacturing year of the car.
- **Mileage**: The total mileage (in miles) of the car.
- **Condition**: The condition of the car, categorized as Excellent, Good, or Fair.
- **Price**: The target variable, representing the price of the car.

## Problem Statement

This project is designed for exploratory data analysis (EDA) and predictive modeling. The main objectives include:
1. Conducting exploratory data analysis to understand data distribution and relationships.
2. Building and evaluating regression models to predict car prices based on available features.
3. Implementing feature engineering techniques for improved model performance.

### Techniques Used
- **Data Cleaning**: Handling data types, encoding categorical variables, and checking data consistency.
- **Data Visualization**: Plotting distributions, correlations, and feature impacts on car prices.
- **Regression Modeling**: Using models like Linear Regression, Ridge, Lasso, and others to predict car prices.
- **Evaluation Metrics**: Analyzing model performance using metrics like R2, RMSE, and MAE.

### Load Libraries

In [2]:
# General
import pandas as pd
import numpy as np
import os
import warnings

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model and Evaluation Metrics
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

### Settings

In [3]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
csv_path = os.path.join(data_path, "CarPrice_cleaned.csv")

### Load Data

In [4]:
df = pd.read_csv(csv_path)

In [5]:
# Check data
df.head()

Unnamed: 0,Mileage,Condition,Price,Age
0,18107,3,19094.75,2
1,13578,3,27321.1,10
2,46054,1,23697.3,8
3,34981,3,18251.05,2
4,63565,3,19821.85,5


### Preprocessing

In [6]:
# Separate Input and Output features
X = df.drop("Price", axis= 1)
y = df["Price"]

In [7]:
# Split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [8]:
# Standardize the data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

In [9]:
# Define function to train the model and evaluate by different metrics

def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predict with trained model
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # Print the evaluation metrics
    print("=" * 60)
    print("EVALUATION METRICS ON TRAIN DATA")
    print("=" * 60)
    print(f"R2 Score: {r2_score(y_train, y_train_pred)}")
    print(f"MAE: {mean_absolute_error(y_train, y_train_pred)}")
    print(f"MSE: {mean_squared_error(y_train, y_train_pred)}")
    print(f"RMSE: {np.sqrt(mean_squared_error(y_train, y_train_pred))}")
    print("=" * 60)
    print("EVALUATION METRICS ON TEST DATA")
    print("=" * 60)
    print(f"R2 Score: {r2_score(y_test, y_test_pred)}")
    print(f"MAE: {mean_absolute_error(y_test, y_test_pred)}")
    print(f"MSE: {mean_squared_error(y_test, y_test_pred)}")
    print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_test_pred))}")

In [10]:
# Try Ridge
r = Ridge()
train_evaluate(r)

EVALUATION METRICS ON TRAIN DATA
R2 Score: 0.9999983326732879
MAE: 4.529409091646016
MSE: 29.067126463763692
RMSE: 5.391393740375831
EVALUATION METRICS ON TEST DATA
R2 Score: 0.9999983172826974
MAE: 4.823706075947467
MSE: 34.09718222813823
RMSE: 5.83927925587895


### Insights

Your Ridge regression model for car price prediction is performing exceptionally well, showing near-perfect R² scores and low errors on both training and test data. Here’s a closer look:

#### Training Performance:

- **R2 Score: 0.9999983** This indicates that the model explains nearly all the variance in car prices on the training data.
- **MAE: 4.53** The average absolute error is very low, meaning predictions are close to actual prices.
- **MSE: 29.07, RMSE: 5.39** These low values indicate that the squared error and overall prediction error are minimized.

#### Testing Performance:

- **R2 Score: 0.9999983** This is almost identical to the training R2 score, indicating excellent generalization and no overfitting.
- **MAE: 4.82** A slight increase from the training data but still very low, showing the model’s stability.
- **MSE: 34.10, RMSE: 5.84** Slightly higher than the training error, but these values remain low, indicating effective performance on unseen data.

#### Analysis:

- **Very High R2 and Low Errors:** With R2 values close to **1** and very low error metrics, your model is capturing the relationship in the data almost perfectly.
- **Slightly Higher Testing Errors:** A small increase in test errors (MAE, MSE, and RMSE) is normal and expected, indicating that the model generalizes well without significant overfitting.

#### Conclusion:

This Ridge regression model on car price prediction is highly effective, with almost perfect accuracy and very minimal errors. It seems well-optimized, and further tuning may yield marginal improvements at best.

In [11]:
# Try Lasso
l = Lasso()
train_evaluate(l)

EVALUATION METRICS ON TRAIN DATA
R2 Score: 0.999999880290013
MAE: 1.180915114828806
MSE: 2.086948711100448
RMSE: 1.4446275336917984
EVALUATION METRICS ON TEST DATA
R2 Score: 0.9999998771698237
MAE: 1.2403011227311618
MSE: 2.488928411245058
RMSE: 1.5776338013763074


### Insights

Your Lasso regression model for car price prediction is showing exceptionally high accuracy with even lower errors compared to Ridge regression. Here’s a detailed breakdown:

#### Training Performance:

- **R2 Score: 0.99999988** Indicates that nearly **100%** of the variance in car prices is explained by the model on the training data.
- **MAE: 1.18, MSE: 2.09, RMSE: 1.44** These low error values highlight that the model’s predictions are extremely close to the actual values, suggesting a high level of accuracy.

#### Testing Performance:

- **R2 Score: 0.99999988** Practically identical to the training R2 score, showing that the model generalizes very well to unseen data.
- **MAE: 1.24** Slightly higher than the training MAE but still remarkably low, indicating the model’s stability.
- **MSE: 2.49, RMSE: 1.58** Both metrics are slightly higher on the test set but remain very low overall, reinforcing the model’s strong predictive power.

#### Analysis:

- **High Accuracy and Minimal Errors:** The R2 values are close to **1**, and the MAE, MSE, and RMSE metrics are exceptionally low, indicating the model’s excellent fit to the data.
- **Good Generalization:** The minimal increase in error metrics on the test data implies that the model avoids overfitting and performs consistently well on unseen data.

#### Conclusion:
The Lasso regression model performs slightly better than Ridge for this dataset, especially in terms of reducing prediction errors. This suggests that the sparsity induced by Lasso, which reduces potential multicollinearity by selecting relevant features, may be beneficial for this dataset.