# Project Objective

The objective of this project is to develop a regression-based machine learning model that accurately predicts the market price of used cars based on their technical specifications and usage history.
This system helps buyers and sellers estimate a fair price for a vehicle and understand the key factors influencing car value.


Dataset Description

The dataset contains historical records of used cars along with their features and selling prices.


Data Dictionary

	•	Brand: Manufacturer of the car (e.g., Toyota, BMW, Honda)
	•	Model: Specific model name of the car
	•	Year: Year of manufacture
	•	Mileage: Total distance the car has been driven (km or miles)
	•	Fuel Type: Type of fuel used (Petrol, Diesel, Electric, Hybrid)
	•	Transmission: Transmission type (Manual or Automatic)
	•	Engine Size: Engine displacement (in cc or liters)
	•	Owner Type: First owner, second owner, etc. (if available)
	•	Location: City or region where the car is listed
	•	Price: Selling price of the car (target variable)


## Problem Type

Regression problem
The goal is to predict a continuous numerical value representing the car’s selling price.



Machine Learning Algorithms Considered

Multiple regression models are trained and compared to identify the most accurate approach:
	•	Linear Regression (baseline model)
	•	Polynomial Regression
	•	Decision Tree Regressor
	•	Random Forest Regressor
	•	Gradient Boosting Regressor (optional advanced model)



# Methodology

1. Data Loading and Initial Inspection
	•	Load the dataset using Pandas.
	•	Inspect the dataset structure using head(), info(), and describe().
	•	Check for missing values and handle them through imputation or removal.
	•	Identify and assess outliers in features such as price and mileage.


2. Exploratory Data Analysis (EDA)
	•	Analyze distributions of numerical features such as Year, Mileage, Engine Size, and Price.
	•	Examine how car price varies with Brand, Fuel Type, and Transmission.
	•	Visualize correlations to identify features with the strongest influence on price.



3. Data Preprocessing
	•	Encode categorical variables such as Brand, Fuel Type, Transmission, and Location using:
	•	Label Encoding or
	•	One-Hot Encoding (preferred for non-ordinal features)
	•	Handle skewed variables using transformations if necessary.
	•	Remove or cap extreme outliers where appropriate.



4. Feature Scaling
	•	Apply feature scaling to numerical variables such as Mileage, Engine Size, and Year.
	•	Use StandardScaler or MinMaxScaler where required.
	•	Ensure consistent preprocessing between training and inference.



5. Train-Test Split
	•	Split the dataset into training and testing sets:
	•	Training set: 70–80%
	•	Testing set: 20–30%
	•	Use a fixed random state for reproducibility.



6. Model Training
	•	Train multiple regression models using the training dataset.
	•	Tune hyperparameters for tree-based models to improve performance.
	•	Store trained models for evaluation and deployment.


7. Model Evaluation

Models are evaluated on the test dataset using:
	•	R² Score (primary metric)
	•	Mean Absolute Error (MAE)
	•	Mean Squared Error (MSE)
	•	Root Mean Squared Error (RMSE)



8. Model Selection and Validation
	•	Compare model performance across all evaluation metrics.
	•	Select the best-performing model based on predictive accuracy and generalization.
	•	Check for overfitting by comparing training and testing performance.



## Final Outcome

The final model is integrated into a web-based application that allows users to input car details and receive a real-time estimated market price along with insights into key influencing factors.

In [1]:
import pandas as pd