# Project Objective

The objective of this project is to develop a regression-based machine learning model that accurately predicts the market price of used cars based on their technical specifications and usage history.
This system helps buyers and sellers estimate a fair price for a vehicle and understand the key factors influencing car value.


## Dataset Description

The dataset contains historical records of used cars along with their features and selling prices.


## Data Dictionary

	•	Brand: Manufacturer of the car (e.g., Toyota, BMW, Honda)
	•	Model: Specific model name of the car
	•	Year: Year of manufacture
	•	Mileage: Total distance the car has been driven (km or miles)
	•	Fuel Type: Type of fuel used (Petrol, Diesel, Electric, Hybrid)
	•	Transmission: Transmission type (Manual or Automatic)
	•	Engine Size: Engine displacement (in cc or liters)
	•	Owner Type: First owner, second owner, etc. (if available)
	•	Location: City or region where the car is listed
	•	Price: Selling price of the car (target variable)


## Problem Type

Regression problem
The goal is to predict a continuous numerical value representing the car’s selling price.



## Machine Learning Algorithms Considered

Multiple regression models are trained and compared to identify the most accurate approach:
	•	Linear Regression (baseline model)
	•	Polynomial Regression
	•	Decision Tree Regressor
	•	Random Forest Regressor
	•	Gradient Boosting Regressor (optional advanced model)



# Methodology

1. Data Loading and Initial Inspection
	•	Load the dataset using Pandas.
	•	Inspect the dataset structure using head(), info(), and describe().
	•	Check for missing values and handle them through imputation or removal.
	•	Identify and assess outliers in features such as price and mileage.


2. Exploratory Data Analysis (EDA)
	•	Analyze distributions of numerical features such as Year, Mileage, Engine Size, and Price.
	•	Examine how car price varies with Brand, Fuel Type, and Transmission.
	•	Visualize correlations to identify features with the strongest influence on price.



3. Data Preprocessing
	•	Encode categorical variables such as Brand, Fuel Type, Transmission, and Location using:
	•	Label Encoding or
	•	One-Hot Encoding (preferred for non-ordinal features)
	•	Handle skewed variables using transformations if necessary.
	•	Remove or cap extreme outliers where appropriate.



4. Feature Scaling
	•	Apply feature scaling to numerical variables such as Mileage, Engine Size, and Year.
	•	Use StandardScaler or MinMaxScaler where required.
	•	Ensure consistent preprocessing between training and inference.



5. Train-Test Split
	•	Split the dataset into training and testing sets:
	•	Training set: 70–80%
	•	Testing set: 20–30%
	•	Use a fixed random state for reproducibility.



6. Model Training
	•	Train multiple regression models using the training dataset.
	•	Tune hyperparameters for tree-based models to improve performance.
	•	Store trained models for evaluation and deployment.


7. Model Evaluation

Models are evaluated on the test dataset using:
	•	R² Score (primary metric)
	•	Mean Absolute Error (MAE)
	•	Mean Squared Error (MSE)
	•	Root Mean Squared Error (RMSE)



8. Model Selection and Validation
	•	Compare model performance across all evaluation metrics.
	•	Select the best-performing model based on predictive accuracy and generalization.
	•	Check for overfitting by comparing training and testing performance.



## Final Outcome

The final model is integrated into a web-based application that allows users to input car details and receive a real-time estimated market price along with insights into key influencing factors.

In [5]:
import pandas as pd

In [None]:
df = pd.read_csv('car_prices.csv')
# we uread care prices dataset and store it in a dataframe called df

In [8]:
# Display the first few rows of the dataframe to understand its structure
df.head()

Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,seller,mmr,sellingprice,saledate
0,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg566472,ca,5.0,16639.0,white,black,kia motors america inc,20500.0,21500.0,Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
1,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg561319,ca,5.0,9393.0,white,beige,kia motors america inc,20800.0,21500.0,Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
2,2014,BMW,3 Series,328i SULEV,Sedan,automatic,wba3c1c51ek116351,ca,45.0,1331.0,gray,black,financial services remarketing (lease),31900.0,30000.0,Thu Jan 15 2015 04:30:00 GMT-0800 (PST)
3,2015,Volvo,S60,T5,Sedan,automatic,yv1612tb4f1310987,ca,41.0,14282.0,white,black,volvo na rep/world omni,27500.0,27750.0,Thu Jan 29 2015 04:30:00 GMT-0800 (PST)
4,2014,BMW,6 Series Gran Coupe,650i,Sedan,automatic,wba6b2c57ed129731,ca,43.0,2641.0,gray,black,financial services remarketing (lease),66000.0,67000.0,Thu Dec 18 2014 12:30:00 GMT-0800 (PST)


In [None]:
# HERE WE CAN SEE THE COLUMNS OF THE DATAFRAME
df.columns

Index(['year', 'make', 'model', 'trim', 'body', 'transmission', 'vin', 'state',
       'condition', 'odometer', 'color', 'interior', 'seller', 'mmr',
       'sellingprice', 'saledate'],
      dtype='object')

In [10]:
# Check the shape of the dataframe to see how many rows and columns it has
df.shape

(558837, 16)

we can see that we have 558837 row and 16 columns

Exploratory Data Analysis 

In [None]:
# Get summary information about the dataframe, including data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 558837 entries, 0 to 558836
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   year          558837 non-null  int64  
 1   make          548536 non-null  object 
 2   model         548438 non-null  object 
 3   trim          548186 non-null  object 
 4   body          545642 non-null  object 
 5   transmission  493485 non-null  object 
 6   vin           558833 non-null  object 
 7   state         558837 non-null  object 
 8   condition     547017 non-null  float64
 9   odometer      558743 non-null  float64
 10  color         558088 non-null  object 
 11  interior      558088 non-null  object 
 12  seller        558837 non-null  object 
 13  mmr           558799 non-null  float64
 14  sellingprice  558825 non-null  float64
 15  saledate      558825 non-null  object 
dtypes: float64(4), int64(1), object(11)
memory usage: 68.2+ MB


In [None]:
# Check for missing values in each column
df.isnull().sum()

year                0
make            10301
model           10399
trim            10651
body            13195
transmission    65352
vin                 4
state               0
condition       11820
odometer           94
color             749
interior          749
seller              0
mmr                38
sellingprice       12
saledate           12
dtype: int64