# Used Cars Price Prediction for UK-based Online Car Marketplace
--- 

# **1. Introduction**
---
## 1.1 Business & Context Understanding

###     Business Problem
The client is an emerging online used car listing company, similar to platforms like OLX Autos, Carmudi, or Carsome.id. Their business model revolves around acquiring pre-owned vehicles in good condition at prices below market value, with the intention of reselling them on their platform at a higher price point.

### Business Model: 
The company strategically purchases used cars at favorable prices, aiming to maximize profit margins upon resale. The success of their business model depends on acquiring vehicles at a cost that allows for competitive pricing in the online marketplace while ensuring a reasonable profit.

### Business Problem:
The primary challenge faced by the company is the potential reduction in profit margins if they acquire used cars at prices higher than the market value. This emphasizes the need for a predictive model that can assist in estimating the appropriate purchase price range for pre-owned vehicles.
- Overpricing: If the purchase price is too high, the company may be forced to sell the vehicle at a price higher than the market value, resulting in a loss.
- Underpricing: If the purchase price is too low, the company may not be be able to purchase the vehicle at the price point, resulting in a missed opportunity.

### Business Success Criteria:
The success of the project is defined by achieving the following criteria:

- Accurate Price Prediction: Develop a machine learning model capable of accurately predicting the resale price of used cars based on relevant features.
- *Profit Consideration: Ensure that the predicted price range allows for a reasonable profit margin, aligning with the company's business goals.
- Optimized Purchase Decisions: Empower agents purchasing used cars to make informed decisions by providing them with a predicted price range for each vehicle.




## 1.2 Proposed Solution: 

###  **Machine Learning Regression Model for Price Prediction:**
To address the business problem, the proposed solution involves creating a robust machine learning model for predicting the resale price of used cars. This model will take into account various features such as the make, model, mileage, year of manufacture, and any other relevant factors that influence the pricing of used cars.

The model's implementation aims to empower purchasing agents with a tool that can provide a reasonable estimate of the resale value of a potential acquisition. This ensures that the company can make well-informed decisions when acquiring used cars, optimizing the balance between competitive pricing and maintaining healthy profit margins.




###  **Model Evaluation:**
####  Evaluation Metrics:

*   **Mean Absolute Error (MAE):**
    
    *   MAE represents the average absolute difference between the predicted prices and the actual prices.
    *   Business Relation: A low MAE indicates that, on average, the model's predictions are close to the true prices. This is crucial for the business, as it directly aligns with the goal of accurate price prediction, helping purchasing agents make informed decisions.
*   **Mean Squared Error (MSE):**
    
    *   MSE calculates the average squared difference between predicted and actual prices.
    *   Business Relation: MSE gives more weight to large errors. A low MSE indicates that the model is effective in minimizing significant deviations, which is important for ensuring that extreme pricing errors are minimized, contributing to more consistent and reliable predictions.
*   **R-squared (R2):**
    
    *   R2 measures the proportion of the variance in the target variable explained by the model.
    *   Business Relation: A high R2 suggests that a significant portion of the variability in used car prices is captured by the model. This is important for business success as it indicates that the model is effectively leveraging the provided features to predict prices, contributing to better decision-making.




### **Model Interpretation:**
#### Feature Importance:
Feature Importance Analysis:
Interpret the importance of each feature in the model. Identify which features have the most significant impact on price predictions.
Business Relation: Understanding feature importance helps the business identify key factors influencing used car prices. This insight can guide strategic decisions, such as focusing on specific features during the car acquisition process.


#### Visualization:
Prediction vs. Actual Plots:
Visualize model predictions against actual prices, considering different subsets of the data if necessary.
Business Relation: Visualization helps stakeholders grasp the overall performance of the model. If the predicted prices align closely with actual prices, it instills confidence in the model's reliability and contributes to better decision-making.



## 1.3 Data Understanding

### About the Data
- Dataset contains listing of used cars in UK from several providers
- Scraped data of used cars listings. 100,000 listings, which have been separated into files corresponding to each car manufacturer.
- The cleaned data set contains information of price, transmission, mileage, fuel type, road tax, miles per gallon (mpg), and engine size. 


---

The raw data provided contains several files corresponding to each car manufacturer. Most files are already pre-cleaned and ready for analysis. However, some files require additional cleaning and feature engineering to prepare them for the modeling phase. 

Note on Data Cleaning: The data cleaning process is documented in the notebook `2-Cleaning.ipynb`.
- The `unclean cclass.csv` file was cleaned and saved as '`cclass.csv`' by the data provider.
- The `unclean focus.csv` file was cleaned and saved as '`focus.csv`' by the data provider.
- Both cclass.csv and focus.csv are contains different columns than the other files. Therefore, in the fisrt part of this project, the data cleaning process was done separately for these two cases where we will examine the data and applying it for modeling.


### Data Check on uncleaned data vs cleaned data 

To verify the data are essentially the same, we conducted a review on 
-  `unclean cclass.csv` vs `cclass.csv`
- `unclean focus.csv` vs `focus.csv`

In [1]:
## EDA Standard Libary

import pandas as pd
import numpy as np

In [2]:
#Load Dataset

cclass = pd.read_csv("Raw Dataset/cclass.csv")
unclean_cclass = pd.read_csv("Raw Dataset/unclean cclass.csv")

#### Check General Data Information

**Mercedes cclass Data**

In [3]:
cclass.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3899 entries, 0 to 3898
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         3899 non-null   object 
 1   year          3899 non-null   int64  
 2   price         3899 non-null   int64  
 3   transmission  3899 non-null   object 
 4   mileage       3899 non-null   int64  
 5   fuelType      3899 non-null   object 
 6   engineSize    3899 non-null   float64
dtypes: float64(1), int64(3), object(3)
memory usage: 213.4+ KB


In [4]:
unclean_cclass.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4006 entries, 0 to 4005
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         3907 non-null   object 
 1   year          3904 non-null   float64
 2   price         3907 non-null   object 
 3   transmission  3907 non-null   object 
 4   mileage       3808 non-null   object 
 5   fuel type     1329 non-null   object 
 6   engine size   3842 non-null   object 
 7   mileage2      3890 non-null   object 
 8   fuel type2    3808 non-null   object 
 9   engine size2  3808 non-null   object 
 10  reference     3907 non-null   object 
dtypes: float64(1), object(10)
memory usage: 344.4+ KB


#### Check on Data Rows and Columns

In [5]:
cclass.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,engineSize
0,C Class,2020,30495,Automatic,1200,Diesel,2.0
1,C Class,2020,29989,Automatic,1000,Petrol,1.5
2,C Class,2020,37899,Automatic,500,Diesel,2.0
3,C Class,2019,30399,Automatic,5000,Diesel,2.0
4,C Class,2019,29899,Automatic,4500,Diesel,2.0


In [6]:
unclean_cclass.head()

Unnamed: 0,model,year,price,transmission,mileage,fuel type,engine size,mileage2,fuel type2,engine size2,reference
0,C Class,2020.0,"£30,495",Automatic,,Diesel,2.0,1200,,,/ad/25017331
1,C Class,2020.0,"£29,989",Automatic,,Petrol,1.5,1000,,,/ad/25043746
2,C Class,2020.0,"£37,899",Automatic,,Diesel,2.0,500,,,/ad/25142894
3,C Class,2019.0,"£30,399",Automatic,,Diesel,2.0,5000,,,/ad/24942816
4,C Class,2019.0,"£29,899",Automatic,,Diesel,2.0,4500,,,/ad/24913660


We reviewed the data and found that the data rows and columns **are the same** for both files, but the data on `unclean cclass.csv` is not cleaned and the data on `cclass.csv` is cleaned.
> We won't be using the `unclean cclass.csv` & `unclean focus.csv` files in this project.