# **Car Price Prediction Analysis**

## **1. Defining the Analytical Question**
### **Can we predict the selling price of a car based on its attributes?**
- This analysis will help us understand which features most influence car pricing.
- The insights can aid sellers in setting fair prices and help buyers assess reasonable values.


In [2]:
import pandas as pd
from sqlalchemy import create_engine

In [4]:
df = pd.read_csv("Car Price Prediction Feature Engineering.csv")
df.head()


Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,...,interior,seller,mmr,sellingprice,saledate,log_sellingprice,car_age,odometer_per_year,price_difference_from_mmr,is_price_outlier
0,2015,0,0,LX,0,0,5xyktca69fg566472,0,5.0,16639.0,...,0,kia motors america inc,20500.0,21500.0,Tue Dec 16 2014 12:30:00 GMT-0800 (PST),9.975808,10,1663.9,1000.0,0
1,2015,0,0,LX,0,0,5xyktca69fg561319,0,5.0,9393.0,...,1,kia motors america inc,20800.0,21500.0,Tue Dec 16 2014 12:30:00 GMT-0800 (PST),9.975808,10,939.3,700.0,0
2,2014,1,1,328i SULEV,1,0,wba3c1c51ek116351,0,45.0,1331.0,...,0,financial services remarketing (lease),31900.0,30000.0,Thu Jan 15 2015 04:30:00 GMT-0800 (PST),10.308953,11,121.0,-1900.0,0
3,2015,2,2,T5,1,0,yv1612tb4f1310987,0,41.0,14282.0,...,0,volvo na rep/world omni,27500.0,27750.0,Thu Jan 29 2015 04:30:00 GMT-0800 (PST),10.230991,10,1428.2,250.0,0
4,2014,1,3,650i,1,0,wba6b2c57ed129731,0,43.0,2641.0,...,0,financial services remarketing (lease),66000.0,67000.0,Thu Dec 18 2014 12:30:00 GMT-0800 (PST),11.112448,11,240.090909,1000.0,0


In [6]:
df.columns

Index(['year', 'make', 'model', 'trim', 'body', 'transmission', 'vin', 'state',
       'condition', 'odometer', 'color', 'interior', 'seller', 'mmr',
       'sellingprice', 'saledate', 'log_sellingprice', 'car_age',
       'odometer_per_year', 'price_difference_from_mmr', 'is_price_outlier'],
      dtype='object')

Data contains combination of categorical and numerical data. As a first step, we will convert all the columns to float type with round to 3 decimal places for continuous data and convert categorical data to numerical data.


In [9]:
# Dropping the columns which are not required for the analysis
df = df.drop(columns=["vin", "seller", "saledate", 'is_price_outlier', 'odometer', 'trim', 'model', 'state', 'year', 'body'])
df.head()


Unnamed: 0,make,transmission,condition,color,interior,mmr,sellingprice,log_sellingprice,car_age,odometer_per_year,price_difference_from_mmr
0,0,0,5.0,0,0,20500.0,21500.0,9.975808,10,1663.9,1000.0
1,0,0,5.0,0,1,20800.0,21500.0,9.975808,10,939.3,700.0
2,1,0,45.0,1,0,31900.0,30000.0,10.308953,11,121.0,-1900.0
3,2,0,41.0,0,0,27500.0,27750.0,10.230991,10,1428.2,250.0
4,1,0,43.0,1,0,66000.0,67000.0,11.112448,11,240.090909,1000.0


In [11]:
categorical_data = ["condition", "color", "interior", "transmission", "make"]

for col in categorical_data:
    df[col] = df[col].astype('category')

df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482346 entries, 0 to 482345
Data columns (total 11 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   make                       482346 non-null  category
 1   transmission               482346 non-null  category
 2   condition                  482346 non-null  category
 3   color                      482346 non-null  category
 4   interior                   482346 non-null  category
 5   mmr                        482346 non-null  float64 
 6   sellingprice               482346 non-null  float64 
 7   log_sellingprice           482346 non-null  float64 
 8   car_age                    482346 non-null  int64   
 9   odometer_per_year          482346 non-null  float64 
 10  price_difference_from_mmr  482346 non-null  float64 
dtypes: category(5), float64(5), int64(1)
memory usage: 24.4 MB


In [13]:
# Converting the categorical data to numerical data
df = pd.get_dummies(df, columns=categorical_data, dtype=int, drop_first=True)

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Convert all the columns to float type with round to 3 decimal places
for col in df.columns:
    df[col] = df[col].astype(float).round(3)

# Separate features and target
X = df.drop(['log_sellingprice', 'sellingprice'], axis=1)
y = df['log_sellingprice']

# Split the data into training, testing, and validation sets
X_train, X_temp, y_train, y_temp = train_test_split(
    X, 
    y, 
    test_size=0.3,  # 30% of the data will be used for testing and validation
    random_state=42
)

X_test, X_val, y_test, y_val = train_test_split(
    X_temp, 
    y_temp, 
    test_size=0.5,  # 50% of the remaining data will be used for validation
    random_state=42
)

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Scale the features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_val_scaled = scaler.transform(X_val)

print("Shape of training data:", X_train_scaled.shape)
print("Shape of testing data:", X_test_scaled.shape)
print("Shape of validation data:", X_val_scaled.shape)

Shape of training data: (337642, 157)
Shape of testing data: (72352, 157)
Shape of validation data: (72352, 157)


In [17]:
import pandas as pd

# Convert the scaled validation data back to a DataFrame
X_val_scaled_df = pd.DataFrame(X_val_scaled, columns=X_val.columns)

# Add the target column to the DataFrame
X_val_scaled_df['log_sellingprice'] = y_val.reset_index(drop=True)

# Save the validation data to a CSV file
X_val_scaled_df.to_csv("validation_data.csv", index=False)

In [19]:
## Training the model using Linear Regression model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train_scaled, y_train)

# Make predictions on the training set
y_train_pred = model.predict(X_train_scaled)

# Make predictions on the testing set
y_test_pred = model.predict(X_test_scaled)

In [21]:
# Evaluate the model on Training set and Testing set

rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
mse = mean_squared_error(y_train, y_train_pred)
r2 = r2_score(y_train, y_train_pred)

print("Training Set Metrics:")
print(f"RMSE: {rmse:.4f}")
print(f"MSE: {mse:.4f}")
print(f"R2 Score: {r2:.4f}")

# Calculate metrics for testing set
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
mse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)

print("--------------------------------")
print("Testing Set Metrics:")
print(f"RMSE: {rmse_test:.4f}")
print(f"MSE: {mse_test:.4f}")
print(f"R2 Score: {r2_test:.4f}")


Training Set Metrics:
RMSE: 0.3257
MSE: 0.1061
R2 Score: 0.8697
--------------------------------
Testing Set Metrics:
RMSE: 0.3308
MSE: 0.1094
R2 Score: 0.8657


The linear regression model demonstrates strong performance with an RMSE of 0.3257 and an R² score of 0.8697 on the training set, indicating a good fit. On the testing set, the RMSE is slightly higher at 0.3308, with an R² score of 0.8657, suggesting the model generalizes well to unseen data with minimal overfitting. The close alignment of training and testing metrics reflects a robust model capable of accurately predicting car prices based on the given features.

In [26]:
## Experimenting with different models
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# Initialize the models
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
dt_model = DecisionTreeRegressor(random_state=42)

# Train the models
rf_model.fit(X_train_scaled, y_train)
dt_model.fit(X_train_scaled, y_train)

# Make predictions on the training set
y_train_pred_rf = rf_model.predict(X_train_scaled)
y_train_pred_dt = dt_model.predict(X_train_scaled)

In [27]:
# Evaluate the models on Training set and Testing set

# Decision Tree Model
rmse_dt = np.sqrt(mean_squared_error(y_train, y_train_pred_dt))
mse_dt = mean_squared_error(y_train, y_train_pred_dt)
r2_dt = r2_score(y_train, y_train_pred_dt)

print("Decision Tree Model Training Metrics:")
print(f"RMSE: {rmse_dt:.4f}")
print(f"MSE: {mse_dt:.4f}")
print(f"R2 Score: {r2_dt:.4f}")

# Make predictions on the testing set
y_test_pred_rf = rf_model.predict(X_test_scaled)
y_test_pred_dt = dt_model.predict(X_test_scaled)

print("--------------------------------")
print("Decision Tree Model Testing Metrics:")
print(f"RMSE: {rmse_dt:.4f}")
print(f"MSE: {mse_dt:.4f}")
print(f"R2 Score: {r2_dt:.4f}")


# Random Forest Model
print("--------------------------------")
print("Random Forest Model Training Metrics:")
rmse_train_rf = np.sqrt(mean_squared_error(y_train, y_train_pred_rf))
mse_train_rf = mean_squared_error(y_train, y_train_pred_rf)
r2_train_rf = r2_score(y_train, y_train_pred_rf)

print(f"RMSE: {rmse_train_rf:.4f}")
print(f"MSE: {mse_train_rf:.4f}")
print(f"R2 Score: {r2_train_rf:.4f}")

print("--------------------------------")
print("Random Forest Model Testing Metrics:")
rmse_test_rf = np.sqrt(mean_squared_error(y_test, y_test_pred_rf))
mse_test_rf = mean_squared_error(y_test, y_test_pred_rf)
r2_test_rf = r2_score(y_test, y_test_pred_rf)

print(f"RMSE: {rmse_test_rf:.4f}")
print(f"MSE: {mse_test_rf:.4f}")
print(f"R2 Score: {r2_test_rf:.4f}")

Decision Tree Model Training Metrics:
RMSE: 0.0000
MSE: 0.0000
R2 Score: 1.0000
--------------------------------
Decision Tree Model Testing Metrics:
RMSE: 0.0000
MSE: 0.0000
R2 Score: 1.0000
--------------------------------
Random Forest Model Training Metrics:
RMSE: 0.0078
MSE: 0.0001
R2 Score: 0.9999
--------------------------------
Random Forest Model Testing Metrics:
RMSE: 0.0275
MSE: 0.0008
R2 Score: 0.9991


The Decision Tree model achieves perfect performance with RMSE and MSE of 0.0000 and an R² score of 1.0000 for both training and testing, indicating overfitting. The Random Forest model shows excellent performance with a training RMSE of 0.0078 and an R² score of 0.9999, and a testing RMSE of 0.0273 with an R² score of 0.9991, suggesting it generalizes well with minimal overfitting. The Random Forest model is more robust, providing a balance between accuracy and generalization.

In [13]:
import joblib

# Save the trained models
joblib.dump(rf_model, 'random_forest_model.joblib')


['random_forest_model.joblib']