# Car Price Prediction and Condition Analysis

This notebook is designed to analyze car price data and predict car conditions using various machine learning models. It explores regression and classification techniques to evaluate the performance of different algorithms, refine models, and provide insights into the factors influencing car prices and conditions. The workflow includes data preprocessing, feature selection, model training, evaluation, and interpretation of results.


In [2]:
import sys
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

cleaned_data_path = os.path.abspath(os.path.join('..', 'data', 'cleaned', 'car_prices_cleaned.csv'))
car_df = pd.read_csv(cleaned_data_path)
car_df.head(5)

Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,seller,mmr,sellingprice,saledate
0,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg566472,ca,5.0,16639.0,white,black,"kia motors america, inc",20500,21500,2014-12-16
1,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg561319,ca,5.0,9393.0,white,beige,"kia motors america, inc",20800,21500,2014-12-16
2,2014,BMW,3 Series,328i SULEV,Sedan,automatic,wba3c1c51ek116351,ca,4.5,1331.0,gray,black,financial services remarketing (lease),31900,30000,2015-01-14
3,2015,Volvo,S60,T5,Sedan,automatic,yv1612tb4f1310987,ca,4.1,14282.0,white,black,volvo na rep/world omni,27500,27750,2015-01-28
4,2014,BMW,6 Series Gran Coupe,650i,Sedan,automatic,wba6b2c57ed129731,ca,4.3,2641.0,gray,black,financial services remarketing (lease),66000,67000,2014-12-18


In [2]:
# Feature and target selection
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics import r2_score
#Split the data 80/20 for train and test
x_train_df, x_test_df, y_train_df, y_test_df = train_test_split(
    car_df[['year', 'odometer', 'condition', 'mmr']],
    car_df['sellingprice'],
    test_size=0.2
)

#Split the train data 80/20 for tain and valid
x_train_df, x_valid_df, y_train_df, y_valid_df = train_test_split(
    x_train_df,
    y_train_df,
    test_size=0.2
)

print("Train: ", x_train_df.shape, y_train_df.shape)
print("Valid: ", x_valid_df.shape, y_valid_df.shape)
print("Test: ", x_test_df.shape, y_test_df.shape)

# Convert the dataframes to numpy array
x_train, y_train = x_train_df.values, y_train_df.values
x_valid, y_valid = x_valid_df.values, y_valid_df.values
x_test, y_test = x_test_df.values, y_test_df.values

Train:  (302294, 4) (302294,)
Valid:  (75574, 4) (75574,)
Test:  (94468, 4) (94468,)


In [3]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
# First 3 experiments
def regression_analysis(model, model_name):
    model.fit(x_train, y_train)

# Predict on validation and testing set
    y_valid_prediction = model.predict(x_valid)
    y_test_prediction = model.predict(x_test)

    # Check the performance of the model
    valid_mse = root_mean_squared_error(y_valid, y_valid_prediction)
    test_mse = root_mean_squared_error(y_test,y_test_prediction)
    valid_r2 = r2_score(y_valid, y_valid_prediction)
    test_r2 = r2_score(y_test,y_test_prediction)

    print(model_name + " Performance:")
    print(f"Validation RMSE: {valid_mse:.2f}, Validation R²: {valid_r2:.2f}")
    print(f"Test RMSE: {test_mse:.2f}, Test R²: {test_r2:.2f}")



# Intialize the model
for i in range(3):
    if i == 0:
        print("\nExperiment 1: Linear Regression:\n")
        model = LinearRegression()
        model_name = "Linear Regression"
        regression_analysis(model,model_name)
    elif i == 1:
        print("\nExperiment 2: Random Forest Regressor:\n")
        model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=50)
        model_name = "Random Forest Regressor"
        regression_analysis(model,model_name)
    elif i ==2:
        print("\nExperiment 3: Decision Tree Regressor:\n")
        model = DecisionTreeRegressor(max_depth=7, random_state=50)
        model_name= "Decision Tree Regressor"
        regression_analysis(model,model_name)




Experiment 1: Linear Regression:

Linear Regression Performance:
Validation RMSE: 1504.18, Validation R²: 0.98
Test RMSE: 1559.36, Test R²: 0.97

Experiment 2: Random Forest Regressor:

Random Forest Regressor Performance:
Validation RMSE: 1462.64, Validation R²: 0.98
Test RMSE: 1507.52, Test R²: 0.98

Experiment 3: Decision Tree Regressor:

Decision Tree Regressor Performance:
Validation RMSE: 1545.56, Validation R²: 0.97
Test RMSE: 1597.30, Test R²: 0.97


In [4]:
for i in range(3):    
    if i ==0:
        print("\nExperiment 4: Ridge Regression:\n")
        model = Ridge(alpha=1.0) # standard alpha chosen
        model_name= "Ridge Regression"
        regression_analysis(model,model_name)
    elif i == 1:
        print("\nExperiment 5: Lasso Regression:\n")
        model = Lasso(alpha=.1) # alpha .1 is considered regular 
        model_name= "Lasso Regression"
        regression_analysis(model,model_name)
    elif i == 2:
        print("\nExperiment 6: Elastic Net Regression:\n") # elastic net combines lasso and ridge
        model = ElasticNet(alpha=0.1, l1_ratio=0.5) # l1 ratio is balance between lasso and ridge models basically
        model_name= "Elastic Net Regression"
        regression_analysis(model,model_name)


Experiment 4: Ridge Regression:

Ridge Regression Performance:
Validation RMSE: 1586.12, Validation R²: 0.97
Test RMSE: 1545.41, Test R²: 0.97

Experiment 5: Lasso Regression:

Lasso Regression Performance:
Validation RMSE: 1586.12, Validation R²: 0.97
Test RMSE: 1545.41, Test R²: 0.97

Experiment 6: Elastic Net Regression:

Elastic Net Regression Performance:
Validation RMSE: 1587.41, Validation R²: 0.97
Test RMSE: 1546.88, Test R²: 0.97


## For each experiment, answer the following:

## Experiment 1: Linear Regression
1. What input data and target (output) data did you use for the prediction task
The input data that was initally used for this was the odometer column and the condition column. However this was later modified to include year and mmr for improved performance.

2. How did your model perform on the train and test set?  
The performance for linear regression was:
* Validation RMSE: 1550.07, Validation R²: 0.97
* Test RMSE: 1563.32, Test R²: 0.97
* This showcases that the accuracy is pretty good.

3. From the low R² score and low RMSE score (low given the size of the dataset), it seems the model is neither overfitting or underfitting.

4. No changes required.

5. What can you potentially do on the data side to increase the performance further (find more data, more of a specific type of data etc.)
* To potentially increase performance it is possible to include data points from another source, that has more infomation like gas milage information, or engine size.

## Experiment 2: Random Forest Regressor
1. What input data and target (output) data did you use for the prediction task
* The input data that was initally used for this was the odometer column and the condition column. However this was later modified to include year and mmr for improved performance.

2. How did your model perform on the train and test set? 
The performance for random forest regressor was:
* Validation RMSE: 1500.37, Validation R²: 0.98
* Test RMSE: 1530.58, Test R²: 0.97
* This showcases that the accuracy is pretty good.

3. From the low R² score and low RMSE score (low given the size of the dataset), it seems the model is neither overfitting or underfitting.

4. No changes required.

5. What can you potentially do on the data side to increase the performance further (find more data, more of a specific type of data etc.)
* To potentially increase performance it is possible to include data points from another source, that has more infomation like gas milage information, or engine size.

## Experiment 3: Decision Tree Regressor
1. What input data and target (output) data did you use for the prediction task.
* The input data that was initally used for this was the odometer column and the condition column. However this was later modified to include year and mmr for improved performance.

2. How did your model perform on the train and test set? 
The performance for decision tree regressor was:
* Validation RMSE: 1582.18, Validation R²: 0.97
* Test RMSE: 1603.22, Test R²: 0.97
* This showcases that the accuracy is pretty good.
3. Did your model overfit or underfit? 

* The accuracy is overall good just like the other models, however the RMSE score is higher in this model showing that it is close to overfitting.

4. If so, try addressing this problem either by modifying the model or modifying the data. This counts as another iteration within the same experiment. What changes did you make and how much did they help?
* Some changes to further improve the possible overfitting issue is adjesting the depth for instance.

5. What can you potentially do on the data side to increase the performance further (find more data, more of a specific type of data etc.)
* To potentially increase performance it is possible to include data points from another source, that has more infomation like gas milage information, or engine size. Possibly exploring feature transformations could also help improve the model.

## Experiment 4: Ridge Regression
1. What input data and target (output) data did you use for the prediction task?
* The input data is the odometer, the condition rating, the year, and the MMR. 

2. How did your model perform on the train and test set?
The performance for Ridge Regression was:
* Validation RMSE: 1532.66, Validation R²: 0.97
* Test RMSE: 1570.18, Test R²: 0.97
* This showcases that the accuracy is pretty good.

3. Did your model overfit or underfit? 
* The model does not appear to be underfitting or overfitting based on the R² score and RMSE score.

4. If so, try addressing this problem either by modifying the model or modifying the data. This counts as another iteration within the same experiment. What changes did you make and how much did they help?
* No changes required.

5. What can you potentially do on the data side to increase the performance further (find more data, more of a specific type of data etc.)
* One thing that would be very beneficial to do would be to add more data by using other columns. Using car data is difficult because a lot of it is string data that must be carefully encoded before we can utilize other data. 

## Experiment 5: Lasso Regression
1. What input data and target (output) data did you use for the prediction task
* The input data is the odometer, the condition rating, the year, and the MMR. 

2. How did your model perform on the train and test set?
The performance for Lasso Regression was:
* Validation RMSE: 1532.66, Validation R²: 0.97
* Test RMSE: 1570.18, Test R²: 0.97
* This showcases that the accuracy is pretty good.

3. Did your model overfit or underfit? 
* The model does not appear to be underfitting or overfitting based on the R² score and RMSE score.

4. If so, try addressing this problem either by modifying the model or modifying the data. This counts as another iteration within the same experiment. What changes did you make and how much did they help?
* No changes required.

5. What can you potentially do on the data side to increase the performance further (find more data, more of a specific type of data etc.)
* One thing that would be very beneficial to do would be to add more data by using other columns. Using car data is difficult because a lot of it is string data that must be carefully encoded before we can utilize other data. 

## Experiment 6: Elastic Net Regression
1. What input data and target (output) data did you use for the prediction task
* The input data is the odometer, the condition rating, the year, and the MMR. 

2. How did your model perform on the train and test set?
The performance for Elastic Net Regression was:
* Validation RMSE: 1533.65, Validation R²: 0.97
* Test RMSE: 1570.98, Test R²: 0.97
* This showcases that the accuracy is pretty good.

3. Did your model overfit or underfit? 
* The model does not appear to be underfitting or overfitting based on the R² score and RMSE score.

4. If so, try addressing this problem either by modifying the model or modifying the data. This counts as another iteration within the same experiment. What changes did you make and how much did they help?
* No changes required.

5. What can you potentially do on the data side to increase the performance further (find more data, more of a specific type of data etc.)
* One thing that would be very beneficial to do would be to add more data by using other columns. Using car data is difficult because a lot of it is string data that must be carefully encoded before we can utilize other data. 

## Experiment 7 & 8: KNN Implementation

### Dependencies

In [8]:
#First lets bring in the dependencies for this experiment !
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import OneHotEncoder


### First implementation of KNN (Iteration 1)

In [25]:
# Select features and target
X = car_df[['year', 'odometer', 'mmr']]  # numerical features
y = pd.cut(car_df['condition'], bins=[0, 2, 3, 5], labels=["Poor", "Fair", "Good"])  # target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict
y_pred_train = knn.predict(X_train_scaled)
y_pred_test = knn.predict(X_test_scaled)

# Evaluate performance
print("Train Accuracy:", accuracy_score(y_train, y_pred_train))
print("Test Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nClassification Report (Test Set):\n", classification_report(y_test, y_pred_test))

Train Accuracy: 0.7467025522140006
Test Accuracy: 0.6561586992420714

Classification Report (Test Set):
               precision    recall  f1-score   support

        Fair       0.36      0.33      0.34     21236
        Good       0.76      0.85      0.80     61242
        Poor       0.46      0.25      0.32     11990

    accuracy                           0.66     94468
   macro avg       0.53      0.48      0.49     94468
weighted avg       0.63      0.66      0.64     94468



### Part two: Refining the KNN algorithm(Second iteration)

now that we know that our model performed poorly it makes sense to either refine the data, predict something else, or refine the algorithm

In [27]:
# Select features and target
X = car_df[['year', 'odometer', 'mmr']]  # numerical features
y = pd.cut(car_df['condition'], bins=[0, 2, 3, 5], labels=["Poor", "Fair", "Good"])  # target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN model
knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train_scaled, y_train)

# Predict
y_pred_train = knn.predict(X_train_scaled)
y_pred_test = knn.predict(X_test_scaled)

# Evaluate performance
print("Train Accuracy:", accuracy_score(y_train, y_pred_train))
print("Test Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nClassification Report (Test Set):\n", classification_report(y_test, y_pred_test))

Train Accuracy: 0.7205585019107201
Test Accuracy: 0.6747575898717025

Classification Report (Test Set):
               precision    recall  f1-score   support

        Fair       0.38      0.29      0.33     21236
        Good       0.76      0.89      0.82     61242
        Poor       0.47      0.28      0.35     11990

    accuracy                           0.67     94468
   macro avg       0.54      0.48      0.50     94468
weighted avg       0.64      0.67      0.65     94468



In this first and second iteration of the KNN algorithm on our dataset, 
What input data and target (output) data did you use for the prediction task
1. The input data, or our numerical features and categorical variables, are the columns: year, odometer, make, and trim.

How did your model perform on the train and test set?
2.The model did just okay for the dataset so it was not the best. It perofrmed well in predicting the good condition but it did not do well at all with the poor or fair conditions

Did your model overfit or underfit?If so, try addressing this problem either by modifying the model or modifying the data. 
3.The train accuracy (0.7467) is significantly higher than the test accuracy (0.6562) which indicatess overfitting.


This counts as another iteration within the same experiment. What changes did you make and how much did they help?(next cell)

4..The second iteration here tried a range of different numbers for n_neighbors and the best seems to be at 11 neighbors which results in about 4% less overfitting, however the model is not much more accurate overall. This means the next changes would likely have to come from modifying the dataset or providing more data for the model to learn from.

What can you potentially do on the data side to increase the performance further (find more data, more of a specific type of data etc.)(answered in next cell)

### What are the real world uses for a model like this?


This model can potentially help car dealerships or those in the businesss of buying and selling cars at a mass to quickly determine the condition of a car just by using data . This is helpful because instead of someone manually assesing the condition of the car and having to inspect it, this model can be applied to predict it and with more data it could have higher accuracy and precision

## Expirement 9: Support vector regression

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# input and output
X = car_df[['year', 'odometer', 'mmr']]  # numerical features
y = pd.cut(car_df['condition'], bins=[0, 2, 3, 5], labels=["Poor", "Fair", "Good"])  # target

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression model
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train_scaled, y_train)

# Predict
y_pred_train = log_reg.predict(X_train_scaled)
y_pred_test = log_reg.predict(X_test_scaled)

# Evaluate performance
print("Train Accuracy:", accuracy_score(y_train, y_pred_train))
print("Test Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nClassification Report (Test Set):\n", classification_report(y_test, y_pred_test))


Train Accuracy: 0.6908629468491643
Test Accuracy: 0.6925202184866833

Classification Report (Test Set):
               precision    recall  f1-score   support

        Fair       0.42      0.22      0.29     21236
        Good       0.75      0.93      0.83     61242
        Poor       0.51      0.29      0.37     11990

    accuracy                           0.69     94468
   macro avg       0.56      0.48      0.50     94468
weighted avg       0.65      0.69      0.65     94468



What input data and target (output) data did you use for the prediction task

1.input data: 'year', 'odometer', 'mmr'. The target was the 'condition' once more to compare models

How did your model perform on the train and test set?

2.It did not do well in most categories but it did perform well in the good category.

Did your model overfit or underfit?

3.The model fit very well but it was not extremely accurate again, it looks like we need a larger data sample to better predict results
This is the only case where this model does better than our previous knn

If so, try addressing this problem either by modifying the model or modifying the data. This counts as another iteration within the same experiment. What changes did you make and how much did they help?

4.The data sample would have to be larger becuase the support column indicates a good performance on predicting the good condition and there are a lot of cars in that condition

What can you potentially do on the data side to increase the performance further (find more data, more of a specific type of data etc.)

5.Maybe with our current dataset, adding more input data from other columns would improve the accuracy of the model. Otherewise maybe creating psuedo data based on the model may help in the training part. Incorporating more data and refining the features can significantly enhance the model's ability to make accurate predictions. 


In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, classification_report

# Prepare features
X = car_df[['year', 'odometer', 'mmr']]  # Numerical features

# Create binary target: 1 for "Good," 0 for "Not Good"
y = pd.cut(car_df['condition'], bins=[0, 2, 3, 5], labels=["Poor", "Fair", "Good"])
y_binary = (y == "Good").astype(int)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN model
knn = KNeighborsClassifier(n_neighbors=7, weights='distance')
knn.fit(X_train_scaled, y_train)

# Predict
y_pred_train = knn.predict(X_train_scaled)
y_pred_test = knn.predict(X_test_scaled)

# Evaluate performance
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)
test_precision = precision_score(y_test, y_pred_test)

print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test Precision: {test_precision:.4f}")

# Detailed classification report
print("\nClassification Report (Test Set):\n")
print(classification_report(y_test, y_pred_test, target_names=["Not Good", "Good"]))


Train Accuracy: 0.9996
Test Accuracy: 0.7202
Test Precision: 0.7703

Classification Report (Test Set):

              precision    recall  f1-score   support

    Not Good       0.61      0.55      0.58     33226
        Good       0.77      0.81      0.79     61242

    accuracy                           0.72     94468
   macro avg       0.69      0.68      0.69     94468
weighted avg       0.71      0.72      0.72     94468

