Car Price Prediction: Machine Leaning Models

Cyrus Kolahi

run proj3_data_preprocess.ipynb to preprocess and create train and test set

In [78]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import sklearn.utils, sklearn.preprocessing, sklearn.decomposition, sklearn.svm
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader, TensorDataset

from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

from tqdm import tqdm
import pickle




load data from data folder

In [153]:
X_train = pd.read_csv("data/X_train_scaled.csv")
X_test = pd.read_csv("data/X_test_scaled.csv")
X_val = pd.read_csv("data/X_val_scaled.csv")
y_train = pd.read_csv("data/y_train_scaled.csv")
y_test = pd.read_csv("data/y_test_scaled.csv")
y_val = pd.read_csv("data/y_val_scaled.csv")


Modeling

Basic Linear Regression:

In [163]:
LinReg = LinearRegression()
LinReg.fit(X_train, y_train)
lr_val_pred = LinReg.predict(X_val)
lr_pred = LinReg.predict(X_test)
print("Linear Regression Results:")
print(f"Test MSE: {mean_squared_error(y_test, lr_pred):.2f}")
print(f"Val MSE: {mean_squared_error(y_val, lr_val_pred):.2f}")
print(f"Test R2 Score: {r2_score(y_test, lr_pred):.2f}")
print(f"Val R2 Score: {r2_score(y_val, lr_val_pred):.2f}")

Linear Regression Results:
Test MSE: 0.00
Val MSE: 0.00
Test R2 Score: 1.00
Val R2 Score: 1.00


Regression with Kernels:

In [165]:
kernels = ['linear', 'rbf', 'poly']
for kernel in kernels:
    kr = KernelRidge(kernel=kernel)
    kr.fit(X_train, y_train)
    test_pred = kr.predict(X_test)
    val_pred = kr.predict(X_val)
    print(f"Regression with {kernel} kernel Results:")
    print(f"Test MSE: {mean_squared_error(y_test, test_pred):.2f}")
    print(f"Val MSE: {mean_squared_error(y_val, val_pred):.2f}")
    print(f"Test R2 Score: {r2_score(y_test, test_pred):.2f}")
    print(f"Val R2 Score: {r2_score(y_val, val_pred):.2f}\n")


Regression with linear kernel Results:
Test MSE: 0.00
Val MSE: 0.00
Test R2 Score: 1.00
Val R2 Score: 1.00

Regression with rbf kernel Results:
Test MSE: 0.00
Val MSE: 0.00
Test R2 Score: 1.00
Val R2 Score: 1.00

Regression with poly kernel Results:
Test MSE: 0.00
Val MSE: 0.00
Test R2 Score: 1.00
Val R2 Score: 1.00



Support Vector Regression with different kernels:

In [166]:
for kernel in ['linear', 'rbf', 'poly']:
    svr = SVR(kernel=kernel)
    svr.fit(X_train, y_train)
    svr_pred = svr.predict(X_test)
    svr_val_pred = svr.predict(X_val)
    print(f"Support Vector Regression ({kernel} kernel) Results:")
    print(f"Test MSE: {mean_squared_error(y_test, svr_pred):.2f}")
    print(f"Val MSE: {mean_squared_error(y_val, svr_val_pred):.2f}")
    print(f"Test R2 Score: {r2_score(y_test, svr_pred):.2f}")
    print(f"Val R2 Score: {r2_score(y_val, svr_val_pred):.2f}\n")

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Support Vector Regression (linear kernel) Results:
Test MSE: 0.00
Val MSE: 0.00
Test R2 Score: 1.00
Val R2 Score: 1.00

Support Vector Regression (rbf kernel) Results:
Test MSE: 0.00
Val MSE: 0.00
Test R2 Score: 1.00
Val R2 Score: 1.00



  y = column_or_1d(y, warn=True)


Support Vector Regression (poly kernel) Results:
Test MSE: 0.05
Val MSE: 0.05
Test R2 Score: 0.95
Val R2 Score: 0.95



Decision Tree

In [170]:
dt = DecisionTreeRegressor(random_state=9)
dt.fit(X_train, y_train)
test_pred = dt.predict(X_test)
val_pred = dt.predict(X_val)
print("Decision Tree Results:")
print(f"Test MSE: {mean_squared_error(y_test, test_pred):.2f}")
print(f"Val MSE: {mean_squared_error(y_val, val_pred):.2f}")
print(f"Test R2 Score: {r2_score(y_test, test_pred):.2f}")
print(f"Val R2 Score: {r2_score(y_val, val_pred):.2f}\n")

Decision Tree Results:
Test MSE: 0.07
Val MSE: 0.08
Test R2 Score: 0.93
Val R2 Score: 0.92



Random Forest

In [171]:
rf = RandomForestRegressor(n_estimators=100, random_state=20)
rf.fit(X_train, y_train)
test_pred = rf.predict(X_test)
val_pred = rf.predict(X_val)
print("Random Forest Results:")
print(f"Test MSE: {mean_squared_error(y_test, test_pred):.2f}")
print(f"Val MSE: {mean_squared_error(y_val, val_pred):.2f}")
print(f"Test R2 Score: {r2_score(y_test, test_pred):.2f}")
print(f"Val R2 Score: {r2_score(y_val, val_pred):.2f}\n")


  return fit_method(estimator, *args, **kwargs)


Random Forest Results:
Test MSE: 0.03
Val MSE: 0.03
Test R2 Score: 0.97
Val R2 Score: 0.97



Neural Network:

In [148]:
class CarPriceNN(nn.Module):
    def __init__(self, input_dim):
        super(CarPriceNN, self).__init__()
        
        # First block with wider layers
        self.layer1 = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.2)
        )
        
        # Second block with residual connection
        self.layer2 = nn.Sequential(
            nn.Linear(64, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.2)
        )
        
        # Third block decreasing dimensions
        self.layer3 = nn.Sequential(
            nn.Linear(64, 32),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.Dropout(0.2)
        )
        
        # Final prediction layers
        self.output_layers = nn.Sequential(
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
        )
        
        # Initialize weights
        #self.apply(self._init_weights)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
            if module.bias is not None:
                nn.init.constant_(module.bias, 0)
                
    def forward(self, x):
        # Forward pass with residual connection
        x1 = self.layer1(x)
        x2 = self.layer2(x1)
        x3 = self.layer3(x2)
        out = self.output_layers(x3)
        return out

def calculate_r2(y_true, y_pred):
    # Ensure inputs are the right shape and scale
    y_true = y_true.squeeze()  # Remove extra dimensions
    y_pred = y_pred.squeeze()
    
    # Convert to numpy if they're torch tensors
    if torch.is_tensor(y_true):
        y_true = y_true.detach().cpu().numpy()
    if torch.is_tensor(y_pred):
        y_pred = y_pred.detach().cpu().numpy()
    
    # Calculate R2
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    r2 = 1 - (ss_res / ss_tot)
    return r2

def train_model(model, train_loader, val_loader, test_loader, epochs=50):
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)

    
    best_val_loss = float('inf')
    patience_counter = 0
    patience = 10  # Early stopping patience
    
    for epoch in tqdm(range(epochs)):
        # Training
        model.train()
        train_loss = 0
        for X_batch, y_batch in train_loader:
            y_batch = y_batch
            #print(y_batch.shape, X_batch.shape)

            optimizer.zero_grad()
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
            
        # Validation/Test
        model.eval()
        val_preds =[]
        val_true=[]
        test_preds=[]
        test_true=[]
        val_loss = 0
        test_loss = 0

        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                #y_batch = y_batch
                #print(y_batch.shape, X_batch.shape)
                y_pred = model(X_batch)
                val_loss += criterion(y_pred, y_batch).item()
                val_preds.append(y_pred)
                val_true.append(y_batch)
        

            for X_batch, y_batch in test_loader:
                #y_batch = y_batch
                #print(y_batch.shape, X_batch.shape)
                y_pred = model(X_batch)
                test_loss += criterion(y_pred, y_batch).item()
                test_preds.append(y_pred)
                test_true.append(y_batch)
            
        # Average losses
        avg_train_loss = train_loss / len(train_loader)
        avg_val_loss = val_loss / len(val_loader)
        avg_test_loss = test_loss / len(test_loader)
        
         # Concatenate all predictions and true values
        val_true = torch.cat(val_true)
        val_pred = torch.cat(val_preds)
        test_true = torch.cat(test_true)
        test_pred = torch.cat(test_preds)

    
        # Calculate R2 score
        val_r2 = calculate_r2(val_true, val_pred)
        test_r2 = calculate_r2(test_true, test_pred)
    
       
        
        # Early stopping
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            patience_counter = 0
        else:
            patience_counter += 1
            
        if patience_counter >= patience:
            print(f'Early stopping at epoch {epoch}')
            break
            
        if epoch % 10 == 0:
            print(f'Epoch {epoch}: \n Train Loss: {avg_train_loss:.4f}, \n Val Loss: {avg_val_loss:.4f} \n Test Loss: {avg_test_loss:.4f}')
            print(f'Val R2: {np.mean(val_r2):.4f}, \n'
                  f'Test R2: {np.mean(test_r2):.4f}')
            
        return avg_test_loss, avg_val_loss, avg_train_loss, val_r2, test_r2
            
       

In [172]:
X_train = pd.read_csv("data/X_train_scaled.csv")
y_train = pd.read_csv("data/y_train_scaled.csv")
X_test = pd.read_csv("data/X_test_scaled.csv")
y_test = pd.read_csv("data/y_test_scaled.csv")
X_val = pd.read_csv("data/X_val_scaled.csv")
y_val = pd.read_csv("data/y_val_scaled.csv")

scaler_y = pickle.load(open("data/scaler_y.pkl", "rb"))
scaler_X = pickle.load(open("data/scaler_X.pkl", "rb"))
scalers = pickle.load(open("data/scalers.pkl", "rb"))


X_train_tensor = torch.FloatTensor(X_train.values) 
y_train_tensor = torch.FloatTensor(y_train.values)

train_dataset=TensorDataset(X_train_tensor,y_train_tensor)

X_test_tensor = torch.FloatTensor(X_test.values) 
y_test_tensor = torch.FloatTensor(y_test.values)

test_dataset=TensorDataset(X_test_tensor,y_test_tensor)

X_val_tensor = torch.FloatTensor(X_val.values) 
y_val_tensor = torch.FloatTensor(y_val.values)

val_dataset=TensorDataset(X_val_tensor,y_val_tensor)


batch_size = 32
train_data_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_data_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_data_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [173]:
model = CarPriceNN(10)
train_model(model,train_data_loader,val_data_loader,test_data_loader)

  0%|          | 0/50 [00:01<?, ?it/s]

Epoch 0: 
 Train Loss: 0.2627, 
 Val Loss: 0.0383 
 Test Loss: 0.0375
Val R2: 0.9596, 
Test R2: 0.9613





(0.03751357157338173,
 0.038292456576798826,
 0.26269428743279144,
 np.float32(0.95958364),
 np.float32(0.9613352))

In [174]:
X_train = pd.read_csv("data/X_train_scaled.csv")
y_train = pd.read_csv("data/y_train_scaled.csv")
X_test = pd.read_csv("data/X_test_scaled.csv")
y_test = pd.read_csv("data/y_test_scaled.csv")
X_val = pd.read_csv("data/X_val_scaled.csv")
y_val = pd.read_csv("data/y_val_scaled.csv")


In [175]:

def set_seed(seed):
    """Set all random seeds for reproducibility"""
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

def run_trial(seed, X_train, y_train, X_val, y_val, X_test, y_test):
    """Run a single trial with given random seed"""
    # Set the seed
    set_seed(seed)
    
    # Create datasets
    X_train_tensor = torch.FloatTensor(X_train.values) 
    y_train_tensor = torch.FloatTensor(y_train.values)

    train_dataset=TensorDataset(X_train_tensor,y_train_tensor)

    X_test_tensor = torch.FloatTensor(X_test.values) 
    y_test_tensor = torch.FloatTensor(y_test.values)

    test_dataset=TensorDataset(X_test_tensor,y_test_tensor)

    X_val_tensor = torch.FloatTensor(X_val.values) 
    y_val_tensor = torch.FloatTensor(y_val.values)

    val_dataset=TensorDataset(X_val_tensor,y_val_tensor)


    batch_size = 32
    train_data_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_data_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    test_data_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    

    
    # Initialize model
    input_dim = X_train.shape[1]
    model = CarPriceNet(input_dim)
    
    # Train model
    avg_test_loss, avg_val_loss, avg_train_loss, val_r2, test_r2 = train_model(model, train_data_loader, val_data_loader, test_data_loader)

    return avg_test_loss, avg_val_loss, avg_train_loss, val_r2, test_r2



    

# Run multiple trials with different seeds
seeds = [42, 43, 44, 45, 46, 100, 200, 400, 333, 73] 
results = []

for seed in seeds:
    print(f"\nRunning trial with seed {seed}")
    avg_test_loss, avg_val_loss, avg_train_loss, val_r2, test_r2 = run_trial(seed, X_train, y_train, X_val, y_val, X_test, y_test)
    results.append({
        'seed': seed,
        'val loss': avg_val_loss,   
        'test loss': avg_test_loss,
        'train loss': avg_train_loss,
        'val r2': val_r2,
        'test r2': test_r2
    })
    print(f"MSE: {avg_test_loss:.4f}")
    print(f"R2 Score: {test_r2:.4f}")

# Calculate average performance
avg_mse_test = np.mean([r['test loss'] for r in results])
avg_r2_test = np.mean([r['test r2'] for r in results])
avg_mse_val = np.mean([r['val loss'] for r in results])
avg_r2_val = np.mean([r['val r2'] for r in results])

print("\nOverall Results:")
print(f"Average Test Loss: {avg_mse_test:.4f}")
print(f"Average Test r2 Score: {avg_r2_test:.4f}")

# Print individual results



Running trial with seed 42


  0%|          | 0/50 [00:00<?, ?it/s]


Epoch 0: 
 Train Loss: 0.2721, 
 Val Loss: 0.0051 
 Test Loss: 0.0048
Val R2: 0.9946, 
Test R2: 0.9950
MSE: 0.0048
R2 Score: 0.9950

Running trial with seed 43


  0%|          | 0/50 [00:00<?, ?it/s]


Epoch 0: 
 Train Loss: 0.1880, 
 Val Loss: 0.0052 
 Test Loss: 0.0052
Val R2: 0.9945, 
Test R2: 0.9946
MSE: 0.0052
R2 Score: 0.9946

Running trial with seed 44


  0%|          | 0/50 [00:00<?, ?it/s]


Epoch 0: 
 Train Loss: 0.1862, 
 Val Loss: 0.0042 
 Test Loss: 0.0044
Val R2: 0.9955, 
Test R2: 0.9955
MSE: 0.0044
R2 Score: 0.9955

Running trial with seed 45


  0%|          | 0/50 [00:00<?, ?it/s]


Epoch 0: 
 Train Loss: 0.1864, 
 Val Loss: 0.0038 
 Test Loss: 0.0038
Val R2: 0.9960, 
Test R2: 0.9960
MSE: 0.0038
R2 Score: 0.9960

Running trial with seed 46


  0%|          | 0/50 [00:00<?, ?it/s]


Epoch 0: 
 Train Loss: 0.2050, 
 Val Loss: 0.0054 
 Test Loss: 0.0055
Val R2: 0.9943, 
Test R2: 0.9943
MSE: 0.0055
R2 Score: 0.9943

Running trial with seed 100


  0%|          | 0/50 [00:00<?, ?it/s]


Epoch 0: 
 Train Loss: 0.2615, 
 Val Loss: 0.0054 
 Test Loss: 0.0054
Val R2: 0.9943, 
Test R2: 0.9944
MSE: 0.0054
R2 Score: 0.9944

Running trial with seed 200


  0%|          | 0/50 [00:00<?, ?it/s]


Epoch 0: 
 Train Loss: 0.1949, 
 Val Loss: 0.0057 
 Test Loss: 0.0057
Val R2: 0.9940, 
Test R2: 0.9941
MSE: 0.0057
R2 Score: 0.9941

Running trial with seed 400


  0%|          | 0/50 [00:00<?, ?it/s]


Epoch 0: 
 Train Loss: 0.2425, 
 Val Loss: 0.0055 
 Test Loss: 0.0052
Val R2: 0.9942, 
Test R2: 0.9946
MSE: 0.0052
R2 Score: 0.9946

Running trial with seed 333


  0%|          | 0/50 [00:00<?, ?it/s]


Epoch 0: 
 Train Loss: 0.2674, 
 Val Loss: 0.0056 
 Test Loss: 0.0056
Val R2: 0.9941, 
Test R2: 0.9942
MSE: 0.0056
R2 Score: 0.9942

Running trial with seed 73


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 0: 
 Train Loss: 0.2060, 
 Val Loss: 0.0047 
 Test Loss: 0.0051
Val R2: 0.9951, 
Test R2: 0.9947
MSE: 0.0051
R2 Score: 0.9947

Overall Results:
Average Test Loss: 0.0051
Average Test r2 Score: 0.9948



