https://www.kaggle.com/competitions/playground-series-s4e9/overview

best score was using Linear Regression model w default hyperparameters

prompt:
The goal of this competition is to predict the price of used cars based on various attributes.
Submissions are scored on the root mean squared error. 

The submission.csv file should be formatted like this:
id,price
188533,43878.016
188534,43878.016
188535,43878.016
etc.

the data for this challenge is:
/home/john/ai/kaggle2/data/regression/used-car-prices/train.csv
the training dataset; price is the continuous target

/home/john/ai/kaggle2/data/regression/used-car-prices/test.csv
the test dataset; your objective is to predict the value of price for each row

/home/john/ai/kaggle2/data/regression/used-car-prices/sample_submission.csv
a sample submission file in the correct format.

I would like to prepare the data to train four popular models.
Some of the features are catgorial so they will need to be encoded.

Let's start with the data preparation in one block

Then we will train the models.

Important: after each of the four models are trained, pick the one with the best root mean squared error and create a submission.csv as per the prescribed format.

I would also like some kind of way to monitor progress of the training in progress.


In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np

# Load datasets
train_df = pd.read_csv('/home/john/ai/kaggle2/data/regression/used-car-prices/train.csv')
test_df = pd.read_csv('/home/john/ai/kaggle2/data/regression/used-car-prices/test.csv')
sample_submission = pd.read_csv('/home/john/ai/kaggle2/data/regression/used-car-prices/sample_submission.csv')

# Separate features and target variable from the training set
X = train_df.drop(columns=['price'])
y = train_df['price']

# Split training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify categorical and numerical columns
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(exclude=['object']).columns

# Preprocessing pipeline for numerical and categorical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Fill missing values with median
    ('scaler', StandardScaler())  # Scale numerical features
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing categorical values
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])

# Combine the transformations into a preprocessor
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

# Return the preprocessed training and validation data for model training
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit the pipeline on the training data
X_train_preprocessed = pipeline.fit_transform(X_train)
X_val_preprocessed = pipeline.transform(X_val)
X_test_preprocessed = pipeline.transform(test_df)

# Monitor progress using callbacks (for tree-based models, you can track iterations)


In [2]:
# Import necessary libraries for progress tracking
import time
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import xgboost as xgb
from tqdm import tqdm  # to add progress bars where necessary

# Define the models with initial hyperparameters
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42, verbosity=1)  # Verbosity for XGBoost
}

# Dictionary to store RMSE for each model
rmse_scores = {}

# Function to calculate RMSE
def calculate_rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Train each model and monitor progress
for model_name, model in models.items():
    print(f'Training {model_name}...')
    start_time = time.time()

    if model_name == 'Linear Regression':
        # No progress tracking available, so we only measure time and RMSE
        model.fit(X_train_preprocessed, y_train)
    
    elif model_name == 'Random Forest':
        # Monitor progress of Random Forest training
        model = RandomForestRegressor(n_estimators=100, warm_start=True, random_state=42)
        for i in tqdm(range(1, 101, 10)):  # Adding trees in increments of 10 to monitor progress
            model.set_params(n_estimators=i)
            model.fit(X_train_preprocessed, y_train)
    
    elif model_name == 'Gradient Boosting':
        # Monitor progress of Gradient Boosting training
        model = GradientBoostingRegressor(n_estimators=100, warm_start=True, random_state=42)
        for i in tqdm(range(1, 101, 10)):  # Incrementally add 10 trees
            model.set_params(n_estimators=i)
            model.fit(X_train_preprocessed, y_train)
    
    elif model_name == 'XGBoost':
        # XGBoost has built-in verbosity for monitoring
        model.fit(X_train_preprocessed, y_train, eval_set=[(X_val_preprocessed, y_val)], verbose=True)

    # Predict on validation set
    y_val_pred = model.predict(X_val_preprocessed)
    
    # Calculate RMSE
    rmse = calculate_rmse(y_val, y_val_pred)
    rmse_scores[model_name] = rmse
    print(f'{model_name} RMSE: {rmse}')
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f'Time taken for {model_name}: {elapsed_time:.2f} seconds')

# Select the model with the lowest RMSE
best_model_name = min(rmse_scores, key=rmse_scores.get)
best_model = models[best_model_name]

print(f'Best model: {best_model_name} with RMSE: {rmse_scores[best_model_name]}')

# Train the best model on the entire training data
best_model.fit(pipeline.transform(X), y)

# Make predictions on the test set
y_test_pred = best_model.predict(X_test_preprocessed)

# Prepare the submission file
submission = pd.DataFrame({'id': test_df['id'], 'price': y_test_pred})
submission.to_csv('/home/john/ai/kaggle2/data/regression/used-car-prices/submission.csv', index=False)
print('Submission file created.')



Training Linear Regression...
Linear Regression RMSE: 69312.44919246348
Time taken for Linear Regression: 12.66 seconds
Training Random Forest...


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [1:14:49<00:00, 449.00s/it]


Random Forest RMSE: 75637.44195670752
Time taken for Random Forest: 4492.63 seconds
Training Gradient Boosting...


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00,  2.27s/it]


Gradient Boosting RMSE: 69688.3041183227
Time taken for Gradient Boosting: 22.75 seconds
Training XGBoost...
[0]	validation_0-rmse:72263.82524
[1]	validation_0-rmse:70963.23408
[2]	validation_0-rmse:70375.14722
[3]	validation_0-rmse:69993.30649
[4]	validation_0-rmse:69811.84043
[5]	validation_0-rmse:69823.25140
[6]	validation_0-rmse:69794.29815
[7]	validation_0-rmse:69887.63693
[8]	validation_0-rmse:69831.57490
[9]	validation_0-rmse:69791.82599
[10]	validation_0-rmse:69787.73642
[11]	validation_0-rmse:69762.64038
[12]	validation_0-rmse:69772.29454
[13]	validation_0-rmse:69814.86923
[14]	validation_0-rmse:69803.35074
[15]	validation_0-rmse:69809.04029
[16]	validation_0-rmse:69776.45313
[17]	validation_0-rmse:69769.15829
[18]	validation_0-rmse:69842.79340
[19]	validation_0-rmse:69933.25848
[20]	validation_0-rmse:69935.46454
[21]	validation_0-rmse:69953.39074
[22]	validation_0-rmse:69949.06340
[23]	validation_0-rmse:69975.16626
[24]	validation_0-rmse:69961.36185
[25]	validation_0-rmse:699

Best model: Linear Regression with RMSE: 69312.44919246348
Submission file created.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import itertools

# Function to calculate RMSE
def calculate_rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Define the hyperparameters for Linear Regression
fit_intercept_options = [True, False]
positive_options = [True, False]

# Store the best combination of hyperparameters and the corresponding RMSE
best_rmse = float("inf")
best_hyperparameters = {}

# Get all combinations of hyperparameters using itertools.product
hyperparameter_combinations = list(itertools.product(fit_intercept_options, positive_options))

# Loop through each combination of hyperparameters
for i, (fit_intercept, positive) in enumerate(hyperparameter_combinations):
    print(f"Training permutation {i+1}/{len(hyperparameter_combinations)}:")
    print(f"  - fit_intercept: {fit_intercept}")
    print(f"  - positive: {positive}")

    # Create the model with the current set of hyperparameters
    model = LinearRegression(fit_intercept=fit_intercept, positive=positive)
    
    # Fit the model
    model.fit(X_train_preprocessed, y_train)
    
    # Predict on the validation set
    y_val_pred = model.predict(X_val_preprocessed)
    
    # Calculate RMSE
    rmse = calculate_rmse(y_val, y_val_pred)
    print(f"  - RMSE: {rmse}\n")
    
    # Check if this RMSE is the best so far
    if rmse < best_rmse:
        best_rmse = rmse
        best_hyperparameters = {'fit_intercept': fit_intercept, 'positive': positive}

# Output the best hyperparameter combination and RMSE
print(f"Best hyperparameters:")
print(f"  - fit_intercept: {best_hyperparameters['fit_intercept']}")
print(f"  - positive: {best_hyperparameters['positive']}")
print(f"Best RMSE: {best_rmse}")

# Train the best model on the full training data
best_model = LinearRegression(**best_hyperparameters)
best_model.fit(pipeline.transform(X), y)

# Make predictions on the test set
y_test_pred = best_model.predict(X_test_preprocessed)

# Prepare the submission file
submission = pd.DataFrame({'id': test_df['id'], 'price': y_test_pred})
submission.to_csv('/home/john/ai/kaggle2/data/regression/used-car-prices/submission.csv', index=False)
print('Submission file created.')


prompt:
I would find out which permutation of all possible hyperparameters of the Linear Regression model results the lowest RMSE. Please write the code for this.  after each round of training give me the permutation and RMSE. After all the permutations have been tried, choose the combination of hyperparameters that has the lowest RMSE and create the submission.csv.   

In [5]:
# Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
import itertools

# Load datasets
train_df = pd.read_csv('/home/john/ai/kaggle2/data/regression/used-car-prices/train.csv')
test_df = pd.read_csv('/home/john/ai/kaggle2/data/regression/used-car-prices/test.csv')

# Separate features and target variable from the training set
X = train_df.drop(columns=['price'])
y = train_df['price']

# Split training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify categorical and numerical columns
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(exclude=['object']).columns

# Preprocessing pipeline for numerical and categorical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))  # Corrected sparse argument
])

# Combine the transformations into a preprocessor
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

# Fit the preprocessor on the training data and transform both training and validation sets
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_val_preprocessed = preprocessor.transform(X_val)
X_test_preprocessed = preprocessor.transform(test_df)

# Function to calculate RMSE
def calculate_rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Define the hyperparameters for Linear Regression
fit_intercept_options = [True, False]
positive_options = [True, False]

# Store the best combination of hyperparameters and the corresponding RMSE
best_rmse = float("inf")
best_hyperparameters = {}

# Get all combinations of hyperparameters using itertools.product
hyperparameter_combinations = list(itertools.product(fit_intercept_options, positive_options))

# Loop through each combination of hyperparameters
for i, (fit_intercept, positive) in enumerate(hyperparameter_combinations):
    print(f"Training permutation {i+1}/{len(hyperparameter_combinations)}:")
    print(f"  - fit_intercept: {fit_intercept}")
    print(f"  - positive: {positive}")

    # Create the model with the current set of hyperparameters
    model = LinearRegression(fit_intercept=fit_intercept, positive=positive)
    
    # Fit the model
    model.fit(X_train_preprocessed, y_train)
    
    # Predict on the validation set
    y_val_pred = model.predict(X_val_preprocessed)
    
    # Calculate RMSE
    rmse = calculate_rmse(y_val, y_val_pred)
    print(f"  - RMSE: {rmse}\n")
    
    # Check if this RMSE is the best so far
    if rmse < best_rmse:
        best_rmse = rmse
        best_hyperparameters = {'fit_intercept': fit_intercept, 'positive': positive}

# Output the best hyperparameter combination and RMSE
print(f"Best hyperparameters:")
print(f"  - fit_intercept: {best_hyperparameters['fit_intercept']}")
print(f"  - positive: {best_hyperparameters['positive']}")
print(f"Best RMSE: {best_rmse}")

# Train the best model on the full training data
best_model = LinearRegression(**best_hyperparameters)
best_model.fit(preprocessor.transform(X), y)

# Make predictions on the test set
y_test_pred = best_model.predict(X_test_preprocessed)

# Prepare the submission file
submission = pd.DataFrame({'id': test_df['id'], 'price': y_test_pred})
submission.to_csv('/home/john/ai/kaggle2/data/regression/used-car-prices/submission.csv', index=False)
print('Submission file created.')



Training permutation 1/4:
  - fit_intercept: True
  - positive: True
  - RMSE: 69636.20854076598

Training permutation 2/4:
  - fit_intercept: True
  - positive: False
  - RMSE: 1813767573304.7864

Training permutation 3/4:
  - fit_intercept: False
  - positive: True
  - RMSE: 69490.48201420742

Training permutation 4/4:
  - fit_intercept: False
  - positive: False
  - RMSE: 197604789544355.44

Best hyperparameters:
  - fit_intercept: False
  - positive: True
Best RMSE: 69490.48201420742
Submission file created.


best score was using Linear Regression model w default hyperparameters