https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales/overview

initial prompt:

In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. 

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

The submission.csv file should be formatted like this:
ID,item_cnt_month
0,0.5
1,0.5
2,0.5
3,0.5
etc.

the data for this challenge will be stored here:
/home/john/ai/kaggle2/data/regression/predict-future-sales/

data description:
item_categories.csv:
has a linking key column called item_category_id

items.csv:
has two linking key columns:
item_id and item_category_id

sales_train.csv:
has 6 columns, including linking key columns shop_id and item_id

sample_submission.csv:
the submission file, called submission.csv, needs to be formatted this way.  see above for prescribed format

shops.csv:
has a linking key shop_id

test.csv:
has two linking keys, shop_id and item_id

Here is additional information regarding the data:
You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

File descriptions
sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
sample_submission.csv - a sample submission file in the correct format.
items.csv - supplemental information about the items/products.
item_categories.csv  - supplemental information about the items categories.
shops.csv- supplemental information about the shops.
Data fields
ID - an Id that represents a (Shop, Item) tuple within the test set
shop_id - unique identifier of a shop
item_id - unique identifier of a product
item_category_id - unique identifier of item category
item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
item_price - current price of an item
date - date in format dd/mm/yyyy
date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
item_name - name of item
shop_name - name of shop
item_category_name - name of item category

Submissions are evaluated by root mean squared error (RMSE). True target values are clipped into [0,20] range.

I would like to prepare the data to train up to four popular models, and then train the models
using default hyperparameters.

Important: after each of the models are trained, pick the one with the best root mean squared error and create a submission.csv as per the prescribed format.

I would also like some kind of way to monitor progress of the training in progress.

using grok

1. Loading the Data
You'll need to load all relevant CSV files (train, test, items, item_categories, shops) into dataframes.

In [None]:
'''
import pandas as pd

# Define file paths
data_path = "/home/john/ai/kaggle2/data/regression/predict-future-sales/"
sales_train = pd.read_csv(data_path + 'sales_train.csv')
test = pd.read_csv(data_path + 'test.csv')
items = pd.read_csv(data_path + 'items.csv')
item_categories = pd.read_csv(data_path + 'item_categories.csv')
shops = pd.read_csv(data_path + 'shops.csv')
sample_submission = pd.read_csv(data_path + 'sample_submission.csv')

# Display the first few rows of each dataframe
sales_train.head(), test.head(), items.head(), item_categories.head(), shops.head()
'''

 Data Preprocessing
You'll need to aggregate sales data by month since you're predicting monthly sales (item_cnt_day needs to be summed).

In [None]:
'''
# Aggregate sales by month
sales_train['date'] = pd.to_datetime(sales_train['date'], format='%d.%m.%Y')
sales_train['month'] = sales_train['date'].dt.month
sales_train['year'] = sales_train['date'].dt.year

# Group by month, shop, and item
monthly_sales = sales_train.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'item_cnt_day': 'sum'}).reset_index()
monthly_sales.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)

# Merge with test set to ensure consistent data structure
train_data = test.merge(monthly_sales, on=['shop_id', 'item_id'], how='left').fillna(0)
'''

3. Prepare Features
You can now merge additional features (like item categories and shops) into the dataset.

In [None]:
'''
# Merge additional features
train_data = train_data.merge(items[['item_id', 'item_category_id']], on='item_id', how='left')
train_data = train_data.merge(item_categories[['item_category_id']], on='item_category_id', how='left')
train_data = train_data.merge(shops[['shop_id']], on='shop_id', how='left')

# Handle missing values
train_data.fillna(0, inplace=True)
'''

Model Training
You will now prepare to train four popular models: Linear Regression, Random Forest Regressor, XGBoost, and LightGBM.



In [None]:
'''
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb

# Prepare feature matrix (X) and target (y)
X = train_data.drop(['item_cnt_month'], axis=1)
y = train_data['item_cnt_month']

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(),
    "XGBoost": xgb.XGBRegressor(),
    "LightGBM": lgb.LGBMRegressor()
}

# Train and evaluate each model
best_model = None
best_rmse = float('inf')

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    rmse = mean_squared_error(y_val, y_pred, squared=False)
    print(f"{name} RMSE: {rmse}")
    
    if rmse < best_rmse:
        best_rmse = rmse
        best_model = model
'''

Generate Submission File
After training the models, use the best-performing one to make predictions on the test set and generate the submission.csv file.

In [None]:
'''
# Make predictions on the test set
test_preds = best_model.predict(test)

# Clip the predictions between 0 and 20 as required by the competition rules
test_preds = test_preds.clip(0, 20)

# Create submission file
submission = pd.DataFrame({
    "ID": test.index,
    "item_cnt_month": test_preds
})

# Save submission file
submission.to_csv(data_path + 'submission.csv', index=False)
'''

In [None]:
'''
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb
from tqdm import tqdm

# Step 1: Load the data
data_path = "/home/john/ai/kaggle2/data/regression/predict-future-sales/"
sales_train = pd.read_csv(data_path + 'sales_train.csv')
test = pd.read_csv(data_path + 'test.csv')
items = pd.read_csv(data_path + 'items.csv')
item_categories = pd.read_csv(data_path + 'item_categories.csv')
shops = pd.read_csv(data_path + 'shops.csv')
sample_submission = pd.read_csv(data_path + 'sample_submission.csv')

# Step 2: Data Preprocessing
# Convert date to datetime and extract month/year
sales_train['date'] = pd.to_datetime(sales_train['date'], format='%d.%m.%Y')
sales_train['month'] = sales_train['date'].dt.month
sales_train['year'] = sales_train['date'].dt.year

# Aggregate sales by month, shop, and item
monthly_sales = sales_train.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'item_cnt_day': 'sum'}).reset_index()
monthly_sales.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)

# Merge with test set to ensure consistent data structure
train_data = test.merge(monthly_sales, on=['shop_id', 'item_id'], how='left').fillna(0)

# Merge additional features
train_data = train_data.merge(items[['item_id', 'item_category_id']], on='item_id', how='left')
train_data = train_data.merge(item_categories[['item_category_id']], on='item_category_id', how='left')
train_data = train_data.merge(shops[['shop_id']], on='shop_id', how='left')

# Handle missing values
train_data.fillna(0, inplace=True)

# Step 3: Prepare Features and Targets
X = train_data.drop(['item_cnt_month'], axis=1)  # Features
y = train_data['item_cnt_month']  # Target

# Step 4: Split the Data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Check for missing features in test set
train_features = X_train.columns
test_features = test.columns

# Add missing features to test set if necessary
missing_features = set(train_features) - set(test_features)
for feature in missing_features:
    test[feature] = 0

# Ensure the test set has the same features as the train set
test = test[X_train.columns]

# Step 6: Model Training
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(),
    "XGBoost": xgb.XGBRegressor(),
    "LightGBM": lgb.LGBMRegressor()
}

best_model = None
best_rmse = float('inf')

# Train and evaluate each model with tqdm progress bar
for name, model in tqdm(models.items()):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    rmse = mean_squared_error(y_val, y_pred, squared=False)
    print(f"{name} RMSE: {rmse}")
    
    if rmse < best_rmse:
        best_rmse = rmse
        best_model = model

# Step 7: Generate Submission File
# Make predictions on the test set (ensure the same columns in test set)
test_preds = best_model.predict(test)

# Clip the predictions between 0 and 20 as required by the competition rules
test_preds = test_preds.clip(0, 20)

# Create submission file
submission = pd.DataFrame({
    "ID": test.index,
    "item_cnt_month": test_preds
})

# Save submission file
submission.to_csv(data_path + 'submission.csv', index=False)
'''


submission was 1.23040 and #11305 on the leaderboard

try to improve the score by hyperparameters

prompt:
i would like to do the same thing but I'd like to experiment with the different hyperparameter permutations of each model.  please perform a grid search for each model and let me know which permutation of hyperparameters resulted in the best score for each model.  then choose the one with the best score and create a submission.csv for it.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb
from tqdm import tqdm

# Step 1: Load the data
data_path = "/home/john/ai/kaggle2/data/regression/predict-future-sales/"
sales_train = pd.read_csv(data_path + 'sales_train.csv')
test = pd.read_csv(data_path + 'test.csv')
items = pd.read_csv(data_path + 'items.csv')
item_categories = pd.read_csv(data_path + 'item_categories.csv')
shops = pd.read_csv(data_path + 'shops.csv')
sample_submission = pd.read_csv(data_path + 'sample_submission.csv')

# Step 2: Data Preprocessing
# Convert date to datetime and extract month/year
sales_train['date'] = pd.to_datetime(sales_train['date'], format='%d.%m.%Y')
sales_train['month'] = sales_train['date'].dt.month
sales_train['year'] = sales_train['date'].dt.year

# Aggregate sales by month, shop, and item
monthly_sales = sales_train.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'item_cnt_day': 'sum'}).reset_index()
monthly_sales.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)

# Merge with test set to ensure consistent data structure
train_data = test.merge(monthly_sales, on=['shop_id', 'item_id'], how='left').fillna(0)

# Merge additional features
train_data = train_data.merge(items[['item_id', 'item_category_id']], on='item_id', how='left')
train_data = train_data.merge(item_categories[['item_category_id']], on='item_category_id', how='left')
train_data = train_data.merge(shops[['shop_id']], on='shop_id', how='left')

# Handle missing values
train_data.fillna(0, inplace=True)

# Step 3: Prepare Features and Targets
X = train_data.drop(['item_cnt_month'], axis=1)  # Features
y = train_data['item_cnt_month']  # Target

# Step 4: Split the Data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Check for missing features in test set
train_features = X_train.columns
test_features = test.columns

# Add missing features to test set if necessary
missing_features = set(train_features) - set(test_features)
for feature in missing_features:
    test[feature] = 0

# Ensure the test set has the same features as the train set
test = test[X_train.columns]

# Step 6: Grid Search for Hyperparameters for each model
model_params = {
    "Linear Regression": {
        "model": LinearRegression(),
        "params": {
            # Linear Regression doesn't have many hyperparameters to tune
        }
    },
    "Random Forest": {
        "model": RandomForestRegressor(),
        "params": {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4],
            'bootstrap': [True, False]
        }
    },
    "XGBoost": {
        "model": xgb.XGBRegressor(),
        "params": {
            'n_estimators': [50, 100, 200],
            'max_depth': [3, 5, 7],
            'learning_rate': [0.01, 0.1, 0.3],
            'subsample': [0.8, 0.9, 1],
            'colsample_bytree': [0.8, 0.9, 1]
        }
    },
    "LightGBM": {
        "model": lgb.LGBMRegressor(),
        "params": {
            'n_estimators': [50, 100, 200],
            'max_depth': [-1, 10, 20],
            'learning_rate': [0.01, 0.1, 0.3],
            'num_leaves': [31, 50, 100],
            'subsample': [0.8, 1],
            'colsample_bytree': [0.8, 1]
        }
    }
}

best_models = {}
best_rmse = float('inf')
best_model = None

# Perform grid search for each model
for model_name, mp in tqdm(model_params.items()):
    print(f"Performing grid search for {model_name}...")
    
    grid = GridSearchCV(mp["model"], mp["params"], cv=3, scoring='neg_root_mean_squared_error', verbose=1, n_jobs=4)
    grid.fit(X_train, y_train)
    
    # Record best model and hyperparameters
    best_models[model_name] = {
        'best_score': -grid.best_score_,  # Convert negative RMSE to positive
        'best_params': grid.best_params_,
        'best_model': grid.best_estimator_
    }
    
    print(f"{model_name} best score: {best_models[model_name]['best_score']}")
    print(f"{model_name} best params: {best_models[model_name]['best_params']}")
    
    # Track the best model based on RMSE
    if best_models[model_name]['best_score'] < best_rmse:
        best_rmse = best_models[model_name]['best_score']
        best_model = best_models[model_name]['best_model']

# Step 7: Generate Submission File
# Make predictions on the test set using the best model
test_preds = best_model.predict(test)

# Clip the predictions between 0 and 20 as required by the competition rules
test_preds = test_preds.clip(0, 20)

# Create submission file
submission = pd.DataFrame({
    "ID": test.index,
    "item_cnt_month": test_preds
})

# Save submission file
submission.to_csv(data_path + 'submission.csv', index=False)

print(f"Best model used for submission: {best_model}")


  0%|                                                                                                                                   | 0/4 [00:00<?, ?it/s]

Performing grid search for Linear Regression...
Fitting 3 folds for each of 1 candidates, totalling 3 fits


 25%|██████████████████████████████▊                                                                                            | 1/4 [00:01<00:03,  1.18s/it]

Linear Regression best score: 10.243375357204238
Linear Regression best params: {}
Performing grid search for Random Forest...
Fitting 3 folds for each of 162 candidates, totalling 486 fits


 50%|██████████████████████████████████████████████████████████▌                                                          | 2/4 [4:33:24<5:21:38, 9649.45s/it]

Random Forest best score: 4.9594025922828555
Random Forest best params: {'bootstrap': True, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Performing grid search for XGBoost...
Fitting 3 folds for each of 243 candidates, totalling 729 fits


 75%|███████████████████████████████████████████████████████████████████████████████████████▊                             | 3/4 [4:39:30<1:30:10, 5410.47s/it]

XGBoost best score: 5.203037560432357
XGBoost best params: {'colsample_bytree': 1, 'learning_rate': 0.3, 'max_depth': 7, 'n_estimators': 200, 'subsample': 1}
Performing grid search for LightGBM...
Fitting 3 folds for each of 324 candidates, totalling 972 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003406 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 649
[LightGBM] [Info] Number of data points in the train set: 374909, number of used features: 5
[LightGBM] [Info] Start training from score 2.308856
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.027724 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 649
[LightGBM] [Info] Number of data points in the train set: 374909, number of used features: 5
[LightGBM] [Info] Start training from score 2.2



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.032872 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 649
[LightGBM] [Info] Number of data points in the train set: 374909, number of used features: 5
[LightGBM] [Info] Start training from score 2.308856
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.035880 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 649
[LightGBM] [Info] Number of data points in the train set: 374910, number of used features: 5
[LightGBM] [Info] Start training from score 2.318343
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.027612 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 649
[LightGBM] [Info] Number of data points in the train set: 3

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [5:37:45<00:00, 5066.35s/it]

LightGBM best score: 5.231673857672423
LightGBM best params: {'colsample_bytree': 0.8, 'learning_rate': 0.3, 'max_depth': 20, 'n_estimators': 200, 'num_leaves': 100, 'subsample': 0.8}





Best model used for submission: RandomForestRegressor(max_depth=20)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.025189 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 649
[LightGBM] [Info] Number of data points in the train set: 374909, number of used features: 5
[LightGBM] [Info] Start training from score 2.296837
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.057613 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 649
[LightGBM] [Info] Number of data points in the train set: 374909, number of used features: 5
[LightGBM] [Info] Start training from score 2.296837
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.036671 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 649
[LightGBM] [Info] Number of data points in the train se

this one had a score of 1.23978 which was not as good as model w default params

prompt:
I would like to do the exact same thing as the last code block, but I would like to use 4 other models that are specialized for regresson analysis

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.linear_model import ElasticNet
from catboost import CatBoostRegressor
from tqdm import tqdm

# Step 1: Load the data
data_path = "/home/john/ai/kaggle2/data/regression/predict-future-sales/"
sales_train = pd.read_csv(data_path + 'sales_train.csv')
test = pd.read_csv(data_path + 'test.csv')
items = pd.read_csv(data_path + 'items.csv')
item_categories = pd.read_csv(data_path + 'item_categories.csv')
shops = pd.read_csv(data_path + 'shops.csv')
sample_submission = pd.read_csv(data_path + 'sample_submission.csv')

# Step 2: Data Preprocessing
# Convert date to datetime and extract month/year
sales_train['date'] = pd.to_datetime(sales_train['date'], format='%d.%m.%Y')
sales_train['month'] = sales_train['date'].dt.month
sales_train['year'] = sales_train['date'].dt.year

# Aggregate sales by month, shop, and item
monthly_sales = sales_train.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'item_cnt_day': 'sum'}).reset_index()
monthly_sales.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)

# Merge with test set to ensure consistent data structure
train_data = test.merge(monthly_sales, on=['shop_id', 'item_id'], how='left').fillna(0)

# Merge additional features
train_data = train_data.merge(items[['item_id', 'item_category_id']], on='item_id', how='left')
train_data = train_data.merge(item_categories[['item_category_id']], on='item_category_id', how='left')
train_data = train_data.merge(shops[['shop_id']], on='shop_id', how='left')

# Handle missing values
train_data.fillna(0, inplace=True)

# Step 3: Prepare Features and Targets
X = train_data.drop(['item_cnt_month'], axis=1)  # Features
y = train_data['item_cnt_month']  # Target

# Step 4: Split the Data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Check for missing features in test set
train_features = X_train.columns
test_features = test.columns

# Add missing features to test set if necessary
missing_features = set(train_features) - set(test_features)
for feature in missing_features:
    test[feature] = 0

# Ensure the test set has the same features as the train set
test = test[X_train.columns]

# Step 6: Grid Search for Hyperparameters for each new model
model_params = {
    "Gradient Boosting Regressor": {
        "model": GradientBoostingRegressor(),
        "params": {
            'n_estimators': [50, 100, 200],
            'max_depth': [3, 5, 7],
            'learning_rate': [0.01, 0.1, 0.3],
            'subsample': [0.8, 1]
        }
    },
    "CatBoost Regressor": {
        "model": CatBoostRegressor(verbose=0),
        "params": {
            'iterations': [100, 200],
            'depth': [4, 6, 10],
            'learning_rate': [0.01, 0.1, 0.3],
            'l2_leaf_reg': [3, 5, 7]
        }
    },
    "ElasticNet": {
        "model": ElasticNet(),
        "params": {
            'alpha': [0.1, 1.0, 10.0],
            'l1_ratio': [0.2, 0.5, 0.8],
            'max_iter': [1000, 2000]
        }
    },
    "HistGradientBoosting Regressor": {
        "model": HistGradientBoostingRegressor(),
        "params": {
            'max_iter': [100, 200],
            'max_depth': [None, 10, 20],
            'learning_rate': [0.01, 0.1, 0.3],
            'l2_regularization': [0.0, 1.0, 10.0]
        }
    }
}

best_models = {}
best_rmse = float('inf')
best_model = None

# Perform grid search for each model
for model_name, mp in tqdm(model_params.items()):
    print(f"Performing grid search for {model_name}...")
    
    grid = GridSearchCV(mp["model"], mp["params"], cv=3, scoring='neg_root_mean_squared_error', verbose=1, n_jobs=2)
    grid.fit(X_train, y_train)
    
    # Record best model and hyperparameters
    best_models[model_name] = {
        'best_score': -grid.best_score_,  # Convert negative RMSE to positive
        'best_params': grid.best_params_,
        'best_model': grid.best_estimator_
    }
    
    print(f"{model_name} best score: {best_models[model_name]['best_score']}")
    print(f"{model_name} best params: {best_models[model_name]['best_params']}")
    
    # Track the best model based on RMSE
    if best_models[model_name]['best_score'] < best_rmse:
        best_rmse = best_models[model_name]['best_score']
        best_model = best_models[model_name]['best_model']

# Step 7: Generate Submission File
# Make predictions on the test set using the best model
test_preds = best_model.predict(test)

# Clip the predictions between 0 and 20 as required by the competition rules
test_preds = test_preds.clip(0, 20)

# Create submission file
submission = pd.DataFrame({
    "ID": test.index,
    "item_cnt_month": test_preds
})

# Save submission file
submission.to_csv(data_path + 'submission.csv', index=False)

print(f"Best model used for submission: {best_model}")


  0%|                                                                                                                                   | 0/4 [00:00<?, ?it/s]

Performing grid search for Gradient Boosting Regressor...
Fitting 3 folds for each of 54 candidates, totalling 162 fits


prompt:
I would like to have more granular progress for each model, so give me a output of each model's permutation and the result, as well as the final best one that is used to generate the submission.csv


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.linear_model import ElasticNet
from catboost import CatBoostRegressor
from tqdm import tqdm

# Step 1: Load the data
data_path = "/home/john/ai/kaggle2/data/regression/predict-future-sales/"
sales_train = pd.read_csv(data_path + 'sales_train.csv')
test = pd.read_csv(data_path + 'test.csv')
items = pd.read_csv(data_path + 'items.csv')
item_categories = pd.read_csv(data_path + 'item_categories.csv')
shops = pd.read_csv(data_path + 'shops.csv')
sample_submission = pd.read_csv(data_path + 'sample_submission.csv')

# Step 2: Data Preprocessing
# Convert date to datetime and extract month/year
sales_train['date'] = pd.to_datetime(sales_train['date'], format='%d.%m.%Y')
sales_train['month'] = sales_train['date'].dt.month
sales_train['year'] = sales_train['date'].dt.year

# Aggregate sales by month, shop, and item
monthly_sales = sales_train.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'item_cnt_day': 'sum'}).reset_index()
monthly_sales.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)

# Merge with test set to ensure consistent data structure
train_data = test.merge(monthly_sales, on=['shop_id', 'item_id'], how='left').fillna(0)

# Merge additional features
train_data = train_data.merge(items[['item_id', 'item_category_id']], on='item_id', how='left')
train_data = train_data.merge(item_categories[['item_category_id']], on='item_category_id', how='left')
train_data = train_data.merge(shops[['shop_id']], on='shop_id', how='left')

# Handle missing values
train_data.fillna(0, inplace=True)

# Step 3: Prepare Features and Targets
X = train_data.drop(['item_cnt_month'], axis=1)  # Features
y = train_data['item_cnt_month']  # Target

# Step 4: Split the Data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Check for missing features in test set
train_features = X_train.columns
test_features = test.columns

# Add missing features to test set if necessary
missing_features = set(train_features) - set(test_features)
for feature in missing_features:
    test[feature] = 0

# Ensure the test set has the same features as the train set
test = test[X_train.columns]

# Step 6: Grid Search for Hyperparameters for each new model
model_params = {
    "Gradient Boosting Regressor": {
        "model": GradientBoostingRegressor(),
        "params": {
            'n_estimators': [50, 100, 200],
            'max_depth': [3, 5, 7],
            'learning_rate': [0.01, 0.1, 0.3],
            'subsample': [0.8, 1]
        }
    },
    "CatBoost Regressor": {
        "model": CatBoostRegressor(verbose=0),
        "params": {
            'iterations': [100, 200],
            'depth': [4, 6, 10],
            'learning_rate': [0.01, 0.1, 0.3],
            'l2_leaf_reg': [3, 5, 7]
        }
    },
    "ElasticNet": {
        "model": ElasticNet(),
        "params": {
            'alpha': [0.1, 1.0, 10.0],
            'l1_ratio': [0.2, 0.5, 0.8],
            'max_iter': [1000, 2000]
        }
    },
    "HistGradientBoosting Regressor": {
        "model": HistGradientBoostingRegressor(),
        "params": {
            'max_iter': [100, 200],
            'max_depth': [None, 10, 20],
            'learning_rate': [0.01, 0.1, 0.3],
            'l2_regularization': [0.0, 1.0, 10.0]
        }
    }
}

best_models = {}
best_rmse = float('inf')
best_model = None

# Perform grid search for each model and print granular progress
for model_name, mp in tqdm(model_params.items()):
    print(f"\nPerforming grid search for {model_name}...")
    
    grid = GridSearchCV(mp["model"], mp["params"], cv=3, scoring='neg_root_mean_squared_error', verbose=0, n_jobs=2)
    
    # Fit the model and extract detailed results
    grid.fit(X_train, y_train)
    
    # Output all results for every permutation
    print(f"\nResults for {model_name}:")
    for idx, params in enumerate(grid.cv_results_['params']):
        mean_test_score = -grid.cv_results_['mean_test_score'][idx]  # Convert to positive RMSE
        print(f"Permutation {idx+1}: {params}, RMSE: {mean_test_score}")
    
    # Record best model and hyperparameters
    best_models[model_name] = {
        'best_score': -grid.best_score_,  # Convert negative RMSE to positive
        'best_params': grid.best_params_,
        'best_model': grid.best_estimator_
    }
    
    print(f"\nBest for {model_name}:")
    print(f"RMSE: {best_models[model_name]['best_score']}")
    print(f"Best parameters: {best_models[model_name]['best_params']}")
    
    # Track the best model based on RMSE
    if best_models[model_name]['best_score'] < best_rmse:
        best_rmse = best_models[model_name]['best_score']
        best_model = best_models[model_name]['best_model']

# Step 7: Generate Submission File
# Make predictions on the test set using the best model
test_preds = best_model.predict(test)

# Clip the predictions between 0 and 20 as required by the competition rules
test_preds = test_preds.clip(0, 20)

# Create submission file
submission = pd.DataFrame({
    "ID": test.index,
    "item_cnt_month": test_preds
})

# Save submission file
submission.to_csv(data_path + 'submission.csv', index=False)

print(f"\nBest model used for submission: {best_model}")
print(f"\nBest overall RMSE: {best_rmse}")


  0%|                                                                                                                                   | 0/4 [00:00<?, ?it/s]


Performing grid search for Gradient Boosting Regressor...


 25%|█████████████████████████████▎                                                                                       | 1/4 [1:13:41<3:41:03, 4421.14s/it]


Results for Gradient Boosting Regressor:
Permutation 1: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.8}, RMSE: 9.083251026477036
Permutation 2: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50, 'subsample': 1}, RMSE: 9.088565693263703
Permutation 3: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}, RMSE: 8.549367767171502
Permutation 4: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1}, RMSE: 8.566128177033638
Permutation 5: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}, RMSE: 8.25054838546299
Permutation 6: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200, 'subsample': 1}, RMSE: 8.28938793217935
Permutation 7: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 50, 'subsample': 0.8}, RMSE: 8.342469254388762
Permutation 8: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 50, 'subsample': 1}, RMSE: 8.374923073196337
Permutation 9: {'lea

 50%|██████████████████████████████████████████████████████████▌                                                          | 2/4 [1:20:41<1:08:55, 2067.80s/it]


Results for CatBoost Regressor:
Permutation 1: {'depth': 4, 'iterations': 100, 'l2_leaf_reg': 3, 'learning_rate': 0.01}, RMSE: 8.777989119314102
Permutation 2: {'depth': 4, 'iterations': 100, 'l2_leaf_reg': 3, 'learning_rate': 0.1}, RMSE: 7.275562601100202
Permutation 3: {'depth': 4, 'iterations': 100, 'l2_leaf_reg': 3, 'learning_rate': 0.3}, RMSE: 6.550030597424609
Permutation 4: {'depth': 4, 'iterations': 100, 'l2_leaf_reg': 5, 'learning_rate': 0.01}, RMSE: 8.779845537325913
Permutation 5: {'depth': 4, 'iterations': 100, 'l2_leaf_reg': 5, 'learning_rate': 0.1}, RMSE: 7.259931422290021
Permutation 6: {'depth': 4, 'iterations': 100, 'l2_leaf_reg': 5, 'learning_rate': 0.3}, RMSE: 6.602757969197564
Permutation 7: {'depth': 4, 'iterations': 100, 'l2_leaf_reg': 7, 'learning_rate': 0.01}, RMSE: 8.78289616800469
Permutation 8: {'depth': 4, 'iterations': 100, 'l2_leaf_reg': 7, 'learning_rate': 0.1}, RMSE: 7.424436457004595
Permutation 9: {'depth': 4, 'iterations': 100, 'l2_leaf_reg': 7, 'lea

 75%|█████████████████████████████████████████████████████████████████████████████████████████▎                             | 3/4 [1:20:48<18:46, 1126.65s/it]


Results for ElasticNet:
Permutation 1: {'alpha': 0.1, 'l1_ratio': 0.2, 'max_iter': 1000}, RMSE: 10.243375569424545
Permutation 2: {'alpha': 0.1, 'l1_ratio': 0.2, 'max_iter': 2000}, RMSE: 10.243375569424545
Permutation 3: {'alpha': 0.1, 'l1_ratio': 0.5, 'max_iter': 1000}, RMSE: 10.243377260426847
Permutation 4: {'alpha': 0.1, 'l1_ratio': 0.5, 'max_iter': 2000}, RMSE: 10.243377260426847
Permutation 5: {'alpha': 0.1, 'l1_ratio': 0.8, 'max_iter': 1000}, RMSE: 10.243380471160686
Permutation 6: {'alpha': 0.1, 'l1_ratio': 0.8, 'max_iter': 2000}, RMSE: 10.243380471160686
Permutation 7: {'alpha': 1.0, 'l1_ratio': 0.2, 'max_iter': 1000}, RMSE: 10.243415587285247
Permutation 8: {'alpha': 1.0, 'l1_ratio': 0.2, 'max_iter': 2000}, RMSE: 10.243415587285247
Permutation 9: {'alpha': 1.0, 'l1_ratio': 0.5, 'max_iter': 1000}, RMSE: 10.24360237448141
Permutation 10: {'alpha': 1.0, 'l1_ratio': 0.5, 'max_iter': 2000}, RMSE: 10.24360237448141
Permutation 11: {'alpha': 1.0, 'l1_ratio': 0.8, 'max_iter': 1000},

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [1:26:38<00:00, 1299.53s/it]


Results for HistGradientBoosting Regressor:
Permutation 1: {'l2_regularization': 0.0, 'learning_rate': 0.01, 'max_depth': None, 'max_iter': 100}, RMSE: 7.395456645048995
Permutation 2: {'l2_regularization': 0.0, 'learning_rate': 0.01, 'max_depth': None, 'max_iter': 200}, RMSE: 6.915582902460227
Permutation 3: {'l2_regularization': 0.0, 'learning_rate': 0.01, 'max_depth': 10, 'max_iter': 100}, RMSE: 7.396052867777897
Permutation 4: {'l2_regularization': 0.0, 'learning_rate': 0.01, 'max_depth': 10, 'max_iter': 200}, RMSE: 6.695244197690471
Permutation 5: {'l2_regularization': 0.0, 'learning_rate': 0.01, 'max_depth': 20, 'max_iter': 100}, RMSE: 7.356747704752841
Permutation 6: {'l2_regularization': 0.0, 'learning_rate': 0.01, 'max_depth': 20, 'max_iter': 200}, RMSE: 6.702371049480921
Permutation 7: {'l2_regularization': 0.0, 'learning_rate': 0.1, 'max_depth': None, 'max_iter': 100}, RMSE: 5.7285128058324375
Permutation 8: {'l2_regularization': 0.0, 'learning_rate': 0.1, 'max_depth': None





Best model used for submission: GradientBoostingRegressor(learning_rate=0.3, max_depth=7, n_estimators=200,
                          subsample=0.8)

Best overall RMSE: 4.913338708252161


results:
Best for Gradient Boosting Regressor:
RMSE: 4.913338708252161
Best parameters: {'learning_rate': 0.3, 'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}

Best for CatBoost Regressor:
RMSE: 5.1890807530591205
Best parameters: {'depth': 10, 'iterations': 200, 'l2_leaf_reg': 3, 'learning_rate': 0.3}

Best for ElasticNet:
RMSE: 10.243375569424545
Best parameters: {'alpha': 0.1, 'l1_ratio': 0.2, 'max_iter': 1000}

Best for HistGradientBoosting Regressor:
RMSE: 5.348628068537356
Best parameters: {'l2_regularization': 1.0, 'learning_rate': 0.3, 'max_depth': 20, 'max_iter': 200}

Best model used for submission: GradientBoostingRegressor(learning_rate=0.3, max_depth=7, n_estimators=200,
                          subsample=0.8)

Best overall RMSE: 4.913338708252161

Best public score 4.82933

Conclusion:
best way is to use Linear Regression model which is what I ran first.