## We will create a model to predict unit sales using grocery store transaction data. We will be trying different models to find the best fit. We will use root mean squared error to compare models with eachother. We want a model with a training and test root mean squared error that are less than the standard deviation of our test variable unit sales, and close together which means it is not overfitting.

In [1]:
import pandas as pd
import numpy as np

import dask.dataframe as dd

# Define the expected data types for columns
dtype = {
    'store_nbr': 'int64',
    'unit_sales': 'float64'
}

# Read the CSV file into a Dask DataFrame with specified data types
df = dd.read_csv('all_info.csv', dtype=dtype)

# Assuming `dask_df` is your Dask DataFrame
df_merged = df.compute()

 ## Create x and y variables with y being unit sales. Create training and test set.

In [13]:

from sklearn.model_selection import train_test_split
X=df_merged.drop(columns=['unit_sales','Unnamed: 0'])
y=df_merged['unit_sales']
X=X[~y.isna()]
y=y.dropna()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=150; total time= 2.7min
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=100; total time= 1.8min
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=150; total time= 9.8min
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=150; total time= 2.9min
[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=200; total time= 3.0min
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=100; total time= 4.0min
[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time= 3.5min
[CV] END bootstrap=True, max_depth=5, max_feat

## We try Random Forest Regressor as a first model. We use Randomized Search CV to chooose best hyperparameters. Subsample the data for quicker hyperparameter tuning.

In [4]:
print(X_train.shape)
print(y_train.shape)


(417722, 582)
(417722,)


In [9]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
# Concatenate X_train and y_train to sample them together
#train_data = pd.concat([X_train, y_train], axis=1)

# Subsample 50% of the data
#train_data_sampled = train_data.sample(frac=0.5, random_state=42)

# Separate the subsampled data back into X and y
X_train_sampled = X_train.drop(columns=['dcoilwtico','transactions'])
#train_data_sampled.drop(columns=[y_train.name])
y_train_sampled = y_train
#train_data_sampled[y_train.name]

# Check the sizes to ensure they match
print(f"X_train_sampled size: {X_train_sampled.shape}")
print(f"y_train_sampled size: {y_train_sampled.shape}")


# Define a smaller, more focused parameter distribution
param_dist = {
    'n_estimators': [100, 150, 200],  # Fewer trees for quicker iterations
    'max_depth': [5, 10],  # Limit tree depth
    'min_samples_split': [5, 10],  # Focus on slightly larger splits
    'min_samples_leaf': [2, 4],  # Prevent very small leaves
    'max_features': ['sqrt'],  # Stick to 'sqrt' for speed
    'bootstrap': [True]  # Keep bootstrapping enabled
}

# Initialize the Random Forest model
rf = RandomForestRegressor(random_state=42)

# Use RandomizedSearchCV for hyperparameter tuning
rf_random = RandomizedSearchCV(
    estimator=rf, 
    param_distributions=param_dist, 
    n_iter=10,  # Reduce the number of iterations
    cv=3,  # Reduce the number of cross-validation folds
    verbose=2, 
    random_state=42, 
    n_jobs=-1,  # Ensure parallel processing
    scoring='neg_root_mean_squared_error'
)

# Fit the RandomizedSearchCV model on the sampled data
rf_random.fit(X_train_sampled, y_train_sampled)

# Get the best model
best_rf = rf_random.best_estimator_

# Predict on the full training and test sets
y_pred_train = best_rf.predict(X_train.drop(columns=['dcoilwtico','transactions']))
y_pred_test = best_rf.predict(X_test.drop(columns=['dcoilwtico','transactions']))

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Print results
print(f"Best Hyperparameters: {rf_random.best_params_}")
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


X_train_sampled size: (417722, 580)
y_train_sampled size: (417722,)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Hyperparameters: {'n_estimators': 150, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'bootstrap': True}
Training RMSE: 18.584561910524652
Test RMSE: 20.92317119424786




## Adjust the model to reduce underfitting

1. Increase the Number of Estimators:
More trees can lead to a better generalization of the model, although this comes at the cost of increased training time.

2. Expand the Hyperparameter Grid:
Experiment with a wider range of values for some parameters, such as max_depth and min_samples_split. Adding more flexibility to the search space may allow for discovering better-performing configurations.

3. Optimize Cross-Validation (CV):
Use a higher number of CV folds for more reliable results (but keep in mind that this will increase computation time).

4. Try Reducing Overfitting with min_samples_split, min_samples_leaf:
Increasing these parameters slightly can help the model generalize better to the test set.

5. Feature Engineering:
Check if the features dcoilwtico and transactions should be dropped or engineered differently. Removing features without testing their significance may lead to suboptimal performance.

6. Experiment with Other Sampling Fractions:
If the dataset is large, subsampling 50% may miss valuable information. Try a higher fraction (e.g., 70-80%).

Key Adjustments:
Larger range of n_estimators and max_depth values to find more complex models.
Adding min_samples_split=2 and min_samples_leaf=1 to allow more granular splits.
Increasing the number of iterations in RandomizedSearchCV to 20 for a more thorough search.
Setting cross-validation to cv=5 for better performance estimates.

In [15]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

# Concatenate X_train and y_train to sample them together if necessary
X_train_sampled = X_train.drop(columns=['dcoilwtico', 'transactions'])
y_train_sampled = y_train

# Reduced hyperparameter grid for quicker tuning
param_dist = {
    'n_estimators': [50, 100, 150],  # Fewer trees for quicker iterations
    'max_depth': [5, 10],  # Limit tree depth to prevent overfitting and speed up training
    'min_samples_split': [5, 10],  # Slightly larger splits
    'min_samples_leaf': [2, 4],  # Avoid very small leaves
    'max_features': ['sqrt'],  # Stick to 'sqrt' for speed
    'bootstrap': [True]  # Keep bootstrapping enabled
}

# Initialize the Random Forest model
rf = RandomForestRegressor(random_state=42)

# Use RandomizedSearchCV with reduced iterations and CV folds
rf_random = RandomizedSearchCV(
    estimator=rf, 
    param_distributions=param_dist, 
    n_iter=5,  # Reduce the number of iterations
    cv=3,  # Use fewer CV folds
    verbose=2, 
    random_state=42, 
    n_jobs=-1,  # Ensure parallel processing
    scoring='neg_root_mean_squared_error'
)

# Fit the RandomizedSearchCV model on the sampled data
rf_random.fit(X_train_sampled, y_train_sampled)

# Get the best model
best_rf = rf_random.best_estimator_

# Predict on the full training and test sets
y_pred_train = best_rf.predict(X_train.drop(columns=['dcoilwtico', 'transactions']))
y_pred_test = best_rf.predict(X_test.drop(columns=['dcoilwtico', 'transactions']))

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Print results
print(f"Best Hyperparameters: {rf_random.best_params_}")
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


Fitting 3 folds for each of 5 candidates, totalling 15 fits
Best Hyperparameters: {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'bootstrap': True}
Training RMSE: 18.587528815929943
Test RMSE: 20.92796610469261




[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=  43.7s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=  44.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=  44.7s
[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=50; total time=  16.6s
[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=50; total time=  21.6s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=50; total time=  14.5s
[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=150; total time=  38.9s
[CV] END bootstrap=True, max_depth=10, max_features

## try to improve model by changing parameters
## Key Modifications:
Add More Depth for Complexity: Increasing the max_depth can allow the model to capture more complex patterns.

Increase n_estimators Gradually: A slight increase in the number of trees helps improve performance while balancing training time.

Fine-Tune the min_samples_split and min_samples_leaf: Making these values smaller can help the model capture more variance in the data.

Focus on Feature Selection: Remove highly correlated or low-importance features to avoid overfitting and improve model generalization.

Enable OOB (Out-of-Bag) Scoring: Use Out-of-Bag (OOB) error to validate performance during training and adjust accordingly.

Use Full Dataset for Final Predictions: After tuning, apply the model to the full dataset for accurate RMSE evaluation.


In [18]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

# Subsample for hyperparameter tuning (30% to reduce time)
X_train_sampled = X_train.drop(columns=['dcoilwtico', 'transactions']).sample(frac=0.3, random_state=42)
y_train_sampled = y_train.sample(frac=0.3, random_state=42)

# Updated hyperparameter grid
param_dist = {
    'n_estimators': [150, 200, 250],  # More trees to reduce variance
    'max_depth': [15, 20, 25],  # Deeper trees to capture complexity
    'min_samples_split': [2, 5],  # Smaller splits to capture finer details
    'min_samples_leaf': [1, 2],  # Smaller leaf nodes to increase model flexibility
    'max_features': ['sqrt'],  # Keep sqrt for speed
    'bootstrap': [True],  # Enable bootstrap sampling
    'oob_score': [True]  # Enable Out-of-Bag scoring to validate model during training
}

# Initialize the Random Forest model
rf = RandomForestRegressor(random_state=42)

# Use RandomizedSearchCV for hyperparameter tuning
rf_random = RandomizedSearchCV(
    estimator=rf, 
    param_distributions=param_dist, 
    n_iter=10,  # More iterations for better tuning
    cv=3,  # Cross-validation folds
    verbose=2, 
    random_state=42, 
    n_jobs=-1,  # Use all available CPU cores
    scoring='neg_root_mean_squared_error'
)

# Fit RandomizedSearchCV on sampled training data
rf_random.fit(X_train_sampled, y_train_sampled)

# Get the best model from tuning
best_rf = rf_random.best_estimator_

# Predict on full training and test sets
y_pred_train = best_rf.predict(X_train.drop(columns=['dcoilwtico', 'transactions']))
y_pred_test = best_rf.predict(X_test.drop(columns=['dcoilwtico', 'transactions']))

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Output the best hyperparameters and RMSE results
print(f"Best Hyperparameters: {rf_random.best_params_}")
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")



Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   4.7s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   5.9s
[CV] END bootstrap=True, max_depth=25, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=250, oob_score=True; total time= 1.7min




[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   4.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=50; total time=   3.7s
[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, oob_score=True; total time=  56.2s
[CV] END bootstrap=True, max_depth=25, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=250, oob_score=True; total time= 1.9min
Best Hyperparameters: {'oob_score': True, 'n_estimators': 250, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 15, 'bootstrap': True}
Training RMSE: 18.144720523054666
Test RMSE: 20.914435850377977




## Strategies to Prevent Overfitting:
Limit Tree Depth: Reduce the maximum depth of trees to prevent them from growing too complex.
Increase Minimum Samples for Splitting and Leaves: This forces the trees to generalize by ensuring that splits occur only when there’s enough data.

Reduce Number of Features Considered for Splitting: Restrict the number of features available for splitting at each node, preventing overly specific splits.

Bootstrap Sampling: Continue using bootstrap sampling to increase variance in the trees, reducing the risk of overfitting.

Tune Hyperparameters for Regularization: Use cross-validation and a smaller set of more focused hyperparameters for RandomizedSearchCV.

Early Stopping: If running on large datasets, you can implement early stopping to halt training when performance stops improving.

In [19]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

# Subsample for hyperparameter tuning (e.g., 30% to reduce time)
X_train_sampled = X_train.drop(columns=['dcoilwtico', 'transactions']).sample(frac=0.3, random_state=42)
y_train_sampled = y_train.sample(frac=0.3, random_state=42)

# Updated hyperparameter grid for regularization
param_dist = {
    'n_estimators': [100, 150],  # Moderate number of trees to prevent overfitting
    'max_depth': [10, 15],  # Limit tree depth to control complexity
    'min_samples_split': [10, 15],  # Increase minimum samples required to split
    'min_samples_leaf': [5, 10],  # Larger leaves to prevent overfitting
    'max_features': ['sqrt'],  # Use 'sqrt' for balanced feature selection
    'bootstrap': [True],  # Bootstrap sampling to introduce variance
    'oob_score': [True]  # Out-of-Bag scoring for internal validation
}

# Initialize the Random Forest model
rf = RandomForestRegressor(random_state=42)

# Use RandomizedSearchCV for hyperparameter tuning with regularization
rf_random = RandomizedSearchCV(
    estimator=rf, 
    param_distributions=param_dist, 
    n_iter=10,  # Perform 10 iterations
    cv=3,  # Cross-validation for performance estimation
    verbose=2, 
    random_state=42, 
    n_jobs=-1,  # Use all available CPU cores
    scoring='neg_root_mean_squared_error'
)

# Fit RandomizedSearchCV on the sampled training data
rf_random.fit(X_train_sampled, y_train_sampled)

# Get the best model from tuning
best_rf = rf_random.best_estimator_

# Predict on full training and test sets
y_pred_train = best_rf.predict(X_train.drop(columns=['dcoilwtico', 'transactions']))
y_pred_test = best_rf.predict(X_test.drop(columns=['dcoilwtico', 'transactions']))

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Output the best hyperparameters and RMSE results
print(f"Best Hyperparameters: {rf_random.best_params_}")
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Hyperparameters: {'oob_score': True, 'n_estimators': 150, 'min_samples_split': 15, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'max_depth': 15, 'bootstrap': True}
Training RMSE: 18.37479537855853
Test RMSE: 20.780135073853046




## CatBoost Model
Strengths: CatBoost is one of the best models for datasets with many categorical variables. It handles categorical features directly by using a combination of target encoding and categorical feature combinations. Additionally, CatBoost can natively handle missing values without the need for imputation.

Why Use It:

Efficient handling of categorical variables without the need for extensive preprocessing.
Handles missing values internally.
Offers high performance with minimal hyperparameter tuning.

Key Features:
Supports categorical features natively.
Robust against overfitting through gradient-boosting regularization techniques.


In [20]:
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error

# Identify categorical columns (assuming they are in object or categorical data types)
categorical_columns = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Subsample the data for quicker iterations (if necessary)
X_train_sampled = X_train.drop(columns=['dcoilwtico', 'transactions']).sample(frac=0.3, random_state=42)
y_train_sampled = y_train.sample(frac=0.3, random_state=42)

# Initialize CatBoost model with basic parameters (these can be tuned further)
catboost_model = CatBoostRegressor(
    iterations=1000,           # Number of boosting iterations (you can tune this)
    learning_rate=0.05,        # Lower learning rate to prevent overfitting
    depth=10,                  # Depth of trees (tune based on dataset)
    cat_features=categorical_columns,  # Specify the categorical features
    random_seed=42,
    verbose=100,               # Verbose to show progress after every 100 iterations
    early_stopping_rounds=50,  # Early stopping to prevent overfitting
)

# Fit the model to the training data
catboost_model.fit(X_train_sampled, y_train_sampled)

# Predict on the full training and test sets
y_pred_train = catboost_model.predict(X_train.drop(columns=['dcoilwtico', 'transactions']))
y_pred_test = catboost_model.predict(X_test.drop(columns=['dcoilwtico', 'transactions']))

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Print results
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


0:	learn: 22.0768852	total: 79.4ms	remaining: 1m 19s
100:	learn: 19.9998952	total: 1.35s	remaining: 12s
200:	learn: 19.6807294	total: 2.62s	remaining: 10.4s
300:	learn: 19.4508601	total: 3.95s	remaining: 9.18s
400:	learn: 19.3072090	total: 5.25s	remaining: 7.84s
500:	learn: 19.1919513	total: 6.58s	remaining: 6.55s
600:	learn: 19.1077763	total: 7.9s	remaining: 5.24s
700:	learn: 19.0370250	total: 9.21s	remaining: 3.93s
800:	learn: 18.9803398	total: 10.6s	remaining: 2.63s
900:	learn: 18.9281981	total: 11.9s	remaining: 1.3s
999:	learn: 18.8810039	total: 13.5s	remaining: 0us
Training RMSE: 18.19464112041713
Test RMSE: 21.32739157240641




[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   4.2s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=50; total time=   3.9s
[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, oob_score=True; total time=  56.8s
[CV] END bootstrap=True, max_depth=25, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=150, oob_score=True; total time=  39.4s
[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, oob_score=True; total time=  55.6s
[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=150, oob_score=True; total time=  48.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_estimators=150, oob_s

[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   2.9s
[CV] END bootstrap=True, max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   3.6s
[CV] END bootstrap=True, max_depth=25, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=250, oob_score=True; total time= 1.1min
[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, oob_score=True; total time=  45.9s
[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=150, oob_score=True; total time=  39.9s
[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=150, oob_score=True; total time=  48.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_estimators=100, oob_s

## This catboosted model is already better but we tune it more
Hyperparameter Tuning:

iterations: Increased to explore higher values for better learning.
depth: Explored deeper trees to capture more complexity in the data.
learning_rate: Lower values were selected to make the model learn slower but more precisely.
l2_leaf_reg: L2 regularization was added to control overfitting.
bagging_temperature: Added this to introduce randomness in bootstrap samples.
random_strength: Increases the randomness for dealing with noisy data.
Cross-Validation: Using cv=3 helps evaluate the model performance on different subsets of the data to avoid overfitting and improve generalization.

Subsampling: Kept subsampling at 30% for faster iteration during the hyperparameter search. After finding the best parameters, you can train on the full dataset for better results.

Early Stopping: early_stopping_rounds=50 prevents overfitting by stopping the training when there is no improvement after 50 rounds.

Tips for Further Improvement:
Feature Engineering: Adding or transforming features (e.g., creating interaction features) can sometimes improve model performance significantly.
Handling Imbalance: If your target variable is imbalanced, consider tuning the loss function or using balanced datasets during cross-validation.
Data Imputation: If missing values are still a problem, try different imputation methods (e.g., median, mode) before modeling.


In [21]:
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import RandomizedSearchCV

# Identify categorical columns
categorical_columns = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Subsample the data for quicker iterations (if necessary)
X_train_sampled = X_train.drop(columns=['dcoilwtico', 'transactions']).sample(frac=0.3, random_state=42)
y_train_sampled = y_train.sample(frac=0.3, random_state=42)

# Define the parameter grid for RandomizedSearchCV
param_dist = {
    'iterations': [500, 1000, 1500],      # Higher iterations for better learning
    'depth': [6, 8, 10, 12],              # Depth of the trees
    'learning_rate': [0.01, 0.03, 0.05],  # Lower learning rate to prevent overfitting
    'l2_leaf_reg': [1, 3, 5, 7],          # L2 regularization to control overfitting
    'bagging_temperature': [0.2, 0.4, 0.6], # Higher bagging temperature for diversity in trees
    'random_strength': [1, 5, 10],        # Randomness for dealing with noisy data
}

# Initialize the CatBoost model
catboost_model = CatBoostRegressor(
    cat_features=categorical_columns,
    early_stopping_rounds=50,   # Stop early to prevent overfitting
    random_seed=42,
    verbose=100,                # Show progress every 100 iterations
)

# Use RandomizedSearchCV for hyperparameter tuning
random_search = RandomizedSearchCV(
    estimator=catboost_model,
    param_distributions=param_dist,
    n_iter=10,                  # Number of random combinations to try
    cv=3,                       # 3-fold cross-validation
    verbose=2,
    random_state=42,
    n_jobs=-1,                  # Use all available cores for speed
    scoring='neg_root_mean_squared_error'
)

# Fit the RandomizedSearchCV model on the sampled data
random_search.fit(X_train_sampled, y_train_sampled)

# Get the best model
best_catboost_model = random_search.best_estimator_

# Predict on the full training and test sets
y_pred_train = best_catboost_model.predict(X_train.drop(columns=['dcoilwtico', 'transactions']))
y_pred_test = best_catboost_model.predict(X_test.drop(columns=['dcoilwtico', 'transactions']))

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Print results
print(f"Best Hyperparameters: {random_search.best_params_}")
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


Fitting 3 folds for each of 10 candidates, totalling 30 fits
0:	learn: 22.1727403	total: 17.8ms	remaining: 17.8s
100:	learn: 21.3129862	total: 1.35s	remaining: 12.1s
200:	learn: 20.8995862	total: 2.73s	remaining: 10.8s
300:	learn: 20.7057169	total: 4.3s	remaining: 10s
400:	learn: 20.5582570	total: 5.39s	remaining: 8.05s
500:	learn: 20.4554844	total: 6.52s	remaining: 6.49s
600:	learn: 20.3648671	total: 7.64s	remaining: 5.07s
700:	learn: 20.2982619	total: 8.84s	remaining: 3.77s
800:	learn: 20.2294936	total: 10s	remaining: 2.49s
900:	learn: 20.1799529	total: 11.2s	remaining: 1.23s
999:	learn: 20.1323101	total: 12.3s	remaining: 0us
Best Hyperparameters: {'random_strength': 5, 'learning_rate': 0.01, 'l2_leaf_reg': 3, 'iterations': 1000, 'depth': 10, 'bagging_temperature': 0.6}
Training RMSE: 18.226256385317935
Test RMSE: 20.835654350124567




## RMSE is lower but its overfitting coz more distance between train and test rmse 
Key Adjustments:
Lower Learning Rate: Slowing down learning helps the model generalize better.
Increase Regularization: Using stronger L2 regularization and adjusting the random_strength can help smooth out noisy patterns.
Reduce Tree Depth: Shallower trees will capture less noise and improve generalization.
Increase Early Stopping: Stop training early if the model starts overfitting.
Subsample/Bayesian Bootstrapping: Adding randomness to data sampling reduces the chances of overfitting to the training data.


Adjustments for Reducing Overfitting:
Shallower Trees (depth): Reducing the depth of the trees helps prevent the model from capturing noise in the training data. This lowers complexity.
Lower Learning Rate (learning_rate): By reducing the learning rate, we ensure that the model learns more gradually, avoiding overfitting to the training data.
L2 Regularization (l2_leaf_reg): Stronger regularization (higher l2_leaf_reg) penalizes large weights, helping to reduce the variance and improve generalization.
Bagging (bagging_temperature): Higher bagging temperature introduces more randomness in the bootstrapped datasets, which can help reduce overfitting.
Random Subspace Method (rsm): Random subspace method controls the feature sampling per split, reducing overfitting.
Early Stopping: Early stopping will halt training if the performance does not improve after 50 iterations, preventing overfitting during training.
Cross-Validation: Using cv=3 in RandomizedSearchCV helps the model generalize better by evaluating it on different subsets of data.


In [22]:
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import RandomizedSearchCV

# Identify categorical columns
categorical_columns = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Subsample the data for quicker iterations (if necessary)
X_train_sampled = X_train.drop(columns=['dcoilwtico', 'transactions']).sample(frac=0.3, random_state=42)
y_train_sampled = y_train.sample(frac=0.3, random_state=42)

# Define the parameter grid for RandomizedSearchCV
param_dist = {
    'iterations': [500, 1000],               # Sufficiently high iterations with early stopping
    'depth': [4, 6, 8],                      # Reduce tree depth to prevent overfitting
    'learning_rate': [0.01, 0.03],           # Lower learning rate for gradual learning
    'l2_leaf_reg': [5, 7, 10],               # Increase regularization to penalize complexity
    'bagging_temperature': [0.5, 0.7],       # More aggressive subsampling to improve generalization
    'random_strength': [5, 10],              # Higher random strength to avoid overfitting
    'rsm': [0.8, 1.0],                       # Random subspace method to introduce more randomness
    'border_count': [128, 254],              # Increase border count to handle feature binning better
}

# Initialize the CatBoost model with more regularization and early stopping
catboost_model = CatBoostRegressor(
    cat_features=categorical_columns,
    early_stopping_rounds=50,    # Stop early to avoid overfitting
    random_seed=42,
    verbose=100,                 # Show progress every 100 iterations
)

# Use RandomizedSearchCV for hyperparameter tuning with cross-validation
random_search = RandomizedSearchCV(
    estimator=catboost_model,
    param_distributions=param_dist,
    n_iter=10,                    # Number of random combinations to try
    cv=3,                         # 3-fold cross-validation
    verbose=2,
    random_state=42,
    n_jobs=-1,                    # Use all available cores for faster processing
    scoring='neg_root_mean_squared_error'
)

# Fit the RandomizedSearchCV model on the sampled data
random_search.fit(X_train_sampled, y_train_sampled)

# Get the best model from the random search
best_catboost_model = random_search.best_estimator_

# Predict on the full training and test sets
y_pred_train = best_catboost_model.predict(X_train.drop(columns=['dcoilwtico', 'transactions']))
y_pred_test = best_catboost_model.predict(X_test.drop(columns=['dcoilwtico', 'transactions']))

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Print results
print(f"Best Hyperparameters: {random_search.best_params_}")
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


Fitting 3 folds for each of 10 candidates, totalling 30 fits
0:	learn: 22.1565258	total: 14.4ms	remaining: 7.17s
100:	learn: 21.3034964	total: 730ms	remaining: 2.88s
200:	learn: 21.0669931	total: 1.45s	remaining: 2.15s
300:	learn: 20.9094659	total: 2.14s	remaining: 1.42s
400:	learn: 20.7645771	total: 2.86s	remaining: 706ms
499:	learn: 20.3743819	total: 3.65s	remaining: 0us
Best Hyperparameters: {'rsm': 0.8, 'random_strength': 10, 'learning_rate': 0.03, 'l2_leaf_reg': 10, 'iterations': 500, 'depth': 8, 'border_count': 128, 'bagging_temperature': 0.5}
Training RMSE: 18.230352999231844
Test RMSE: 20.732974410857167




## Try a light gbm model
Explanation of the Adjustments:
Hyperparameters:

n_estimators: Number of boosting iterations.
learning_rate: Controls how fast the model learns; lower rates ensure slower, more precise learning.
num_leaves: Controls the complexity of the model; fewer leaves prevent overfitting.
max_depth: Controls tree depth, similar to num_leaves; shallower trees generalize better.
min_data_in_leaf: Ensures a minimum number of samples per leaf to prevent overfitting on small splits.
feature_fraction: Randomly selects a fraction of features to be used for each tree to avoid overfitting.
bagging_fraction and bagging_freq: Control subsampling of the data, which helps in regularization.
lambda_l1 and lambda_l2: L1 and L2 regularization terms that help prevent overfitting by penalizing large coefficients.
Cross-Validation: Used cv=3 for 3-fold cross-validation to prevent overfitting and ensure the model generalizes well.

Efficiency: The model can handle large datasets efficiently and handles categorical variables natively in LightGBM.



In [29]:
import lightgbm as lgb
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np

# Define the parameter grid for RandomizedSearchCV with better regularization
param_dist = {
    'n_estimators': [500, 1000],  # Number of boosting rounds
    'learning_rate': [0.01, 0.05, 0.1],  # Learning rate
    'num_leaves': [31, 60],  # Reduce range to avoid complexity
    'max_depth': [6, 10],  # Limit depth to prevent overfitting
    'min_child_samples': [20, 50],  # Increase this to control split complexity
    'feature_fraction': [0.6, 0.8],  # Fraction of features to use
    'bagging_fraction': [0.6, 0.8],  # Fraction of data for bagging
    'bagging_freq': [1],  # Bagging frequency
    'lambda_l1': [0.01, 0.1],  # L1 regularization to prevent overfitting
    'lambda_l2': [0.01, 0.1],  # L2 regularization
}

# Initialize the LightGBM model
lgb_model = lgb.LGBMRegressor(
    random_state=42,
    objective='regression',  # Objective for regression tasks
    n_jobs=-1  # Use all available cores
)

# Use RandomizedSearchCV for hyperparameter tuning
lgb_random = RandomizedSearchCV(
    estimator=lgb_model, 
    param_distributions=param_dist, 
    n_iter=10,  # Number of parameter settings sampled
    cv=3,  # Cross-validation folds
    verbose=2, 
    random_state=42, 
    n_jobs=-1,  # Parallel processing
    scoring='neg_root_mean_squared_error'  # Metric for optimization
)

# Fit the model on the training data
lgb_random.fit(X_train, y_train)

# Get the best model from the search
best_lgb = lgb_random.best_estimator_

# Make predictions on training and test sets
y_pred_train = best_lgb.predict(X_train)
y_pred_test = best_lgb.predict(X_test)

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Print the results
print(f"Best Hyperparameters: {lgb_random.best_params_}")
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


Fitting 3 folds for each of 10 candidates, totalling 30 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.165027 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1431
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 327
[LightGBM] [Info] Start training from score 9.016899
[CV] END bagging_fraction=0.6, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=0.01, learning_rate=0.1, max_depth=6, min_child_samples=50, n_estimators=500, num_leaves=31; total time= 2.1min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.037248 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1455
[LightGBM] [Info] Number of data points in the train set: 278481,

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.086571 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1431
[LightGBM] [Info] Number of data points in the train set: 278482, number of used features: 326
[LightGBM] [Info] Start training from score 9.059431
[CV] END bagging_fraction=0.6, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=0.01, learning_rate=0.1, max_depth=6, min_child_samples=50, n_estimators=500, num_leaves=31; total time= 2.1min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.065669 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1455
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 339
[LightGBM] [Info] Start trainin

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.021561 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1444
[LightGBM] [Info] Number of data points in the train set: 417722, number of used features: 332
[LightGBM] [Info] Start training from score 9.020188




Best Hyperparameters: {'num_leaves': 31, 'n_estimators': 500, 'min_child_samples': 50, 'max_depth': 6, 'learning_rate': 0.1, 'lambda_l2': 0.01, 'lambda_l1': 0.1, 'feature_fraction': 0.8, 'bagging_freq': 1, 'bagging_fraction': 0.6}
Training RMSE: 17.633711500745374
Test RMSE: 20.428758847107172




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.135020 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1455
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 339
[LightGBM] [Info] Start training from score 9.016899
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.01, lambda_l2=0.1, learning_rate=0.01, max_depth=6, min_child_samples=20, n_estimators=500, num_leaves=31; total time= 2.3min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.037690 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1457
[LightGBM] [Info] Number of data points in the train set: 278482, number of used features: 339
[LightGBM] [Info] Start traini

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.131275 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1431
[LightGBM] [Info] Number of data points in the train set: 278482, number of used features: 326
[LightGBM] [Info] Start training from score 9.059431
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.01, lambda_l2=0.1, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31; total time= 2.6min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.039184 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1455
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 339
[LightGBM] [Info] Start traini

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.051750 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1431
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 327
[LightGBM] [Info] Start training from score 9.016899
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.01, lambda_l2=0.1, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31; total time= 2.6min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.041996 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1457
[LightGBM] [Info] Number of data points in the train set: 278482, number of used features: 339
[LightGBM] [Info] Start traini

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.097051 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1433
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 328
[LightGBM] [Info] Start training from score 8.984235
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.01, lambda_l2=0.1, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31; total time= 2.5min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.288383 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1455
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 339
[LightGBM] [Info] Start traini

[CV] END bagging_fraction=0.6, bagging_freq=1, feature_fraction=0.6, lambda_l1=0.1, lambda_l2=0.01, learning_rate=0.01, max_depth=6, min_child_samples=20, n_estimators=500, num_leaves=60; total time= 2.7min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.037685 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1433
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 328
[LightGBM] [Info] Start training from score 8.984235
[CV] END bagging_fraction=0.6, bagging_freq=1, feature_fraction=0.6, lambda_l1=0.1, lambda_l2=0.1, learning_rate=0.01, max_depth=6, min_child_samples=50, n_estimators=500, num_leaves=31; total time= 1.1min


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.120407 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1433
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 328
[LightGBM] [Info] Start training from score 8.984235
[CV] END bagging_fraction=0.6, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=0.01, learning_rate=0.1, max_depth=6, min_child_samples=50, n_estimators=500, num_leaves=31; total time= 2.1min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.032854 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1457
[LightGBM] [Info] Number of data points in the train set: 278482, number of used features: 339
[LightGBM] [Info] Start trainin

[CV] END bagging_fraction=0.6, bagging_freq=1, feature_fraction=0.6, lambda_l1=0.1, lambda_l2=0.01, learning_rate=0.01, max_depth=6, min_child_samples=20, n_estimators=500, num_leaves=60; total time= 2.6min
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.049795 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1431
[LightGBM] [Info] Number of data points in the train set: 278481, number of used features: 327
[LightGBM] [Info] Start training from score 9.016899
[CV] END bagging_fraction=0.6, bagging_freq=1, feature_fraction=0.6, lambda_l1=0.1, lambda_l2=0.1, learning_rate=0.01, max_depth=6, min_child_samples=50, n_estimators=500, num_leaves=31; total time= 1.2min


In [33]:
import lightgbm as lgb
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np
# Adjust the parameter grid to focus on reducing overfitting
param_grid = {
    'learning_rate': [0.05, 0.1],  # Try smaller learning rates
    'n_estimators': [500, 1000],
    'max_depth': [3, 5, 7],
    'num_leaves': [20, 31, 50],  # Smaller number of leaves
    'min_child_samples': [20, 30, 50],  # Larger min_child_samples to reduce overfitting
    'lambda_l1': [0.1, 0.5, 1.0],  # Increase L1 regularization
    'lambda_l2': [0.1, 0.5, 1.0],  # Increase L2 regularization
    'bagging_fraction': [0.7, 0.8],  # Bagging to reduce variance
    'bagging_freq': [1],
    'feature_fraction': [0.6, 0.8]  # Feature sampling to prevent overfitting
}

# Initialize the LightGBM model
lgbm = lgb.LGBMRegressor(random_state=42)

# Perform RandomizedSearchCV to tune the hyperparameters
lgbm_random = RandomizedSearchCV(
    estimator=lgbm,
    param_distributions=param_grid,
    n_iter=10,
    cv=3,
    verbose=2,
    random_state=42,
    n_jobs=-1,
    scoring='neg_root_mean_squared_error'
)

# Fit the model on the training data
lgbm_random.fit(X_train_sampled, y_train_sampled)

# Get the best model
best_lgbm = lgbm_random.best_estimator_

# Predict on the full training and test sets
y_pred_train = best_lgbm.predict(X_train.drop(columns=['dcoilwtico', 'transactions']))
y_pred_test = best_lgbm.predict(X_test.drop(columns=['dcoilwtico', 'transactions']))

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Print results
print(f"Best Hyperparameters: {lgbm_random.best_params_}")
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


Fitting 3 folds for each of 10 candidates, totalling 30 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.092685 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1118
[LightGBM] [Info] Number of data points in the train set: 83544, number of used features: 301
[LightGBM] [Info] Start training from score 8.995700
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.5, lambda_l2=1.0, learning_rate=0.05, max_depth=7, min_child_samples=50, n_estimators=1000, num_leaves=20; total time= 2.7min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.034858 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1118
[LightGBM] [Info] Number of data points in the train set: 83544, 

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.097910 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1144
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 314
[LightGBM] [Info] Start training from score 9.046809
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=1.0, learning_rate=0.05, max_depth=7, min_child_samples=30, n_estimators=500, num_leaves=20; total time= 1.3min
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.043955 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1122
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 303
[LightGBM] [Info] Start training from score 8.969688
[CV] END bagging_fraction=0.7, bagging_freq=

[CV] END bagging_fraction=0.7, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=0.5, learning_rate=0.05, max_depth=3, min_child_samples=50, n_estimators=1000, num_leaves=20; total time=12.6min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.204707 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1118
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 301
[LightGBM] [Info] Start training from score 9.046809


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.123797 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1118
[LightGBM] [Info] Number of data points in the train set: 83544, number of used features: 301
[LightGBM] [Info] Start training from score 8.995700
[CV] END bagging_fraction=0.7, bagging_freq=1, feature_fraction=0.6, lambda_l1=1.0, lambda_l2=0.5, learning_rate=0.1, max_depth=7, min_child_samples=50, n_estimators=500, num_leaves=50; total time= 2.7min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.061333 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1122
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 303
[LightGBM] [Info] Start training from score 8.969688


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.143757 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1122
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 303
[LightGBM] [Info] Start training from score 8.969688
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.5, lambda_l2=1.0, learning_rate=0.05, max_depth=7, min_child_samples=50, n_estimators=1000, num_leaves=20; total time= 2.6min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.024795 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1118
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 301
[LightGBM] [Info] Start training

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.094481 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1118
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 301
[LightGBM] [Info] Start training from score 9.046809
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.5, lambda_l2=1.0, learning_rate=0.05, max_depth=7, min_child_samples=50, n_estimators=1000, num_leaves=20; total time= 2.6min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.066777 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1118
[LightGBM] [Info] Number of data points in the train set: 83544, number of used features: 301
[LightGBM] [Info] Start training

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.096288 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1118
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 301
[LightGBM] [Info] Start training from score 9.046809
[CV] END bagging_fraction=0.7, bagging_freq=1, feature_fraction=0.6, lambda_l1=1.0, lambda_l2=0.5, learning_rate=0.1, max_depth=7, min_child_samples=50, n_estimators=500, num_leaves=50; total time= 2.4min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.034455 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1122
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 303
[LightGBM] [Info] Start training from score 8.969688


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.111971 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1144
[LightGBM] [Info] Number of data points in the train set: 83544, number of used features: 314
[LightGBM] [Info] Start training from score 8.995700
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=1.0, learning_rate=0.05, max_depth=7, min_child_samples=30, n_estimators=500, num_leaves=20; total time= 1.6min
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.051720 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1118
[LightGBM] [Info] Number of data points in the train set: 83544, number of used features: 301
[LightGBM] [Info] Start training from score 8.995700
[CV] END bagging_fraction=0.7, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=1.0, learning_

[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=0.5, learning_rate=0.1, max_depth=3, min_child_samples=50, n_estimators=500, num_leaves=50; total time=  27.0s
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.042765 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1122
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 303
[LightGBM] [Info] Start training from score 8.969688
[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=0.5, learning_rate=0.1, max_depth=3, min_child_samples=50, n_estimators=500, num_leaves=50; total time=  31.6s
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.036565 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1156
[LightGBM] [Info] Number of data points in the trai

[CV] END bagging_fraction=0.8, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=0.5, learning_rate=0.1, max_depth=3, min_child_samples=50, n_estimators=500, num_leaves=50; total time=  27.4s
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.020860 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1158
[LightGBM] [Info] Number of data points in the train set: 83544, number of used features: 321
[LightGBM] [Info] Start training from score 8.995700
[CV] END bagging_fraction=0.7, bagging_freq=1, feature_fraction=0.8, lambda_l1=1.0, lambda_l2=0.1, learning_rate=0.05, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=20; total time= 1.1min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.035931 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not

[CV] END bagging_fraction=0.7, bagging_freq=1, feature_fraction=0.8, lambda_l1=0.1, lambda_l2=1.0, learning_rate=0.1, max_depth=7, min_child_samples=50, n_estimators=1000, num_leaves=31; total time=15.6min
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.027953 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1144
[LightGBM] [Info] Number of data points in the train set: 83545, number of used features: 314
[LightGBM] [Info] Start training from score 9.046809


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008645 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1142
[LightGBM] [Info] Number of data points in the train set: 125317, number of used features: 313
[LightGBM] [Info] Start training from score 9.004066








Best Hyperparameters: {'num_leaves': 50, 'n_estimators': 500, 'min_child_samples': 50, 'max_depth': 3, 'learning_rate': 0.1, 'lambda_l2': 0.5, 'lambda_l1': 0.1, 'feature_fraction': 0.8, 'bagging_freq': 1, 'bagging_fraction': 0.8}
Training RMSE: 18.09958726588607
Test RMSE: 20.526955214264422




In [38]:
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split

# Hyperparameter adjustments
lgbm = LGBMRegressor(
    boosting_type='gbdt',
    objective='regression',
    n_estimators=1500,  # Increased estimators for more trees
    learning_rate=0.05,  # Lower learning rate
    max_depth=7,  # Control overfitting with a moderate depth
    num_leaves=50,  # Reduce complexity to avoid overfitting
    min_child_samples=15,  # Fewer samples for a split
    bagging_fraction=0.75,  # Stronger bagging
    feature_fraction=0.8,  # Slightly reduced feature sampling
    lambda_l1=0.1,  # Regularization
    lambda_l2=5.0  # Regularization
)

# Train and validation sets, X_train and X_test must have already been defined
# Fit the model with early stopping
lgbm.fit(
    X_train.drop(columns=['dcoilwtico', 'transactions']),
    y_train,
    eval_set=[(X_test.drop(columns=['dcoilwtico', 'transactions']), y_test)],  # Validation set
    eval_metric='rmse',  # Specify RMSE as the evaluation metric
   
)

# Predictions
y_pred_train = lgbm.predict(X_train.drop(columns=['dcoilwtico', 'transactions']))
y_pred_test = lgbm.predict(X_test.drop(columns=['dcoilwtico', 'transactions']))

# Calculate RMSE
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Print results
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.022304 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1193
[LightGBM] [Info] Number of data points in the train set: 417722, number of used features: 338
[LightGBM] [Info] Start training from score 9.020188














Training RMSE: 17.476291890209875
Test RMSE: 20.399061645219223




## Try an XGBoost Model.

In [40]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Define the XGBoost model with tuned hyperparameters
xgb_model = XGBRegressor(
    n_estimators=500,      # Number of trees
    learning_rate=0.05,    # Step size shrinkage
    max_depth=6,           # Maximum depth of a tree
    subsample=0.8,         # Fraction of samples used for fitting the individual trees
    colsample_bytree=0.8,  # Fraction of features to consider at each split
    reg_alpha=1,           # L1 regularization term
    reg_lambda=1,          # L2 regularization term
    random_state=42
)

# Fit the model on training data
xgb_model.fit(X_train, y_train)

# Predict on training and test sets
y_pred_train = xgb_model.predict(X_train)
y_pred_test = xgb_model.predict(X_test)

# Calculate RMSE for training and test sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

# Print results
print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


Training RMSE: 0.04200875527710678
Test RMSE: 0.15826920794236946




## XGBoost Model performs much better than our other models. But we need to reduce overfitting.  This model often performs better than other models like CatBoost or LightGBM due to several key reasons, especially in the context of complex datasets with categorical variables and missing values:

1. Advanced Gradient Boosting Algorithm:
XGBoost uses an optimized implementation of gradient boosting, which makes it faster and more accurate. Its tree-building process is more robust due to features like:

Regularization: XGBoost applies both L1 (Lasso) and L2 (Ridge) regularization, which helps prevent overfitting, giving it an edge in reducing model complexity.
Tree Pruning: XGBoost stops splitting nodes when the gain becomes negative, which ensures that it doesn’t overfit to noise.
2. Handling Sparse Data Efficiently:
XGBoost natively handles missing values and sparse data very efficiently. It automatically learns which missing values are more important and directs the flow of training to better splits. This is often crucial in real-world datasets where missing data or imbalanced dummy variables are common.

3. Distributed and Efficient Computation:
XGBoost uses a block structure for memory access, which makes it faster when dealing with large datasets like yours. It can parallelize the tree-building process effectively, leading to faster convergence and more accurate predictions in less time.

4. Tunable Regularization Terms:
XGBoost gives greater flexibility with regularization (reg_alpha and reg_lambda), which can be precisely adjusted to prevent overfitting. In your case, XGBoost likely struck a good balance between model complexity and generalization, hence achieving lower RMSE compared to other models.

5. Handling of Non-Linear Interactions:
XGBoost can capture complex non-linear interactions between features better than other models due to its tree-based structure. It optimizes each tree iteratively, and the depth of the trees allows it to capture interactions between categorical variables.

6. Efficient Feature Utilization:
In comparison to models like Random Forest, XGBoost is more selective about which features to use for splits, which leads to better feature selection and less overfitting. LightGBM tends to underperform when too many categorical variables are present unless they're properly handled.

7. Early Stopping:
The use of early stopping in XGBoost allows the model to automatically halt the training process once performance plateaus, preventing unnecessary trees from being built and avoiding overfitting.

Why It Worked Better in this Case:
Large and Complex Dataset: 
dataset with 12 million rows and a mix of categorical variables benefits from XGBoost’s ability to handle sparsity and missing data efficiently.
Regularization: The presence of well-tuned regularization in XGBoost likely helped control overfitting better than in other models.
Strong Feature Utilization: XGBoost’s tree-building process was able to identify the most important interactions between your features, reducing noise.
In short, XGBoost shines with large, sparse, and complex datasets like yours by balancing regularization, feature importance, and handling of missing data.








## reduce overfitting
1. Increase Regularization:
alpha (L1 regularization): Adds a penalty for large coefficients, which encourages sparsity and can reduce overfitting.
lambda (L2 regularization): Adds a penalty for large coefficients, encouraging smaller values to prevent overfitting.

In [42]:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Assuming X and y are your features and target variable
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the XGBoost model with adjusted hyperparameters
model = XGBRegressor(
    n_estimators=1000,
    max_depth=4,
    learning_rate=0.01,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=10,
    reg_lambda=10,
    min_child_weight=5,
    early_stopping_rounds=10,
    random_state=42
)

# Fit the model on the training data, including the validation set
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=50)

# Predict on the full training and test sets (you can replace X_test with your actual test set)
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)

# Calculate RMSE for training and validation sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
val_rmse = mean_squared_error(y_val, y_pred_val, squared=False)

# Print results
print(f"Training RMSE: {train_rmse}")
print(f"Validation RMSE: {val_rmse}")


[0]	validation_0-rmse:1.02197
[50]	validation_0-rmse:0.76232
[100]	validation_0-rmse:0.59658
[150]	validation_0-rmse:0.48487
[200]	validation_0-rmse:0.40077
[250]	validation_0-rmse:0.34085
[300]	validation_0-rmse:0.30165
[350]	validation_0-rmse:0.27181
[400]	validation_0-rmse:0.25124
[450]	validation_0-rmse:0.23467
[500]	validation_0-rmse:0.22332
[550]	validation_0-rmse:0.21457
[600]	validation_0-rmse:0.20778
[650]	validation_0-rmse:0.20213
[700]	validation_0-rmse:0.19791
[750]	validation_0-rmse:0.19480
[800]	validation_0-rmse:0.19226
[850]	validation_0-rmse:0.19009
[900]	validation_0-rmse:0.18824
[950]	validation_0-rmse:0.18670
[999]	validation_0-rmse:0.18545
Training RMSE: 0.1482615886044696
Validation RMSE: 0.18544888072387364




2. Lower max_depth:
A lower max_depth can prevent the model from learning highly specific patterns in the training data. You can experiment with lowering the value from 6 to 3 or 4.

In [43]:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Assuming X and y are your features and target variable
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the XGBoost model with adjusted hyperparameters
model = XGBRegressor(
    n_estimators=1000,
    max_depth=4,  # Try reducing depth
    learning_rate=0.01,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=10,
    reg_lambda=10,
    early_stopping_rounds=10
)
# Fit the model on the training data, including the validation set
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=50)

# Predict on the full training and test sets (you can replace X_test with your actual test set)
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)

# Calculate RMSE for training and validation sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
val_rmse = mean_squared_error(y_val, y_pred_val, squared=False)

# Print results
print(f"Training RMSE: {train_rmse}")
print(f"Validation RMSE: {val_rmse}")


[0]	validation_0-rmse:1.02251
[50]	validation_0-rmse:0.76595
[100]	validation_0-rmse:0.60053
[150]	validation_0-rmse:0.49054
[200]	validation_0-rmse:0.40851
[250]	validation_0-rmse:0.34908
[300]	validation_0-rmse:0.31046
[350]	validation_0-rmse:0.28151
[400]	validation_0-rmse:0.25988
[450]	validation_0-rmse:0.24417
[500]	validation_0-rmse:0.23270
[550]	validation_0-rmse:0.22378
[600]	validation_0-rmse:0.21691
[650]	validation_0-rmse:0.21142
[700]	validation_0-rmse:0.20722
[750]	validation_0-rmse:0.20387
[800]	validation_0-rmse:0.20113
[850]	validation_0-rmse:0.19885
[900]	validation_0-rmse:0.19700
[950]	validation_0-rmse:0.19543
[999]	validation_0-rmse:0.19403
Training RMSE: 0.14589152628105612
Validation RMSE: 0.19402913387635518




3. Lower learning_rate and Increase n_estimators:
A smaller learning rate helps the model to learn slowly and thus reduces overfitting, but compensating with a higher number of estimators ensures that enough learning occurs.

In [44]:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Assuming X and y are your features and target variable
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBRegressor(
    n_estimators=2000,  # Increase estimators
    max_depth=4,
    learning_rate=0.005,  # Reduce learning rate
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=10,
    reg_lambda=10,
    early_stopping_rounds=10
)
# Fit the model on the training data, including the validation set
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=50)

# Predict on the full training and test sets (you can replace X_test with your actual test set)
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)

# Calculate RMSE for training and validation sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
val_rmse = mean_squared_error(y_val, y_pred_val, squared=False)

# Print results
print(f"Training RMSE: {train_rmse}")
print(f"Validation RMSE: {val_rmse}")


[0]	validation_0-rmse:1.02597
[50]	validation_0-rmse:0.88346
[100]	validation_0-rmse:0.77250
[150]	validation_0-rmse:0.68333
[200]	validation_0-rmse:0.60526
[250]	validation_0-rmse:0.53867
[300]	validation_0-rmse:0.48823
[350]	validation_0-rmse:0.44268
[400]	validation_0-rmse:0.40709
[450]	validation_0-rmse:0.37598
[500]	validation_0-rmse:0.35045
[550]	validation_0-rmse:0.32907
[600]	validation_0-rmse:0.31068
[650]	validation_0-rmse:0.29473
[700]	validation_0-rmse:0.28096
[750]	validation_0-rmse:0.26978
[800]	validation_0-rmse:0.26020
[850]	validation_0-rmse:0.25210
[900]	validation_0-rmse:0.24514
[950]	validation_0-rmse:0.23882
[1000]	validation_0-rmse:0.23311
[1050]	validation_0-rmse:0.22786
[1100]	validation_0-rmse:0.22399
[1150]	validation_0-rmse:0.22039
[1200]	validation_0-rmse:0.21703
[1250]	validation_0-rmse:0.21409
[1300]	validation_0-rmse:0.21142
[1350]	validation_0-rmse:0.20902
[1400]	validation_0-rmse:0.20706
[1450]	validation_0-rmse:0.20511
[1500]	validation_0-rmse:0.20365




In [46]:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Assuming X and y are your features and target variable
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBRegressor(
    n_estimators=1000,
    max_depth=4,
    learning_rate=0.01,
    subsample=0.7,  # Reduce subsample
    colsample_bytree=0.7,  # Reduce column sampling
    reg_alpha=10,
    reg_lambda=10,
    early_stopping_rounds=10
)

# Fit the model on the training data, including the validation set
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=50)

# Predict on the full training and test sets (you can replace X_test with your actual test set)
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)

# Calculate RMSE for training and validation sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
val_rmse = mean_squared_error(y_val, y_pred_val, squared=False)

# Print results
print(f"Training RMSE: {train_rmse}")
print(f"Validation RMSE: {val_rmse}")

[0]	validation_0-rmse:1.02326
[50]	validation_0-rmse:0.81385
[100]	validation_0-rmse:0.65091
[150]	validation_0-rmse:0.53929
[200]	validation_0-rmse:0.45193
[250]	validation_0-rmse:0.38759
[300]	validation_0-rmse:0.34541
[350]	validation_0-rmse:0.31190
[400]	validation_0-rmse:0.28725
[450]	validation_0-rmse:0.26748
[500]	validation_0-rmse:0.25292
[550]	validation_0-rmse:0.24121
[600]	validation_0-rmse:0.23294
[650]	validation_0-rmse:0.22589
[700]	validation_0-rmse:0.22025
[750]	validation_0-rmse:0.21622
[800]	validation_0-rmse:0.21241
[850]	validation_0-rmse:0.20930
[900]	validation_0-rmse:0.20682
[950]	validation_0-rmse:0.20469
[999]	validation_0-rmse:0.20267
Training RMSE: 0.15665978252929624
Validation RMSE: 0.2026694887433994




5. Tune min_child_weight:
This parameter controls the minimum sum of instance weight (hessian) needed in a child. Increasing min_child_weight makes the algorithm more conservative and prevents overfitting by requiring larger leaf sizes.

In [47]:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Assuming X and y are your features and target variable
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBRegressor(
    n_estimators=1000,
    max_depth=4,
    learning_rate=0.01,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=10,
    reg_lambda=10,
    min_child_weight=5,  # Increase to prevent overfitting
    early_stopping_rounds=10
)

# Fit the model on the training data, including the validation set
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=50)

# Predict on the full training and test sets (you can replace X_test with your actual test set)
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)

# Calculate RMSE for training and validation sets
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
val_rmse = mean_squared_error(y_val, y_pred_val, squared=False)

# Print results
print(f"Training RMSE: {train_rmse}")
print(f"Validation RMSE: {val_rmse}")

[0]	validation_0-rmse:1.02245
[50]	validation_0-rmse:0.76356
[100]	validation_0-rmse:0.59574
[150]	validation_0-rmse:0.48426
[200]	validation_0-rmse:0.40166
[250]	validation_0-rmse:0.34161
[300]	validation_0-rmse:0.30176
[350]	validation_0-rmse:0.27208
[400]	validation_0-rmse:0.25060
[450]	validation_0-rmse:0.23464
[500]	validation_0-rmse:0.22289
[550]	validation_0-rmse:0.21403
[600]	validation_0-rmse:0.20737
[650]	validation_0-rmse:0.20174
[700]	validation_0-rmse:0.19752
[750]	validation_0-rmse:0.19414
[800]	validation_0-rmse:0.19149
[850]	validation_0-rmse:0.18922
[900]	validation_0-rmse:0.18748
[950]	validation_0-rmse:0.18590
[999]	validation_0-rmse:0.18463
Training RMSE: 0.14738945572373655
Validation RMSE: 0.18463001787469321





## The Training and Test RMSE are best in out XGBoost Model by far. The training RMSE is 0.147 and the Test RMSE is 0.185. This is the model we will choose to predict grocery store sales