# **FOREWORD**

This is my starter kernel for the [MITSUI&CO. Commodity Prediction Challenge](https://www.kaggle.com/competitions/mitsui-commodity-prediction-challenge) competition. Excerpts from the overview page highlight as below- <br>

### **The Challenge: Aiming for More Accurate Commodity Price Forecasting**
This competition tackles the critical need for more accurate and stable long-term commodity price predictions. Getting these predictions right has significant implications for both businesses and the global economy. Inaccurate forecasts can lead to suboptimal trading strategies, poor investment decisions, and increased financial risk for companies involved in commodity markets. By encouraging the development of advanced AI models that can accurately predict future commodity returns using historical data from LME, JPX, US Stock, and Forex, this competition aims to directly improve the precision of financial forecasting and enable the optimization of automated trading strategies.

In particular, participants are challenged to predict price-difference series—derived from the time-series differences between two distinct assets’ prices—to extract robust price-movement signals as features and deploy AI-driven trading techniques that turn those signals into sustainable trading profits.

### **Evaluation metric**
The evaluation metric for the competition is a modified Sharpe ratio with the code in the kernel here. It is a custom metric explained as below- <br>
The competition's metric is a variant of the Sharpe ratio, computed by dividing the mean Spearman rank correlation between the predictions and targets by the standard deviation. 

### **What I do here**
I will start with basic FE and simple boosted tree models using a time series based cross-validation. I also illustrate the idea to submit the model to the inference API for evaluation on the public leaderboard. 

Wishing the best for the competition!


# **IMPORTS**

In [1]:
%%time 

!pip install -q lightgbm==4.6.0      --no-index --find-links=/kaggle/input/mitsui2025-public-imports-v1/packages
!pip install -q xgboost==3.0.2       --no-index --find-links=/kaggle/input/mitsui2025-public-imports-v1/packages
!pip install -q scikit-learn==1.7.1  --no-index --find-links=/kaggle/input/mitsui2025-public-imports-v1/packages
!pip install -q polars==1.31.0       --no-index --find-links=/kaggle/input/mitsui2025-public-imports-v1/packages

exec(open(f"/kaggle/input/mitsui2025-public-imports-v1/myimports.py", "r").read())
exec(open(f"/kaggle/input/mitsui2025-public-imports-v1/myutils.py","r").read())
exec(open(f"/kaggle/input/mitsui2025-public-imports-v1/training.py","r").read())

print()

---> Imports- part 1 done
---> Sklearn = 1.7.1 | Pandas = 2.2.3 | Polars = 1.31.0
---> Commencing imports-part2
---> XGBoost = 3.0.2 | LightGBM = 4.6.0
---> Imports- part 2 done
---> Seeding everything

---> Imports done

CPU times: user 6.96 s, sys: 1.5 s, total: 8.46 s
Wall time: 44 s


# **CONFIGURATION**

In [2]:
%%time

utils = Utils()

class CFG:
    """
    Configuration class for parameters and CV strategy for tuning and training
    Some parameters may be unused here as this is a general configuration class
    """

    # Data preparation:-
    version_nb         = 1
    model_id           = "V1_2"
    model_label        = "ML"
    test_req           = True
    test_iter          = 25
    gpu_switch         = "ON" if torch.cuda.is_available() else "OFF"
    state              = 42
    target             = f"target"
    grouper            = f""
    
    tgt_mapper         = {}
    
    ip_path            = f"/kaggle/input/mitsui-commodity-prediction-challenge"
    op_path            = f"/kaggle/working"
    dtl_preproc_req    = False
    ftre_plots_req     = False
    ftre_imp_req       = False

    # Model Training:-
    pstprcs_oof        = False
    ML                 = True
    test_preds_req     = True
    n_splits           = 5
    n_repeats          = 1
    nbrnd_erly_stp     = 0
    mdlcv_mthd         = 'KF'
    metric_obj         = 'maximize'

    # Global variables for plotting:-
    grid_specs = {'visible'  : True,
                  'which'    : 'both',
                  'linestyle': '--',
                  'color'    : 'lightgrey',
                  'linewidth': 0.75
                 }

    title_specs = {'fontsize'   : 9,
                   'fontweight' : 'bold',
                   'color'      : '#992600',
                  }


cv_selector = \
{
 "RKF"   : RepeatedKFold(n_splits = CFG.n_splits, n_repeats= CFG.n_repeats, random_state= CFG.state),
 "RSKF"  : RepeatedStratifiedKFold(n_splits  = CFG.n_splits, n_repeats= CFG.n_repeats, random_state= CFG.state),
 "SKF"   : StratifiedKFold(n_splits = CFG.n_splits, shuffle = True, random_state= CFG.state),
 "KF"    : KFold(n_splits = CFG.n_splits, shuffle = True, random_state= CFG.state),
 "GKF"   : GroupKFold(n_splits = CFG.n_splits)
}

collect()


CPU times: user 162 ms, sys: 43.6 ms, total: 205 ms
Wall time: 247 ms


0

# **PREPROCESSING**

We load the data sets here and preprocess it to have a single target column in an appended fashion as below- <br>
- Load the datasets
- Create a table with all features and 1 target id. This is an interim table
- Purge off rows with null and unrealistic targets
- Append the temporary table to a list and concatenate this at the end to host the features and a single target column

Thanks to the kernel [here](https://www.kaggle.com/code/takaito/mitsui-cpc-gradient-boosting-models-training) for this idea.

In [3]:
%%time 

train        = pd.read_csv(f"{CFG.ip_path}/train.csv")
test         = pd.read_csv(f"{CFG.ip_path}/test.csv")
train_labels = pd.read_csv(f"{CFG.ip_path}/train_labels.csv")
target_pairs = pd.read_csv(f"{CFG.ip_path}/target_pairs.csv")
target_cols  = list( train_labels.columns[1:] )

PrintColor(f"\n---> Shape = {train.shape} {train_labels.shape}\n")

display(
    target_pairs.groupby("lag").first().
    style.
    set_caption(f"Target categories by lags")
)

print()
Xtrain, Xtest = train.copy(), test.copy()
Xtrain.index  = range(len(Xtrain))
Ytrain        = train_labels.copy()

PrintColor(f"\n---> Shape = {Xtrain.shape} {Xtest.shape} {Ytrain.shape}\n")

_ = utils.CleanMemory()
print()

[1m[34m
---> Shape = (1917, 558) (1917, 425)
[0m


Unnamed: 0_level_0,target,pair
lag,Unnamed: 1_level_1,Unnamed: 2_level_1
1,target_0,US_Stock_VT_adj_close
2,target_106,US_Stock_VXUS_adj_close
3,target_212,FX_ZARUSD
4,target_318,FX_NOKEUR



[1m[34m
---> Shape = (1917, 558) (90, 559) (1917, 425)
[0m

CPU times: user 547 ms, sys: 74.5 ms, total: 622 ms
Wall time: 829 ms


## **INITIAL INSIGHTS**

- All columns are numerical with nulls across them
- Category columns could be hidden within the dataset, we need to check the value_counts across features to determine if any column is sparse in categories (potential category columns)
- Analysis of nulls and outliers across individual features will be useful
- Targets are difference of returns, hence lagged and moving period features will be useful

# **BASELINE MODELS**

We use the last 90 days as OOF data and fit an offline model for **each target** without FE. We then refit the model on the full data and store the fitted model in a dictionary for evaluation and prediction



In [4]:
%%time 

OOF_Preds    = []
FittedModels = {}

Mdl_Master = {

    "LGBM1R" : [LGBMR(
                        objective        = "regression",
                        device           = "cpu" if CFG.gpu_switch == "OFF" else "gpu",
                        n_estimators     = 200,
                        metric           = "rmse",
                        learning_rate    = 0.05,
                        random_state     = CFG.state,
                        verbosity        = -1,
                    ),
                 {"callbacks" : [log_evaluation(0)] },
               ], 

    "LGBM2R" : [LGBMR(
                        objective        = "regression",
                        device           = "cpu" if CFG.gpu_switch == "OFF" else "gpu",
                        n_estimators     = 225,
                        data_sample_strategy = "goss",
                        metric           = "rmse",
                        learning_rate    = 0.02,
                        reg_alpha        = 0.001,
                        reg_lambda       = 1.25,
                        colsample_bytree = 0.20,
                        subsample        = 0.30,
                        random_state     = CFG.state,
                        verbosity        = -1,
                    ),
                 {"callbacks" : [log_evaluation(0)] },
               ], 
    
    "LGBM3R" : [LGBMR(
                        objective        = "regression",
                        device           = "cpu" if CFG.gpu_switch == "OFF" else "gpu",
                        n_estimators     = 150,
                        metric           = "rmse",
                        colsample_bytree = 0.55,
                        subsample        = 0.55,
                        learning_rate    = 0.005,
                        random_state     = CFG.state,
                        verbosity        = -1,
                    ),
                 {"callbacks" : [log_evaluation(0)] },
               ],     
}

targets   = list(train_labels.columns[1:])
drop_cols = ["Source", "date_id", "id", "row_id", "is_scored"]
scores    = {}

for target in tqdm(targets):

    Xtr = Xtrain.copy()
    Xtr[target] = train_labels[target].values
    Xtr  = Xtr.dropna(subset = target)
    
    Xdev = Xtr.loc[Xtr.date_id >= 1827]
    dev_idx = Xdev["date_id"].values
    Xdev = Xdev.drop(drop_cols, axis=1, errors = "ignore")
    ydev = Xdev[target]
    del Xdev[target]
    
    Xtr  = Xtr.loc[Xtr.date_id < 1827].drop(drop_cols, axis=1, errors = "ignore")
    ytr  = Xtr[target]
    del Xtr[target]

    scores_ = []
    oof_    = []
    for method, (mymodel, fit_params) in Mdl_Master.items():

        model = clone(mymodel)
        model.fit(Xtr, ytr, **fit_params)
        dev_preds = model.predict(Xdev)
        
        scores_.append( root_mean_squared_error(ydev, dev_preds) )
        _ = utils.CleanMemory()
        oof_.append(dev_preds)

    oof_ = np.mean(np.stack(oof_, axis=1), axis=1)
    OOF_Preds.append(
        pd.Series(oof_, index = dev_idx, name = target)
    )
    scores[target] = scores_
    
    Xtr = Xtrain.copy()
    Xtr[target] = train_labels[target].values
    Xtr  = Xtr.dropna(subset = target).drop(drop_cols, axis=1, errors = "ignore")
    ytr  = Xtr[target]
    del Xtr[target]

    f_models = []
    for method, (mymodel, fit_params) in Mdl_Master.items():
    
        model = clone(mymodel)
        model.fit(Xtr, ytr, **fit_params)
        f_models.append(model)
        
    FittedModels[target] = f_models
    
joblib.dump(FittedModels, "FittedModels.joblib")
print()
_ = utils.CleanMemory()

  0%|          | 0/424 [00:00<?, ?it/s]




CPU times: user 7h 51min 46s, sys: 17min 24s, total: 8h 9min 11s
Wall time: 4h 9min 7s


## **CV-SCORE**

In [5]:
%%time 

submission            = pd.concat(OOF_Preds, axis=1).sort_index(ascending = True)
submission["date_id"] = np.array(submission.index)
solution              = train_labels.loc[train_labels.date_id >= 1827]
solution["date_id"]   = submission["date_id"].values

score = \
utils.ScoreMetric(
    solution,
    submission,
    "date_id"
)

PrintColor(f"\n---> Score = {score:,.8f}\n")

[1m[34m
---> Score = -0.00549112
[0m
CPU times: user 454 ms, sys: 3.08 ms, total: 457 ms
Wall time: 496 ms
