Skip to content

Add sample weight to LightGBM#147

Merged
artpdr merged 27 commits into
feedzai:masterfrom
jpleitao:ft-jl-lightgbm-sample-weight
May 19, 2026
Merged

Add sample weight to LightGBM#147
artpdr merged 27 commits into
feedzai:masterfrom
jpleitao:ft-jl-lightgbm-sample-weight

Conversation

@jpleitao
Copy link
Copy Markdown
Contributor

Summary

This MR adds support for sample weights in LightGBM model training. This is achieved by:

  • Adding a sample_weight parameter to the utility class that organizes all LightGBM's training hyper-parameters.

  • Introduce utility classes for managing schema fields, in particular a SampleWeightParamParserUtil for the sample weight hyper-parameter.

  • Modified LightGBM's training logic, in order to make use of the sample weight hyper-parameter:

    • Creates SWIG objects to hold the sample weights.
    • Adjusts the set of features in LightGBM's native implementation, to exclude the sample weight from predictive fields.
    • Sets/Initializes the sample weight objects with the appropriate values from the training dataset (according to the hyper-parameter specification), and ensures this is copied to SWIG data arrays, so it can be used in model training.
    • Added validation to the sample weight parameter, to ensure it references an existing field.
    • Adjusted model loading verifications, in particular related to the number and names of the predictive fields, to not include sample weights (as these are not predictive fields).
  • Tests

    • Added two new test resources, and corresponding schemas, corresponding to two sample datasets with different sample weight.
    • Added new tests:
      • Load model with schema containing sample weights.
      • Validation error is thrown when the sample weight references a field that does not exist.
      • Validation error is thrown when the sample weight references a non-numeric field.
      • Successful model training with sample weight
      • Validate sample weight usage in model training: Train two models with the same data but different sample weights:
        • Model 1 is trained with random sample weights, uniformly distributed across positive and negative samples (doesn't give more weight to a particular class).
        • Model 2 is trained with significant higher sample weight in the positive class: all positive class samples have weight = 10, whereas negative class samples have weight = 1
        • Model 2 is expected to have higher recall, as the assigned sample weights make it give more attention to positive classes

@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 88.97638% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.99%. Comparing base (4ab3e8e) to head (c515b98).

Files with missing lines Patch % Lines
...tgbm/LightGBMBinaryClassificationModelTrainer.java 86.36% 3 Missing and 3 partials ⚠️
...eedzai/openml/provider/lightgbm/SWIGTrainData.java 80.00% 1 Missing and 3 partials ⚠️
...openml/provider/lightgbm/LightGBMModelCreator.java 92.30% 1 Missing and 2 partials ⚠️
...zai/openml/provider/lightgbm/SchemaFieldsUtil.java 94.44% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #147      +/-   ##
============================================
+ Coverage     80.52%   80.99%   +0.46%     
- Complexity      479      516      +37     
============================================
  Files            50       52       +2     
  Lines          1648     1757     +109     
  Branches        158      177      +19     
============================================
+ Hits           1327     1423      +96     
- Misses          233      238       +5     
- Partials         88       96       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread .gitmodules Outdated
@jpleitao jpleitao requested a review from artpdr May 18, 2026 11:50
@jpleitao jpleitao force-pushed the ft-jl-lightgbm-sample-weight branch from 1aabfce to dde15e0 Compare May 18, 2026 17:22
@jpleitao jpleitao requested a review from artpdr May 18, 2026 17:26
@jpleitao jpleitao requested a review from artpdr May 19, 2026 09:21
@AlbertoEAF AlbertoEAF self-requested a review May 19, 2026 10:44
Copy link
Copy Markdown
Contributor

@AlbertoEAF AlbertoEAF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get time to make a thorough review, but by a quick look the logic seemed sound in terms of LGBM lib calls, and selecting columns.

I'd suggest a couple manual UATs to be sure everything is as expected, such as passing or not that field, and passing an invalid one, and having a small synthetic dataset that shows the model took the weight into consideration.

@artpdr artpdr merged commit f31e058 into feedzai:master May 19, 2026
3 checks passed
@jpleitao jpleitao deleted the ft-jl-lightgbm-sample-weight branch May 19, 2026 22:58
artpdr pushed a commit that referenced this pull request May 20, 2026
…ter training (#150)

# Summary

This PR addresses a memory leak introduced in #147, by not ensuring the release of all allocated SWIG resources, in particular when training crashes (e.g., when an `IllegalArgumentException` is thrown due to negative sample weights provided). To ensure allocated resources are always released, a try-with-resources is used for the training logic, ensuring the call to `.close()` method for the two resources created - `swigTrainData` and `swigTrainBooster` -, regardless if the training is successful or not
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants