Add sample weight to LightGBM by jpleitao · Pull Request #147 · feedzai/feedzai-openml-java

jpleitao · 2026-05-14T08:51:18Z

Summary

This MR adds support for sample weights in LightGBM model training. This is achieved by:

Adding a sample_weight parameter to the utility class that organizes all LightGBM's training hyper-parameters.
Introduce utility classes for managing schema fields, in particular a SampleWeightParamParserUtil for the sample weight hyper-parameter.
Modified LightGBM's training logic, in order to make use of the sample weight hyper-parameter:
- Creates SWIG objects to hold the sample weights.
- Adjusts the set of features in LightGBM's native implementation, to exclude the sample weight from predictive fields.
- Sets/Initializes the sample weight objects with the appropriate values from the training dataset (according to the hyper-parameter specification), and ensures this is copied to SWIG data arrays, so it can be used in model training.
- Added validation to the sample weight parameter, to ensure it references an existing field.
- Adjusted model loading verifications, in particular related to the number and names of the predictive fields, to not include sample weights (as these are not predictive fields).
Tests
- Added two new test resources, and corresponding schemas, corresponding to two sample datasets with different sample weight.
- Added new tests:
  - Load model with schema containing sample weights.
  - Validation error is thrown when the sample weight references a field that does not exist.
  - Validation error is thrown when the sample weight references a non-numeric field.
  - Successful model training with sample weight
  - Validate sample weight usage in model training: Train two models with the same data but different sample weights:
    - Model 1 is trained with random sample weights, uniformly distributed across positive and negative samples (doesn't give more weight to a particular class).
    - Model 2 is trained with significant higher sample weight in the positive class: all positive class samples have weight = 10, whereas negative class samples have weight = 1
    - Model 2 is expected to have higher recall, as the assigned sample weights make it give more attention to positive classes

…rain parameter

…ovided

…dockerfile based on manylinux_2_28

codecov · 2026-05-15T10:42:47Z

Codecov Report

❌ Patch coverage is 88.97638% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.99%. Comparing base (4ab3e8e) to head (c515b98).

Files with missing lines	Patch %	Lines
...tgbm/LightGBMBinaryClassificationModelTrainer.java	86.36%	3 Missing and 3 partials ⚠️
...eedzai/openml/provider/lightgbm/SWIGTrainData.java	80.00%	1 Missing and 3 partials ⚠️
...openml/provider/lightgbm/LightGBMModelCreator.java	92.30%	1 Missing and 2 partials ⚠️
...zai/openml/provider/lightgbm/SchemaFieldsUtil.java	94.44%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #147      +/-   ##
============================================
+ Coverage     80.52%   80.99%   +0.46%     
- Complexity      479      516      +37     
============================================
  Files            50       52       +2     
  Lines          1648     1757     +109     
  Branches        158      177      +19     
============================================
+ Hits           1327     1423      +96     
- Misses          233      238       +5     
- Partials         88       96       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…documentation

…and replaced inefficient comparison during copy to SWIG arrays

…ding the model

…Creator.schemaMatchAllFeatures

… with underscore) in single method

…ring only the sample weight field can differ between model and dataset schema

AlbertoEAF

Didn't get time to make a thorough review, but by a quick look the logic seemed sound in terms of LGBM lib calls, and selecting columns.

I'd suggest a couple manual UATs to be sure everything is as expected, such as passing or not that field, and passing an invalid one, and having a small synthetic dataset that shows the model took the weight into consideration.

…ter training (#150) # Summary This PR addresses a memory leak introduced in #147, by not ensuring the release of all allocated SWIG resources, in particular when training crashes (e.g., when an `IllegalArgumentException` is thrown due to negative sample weights provided). To ensure allocated resources are always released, a try-with-resources is used for the training logic, ensuring the call to `.close()` method for the two resources created - `swigTrainData` and `swigTrainBooster` -, regardless if the training is successful or not

AlbertoEAF and others added 13 commits May 13, 2026 17:57

lightgbm: Add SoftLabelParamParserUtil

3575f54

chore(lightgbm): move SchemaFieldsUtil to lightgbm package

3a320d2

feat(lightgbm): added utility class to parse lightgbm sample weight t…

4b743e3

…rain parameter

feat(lightgbm): add sample weight parameter to lightgbm descriptor

b6b0ce3

feat(lightgbm): modify training to make use of sample weight, when pr…

8ab146c

…ovided

chore(lightgbm): updated tests to validate sample weight

74ceac0

chore(cicd): bump cache maven packages

629ced4

fix(submodules): pointing make-lightgbm submodule to fork with AMD64 …

496f967

…dockerfile based on manylinux_2_28

fix(submodules): update make-lightgbm submodule

690313c

fix(submodules): update make-lightgbm submodule

bcb3b8e

fix(submodules): update make-lightgbm submodule

241ec89

fix(pom): pin maven-antrun-plugin version

905a26f

chore(cicd): bumping codecod to v5

d1ce4d4

artpdr requested changes May 15, 2026

View reviewed changes

Comment thread .gitmodules Outdated

Joaquim Leitao added 6 commits May 15, 2026 18:31

fix(lightgbm): add check for sample weight values

47faf1f

chore(tests): added tests for SchemaFieldsUtil class

35bdb1b

chore(tests): add test for negative sample weights

817ecec

fix(submodules): update make-lightgbm submodule

b23a131

chore(tests): added tests to cover untested lines

d3e0230

fix(submodules): pointing make-lightgbm to latest upstream

f7fa44e

jpleitao requested a review from artpdr May 18, 2026 11:50

artpdr reviewed May 18, 2026

View reviewed changes

Comment thread ...provider/src/main/java/com/feedzai/openml/provider/lightgbm/SampleWeightParamParserUtil.java Outdated

artpdr reviewed May 18, 2026

View reviewed changes

Comment thread ...main/java/com/feedzai/openml/provider/lightgbm/LightGBMBinaryClassificationModelTrainer.java Outdated

artpdr reviewed May 18, 2026

View reviewed changes

Comment thread ...tgbm/lightgbm-provider/src/main/java/com/feedzai/openml/provider/lightgbm/SWIGTrainData.java

artpdr reviewed May 18, 2026

View reviewed changes

Comment thread ...main/java/com/feedzai/openml/provider/lightgbm/LightGBMBinaryClassificationModelTrainer.java Outdated

Joaquim Leitao added 3 commits May 18, 2026 18:06

chore(fairgbm-descriptor): fix link to feedzai's fairgbm documentation

95a6836

chore(lint): fix variable name and added missing parameter to method …

d0b10d0

…documentation

fix(lightgbm): simplified logic to determine the number of features, …

dde15e0

…and replaced inefficient comparison during copy to SWIG arrays

jpleitao force-pushed the ft-jl-lightgbm-sample-weight branch from 1aabfce to dde15e0 Compare May 18, 2026 17:22

jpleitao requested a review from artpdr May 18, 2026 17:26

artpdr reviewed May 18, 2026

View reviewed changes

Joaquim Leitao added 5 commits May 19, 2026 09:55

chore(lightgbm): added method to retrieve the relevant schema for loa…

f6d1c1a

…ding the model

chore(lightgbm): updated javadoc and parameter names in LightGBMModel…

9013fd1

…Creator.schemaMatchAllFeatures

chore(lightgbm): unified logic to sanitize field names (replace space…

5b7b9be

… with underscore) in single method

fix(lightgbm): fixed validations for model's predictive fields - ensu…

0ba4b77

…ring only the sample weight field can differ between model and dataset schema

chore(lightgbm): fix constructor argument

c515b98

jpleitao requested a review from artpdr May 19, 2026 09:21

AlbertoEAF self-requested a review May 19, 2026 10:44

AlbertoEAF approved these changes May 19, 2026

View reviewed changes

artpdr merged commit f31e058 into feedzai:master May 19, 2026
3 checks passed

jpleitao deleted the ft-jl-lightgbm-sample-weight branch May 19, 2026 22:58

jpleitao mentioned this pull request May 20, 2026

fix(lightgbm): ensuring c++ resources created in swig are released after training #150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample weight to LightGBM#147

Add sample weight to LightGBM#147
artpdr merged 27 commits into
feedzai:masterfrom
jpleitao:ft-jl-lightgbm-sample-weight

jpleitao commented May 14, 2026

Uh oh!

codecov Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AlbertoEAF left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jpleitao commented May 14, 2026

Summary

Uh oh!

codecov Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AlbertoEAF left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 15, 2026 •

edited

Loading

AlbertoEAF left a comment •

edited

Loading