Store ground truth and predictions on validation folds to output folder #230

iaroslav-ai · 2021-12-10T13:45:24Z

Description of changes:
Stores predictions of xgboost model on cross validation folds to output folder, so that those predictions can be used for any postprocessing.

Our use case is to calculate confusion matrix and a few other statistics for model produced by this container; This requires predictions data that is not used for training. Currently this container with cv enabled loads all data in memory (training + validation) and uses both partitions for training of model ensemble. As we only have training / validation partition (and cannot easily add extra test partition) this change is storing predictions similar to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html, which is the only way for us to be able to calculate correctly performance metrics after the container has terminated.

To store additional outputs, the feature of training platform to output additional arguments is used: https://github.com/aws/sagemaker-containers#sm-output-data-dir .

For output folder, the data is written into “data” subdirectory following convention here: https://github.com/aws/sagemaker-training-toolkit/blob/master/src/sagemaker_training/environment.py#L136 .

Testing performed:

Unit tests passing
Integration tests passing
Run xgboost training on ~166 benchmarking datasets

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mabunday

Looks good to me aside from some very minor comments.

src/sagemaker_xgboost_container/algorithm_mode/train.py

awsbmillare · 2021-12-10T19:34:59Z

src/sagemaker_xgboost_container/algorithm_mode/train.py

@@ -232,6 +234,8 @@ def train_job(train_cfg, train_dmatrix, val_dmatrix, train_val_dmatrix, model_di

        else:
            num_cv_round = train_cfg.pop("_num_cv_round", 1)
+            additional_output_path = os.environ[SM_OUTPUT_DATA_DIR]


have you confirmed if this ENV var is always set? Might be a good idea to fallback to a sane default here. Alternatively, we'd need to check later if features are required and this is not set, then throw error there.

I think this should be always set by the training platform https://github.com/aws/sagemaker-containers#sm-output-data-dir

I will double check

src/sagemaker_xgboost_container/algorithm_mode/train.py

…arate class

iaroslav-ai · 2021-12-13T19:11:58Z

Thanks all for your comments! I refactored the code for storage of the predictions on validation folds into its own class. Unittests are missing for the functionality of this class, but integration tests are passing. I will add unittests if we are ok with this new class.

src/sagemaker_xgboost_container/prediction_utils.py

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

src/sagemaker_xgboost_container/prediction_utils.py

mabunday · 2021-12-15T00:25:07Z

Minor comments but looks good to me overall. Thanks for the changed!

awsbmillare

I have only a very minor nit, otherwise LGTM. Really nice work, thanks for all the changes.

awsbmillare · 2021-12-16T16:31:25Z

test/integration/local/test_kfold.py

+)
+def test_xgboost_abalone_kfold(dataset, extra_hps, model_file_count, docker_image, opt_ml):
+    hyperparameters = get_abalone_default_hyperparameters()
+    data_path = os.path.join(path, "..", "..", "resources", dataset, "data")


nit: can we reuse data_root above here?

Fixed, thanks!

* Store ground truth and predictions on validation folds to output folder (#230) * Store ground truth and predictions on validation folds to output folder * Refactor functionality to store validation set predictions into a separate class * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Add checks for recorded predictions, refactor helper functions * More accurate final repeated prediction counter check * Add unit tests for prediction_utils.py * Using data_root for kfold tests to reduce code duplication Co-authored-by: Iaroslav Shcherbatyi <siarosla@amazon.com> Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * upgrading pillow to 9.0.0 to address high security risks from CVE https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22817 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22816 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22815 * upgrading pillow to 9.0.0 to address high security risks from CVE https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22817 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22816 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22815 * Using keyring to prevent GPG key from being outdated * Using keyring to prevent GPG key from being outdated * Bump numpy from 1.20.3 to 1.21.0 * Bump numpy from 1.20.3 to 1.21.0 * Bump numpy from 1.20.3 to 1.21.0 * bump up version to 1.5.2 * Revert "Merge branch '1.5.2-draft'" This reverts commit 633f5bb, reversing changes made to a9524d2. * Track buildspecs in repo like sklearn container (#248) Co-authored-by: Brent Millare <bmillare@amazon.com> * Pin flask dep itsdangerous (#251) Co-authored-by: Brent Millare <bmillare@amazon.com> * Add metrics for MultiClass (#250) * Add metrics for MultiClass * --amend Co-authored-by: Krittaphat Pugdeethosapol <krittp@amazon.com> * Add multiclass metrics (#252) * Add metrics for MultiClass * --amend * Allow new metrics for eval_metric Co-authored-by: Krittaphat Pugdeethosapol <krittp@amazon.com> * pin markupsafe jinja2 * Using ensemble flag for multi-model endpoint * Add accept encodings to alg_mode (#270) Co-authored-by: Brent Millare <bmillare@amazon.com> * Add missing integ test * added HP for `sampling_method` * added HP for `prob_buffer_row` remove after fixed in console * Fix JSON output format * removing duplicate files from data_path while creating symlink * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * Add warnings when validation files are suspected to be identical with training files (#273) * Add warnings when validation files are suspected to be identical with training files * Resolving comments Co-authored-by: Iaroslav Shcherbatyi <iaroslav.github@gmail.com> Co-authored-by: Iaroslav Shcherbatyi <siarosla@amazon.com> Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> Co-authored-by: Nikhil Raverkar <nraverka@amazon.com> Co-authored-by: haixiw <haixiw@amazon.com> Co-authored-by: Brent Millare <69818968+awsbmillare@users.noreply.github.com> Co-authored-by: Brent Millare <bmillare@amazon.com> Co-authored-by: Kritt <krittaphat.pug@gmail.com> Co-authored-by: Krittaphat Pugdeethosapol <krittp@amazon.com> Co-authored-by: Dewan Choudhury <cdewan@amazon.com> Co-authored-by: Haixin Wang <98612668+haixiw@users.noreply.github.com>

Store ground truth and predictions on validation folds to output folder

4d2d8a6

mabunday self-requested a review December 10, 2021 15:05

awsbmillare self-requested a review December 10, 2021 15:36

mabunday approved these changes Dec 10, 2021

View reviewed changes

awsbmillare reviewed Dec 10, 2021

View reviewed changes

Refactor functionality to store validation set predictions into a sep…

b34d02d

…arate class

mabunday suggested changes Dec 13, 2021

View reviewed changes

iaroslav-ai and others added 7 commits December 14, 2021 17:30

Update src/sagemaker_xgboost_container/prediction_utils.py

301887e

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

Update src/sagemaker_xgboost_container/prediction_utils.py

70e4357

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

Update src/sagemaker_xgboost_container/prediction_utils.py

452cbb0

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

Update src/sagemaker_xgboost_container/prediction_utils.py

61bb822

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

Update src/sagemaker_xgboost_container/prediction_utils.py

d3d6fcb

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

Update src/sagemaker_xgboost_container/prediction_utils.py

f3d7821

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

Add checks for recorded predictions, refactor helper functions

b5c1031

iaroslav-ai requested a review from mabunday December 14, 2021 19:38

More accurate final repeated prediction counter check

15832a0

iaroslav-ai requested a review from awsbmillare December 14, 2021 19:46

mabunday suggested changes Dec 15, 2021

View reviewed changes

src/sagemaker_xgboost_container/prediction_utils.py Show resolved Hide resolved

src/sagemaker_xgboost_container/prediction_utils.py Outdated Show resolved Hide resolved

mabunday approved these changes Dec 15, 2021

View reviewed changes

Add unit tests for prediction_utils.py

96e26f0

mabunday approved these changes Dec 15, 2021

View reviewed changes

awsbmillare approved these changes Dec 16, 2021

View reviewed changes

Using data_root for kfold tests to reduce code duplication

37c4c74

mabunday merged commit f77551e into aws:master Dec 16, 2021

iaroslav-ai mentioned this pull request Jan 20, 2022

Backporting to 1.2-2 version of the container - store ground truth and predictions on validation folds to output folder #236

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store ground truth and predictions on validation folds to output folder #230

Store ground truth and predictions on validation folds to output folder #230

iaroslav-ai commented Dec 10, 2021

mabunday left a comment

awsbmillare Dec 10, 2021

iaroslav-ai Dec 14, 2021

iaroslav-ai Dec 14, 2021

iaroslav-ai commented Dec 13, 2021

mabunday commented Dec 15, 2021

awsbmillare left a comment

awsbmillare Dec 16, 2021

iaroslav-ai Dec 16, 2021

Store ground truth and predictions on validation folds to output folder #230

Store ground truth and predictions on validation folds to output folder #230

Conversation

iaroslav-ai commented Dec 10, 2021

mabunday left a comment

Choose a reason for hiding this comment

awsbmillare Dec 10, 2021

Choose a reason for hiding this comment

iaroslav-ai Dec 14, 2021

Choose a reason for hiding this comment

iaroslav-ai Dec 14, 2021

Choose a reason for hiding this comment

iaroslav-ai commented Dec 13, 2021

mabunday commented Dec 15, 2021

awsbmillare left a comment

Choose a reason for hiding this comment

awsbmillare Dec 16, 2021

Choose a reason for hiding this comment

iaroslav-ai Dec 16, 2021

Choose a reason for hiding this comment