Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store ground truth and predictions on validation folds to output folder #230

Merged
merged 12 commits into from
Dec 16, 2021

Conversation

iaroslav-ai
Copy link
Contributor

Description of changes:
Stores predictions of xgboost model on cross validation folds to output folder, so that those predictions can be used for any postprocessing.

Our use case is to calculate confusion matrix and a few other statistics for model produced by this container; This requires predictions data that is not used for training. Currently this container with cv enabled loads all data in memory (training + validation) and uses both partitions for training of model ensemble. As we only have training / validation partition (and cannot easily add extra test partition) this change is storing predictions similar to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html, which is the only way for us to be able to calculate correctly performance metrics after the container has terminated.

To store additional outputs, the feature of training platform to output additional arguments is used: https://github.com/aws/sagemaker-containers#sm-output-data-dir .

For output folder, the data is written into “data” subdirectory following convention here: https://github.com/aws/sagemaker-training-toolkit/blob/master/src/sagemaker_training/environment.py#L136 .

Testing performed:

  • Unit tests passing
  • Integration tests passing
  • Run xgboost training on ~166 benchmarking datasets

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mabunday mabunday self-requested a review December 10, 2021 15:05
@awsbmillare awsbmillare self-requested a review December 10, 2021 15:36
Copy link
Contributor

@mabunday mabunday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me aside from some very minor comments.

src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
@@ -232,6 +234,8 @@ def train_job(train_cfg, train_dmatrix, val_dmatrix, train_val_dmatrix, model_di

else:
num_cv_round = train_cfg.pop("_num_cv_round", 1)
additional_output_path = os.environ[SM_OUTPUT_DATA_DIR]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you confirmed if this ENV var is always set? Might be a good idea to fallback to a sane default here. Alternatively, we'd need to check later if features are required and this is not set, then throw error there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be always set by the training platform https://github.com/aws/sagemaker-containers#sm-output-data-dir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will double check

src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
@iaroslav-ai
Copy link
Contributor Author

Thanks all for your comments! I refactored the code for storage of the predictions on validation folds into its own class. Unittests are missing for the functionality of this class, but integration tests are passing. I will add unittests if we are ok with this new class.

src/sagemaker_xgboost_container/prediction_utils.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/prediction_utils.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/prediction_utils.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/prediction_utils.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/prediction_utils.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/prediction_utils.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/prediction_utils.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/prediction_utils.py Outdated Show resolved Hide resolved
iaroslav-ai and others added 7 commits December 14, 2021 17:30
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
@mabunday
Copy link
Contributor

Minor comments but looks good to me overall. Thanks for the changed!

Copy link
Contributor

@awsbmillare awsbmillare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only a very minor nit, otherwise LGTM. Really nice work, thanks for all the changes.

)
def test_xgboost_abalone_kfold(dataset, extra_hps, model_file_count, docker_image, opt_ml):
hyperparameters = get_abalone_default_hyperparameters()
data_path = os.path.join(path, "..", "..", "resources", dataset, "data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we reuse data_root above here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

@mabunday mabunday merged commit f77551e into aws:master Dec 16, 2021
NikhilRaverkar added a commit that referenced this pull request Apr 27, 2022
* Store ground truth and predictions on validation folds to output folder (#230)

* Store ground truth and predictions on validation folds to output folder

* Refactor functionality to store validation set predictions into a separate class

* Update src/sagemaker_xgboost_container/prediction_utils.py

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

* Update src/sagemaker_xgboost_container/prediction_utils.py

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

* Update src/sagemaker_xgboost_container/prediction_utils.py

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

* Update src/sagemaker_xgboost_container/prediction_utils.py

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

* Update src/sagemaker_xgboost_container/prediction_utils.py

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

* Update src/sagemaker_xgboost_container/prediction_utils.py

Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

* Add checks for recorded predictions, refactor helper functions

* More accurate final repeated prediction counter check

* Add unit tests for prediction_utils.py

* Using data_root for kfold tests to reduce code duplication

Co-authored-by: Iaroslav Shcherbatyi <siarosla@amazon.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>

* upgrading pillow to 9.0.0 to address high security risks from CVE
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22817
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22816
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22815

* upgrading pillow to 9.0.0 to address high security risks from CVE
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22817
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22816
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22815

* Using keyring to prevent GPG key from being outdated

* Using keyring to prevent GPG key from being outdated

* Bump numpy from 1.20.3 to 1.21.0

* Bump numpy from 1.20.3 to 1.21.0

* Bump numpy from 1.20.3 to 1.21.0

* bump up version to 1.5.2

* Revert "Merge branch '1.5.2-draft'"

This reverts commit 633f5bb, reversing
changes made to a9524d2.

* Track buildspecs in repo like sklearn container (#248)

Co-authored-by: Brent Millare <bmillare@amazon.com>

* Pin flask dep itsdangerous (#251)

Co-authored-by: Brent Millare <bmillare@amazon.com>

* Add metrics for MultiClass (#250)

* Add metrics for MultiClass

* --amend

Co-authored-by: Krittaphat Pugdeethosapol <krittp@amazon.com>

* Add multiclass metrics (#252)

* Add metrics for MultiClass

* --amend

* Allow new metrics for eval_metric

Co-authored-by: Krittaphat Pugdeethosapol <krittp@amazon.com>

* pin markupsafe jinja2

* Using ensemble flag for multi-model endpoint

* Add accept encodings to alg_mode (#270)

Co-authored-by: Brent Millare <bmillare@amazon.com>

* Add missing integ test

* added HP for `sampling_method`

* added HP for `prob_buffer_row` remove after fixed in console

* Fix JSON output format

* removing duplicate files from data_path while creating symlink

* not creating duplicate symlinks to resolve FileExistsError

* not creating duplicate symlinks to resolve FileExistsError

* not creating duplicate symlinks to resolve FileExistsError

* not creating duplicate symlinks to resolve FileExistsError

* not creating duplicate symlinks to resolve FileExistsError

* not creating duplicate symlinks to resolve FileExistsError

* not creating duplicate symlinks to resolve FileExistsError

* not creating duplicate symlinks to resolve FileExistsError

* not creating duplicate symlinks to resolve FileExistsError

* not creating duplicate symlinks to resolve FileExistsError

* Add warnings when validation files are suspected to be identical with training files (#273)

* Add warnings when validation files are suspected to be identical with training files

* Resolving comments

Co-authored-by: Iaroslav Shcherbatyi <iaroslav.github@gmail.com>
Co-authored-by: Iaroslav Shcherbatyi <siarosla@amazon.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Nikhil Raverkar <nraverka@amazon.com>
Co-authored-by: haixiw <haixiw@amazon.com>
Co-authored-by: Brent Millare <69818968+awsbmillare@users.noreply.github.com>
Co-authored-by: Brent Millare <bmillare@amazon.com>
Co-authored-by: Kritt <krittaphat.pug@gmail.com>
Co-authored-by: Krittaphat Pugdeethosapol <krittp@amazon.com>
Co-authored-by: Dewan Choudhury <cdewan@amazon.com>
Co-authored-by: Haixin Wang <98612668+haixiw@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants