-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store ground truth and predictions on validation folds to output folder #230
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me aside from some very minor comments.
@@ -232,6 +234,8 @@ def train_job(train_cfg, train_dmatrix, val_dmatrix, train_val_dmatrix, model_di | |||
|
|||
else: | |||
num_cv_round = train_cfg.pop("_num_cv_round", 1) | |||
additional_output_path = os.environ[SM_OUTPUT_DATA_DIR] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you confirmed if this ENV var is always set? Might be a good idea to fallback to a sane default here. Alternatively, we'd need to check later if features are required and this is not set, then throw error there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be always set by the training platform https://github.com/aws/sagemaker-containers#sm-output-data-dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will double check
Thanks all for your comments! I refactored the code for storage of the predictions on validation folds into its own class. Unittests are missing for the functionality of this class, but integration tests are passing. I will add unittests if we are ok with this new class. |
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com>
Minor comments but looks good to me overall. Thanks for the changed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only a very minor nit, otherwise LGTM. Really nice work, thanks for all the changes.
test/integration/local/test_kfold.py
Outdated
) | ||
def test_xgboost_abalone_kfold(dataset, extra_hps, model_file_count, docker_image, opt_ml): | ||
hyperparameters = get_abalone_default_hyperparameters() | ||
data_path = os.path.join(path, "..", "..", "resources", dataset, "data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we reuse data_root above here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
* Store ground truth and predictions on validation folds to output folder (#230) * Store ground truth and predictions on validation folds to output folder * Refactor functionality to store validation set predictions into a separate class * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Update src/sagemaker_xgboost_container/prediction_utils.py Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * Add checks for recorded predictions, refactor helper functions * More accurate final repeated prediction counter check * Add unit tests for prediction_utils.py * Using data_root for kfold tests to reduce code duplication Co-authored-by: Iaroslav Shcherbatyi <siarosla@amazon.com> Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> * upgrading pillow to 9.0.0 to address high security risks from CVE https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22817 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22816 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22815 * upgrading pillow to 9.0.0 to address high security risks from CVE https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22817 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22816 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-22815 * Using keyring to prevent GPG key from being outdated * Using keyring to prevent GPG key from being outdated * Bump numpy from 1.20.3 to 1.21.0 * Bump numpy from 1.20.3 to 1.21.0 * Bump numpy from 1.20.3 to 1.21.0 * bump up version to 1.5.2 * Revert "Merge branch '1.5.2-draft'" This reverts commit 633f5bb, reversing changes made to a9524d2. * Track buildspecs in repo like sklearn container (#248) Co-authored-by: Brent Millare <bmillare@amazon.com> * Pin flask dep itsdangerous (#251) Co-authored-by: Brent Millare <bmillare@amazon.com> * Add metrics for MultiClass (#250) * Add metrics for MultiClass * --amend Co-authored-by: Krittaphat Pugdeethosapol <krittp@amazon.com> * Add multiclass metrics (#252) * Add metrics for MultiClass * --amend * Allow new metrics for eval_metric Co-authored-by: Krittaphat Pugdeethosapol <krittp@amazon.com> * pin markupsafe jinja2 * Using ensemble flag for multi-model endpoint * Add accept encodings to alg_mode (#270) Co-authored-by: Brent Millare <bmillare@amazon.com> * Add missing integ test * added HP for `sampling_method` * added HP for `prob_buffer_row` remove after fixed in console * Fix JSON output format * removing duplicate files from data_path while creating symlink * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * not creating duplicate symlinks to resolve FileExistsError * Add warnings when validation files are suspected to be identical with training files (#273) * Add warnings when validation files are suspected to be identical with training files * Resolving comments Co-authored-by: Iaroslav Shcherbatyi <iaroslav.github@gmail.com> Co-authored-by: Iaroslav Shcherbatyi <siarosla@amazon.com> Co-authored-by: Mark Bunday <15115482+mabunday@users.noreply.github.com> Co-authored-by: Nikhil Raverkar <nraverka@amazon.com> Co-authored-by: haixiw <haixiw@amazon.com> Co-authored-by: Brent Millare <69818968+awsbmillare@users.noreply.github.com> Co-authored-by: Brent Millare <bmillare@amazon.com> Co-authored-by: Kritt <krittaphat.pug@gmail.com> Co-authored-by: Krittaphat Pugdeethosapol <krittp@amazon.com> Co-authored-by: Dewan Choudhury <cdewan@amazon.com> Co-authored-by: Haixin Wang <98612668+haixiw@users.noreply.github.com>
Description of changes:
Stores predictions of xgboost model on cross validation folds to output folder, so that those predictions can be used for any postprocessing.
Our use case is to calculate confusion matrix and a few other statistics for model produced by this container; This requires predictions data that is not used for training. Currently this container with cv enabled loads all data in memory (training + validation) and uses both partitions for training of model ensemble. As we only have training / validation partition (and cannot easily add extra test partition) this change is storing predictions similar to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html, which is the only way for us to be able to calculate correctly performance metrics after the container has terminated.
To store additional outputs, the feature of training platform to output additional arguments is used: https://github.com/aws/sagemaker-containers#sm-output-data-dir .
For output folder, the data is written into “data” subdirectory following convention here: https://github.com/aws/sagemaker-training-toolkit/blob/master/src/sagemaker_training/environment.py#L136 .
Testing performed:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.