-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixed errors Case2272044573 #294
Conversation
if train_path == val_path or os.path.basename(train_path) == os.path.basename(val_path): | ||
logger.warning('Found same path for training and validation. This is not recommended and results may not ' | ||
'be correct.') | ||
elif not is_pipe: | ||
# Check if there is potential data redundancy between training and validation sets | ||
check_data_redundancy(train_path, val_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? This is saying
- If we suspect data and validation redundancy => log a warning
- Else (we DON'T suspect data and validation redundancy) AND we're in file mode => check data redundancy
Shouldn't the call to check_data_redundancy
fall under the same parent condition as the path check and the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @haixiw
The idea was:
- If we KNOW that the data and validation paths are same, then log a warning (log a warning soon as we can)
- Else(Shallow path check did not find redundancy and we want file level comparison) AND if we are in file mode , run the file-by-file comparison
We want to run check_data_redundancy
only when necessary, that's why it's under "else". We don't want to run this file-by-file check on "pipe" mode that's why the condition.
""" | ||
:param data_path: Either directory or file | ||
""" | ||
# For pipe mode, we leverages mlio directly by creating a list of SageMakerPipe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this comment means or what its purpose is. Let's re-word it or remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove in next rev.
:param data_path: Either directory or file | ||
""" | ||
# In file mode, we create a temp directory with symlink to all input files or | ||
# directories to meet XGB's assumption that all files are in the same directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Optional) Since we're referencing an assumption XGB makes, it would be helpful to include a doc link to that assumption.
os.symlink(path, base_name) | ||
|
||
|
||
def check_data_redundancy(train_path, validate_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function only runs once per training job, right? Do we have an idea of how it performs with very large training or validation data sets? Would it be enough just to log the warning that the paths are the same without doing a file by file comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this with @haixiw
- Yes this only runs once per training job
2.He had a discussion earlier regarding this with Brent, since we are just comparing file name and file size, this should be insignificant as compared to creating DMatrix for huge training and validation data.
@@ -225,3 +226,19 @@ def test_parse_protobuf_dmatrix_single_feature_label(self): | |||
pb_path = os.path.join(self.data_path, 'recordio_protobuf', file_path) | |||
reader = data_utils.get_recordio_protobuf_dmatrix | |||
self._check_dmatrix(reader, pb_path, 1, 1) | |||
|
|||
@patch("logging.warning") | |||
def test_check_data_redundancy_positive(self, mock_log_warning): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add additional test(s) to cover file and pipe mode differences since that failed tests last time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure will add a couple of unit tests to verify we are calling this only from file mode and not from pipe mode.
Issue #, if available:
Fixed an issues with Case2272044573
The issue was we were doing the data redundancy check, which relies on files for Pipe mode as well.
Description of changes:
bb integration --image=515193369038.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1.0.41-nraverka-test --stage=dev
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.