Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed errors Case2272044573 #294

Merged
merged 5 commits into from
Jun 15, 2022
Merged

fixed errors Case2272044573 #294

merged 5 commits into from
Jun 15, 2022

Conversation

NikhilRaverkar
Copy link
Contributor

@NikhilRaverkar NikhilRaverkar commented Jun 7, 2022

Issue #, if available:
Fixed an issues with Case2272044573
The issue was we were doing the data redundancy check, which relies on files for Pipe mode as well.

Description of changes:

  • Added a condition to not run file comparison check for Pipe mode
  • All integ tests passed. bb integration --image=515193369038.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1.0.41-nraverka-test --stage=dev
----------------------------------------------------------- generated xml file: /workplace/nraverka/SMFrameworkXGBoost/src/SMFrameworksXGBoost1_3-1Tests/build/brazil-integ-tests/TEST-results.xml -----------------------------------------------------------
---------------------------- generated html file: /workplace/nraverka/SMFrameworkXGBoost/build/SMFrameworksXGBoost1_3-1Tests/SMFrameworksXGBoost1_3-1Tests-1.0/AL2_x86_64/DEV.STD.PTHREAD/build/brazil-unit-tests/python3.7.html -----------------------------
--------------------------- generated xml file: /workplace/nraverka/SMFrameworkXGBoost/build/SMFrameworksXGBoost1_3-1Tests/SMFrameworksXGBoost1_3-1Tests-1.0/AL2_x86_64/DEV.STD.PTHREAD/build/brazil-unit-tests/TEST-python3.7.xml ---------------------------
---------------------------- generated html file: /workplace/nraverka/SMFrameworkXGBoost/build/SMFrameworksXGBoost1_3-1Tests/SMFrameworksXGBoost1_3-1Tests-1.0/AL2_x86_64/DEV.STD.PTHREAD/build/brazil-unit-tests/python3.7.html -----------------------------
====================================================================================================================== warnings summary ======================================================================================================================
:68
  'pytest_runtest_protocol' hook uses deprecated __multicall__ argument

-- Docs: http://doc.pytest.org/en/latest/warnings.html
========================================================================================================== 122 passed, 1 warnings in 921.17 seconds ==========================================================================================================
BUILD SUCCEEDED


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@NikhilRaverkar NikhilRaverkar marked this pull request as ready for review June 8, 2022 18:42
Comment on lines +144 to +149
if train_path == val_path or os.path.basename(train_path) == os.path.basename(val_path):
logger.warning('Found same path for training and validation. This is not recommended and results may not '
'be correct.')
elif not is_pipe:
# Check if there is potential data redundancy between training and validation sets
check_data_redundancy(train_path, val_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? This is saying

  1. If we suspect data and validation redundancy => log a warning
  2. Else (we DON'T suspect data and validation redundancy) AND we're in file mode => check data redundancy

Shouldn't the call to check_data_redundancy fall under the same parent condition as the path check and the warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @haixiw
The idea was:

  1. If we KNOW that the data and validation paths are same, then log a warning (log a warning soon as we can)
  2. Else(Shallow path check did not find redundancy and we want file level comparison) AND if we are in file mode , run the file-by-file comparison

We want to run check_data_redundancy only when necessary, that's why it's under "else". We don't want to run this file-by-file check on "pipe" mode that's why the condition.

"""
:param data_path: Either directory or file
"""
# For pipe mode, we leverages mlio directly by creating a list of SageMakerPipe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this comment means or what its purpose is. Let's re-word it or remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove in next rev.

:param data_path: Either directory or file
"""
# In file mode, we create a temp directory with symlink to all input files or
# directories to meet XGB's assumption that all files are in the same directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optional) Since we're referencing an assumption XGB makes, it would be helpful to include a doc link to that assumption.

os.symlink(path, base_name)


def check_data_redundancy(train_path, validate_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only runs once per training job, right? Do we have an idea of how it performs with very large training or validation data sets? Would it be enough just to log the warning that the paths are the same without doing a file by file comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with @haixiw

  1. Yes this only runs once per training job
    2.He had a discussion earlier regarding this with Brent, since we are just comparing file name and file size, this should be insignificant as compared to creating DMatrix for huge training and validation data.

@@ -225,3 +226,19 @@ def test_parse_protobuf_dmatrix_single_feature_label(self):
pb_path = os.path.join(self.data_path, 'recordio_protobuf', file_path)
reader = data_utils.get_recordio_protobuf_dmatrix
self._check_dmatrix(reader, pb_path, 1, 1)

@patch("logging.warning")
def test_check_data_redundancy_positive(self, mock_log_warning):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add additional test(s) to cover file and pipe mode differences since that failed tests last time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will add a couple of unit tests to verify we are calling this only from file mode and not from pipe mode.

@NikhilRaverkar NikhilRaverkar merged commit 7da4fd6 into 1.3-1 Jun 15, 2022
@mabunday mabunday deleted the Case2272044573 branch August 4, 2022 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants