fixed errors Case2272044573 #294

NikhilRaverkar · 2022-06-07T20:59:06Z

Issue #, if available:
Fixed an issues with Case2272044573
The issue was we were doing the data redundancy check, which relies on files for Pipe mode as well.

Description of changes:

Added a condition to not run file comparison check for Pipe mode
All integ tests passed. bb integration --image=515193369038.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1.0.41-nraverka-test --stage=dev

----------------------------------------------------------- generated xml file: /workplace/nraverka/SMFrameworkXGBoost/src/SMFrameworksXGBoost1_3-1Tests/build/brazil-integ-tests/TEST-results.xml -----------------------------------------------------------
---------------------------- generated html file: /workplace/nraverka/SMFrameworkXGBoost/build/SMFrameworksXGBoost1_3-1Tests/SMFrameworksXGBoost1_3-1Tests-1.0/AL2_x86_64/DEV.STD.PTHREAD/build/brazil-unit-tests/python3.7.html -----------------------------
--------------------------- generated xml file: /workplace/nraverka/SMFrameworkXGBoost/build/SMFrameworksXGBoost1_3-1Tests/SMFrameworksXGBoost1_3-1Tests-1.0/AL2_x86_64/DEV.STD.PTHREAD/build/brazil-unit-tests/TEST-python3.7.xml ---------------------------
---------------------------- generated html file: /workplace/nraverka/SMFrameworkXGBoost/build/SMFrameworksXGBoost1_3-1Tests/SMFrameworksXGBoost1_3-1Tests-1.0/AL2_x86_64/DEV.STD.PTHREAD/build/brazil-unit-tests/python3.7.html -----------------------------
====================================================================================================================== warnings summary ======================================================================================================================
:68
  'pytest_runtest_protocol' hook uses deprecated __multicall__ argument

-- Docs: http://doc.pytest.org/en/latest/warnings.html
========================================================================================================== 122 passed, 1 warnings in 921.17 seconds ==========================================================================================================
BUILD SUCCEEDED

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mabunday · 2022-06-08T18:48:32Z

src/sagemaker_xgboost_container/algorithm_mode/train.py

+        if train_path == val_path or os.path.basename(train_path) == os.path.basename(val_path):
+            logger.warning('Found same path for training and validation. This is not recommended and results may not '
+                           'be correct.')
+        elif not is_pipe:
+            # Check if there is potential data redundancy between training and validation sets
+            check_data_redundancy(train_path, val_path)


Is this correct? This is saying

If we suspect data and validation redundancy => log a warning

Else (we DON'T suspect data and validation redundancy) AND we're in file mode => check data redundancy

Shouldn't the call to check_data_redundancy fall under the same parent condition as the path check and the warning?

Discussed with @haixiw
The idea was:

If we KNOW that the data and validation paths are same, then log a warning (log a warning soon as we can)

Else(Shallow path check did not find redundancy and we want file level comparison) AND if we are in file mode , run the file-by-file comparison

We want to run check_data_redundancy only when necessary, that's why it's under "else". We don't want to run this file-by-file check on "pipe" mode that's why the condition.

mabunday · 2022-06-08T18:49:05Z

src/sagemaker_xgboost_container/data_utils.py

+    """
+    :param data_path: Either directory or file
+    """
+    # For pipe mode, we leverages mlio directly by creating a list of SageMakerPipe.


I'm not sure what this comment means or what its purpose is. Let's re-word it or remove it.

will remove in next rev.

mabunday · 2022-06-08T18:49:56Z

src/sagemaker_xgboost_container/data_utils.py

+    :param data_path: Either directory or file
+    """
+    # In file mode, we create a temp directory with symlink to all input files or
+    # directories to meet XGB's assumption that all files are in the same directory.


(Optional) Since we're referencing an assumption XGB makes, it would be helpful to include a doc link to that assumption.

mabunday · 2022-06-08T18:52:06Z

src/sagemaker_xgboost_container/data_utils.py

+    os.symlink(path, base_name)
+
+
+def check_data_redundancy(train_path, validate_path):


This function only runs once per training job, right? Do we have an idea of how it performs with very large training or validation data sets? Would it be enough just to log the warning that the paths are the same without doing a file by file comparison?

Discussed this with @haixiw

Yes this only runs once per training job
2.He had a discussion earlier regarding this with Brent, since we are just comparing file name and file size, this should be insignificant as compared to creating DMatrix for huge training and validation data.

mabunday · 2022-06-08T18:52:44Z

test/unit/test_data_utils.py

@@ -225,3 +226,19 @@ def test_parse_protobuf_dmatrix_single_feature_label(self):
                pb_path = os.path.join(self.data_path, 'recordio_protobuf', file_path)
                reader = data_utils.get_recordio_protobuf_dmatrix
                self._check_dmatrix(reader, pb_path, 1, 1)
+
+    @patch("logging.warning")
+    def test_check_data_redundancy_positive(self, mock_log_warning):


Let's add additional test(s) to cover file and pipe mode differences since that failed tests last time.

sure will add a couple of unit tests to verify we are calling this only from file mode and not from pipe mode.

Nikhil Raverkar added 3 commits June 7, 2022 16:55

fixed errors Case2272044573

29d471b

fixed errors Case2272044573

6824795

fixed errors Case2272044573

4a4f980

NikhilRaverkar requested review from mabunday and dewan-c June 8, 2022 18:39

NikhilRaverkar marked this pull request as ready for review June 8, 2022 18:42

mabunday suggested changes Jun 8, 2022

View reviewed changes

fixed errors Case2272044573

9763de9

mabunday approved these changes Jun 15, 2022

View reviewed changes

fixed errors Case2272044573

9b710b2

NikhilRaverkar requested a review from haixiw June 15, 2022 17:39

dewan-c approved these changes Jun 15, 2022

View reviewed changes

NikhilRaverkar merged commit 7da4fd6 into 1.3-1 Jun 15, 2022

mabunday deleted the Case2272044573 branch August 4, 2022 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed errors Case2272044573 #294

fixed errors Case2272044573 #294

NikhilRaverkar commented Jun 7, 2022 •

edited

mabunday Jun 8, 2022

NikhilRaverkar Jun 14, 2022

mabunday Jun 8, 2022

NikhilRaverkar Jun 14, 2022

mabunday Jun 8, 2022

mabunday Jun 8, 2022

NikhilRaverkar Jun 14, 2022

mabunday Jun 8, 2022

NikhilRaverkar Jun 14, 2022

		os.symlink(path, base_name)


		def check_data_redundancy(train_path, validate_path):

fixed errors Case2272044573 #294

fixed errors Case2272044573 #294

Conversation

NikhilRaverkar commented Jun 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NikhilRaverkar commented Jun 7, 2022 •

edited