Initial commit for pipe mode support. #36

aws-patlin · 2019-10-07T20:05:36Z

Description of changes:
Added support for training with CSV pipe mode, parquet file and pipe mode, recordio-protobuf file and pipe mode.

Unit tests added for the new data reading methods, sagemaker-pipe.py from https://github.com/ishaaq/sagemaker-pipe added to emulate pipe mode during unit tests.

NOTE: Integration tests for checkpointing, inference, and HPO are failing when run on a local image. The error produced by the checkpointing tests are the same as before boto3 was updated to support CheckpointConfig, but pip freeze shows the latest version of boto3 installed in the container. HPO test is failing due to missing metric definition, which also doesn't make sense. Inference tests simply fail during endpoint creation. Not sure what is the cause of this problem, don't merge until resolved.

UPDATE: With an image built via CodeBuild, inference and HPO tests succeed, and checkpointing tests succeed after creating a new workspace, so these failures were not due to any change in this PR.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

docker/0.90-1/base/Dockerfile.cpu

src/sagemaker_xgboost_container/data_utils.py

rizwangilani · 2019-10-07T21:44:37Z

I feel like there are a lot of changes for a single commit. Could we group the changes into smaller/more purposeful commits? perhaps move refactorization/indentation to a sep commit/etc? Let me know what you think.

aws-patlin · 2019-10-07T21:50:34Z

I feel like there are a lot of changes for a single commit. Could we group the changes into smaller/more purposeful commits?

Good point. Let me see if I can group this better into a couple commits. I'll separate it into build-related changes, implementation, and testing.

Added mlio installation to docker build and tox. CSV pipe mode support + unit tests.

aws-patlin · 2019-10-07T22:26:24Z

Commits have been split into Initial + CSV, Parquet, and RecordIO-Protobuf.

src/sagemaker_xgboost_container/algorithm_mode/channel_validation.py

ericangelokim · 2019-10-10T05:13:06Z

src/sagemaker_xgboost_container/data_utils.py

+    try:
+        table = pq.read_table(files_path)
+
+        data = table.to_pandas()


Just a warning, but pandas is really slow. Can we avoid using pandas here?

Agreed. I'm assuming the result is numpy.array if it's all of the same type (can we confirm that behavior). Since we're only interested in float arrays for the XGBoost DMatrix, can we verify this earlier and enforce numpy.array instead of pandas.DataFrame?

Unfortunately this seems to be the only method of converting the pyarrow Table into a format that can be read by DMatrix. This is reflected by file mode training being slower than pipe mode. I'm not certain if there's a way to verify and enforce the conversion, but I will look into it.

test/unit/test_data_utils.py

ericangelokim · 2019-10-10T05:19:39Z

src/sagemaker_xgboost_container/data_utils.py

            for root, dirs, files in os.walk(data_path):
                if dirs == []:
                    files_path = root
                    break
        if content_type.lower() == CSV:
-            dmatrix = get_csv_dmatrix(files_path, csv_weights)
+            if is_pipe:


Ideally this logic would look something like

reader = get_content_reader('csv') // this would return the correct reader for the format for example CSVReader for csv

dmatrix = get_dmatrix(reader, args, pipe)

I know that that requires changing code out of cope for this task, but can we at least add a todo?

There would need to be a few prerequisite changes in order for this to work:

Use mlio for csv file mode

Have a data reader for libsvm (not yet supported)

Have a data reader for parquet (the current ParquetRecordReader is actually an iterator over the records and does not follow the same interface as the other readers)

Should I leave a todo as a comment?

iyerr3 · 2019-10-10T13:39:30Z

src/sagemaker_xgboost_container/data_utils.py

+    try:
+        table = pq.read_table(files_path)
+
+        data = table.to_pandas()


Agreed. I'm assuming the result is numpy.array if it's all of the same type (can we confirm that behavior). Since we're only interested in float arrays for the XGBoost DMatrix, can we verify this earlier and enforce numpy.array instead of pandas.DataFrame?

src/sagemaker_xgboost_container/data_utils.py

src/sagemaker_xgboost_container/algorithm_mode/train.py

ericangelokim · 2019-10-10T19:20:25Z

src/sagemaker_xgboost_container/data_utils.py

+from mlio.integ.numpy import as_numpy
+from mlio.integ.arrow import as_arrow_file
+
+BATCH_SIZE = 4000


Where does this come from?

I arbitrarily chose a value, since we need to specify a batch size for the mlio data readers. I have not experimented yet if memory consumption is affected by this value.

test/utils/sagemaker_pipe.py

ericangelokim

Approved with minor comments. No need for additional review

…on feedback. Added doc strings. Improved parquet file mode memory efficiency. Improved recordio-protobuf memory efficiency.

aws-patlin requested review from ericangelokim, cbalioglu and iyerr3 October 7, 2019 20:05

aws-patlin force-pushed the pipe_mode branch from f893f78 to ab7502d Compare October 7, 2019 20:15

cbalioglu reviewed Oct 7, 2019

View reviewed changes

docker/0.90-1/base/Dockerfile.cpu Outdated Show resolved Hide resolved

cbalioglu reviewed Oct 7, 2019

View reviewed changes

src/sagemaker_xgboost_container/data_utils.py Outdated Show resolved Hide resolved

aws-patlin force-pushed the pipe_mode branch from ab7502d to 2b2af45 Compare October 7, 2019 21:17

aws-patlin added 2 commits October 7, 2019 15:14

Initial commit for mlio integration.

53d0427

Added mlio installation to docker build and tox. CSV pipe mode support + unit tests.

Added parquet support + unit tests.

25aa447

aws-patlin force-pushed the pipe_mode branch from 2b2af45 to 8c8e9aa Compare October 7, 2019 22:24

aws-patlin force-pushed the pipe_mode branch from 5ffe176 to ef820d6 Compare October 8, 2019 21:03

Added recordio-protobuf support + unit tests.

3e77dce

aws-patlin force-pushed the pipe_mode branch from ef820d6 to 3e77dce Compare October 8, 2019 21:44

ericangelokim reviewed Oct 10, 2019

View reviewed changes

iyerr3 reviewed Oct 10, 2019

View reviewed changes

aws-patlin force-pushed the pipe_mode branch 2 times, most recently from 1d58497 to aee26e5 Compare October 10, 2019 18:00

aws-patlin requested review from ericangelokim and iyerr3 October 10, 2019 18:25

ericangelokim reviewed Oct 10, 2019

View reviewed changes

ericangelokim approved these changes Oct 10, 2019

View reviewed changes

aws-patlin force-pushed the pipe_mode branch 3 times, most recently from 9263fd1 to 0703598 Compare October 11, 2019 22:26

Cleaned up data_utils, test_data_utils, and channel_validation based …

484a5a6

…on feedback. Added doc strings. Improved parquet file mode memory efficiency. Improved recordio-protobuf memory efficiency.

aws-patlin force-pushed the pipe_mode branch from 5db90c2 to 484a5a6 Compare October 14, 2019 16:35

aws-patlin merged commit d91309e into master Oct 14, 2019

aws-patlin deleted the pipe_mode branch October 14, 2019 19:49

aws-patlin restored the pipe_mode branch October 14, 2019 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial commit for pipe mode support. #36

Initial commit for pipe mode support. #36

aws-patlin commented Oct 7, 2019 •

edited

rizwangilani commented Oct 7, 2019

aws-patlin commented Oct 7, 2019

aws-patlin commented Oct 7, 2019

ericangelokim Oct 10, 2019

iyerr3 Oct 10, 2019

aws-patlin Oct 10, 2019

ericangelokim Oct 10, 2019

aws-patlin Oct 10, 2019

iyerr3 Oct 10, 2019

ericangelokim Oct 10, 2019

aws-patlin Oct 10, 2019

ericangelokim left a comment

Initial commit for pipe mode support. #36

Initial commit for pipe mode support. #36

Conversation

aws-patlin commented Oct 7, 2019 • edited

rizwangilani commented Oct 7, 2019

aws-patlin commented Oct 7, 2019

aws-patlin commented Oct 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericangelokim left a comment

Choose a reason for hiding this comment

aws-patlin commented Oct 7, 2019 •

edited