Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for pipe mode support. #36

Merged
merged 4 commits into from
Oct 14, 2019
Merged

Initial commit for pipe mode support. #36

merged 4 commits into from
Oct 14, 2019

Conversation

aws-patlin
Copy link
Contributor

@aws-patlin aws-patlin commented Oct 7, 2019

Description of changes:
Added support for training with CSV pipe mode, parquet file and pipe mode, recordio-protobuf file and pipe mode.

Unit tests added for the new data reading methods, sagemaker-pipe.py from https://github.com/ishaaq/sagemaker-pipe added to emulate pipe mode during unit tests.

NOTE: Integration tests for checkpointing, inference, and HPO are failing when run on a local image. The error produced by the checkpointing tests are the same as before boto3 was updated to support CheckpointConfig, but pip freeze shows the latest version of boto3 installed in the container. HPO test is failing due to missing metric definition, which also doesn't make sense. Inference tests simply fail during endpoint creation. Not sure what is the cause of this problem, don't merge until resolved.

UPDATE: With an image built via CodeBuild, inference and HPO tests succeed, and checkpointing tests succeed after creating a new workspace, so these failures were not due to any change in this PR.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@rizwangilani
Copy link
Contributor

I feel like there are a lot of changes for a single commit. Could we group the changes into smaller/more purposeful commits? perhaps move refactorization/indentation to a sep commit/etc? Let me know what you think.

@aws-patlin
Copy link
Contributor Author

I feel like there are a lot of changes for a single commit. Could we group the changes into smaller/more purposeful commits?

Good point. Let me see if I can group this better into a couple commits. I'll separate it into build-related changes, implementation, and testing.

Added mlio installation to docker build and tox.

CSV pipe mode support + unit tests.
@aws-patlin
Copy link
Contributor Author

Commits have been split into Initial + CSV, Parquet, and RecordIO-Protobuf.

try:
table = pq.read_table(files_path)

data = table.to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a warning, but pandas is really slow. Can we avoid using pandas here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'm assuming the result is numpy.array if it's all of the same type (can we confirm that behavior). Since we're only interested in float arrays for the XGBoost DMatrix, can we verify this earlier and enforce numpy.array instead of pandas.DataFrame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this seems to be the only method of converting the pyarrow Table into a format that can be read by DMatrix. This is reflected by file mode training being slower than pipe mode. I'm not certain if there's a way to verify and enforce the conversion, but I will look into it.

test/unit/test_data_utils.py Show resolved Hide resolved
for root, dirs, files in os.walk(data_path):
if dirs == []:
files_path = root
break
if content_type.lower() == CSV:
dmatrix = get_csv_dmatrix(files_path, csv_weights)
if is_pipe:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this logic would look something like

reader = get_content_reader('csv') // this would return the correct reader for the format for example CSVReader for csv

dmatrix = get_dmatrix(reader, args, pipe)

I know that that requires changing code out of cope for this task, but can we at least add a todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would need to be a few prerequisite changes in order for this to work:

  • Use mlio for csv file mode
  • Have a data reader for libsvm (not yet supported)
  • Have a data reader for parquet (the current ParquetRecordReader is actually an iterator over the records and does not follow the same interface as the other readers)

Should I leave a todo as a comment?

try:
table = pq.read_table(files_path)

data = table.to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'm assuming the result is numpy.array if it's all of the same type (can we confirm that behavior). Since we're only interested in float arrays for the XGBoost DMatrix, can we verify this earlier and enforce numpy.array instead of pandas.DataFrame?

src/sagemaker_xgboost_container/data_utils.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/data_utils.py Show resolved Hide resolved
from mlio.integ.numpy import as_numpy
from mlio.integ.arrow import as_arrow_file

BATCH_SIZE = 4000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I arbitrarily chose a value, since we need to specify a batch size for the mlio data readers. I have not experimented yet if memory consumption is affected by this value.

test/utils/sagemaker_pipe.py Outdated Show resolved Hide resolved
test/utils/sagemaker_pipe.py Show resolved Hide resolved
Copy link
Contributor

@ericangelokim ericangelokim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with minor comments. No need for additional review

@aws-patlin aws-patlin force-pushed the pipe_mode branch 3 times, most recently from 9263fd1 to 0703598 Compare October 11, 2019 22:26
…on feedback.

Added doc strings.

Improved parquet file mode memory efficiency.

Improved recordio-protobuf memory efficiency.
@aws-patlin aws-patlin deleted the pipe_mode branch October 14, 2019 19:49
@aws-patlin aws-patlin restored the pipe_mode branch October 14, 2019 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants