-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial commit for pipe mode support. #36
Conversation
I feel like there are a lot of changes for a single commit. Could we group the changes into smaller/more purposeful commits? perhaps move refactorization/indentation to a sep commit/etc? Let me know what you think. |
Good point. Let me see if I can group this better into a couple commits. I'll separate it into build-related changes, implementation, and testing. |
Added mlio installation to docker build and tox. CSV pipe mode support + unit tests.
Commits have been split into Initial + CSV, Parquet, and RecordIO-Protobuf. |
src/sagemaker_xgboost_container/algorithm_mode/channel_validation.py
Outdated
Show resolved
Hide resolved
try: | ||
table = pq.read_table(files_path) | ||
|
||
data = table.to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a warning, but pandas is really slow. Can we avoid using pandas here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'm assuming the result is numpy.array
if it's all of the same type (can we confirm that behavior). Since we're only interested in float arrays for the XGBoost DMatrix, can we verify this earlier and enforce numpy.array
instead of pandas.DataFrame
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this seems to be the only method of converting the pyarrow Table into a format that can be read by DMatrix. This is reflected by file mode training being slower than pipe mode. I'm not certain if there's a way to verify and enforce the conversion, but I will look into it.
for root, dirs, files in os.walk(data_path): | ||
if dirs == []: | ||
files_path = root | ||
break | ||
if content_type.lower() == CSV: | ||
dmatrix = get_csv_dmatrix(files_path, csv_weights) | ||
if is_pipe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this logic would look something like
reader = get_content_reader('csv') // this would return the correct reader for the format for example CSVReader for csv
dmatrix = get_dmatrix(reader, args, pipe)
I know that that requires changing code out of cope for this task, but can we at least add a todo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There would need to be a few prerequisite changes in order for this to work:
- Use mlio for csv file mode
- Have a data reader for libsvm (not yet supported)
- Have a data reader for parquet (the current ParquetRecordReader is actually an iterator over the records and does not follow the same interface as the other readers)
Should I leave a todo as a comment?
try: | ||
table = pq.read_table(files_path) | ||
|
||
data = table.to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'm assuming the result is numpy.array
if it's all of the same type (can we confirm that behavior). Since we're only interested in float arrays for the XGBoost DMatrix, can we verify this earlier and enforce numpy.array
instead of pandas.DataFrame
?
1d58497
to
aee26e5
Compare
from mlio.integ.numpy import as_numpy | ||
from mlio.integ.arrow import as_arrow_file | ||
|
||
BATCH_SIZE = 4000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I arbitrarily chose a value, since we need to specify a batch size for the mlio data readers. I have not experimented yet if memory consumption is affected by this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with minor comments. No need for additional review
9263fd1
to
0703598
Compare
…on feedback. Added doc strings. Improved parquet file mode memory efficiency. Improved recordio-protobuf memory efficiency.
Description of changes:
Added support for training with CSV pipe mode, parquet file and pipe mode, recordio-protobuf file and pipe mode.
Unit tests added for the new data reading methods, sagemaker-pipe.py from https://github.com/ishaaq/sagemaker-pipe added to emulate pipe mode during unit tests.
NOTE: Integration tests for checkpointing, inference, and HPO are failing when run on a local image. The error produced by the checkpointing tests are the same as before boto3 was updated to support
CheckpointConfig
, but pip freeze shows the latest version of boto3 installed in the container. HPO test is failing due to missing metric definition, which also doesn't make sense. Inference tests simply fail during endpoint creation. Not sure what is the cause of this problem, don't merge until resolved.UPDATE: With an image built via CodeBuild, inference and HPO tests succeed, and checkpointing tests succeed after creating a new workspace, so these failures were not due to any change in this PR.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.