Add subsampling support for CSV and RecordIO-protobuf datasets #46

cbalioglu · 2019-11-11T18:10:23Z

This PR introduces support for subsampling-on-read to CSV datasets read from the SageMaker pipe mode. What subsampling means in this context is reducing the size of dataset artificially to ensure that it fits into the memory of a given host machine.

The logic implemented here simply takes the first subsample_ratio * mini_batch_size records of the mini-batch and discards the rest; this effectively reduces the size of the dataset by the specified ratio.

Also included in this PR is an upgrade to ML-IO v0.2. The new version of ML-IO now uses Apache Arrow v0.15.1 which brings lots of bug fixes and performance improvements.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

src/sagemaker_xgboost_container/algorithm_mode/train.py

src/sagemaker_xgboost_container/data_utils.py

iyerr3 · 2019-11-11T18:16:31Z

src/sagemaker_xgboost_container/data_utils.py

@@ -300,15 +300,17 @@ def _get_csv_dmatrix_file_mode(files_path, csv_weights):
    return dmatrix


-def _get_csv_dmatrix_pipe_mode(pipe_path, csv_weights):
+def _get_csv_dmatrix_pipe_mode(pipe_path, csv_weights, subsample_on_read):


Should we rename to indicate a ratio? I read this right now as a boolean.

Should it have a default of 1 i.e. no subsampling?

Agree with the first point.

The second point is debateable because if the value is None then it the code flow doesn't even go through the subsampling section. However, I'm also not fond of overloading arguments (ie, None in one case, and then ratio in another. Is this pythonic best practices?).

We can also have default be 1, and the check for subsampling does:

if submsampling is < 1 and subsampling > 0: do stuff

And this way we don't need a None default.

Yeah I can rename it to subsample_ratio_on_read.

This is an internal function. The default value (None) is set in get_csv_dmatrix.

If you think of subsample_ratio_on_read as an optional parameter, having a value of None vs. 1 has a semantic difference. As an API user with None you explicitly specify that you do not want to have any subsampling; with 1, although the result will be the same as if None is specified, you do not know for sure whether there will be some kind of additional logic executed related to subsampling.

FYI: XGBoost has a similar parameter (https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster) that has default value 1.

Q: Under what circumstance would an API user specify subsample_ratio_on_read=None?

If no subsampling is desired then user could just not specify the parameter.

Additionally if subsample_ratio_on_read=1 then behavior should be same as None.

So it's not clear to me when the None parameter should be used.

I changed the accepted range of values to (0, 1); meaning, if specified, it has to be greater than zero and less than 1. If None, then no subsampling is performed. I still think that None is more distinct than 1 to indicate that no subsampling should be performed.

src/sagemaker_xgboost_container/data_utils.py

aws-patlin · 2019-11-13T20:37:37Z

Did you remove the unit test you initially wrote?

cbalioglu · 2019-11-13T20:49:34Z

Did you remove the unit test you initially wrote?

Yes, ML-IO uses a different logic than what I had here before. If the batch size is less than or equal to the (remaining) dataset size, it will always get filled in. What ML-IO does is, after reading a mini-batch, it skips (batch_size / subsample_ratio) - batch_size records and starts reading the next mini-batch from there. Practically it does the same thing as what was previously implemented here only with the difference that it discards records before putting them into a mini-batch.

iyerr3

Did you remove the unit test you initially wrote?

Yes, ML-IO uses a different logic than what I had here before. If the batch size is less than or equal to the (remaining) dataset size, it will always get filled in. What ML-IO does is, after reading a mini-batch, it skips (batch_size / subsample_ratio) - batch_size records and starts reading the next mini-batch from there. Practically it does the same thing as what was previously implemented here only with the difference that it discards records before putting them into a mini-batch.

I'm not clear on this and unsure if this is the same logic as previous.

For eg. dataset = 300, batch_size = 100, subsample_ratio = 0.5, expected output = 150
So by above logic, we'll read 100 rows, then skip 100 rows (100/0.5 - 100) and then read 100 again. So we'll end up with 200 rows instead of 150. Am I reading this wrong?

src/sagemaker_xgboost_container/data_utils.py

cbalioglu · 2019-11-14T16:02:41Z

For eg. dataset = 300, batch_size = 100, subsample_ratio = 0.5, expected output = 150
So by above logic, we'll read 100 rows, then skip 100 rows (100/0.5 - 100) and then read 100 again. So we'll end up with 200 rows instead of 150. Am I reading this wrong?

If your dataset is fairly small, as in your example, then yes, it does not match the exact ratio. I explicitly have a remark in the API documentation that mentions this fact. ML-IO has no other way to subsample a dataset. If you do not know the size of the dataset in advance and the user requests a mini-batch size of 100, your only option is to skip n number of records in between the mini-batches. If your dataset is sufficiently large, then the actual ratio will converge to the requested ratio.

aws-patlin · 2019-11-16T00:23:11Z

You need to update tox.ini to use the updated mlio and pyarrow versions.

aws-patlin · 2019-11-16T00:28:11Z

test/unit/test_data_utils.py

@@ -14,8 +14,8 @@
 import unittest
 import os
 from pathlib import Path
+import shutil
 import signal


You can remove this, since you also removed the only instance it was used.

It is still being used. I deleted the _clear_folder function and started using shutil.rmtree in _check_piped_dmatrix.

aws-patlin · 2019-11-16T00:28:52Z

src/sagemaker_xgboost_container/algorithm_mode/hyperparameter_validation.py

@@ -202,5 +202,8 @@ def eval_metric_dep_validator(value, dependencies):
                                             required=False),
        hpv.IntegerHyperparameter(name="seed", range=hpv.Interval(min_open=-2**31, max_open=2**31-1),
                                  required=False),
+
+        # For internal use only!
+        hpv.ContinuousHyperparameter(name="_subsample_ratio_on_read", range=hpv.Interval(min_open=0, max_open=1), required=False),


This line is too long. Can you break it up, one arg per line?

I broke it up. Overall the whole code base needs to be linted though. Looks like no one has run flake8 or a similar tool yet.

cbalioglu

You need to update tox.ini to use the updated mlio and pyarrow versions.

Fixed in the new revision.

rizwangilani · 2019-11-19T17:35:31Z

I was testing this, and seems like we do not support 1 as value for subsample_ratio_on_read hyperparameter. Is this the best experience for the customer?

cbalioglu · 2019-11-19T18:15:20Z

I was testing this, and seems like we do not support 1 as value for subsample_ratio_on_read hyperparameter. Is this the best experience for the customer?

First a sidenote: _subsample_ratio_on_read is an internal "hyperparameter", so it should not be used by customers.

Now back to the main part of the question: My intuition was that if subsampling is requested, it has to be in the range of (0, 1). Subsampling with 1 does not make any sense as you are practically not subsamling anything in such case. So if you specify None (which is the default value), subsampling is disabled; if you specify a value it has to be a floating-point number greater than 0 and less than 1. I personally find a value of None more descriptive than 1 if subsampling is not desired. With 1 it is not clear whether we are actually performing subsampling.

cbalioglu requested review from iyerr3, edwardjkim, ericangelokim and aws-patlin November 11, 2019 18:10

aws-patlin reviewed Nov 11, 2019

View reviewed changes

src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved

aws-patlin reviewed Nov 11, 2019

View reviewed changes

src/sagemaker_xgboost_container/data_utils.py Show resolved Hide resolved

iyerr3 reviewed Nov 11, 2019

View reviewed changes

ericangelokim reviewed Nov 11, 2019

View reviewed changes

src/sagemaker_xgboost_container/data_utils.py Outdated Show resolved Hide resolved

aws-patlin reviewed Nov 11, 2019

View reviewed changes

src/sagemaker_xgboost_container/data_utils.py Outdated Show resolved Hide resolved

aws-patlin reviewed Nov 11, 2019

View reviewed changes

src/sagemaker_xgboost_container/data_utils.py Outdated Show resolved Hide resolved

cbalioglu changed the title ~~Add subsampling support for CSV datasets read from pipe~~ Add subsampling support for CSV and RecordIO-protobuf datasets Nov 13, 2019

iyerr3 reviewed Nov 13, 2019

View reviewed changes

src/sagemaker_xgboost_container/data_utils.py Outdated Show resolved Hide resolved

src/sagemaker_xgboost_container/data_utils.py Outdated Show resolved Hide resolved

cbalioglu removed the request for review from edwardjkim November 14, 2019 16:15

aws-patlin reviewed Nov 16, 2019

View reviewed changes

Add subsampling support for CSV and RecordIO-protobuf datasets

5202a4e

cbalioglu commented Nov 18, 2019

View reviewed changes

cbalioglu merged commit 56837eb into aws:master Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add subsampling support for CSV and RecordIO-protobuf datasets #46

Add subsampling support for CSV and RecordIO-protobuf datasets #46

cbalioglu commented Nov 11, 2019

iyerr3 Nov 11, 2019

ericangelokim Nov 11, 2019

cbalioglu Nov 11, 2019

cbalioglu Nov 11, 2019

iyerr3 Nov 11, 2019

cbalioglu Nov 14, 2019

aws-patlin commented Nov 13, 2019

cbalioglu commented Nov 13, 2019

iyerr3 left a comment •

edited

cbalioglu commented Nov 14, 2019

aws-patlin commented Nov 16, 2019

aws-patlin Nov 16, 2019

cbalioglu Nov 18, 2019

aws-patlin Nov 16, 2019

cbalioglu Nov 18, 2019

cbalioglu left a comment

rizwangilani commented Nov 19, 2019

cbalioglu commented Nov 19, 2019

Add subsampling support for CSV and RecordIO-protobuf datasets #46

Add subsampling support for CSV and RecordIO-protobuf datasets #46

Conversation

cbalioglu commented Nov 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aws-patlin commented Nov 13, 2019

cbalioglu commented Nov 13, 2019

iyerr3 left a comment • edited

Choose a reason for hiding this comment

cbalioglu commented Nov 14, 2019

aws-patlin commented Nov 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbalioglu left a comment

Choose a reason for hiding this comment

rizwangilani commented Nov 19, 2019

cbalioglu commented Nov 19, 2019

iyerr3 left a comment •

edited