Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subsampling support for CSV and RecordIO-protobuf datasets #46

Merged
merged 1 commit into from
Nov 21, 2019
Merged

Add subsampling support for CSV and RecordIO-protobuf datasets #46

merged 1 commit into from
Nov 21, 2019

Conversation

cbalioglu
Copy link
Contributor

This PR introduces support for subsampling-on-read to CSV datasets read from the SageMaker pipe mode. What subsampling means in this context is reducing the size of dataset artificially to ensure that it fits into the memory of a given host machine.

The logic implemented here simply takes the first subsample_ratio * mini_batch_size records of the mini-batch and discards the rest; this effectively reduces the size of the dataset by the specified ratio.

Also included in this PR is an upgrade to ML-IO v0.2. The new version of ML-IO now uses Apache Arrow v0.15.1 which brings lots of bug fixes and performance improvements.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@@ -300,15 +300,17 @@ def _get_csv_dmatrix_file_mode(files_path, csv_weights):
return dmatrix


def _get_csv_dmatrix_pipe_mode(pipe_path, csv_weights):
def _get_csv_dmatrix_pipe_mode(pipe_path, csv_weights, subsample_on_read):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we rename to indicate a ratio? I read this right now as a boolean.
  2. Should it have a default of 1 i.e. no subsampling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the first point.

The second point is debateable because if the value is None then it the code flow doesn't even go through the subsampling section. However, I'm also not fond of overloading arguments (ie, None in one case, and then ratio in another. Is this pythonic best practices?).

We can also have default be 1, and the check for subsampling does:

if submsampling is < 1 and subsampling > 0:
    do stuff

And this way we don't need a None default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yeah I can rename it to subsample_ratio_on_read.
  2. This is an internal function. The default value (None) is set in get_csv_dmatrix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think of subsample_ratio_on_read as an optional parameter, having a value of None vs. 1 has a semantic difference. As an API user with None you explicitly specify that you do not want to have any subsampling; with 1, although the result will be the same as if None is specified, you do not know for sure whether there will be some kind of additional logic executed related to subsampling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: XGBoost has a similar parameter (https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster) that has default value 1.

Q: Under what circumstance would an API user specify subsample_ratio_on_read=None?

  • If no subsampling is desired then user could just not specify the parameter.
  • Additionally if subsample_ratio_on_read=1 then behavior should be same as None.

So it's not clear to me when the None parameter should be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the accepted range of values to (0, 1); meaning, if specified, it has to be greater than zero and less than 1. If None, then no subsampling is performed. I still think that None is more distinct than 1 to indicate that no subsampling should be performed.

src/sagemaker_xgboost_container/data_utils.py Show resolved Hide resolved
@cbalioglu cbalioglu changed the title Add subsampling support for CSV datasets read from pipe Add subsampling support for CSV and RecordIO-protobuf datasets Nov 13, 2019
@aws-patlin
Copy link
Contributor

Did you remove the unit test you initially wrote?

@cbalioglu
Copy link
Contributor Author

Did you remove the unit test you initially wrote?

Yes, ML-IO uses a different logic than what I had here before. If the batch size is less than or equal to the (remaining) dataset size, it will always get filled in. What ML-IO does is, after reading a mini-batch, it skips (batch_size / subsample_ratio) - batch_size records and starts reading the next mini-batch from there. Practically it does the same thing as what was previously implemented here only with the difference that it discards records before putting them into a mini-batch.

Copy link
Contributor

@iyerr3 iyerr3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you remove the unit test you initially wrote?

Yes, ML-IO uses a different logic than what I had here before. If the batch size is less than or equal to the (remaining) dataset size, it will always get filled in. What ML-IO does is, after reading a mini-batch, it skips (batch_size / subsample_ratio) - batch_size records and starts reading the next mini-batch from there. Practically it does the same thing as what was previously implemented here only with the difference that it discards records before putting them into a mini-batch.

I'm not clear on this and unsure if this is the same logic as previous.

For eg. dataset = 300, batch_size = 100, subsample_ratio = 0.5, expected output = 150
So by above logic, we'll read 100 rows, then skip 100 rows (100/0.5 - 100) and then read 100 again. So we'll end up with 200 rows instead of 150. Am I reading this wrong?

src/sagemaker_xgboost_container/data_utils.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/data_utils.py Outdated Show resolved Hide resolved
@cbalioglu
Copy link
Contributor Author

For eg. dataset = 300, batch_size = 100, subsample_ratio = 0.5, expected output = 150
So by above logic, we'll read 100 rows, then skip 100 rows (100/0.5 - 100) and then read 100 again. So we'll end up with 200 rows instead of 150. Am I reading this wrong?

If your dataset is fairly small, as in your example, then yes, it does not match the exact ratio. I explicitly have a remark in the API documentation that mentions this fact. ML-IO has no other way to subsample a dataset. If you do not know the size of the dataset in advance and the user requests a mini-batch size of 100, your only option is to skip n number of records in between the mini-batches. If your dataset is sufficiently large, then the actual ratio will converge to the requested ratio.

@cbalioglu cbalioglu removed the request for review from edwardjkim November 14, 2019 16:15
@aws-patlin
Copy link
Contributor

You need to update tox.ini to use the updated mlio and pyarrow versions.

@@ -14,8 +14,8 @@
import unittest
import os
from pathlib import Path
import shutil
import signal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this, since you also removed the only instance it was used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still being used. I deleted the _clear_folder function and started using shutil.rmtree in _check_piped_dmatrix.

@@ -202,5 +202,8 @@ def eval_metric_dep_validator(value, dependencies):
required=False),
hpv.IntegerHyperparameter(name="seed", range=hpv.Interval(min_open=-2**31, max_open=2**31-1),
required=False),

# For internal use only!
hpv.ContinuousHyperparameter(name="_subsample_ratio_on_read", range=hpv.Interval(min_open=0, max_open=1), required=False),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is too long. Can you break it up, one arg per line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I broke it up. Overall the whole code base needs to be linted though. Looks like no one has run flake8 or a similar tool yet.

Copy link
Contributor Author

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update tox.ini to use the updated mlio and pyarrow versions.

Fixed in the new revision.

@rizwangilani
Copy link
Contributor

I was testing this, and seems like we do not support 1 as value for subsample_ratio_on_read hyperparameter. Is this the best experience for the customer?

@cbalioglu
Copy link
Contributor Author

I was testing this, and seems like we do not support 1 as value for subsample_ratio_on_read hyperparameter. Is this the best experience for the customer?

First a sidenote: _subsample_ratio_on_read is an internal "hyperparameter", so it should not be used by customers.

Now back to the main part of the question: My intuition was that if subsampling is requested, it has to be in the range of (0, 1). Subsampling with 1 does not make any sense as you are practically not subsamling anything in such case. So if you specify None (which is the default value), subsampling is disabled; if you specify a value it has to be a floating-point number greater than 0 and less than 1. I personally find a value of None more descriptive than 1 if subsampling is not desired. With 1 it is not clear whether we are actually performing subsampling.

@cbalioglu cbalioglu merged commit 56837eb into aws:master Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants