-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add subsampling support for CSV and RecordIO-protobuf datasets #46
Conversation
@@ -300,15 +300,17 @@ def _get_csv_dmatrix_file_mode(files_path, csv_weights): | |||
return dmatrix | |||
|
|||
|
|||
def _get_csv_dmatrix_pipe_mode(pipe_path, csv_weights): | |||
def _get_csv_dmatrix_pipe_mode(pipe_path, csv_weights, subsample_on_read): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we rename to indicate a ratio? I read this right now as a boolean.
- Should it have a default of
1
i.e. no subsampling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the first point.
The second point is debateable because if the value is None then it the code flow doesn't even go through the subsampling section. However, I'm also not fond of overloading arguments (ie, None in one case, and then ratio in another. Is this pythonic best practices?).
We can also have default be 1, and the check for subsampling does:
if submsampling is < 1 and subsampling > 0:
do stuff
And this way we don't need a None default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yeah I can rename it to
subsample_ratio_on_read
. - This is an internal function. The default value (
None
) is set inget_csv_dmatrix
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think of subsample_ratio_on_read
as an optional parameter, having a value of None
vs. 1
has a semantic difference. As an API user with None
you explicitly specify that you do not want to have any subsampling; with 1
, although the result will be the same as if None
is specified, you do not know for sure whether there will be some kind of additional logic executed related to subsampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: XGBoost has a similar parameter (https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster) that has default value 1
.
Q: Under what circumstance would an API user specify subsample_ratio_on_read=None
?
- If no subsampling is desired then user could just not specify the parameter.
- Additionally if
subsample_ratio_on_read=1
then behavior should be same asNone
.
So it's not clear to me when the None
parameter should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the accepted range of values to (0, 1); meaning, if specified, it has to be greater than zero and less than 1. If None
, then no subsampling is performed. I still think that None
is more distinct than 1
to indicate that no subsampling should be performed.
Did you remove the unit test you initially wrote? |
Yes, ML-IO uses a different logic than what I had here before. If the batch size is less than or equal to the (remaining) dataset size, it will always get filled in. What ML-IO does is, after reading a mini-batch, it skips |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you remove the unit test you initially wrote?
Yes, ML-IO uses a different logic than what I had here before. If the batch size is less than or equal to the (remaining) dataset size, it will always get filled in. What ML-IO does is, after reading a mini-batch, it skips
(batch_size / subsample_ratio) - batch_size
records and starts reading the next mini-batch from there. Practically it does the same thing as what was previously implemented here only with the difference that it discards records before putting them into a mini-batch.
I'm not clear on this and unsure if this is the same logic as previous.
For eg. dataset = 300, batch_size = 100, subsample_ratio = 0.5, expected output = 150
So by above logic, we'll read 100 rows, then skip 100 rows (100/0.5 - 100) and then read 100 again. So we'll end up with 200 rows instead of 150. Am I reading this wrong?
If your dataset is fairly small, as in your example, then yes, it does not match the exact ratio. I explicitly have a remark in the API documentation that mentions this fact. ML-IO has no other way to subsample a dataset. If you do not know the size of the dataset in advance and the user requests a mini-batch size of 100, your only option is to skip |
You need to update tox.ini to use the updated mlio and pyarrow versions. |
@@ -14,8 +14,8 @@ | |||
import unittest | |||
import os | |||
from pathlib import Path | |||
import shutil | |||
import signal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this, since you also removed the only instance it was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still being used. I deleted the _clear_folder
function and started using shutil.rmtree
in _check_piped_dmatrix
.
@@ -202,5 +202,8 @@ def eval_metric_dep_validator(value, dependencies): | |||
required=False), | |||
hpv.IntegerHyperparameter(name="seed", range=hpv.Interval(min_open=-2**31, max_open=2**31-1), | |||
required=False), | |||
|
|||
# For internal use only! | |||
hpv.ContinuousHyperparameter(name="_subsample_ratio_on_read", range=hpv.Interval(min_open=0, max_open=1), required=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is too long. Can you break it up, one arg per line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I broke it up. Overall the whole code base needs to be linted though. Looks like no one has run flake8
or a similar tool yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to update tox.ini to use the updated mlio and pyarrow versions.
Fixed in the new revision.
I was testing this, and seems like we do not support 1 as value for subsample_ratio_on_read hyperparameter. Is this the best experience for the customer? |
First a sidenote: Now back to the main part of the question: My intuition was that if subsampling is requested, it has to be in the range of (0, 1). Subsampling with 1 does not make any sense as you are practically not subsamling anything in such case. So if you specify |
This PR introduces support for subsampling-on-read to CSV datasets read from the SageMaker pipe mode. What subsampling means in this context is reducing the size of dataset artificially to ensure that it fits into the memory of a given host machine.
The logic implemented here simply takes the first
subsample_ratio * mini_batch_size
records of the mini-batch and discards the rest; this effectively reduces the size of the dataset by the specified ratio.Also included in this PR is an upgrade to ML-IO v0.2. The new version of ML-IO now uses Apache Arrow v0.15.1 which brings lots of bug fixes and performance improvements.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.