-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minibatch Preprocessing: change default buffer size formula for grouping #256
Minibatch Preprocessing: change default buffer size formula for grouping #256
Conversation
…ouping This commit changes the previous calculation formula for default buffer size. Previously, we used num_rows_processed/num_of_segments to indicate data distribution in each segment. To adjust this to a grouping scenario, we use avg_num_rows_processed/num_of_segment to indicate data distribution when there are more than one groups of data. Other code changes are due to this change.
Refer to this link for build results (access rights to CI server needed): |
We seem to be computing batch size using master+num segments |
Previously, this function will return total segment number, including master segment. This commit changes it to only get primary segment number.
Refer to this link for build results (access rights to CI server needed): |
Is this expected behavior? last group for NJ gets only 1 observation
|
Oh I see, with the averaging approach: buffer_size = avg_num_rows_per_group / num_segments and rounding up we get 11. Can you think of any drawbacks of using this approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM Default selection looks reasonable: (0) data
(1) no groups, 2 segments, default buffer size
(2) no groups, 2 segments, buffer size=10
(3) groups, 2 segments, default buffer size
^^^ Above buffer size is based on average group size: (4) groups, 2 segments, buffer size=10
(5) mnist |
This commit changes the previous calculation formula for default buffer
size. Previously, we used num_rows_processed/num_of_segments to indicate
data distribution in each segment. To adjust this to a grouping
scenario, we use avg_num_rows_processed/num_of_segment to indicate data
distribution when there are more than one groups of data. Other code changes
are due to this change.