Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minibatch Preprocessing: change default buffer size formula for grouping #256

Conversation

jingyimei
Copy link

This commit changes the previous calculation formula for default buffer
size. Previously, we used num_rows_processed/num_of_segments to indicate
data distribution in each segment. To adjust this to a grouping
scenario, we use avg_num_rows_processed/num_of_segment to indicate data
distribution when there are more than one groups of data. Other code changes
are due to this change.

…ouping

This commit changes the previous calculation formula for default buffer
size. Previously, we used num_rows_processed/num_of_segments to indicate
data distribution in each segment. To adjust this to a grouping
scenario, we use avg_num_rows_processed/num_of_segment to indicate data
distribution when there are more than one groups of data. Other code changes
are due to this change.
@asfgit
Copy link

asfgit commented Apr 5, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/424/

@fmcquillan99
Copy link

fmcquillan99 commented Apr 6, 2018

We seem to be computing batch size using master+num segments
but
prob should just consider num segments.

Previously, this function will return total segment number, including
master segment. This commit changes it to only get primary segment
number.
@asfgit
Copy link

asfgit commented Apr 6, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/425/

@fmcquillan99
Copy link

Is this expected behavior? last group for NJ gets only 1 observation

DROP TABLE IF EXISTS iris_data;
CREATE TABLE iris_data(
    id serial,
    attributes numeric[],
    class_text text,
    class integer,
    state text
);
INSERT INTO iris_data(id, attributes, class_text, class, state) VALUES
(1,ARRAY[5.0,3.2,1.2,0.2],'Iris_setosa',1,'Alaska'),
(2,ARRAY[5.5,3.5,1.3,0.2],'Iris_setosa',1,'Alaska'),
(3,ARRAY[4.9,3.1,1.5,0.1],'Iris_setosa',1,'Alaska'),
(4,ARRAY[4.4,3.0,1.3,0.2],'Iris_setosa',1,'Alaska'),
(5,ARRAY[5.1,3.4,1.5,0.2],'Iris_setosa',1,'Alaska'),
(6,ARRAY[5.0,3.5,1.3,0.3],'Iris_setosa',1,'Alaska'),
(7,ARRAY[4.5,2.3,1.3,0.3],'Iris_setosa',1,'Alaska'),
(8,ARRAY[4.4,3.2,1.3,0.2],'Iris_setosa',1,'Alaska'),
(9,ARRAY[5.0,3.5,1.6,0.6],'Iris_setosa',1,'Alaska'),
(10,ARRAY[5.1,3.8,1.9,0.4],'Iris_setosa',1,'Alaska'),
(11,ARRAY[4.8,3.0,1.4,0.3],'Iris_setosa',1,'Alaska'),
(12,ARRAY[5.1,3.8,1.6,0.2],'Iris_setosa',1,'Alaska'),
(13,ARRAY[5.7,2.8,4.5,1.3],'Iris_versicolor',2,'Alaska'),
(14,ARRAY[6.3,3.3,4.7,1.6],'Iris_versicolor',2,'Alaska'),
(15,ARRAY[4.9,2.4,3.3,1.0],'Iris_versicolor',2,'Alaska'),
(16,ARRAY[6.6,2.9,4.6,1.3],'Iris_versicolor',2,'Alaska'),
(17,ARRAY[5.2,2.7,3.9,1.4],'Iris_versicolor',2,'Alaska'),
(18,ARRAY[5.0,2.0,3.5,1.0],'Iris_versicolor',2,'Alaska'),
(19,ARRAY[5.9,3.0,4.2,1.5],'Iris_versicolor',2,'Alaska'),
(20,ARRAY[6.0,2.2,4.0,1.0],'Iris_versicolor',2,'Alaska'),
(21,ARRAY[5.0,3.2,1.2,0.2],'Iris_setosa',1,'NJ'),
(22,ARRAY[5.5,3.5,1.3,0.2],'Iris_setosa',1,'NJ'),
(23,ARRAY[4.9,3.1,1.5,0.1],'Iris_setosa',1,'NJ'),
(24,ARRAY[4.4,3.0,1.3,0.2],'Iris_setosa',1,'NJ'),
(25,ARRAY[5.1,3.4,1.5,0.2],'Iris_setosa',1,'NJ'),
(26,ARRAY[5.0,3.5,1.3,0.3],'Iris_setosa',1,'NJ'),
(27,ARRAY[4.5,2.3,1.3,0.3],'Iris_setosa',1,'NJ'),
(28,ARRAY[4.4,3.2,1.3,0.2],'Iris_setosa',1,'NJ'),
(29,ARRAY[5.0,3.5,1.6,0.6],'Iris_setosa',1,'NJ'),
(30,ARRAY[5.1,3.8,1.9,0.4],'Iris_setosa',1,'NJ'),
(31,ARRAY[4.8,3.0,1.4,0.3],'Iris_setosa',1,'NJ'),
(32,ARRAY[5.1,3.8,1.6,0.2],'Iris_setosa',1,'NJ'),
(33,ARRAY[5.7,2.8,4.5,1.3],'Iris_versicolor',2,'NJ'),
(34,ARRAY[6.3,3.3,4.7,1.6],'Iris_versicolor',2,'NJ'),
(35,ARRAY[4.9,2.4,3.3,1.0],'Iris_versicolor',2,'NJ'),
(36,ARRAY[6.6,2.9,4.6,1.3],'Iris_versicolor',2,'NJ'),
(37,ARRAY[5.2,2.7,3.9,1.4],'Iris_versicolor',2,'NJ'),
(38,ARRAY[5.0,2.0,3.5,1.0],'Iris_versicolor',2,'NJ'),
(39,ARRAY[5.9,3.0,4.2,1.5],'Iris_versicolor',2,'NJ'),
(40,ARRAY[6.0,2.2,4.0,1.0],'Iris_versicolor',2,'NJ'),
(41,ARRAY[6.0,2.2,4.0,1.0],'Iris_versicolor',2,'NJ'),
(42,ARRAY[6.0,2.2,4.0,1.0],'Iris_versicolor',2,'NJ'),
(43,ARRAY[6.0,2.2,4.0,1.0],'Iris_versicolor',2,'NJ');
DROP TABLE IF EXISTS iris_data_packed, iris_data_packed_summary, iris_data_packed_standardization;
SELECT madlib.minibatch_preprocessor('iris_data',         -- Source table
                                     'iris_data_packed',  -- Output table
                                     'class_text',        -- Dependent variable
                                     'attributes',        -- Independent variables
                                     'state'              -- Grouping
                                     );
SELECT * FROM iris_data_packed ORDER BY state, __id__;
__id__	state	dependent_varname	independent_varname

0	Alaska	[[1.0, 0.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 0.0], [1.0, 0.0], [1.0, 0.0], [1.0, 0.0]]	[[-0.71340313126425, -0.0601082924775639, -0.815378124905574, -0.706014621503565], [1.1550336410945, -0.0601082924775639, 1.26960703467032, 1.61512933960405], [1.8344651946795, 0.540974632298078, 1.64192581316602, 1.80855800302968], [-0.373687354471751, -2.06371804172971, 0.748360744776349, 0.647986022475874], [-0.203829466075501, 1.54277950692415, -0.666450613507296, -0.899443284929199], [1.32489152949075, -1.66299609187928, 1.12067952327204, 0.647986022475874], [0.475602087509498, 0.941696582148507, -0.889841880604713, -0.899443284929199], [-0.203829466075501, 0.741335607223293, -0.740914369206435, -0.899443284929199], [-0.373687354471751, 0.941696582148507, -0.889841880604713, -0.706014621503565], [-0.373687354471751, 0.941696582148507, -0.666450613507296, -0.125728631226662], [-0.543545242868, 0.14025268244765, -0.740914369206435, -1.09287194835483]]

1	Alaska	[[1.0, 0.0], [1.0, 0.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 0.0]]	[[-1.222976796453, -1.46263511695406, -0.889841880604713, -0.706014621503565], [-0.373687354471751, 0.340613657372865, -0.964305636303853, -0.899443284929199], [-0.543545242868, -1.26227414202885, 0.59943323337807, 0.647986022475874], [0.815317864301998, -0.460830242327993, 1.49299830176774, 1.22827201275278], [-0.0339715776792507, -0.661191217253206, 1.04621576757291, 1.42170067617841], [-1.39283468484925, -0.0601082924775639, -0.889841880604713, -0.899443284929199], [2.34403885986824, -0.260469267402778, 1.56746205746688, 1.22827201275278], [-1.39283468484925, 0.340613657372865, -0.889841880604713, -0.899443284929199], [-0.203829466075501, 1.54277950692415, -0.443059346409878, -0.512585958077931]]

0	NJ	[[1.0, 0.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 0.0], [1.0, 0.0]]	[[-1.51451754593938, 0.144152305294091, -1.02984968208236, -1.02742318255518], [-0.528652350941107, 1.06512536689522, -0.80778834438335, -0.219534868067347], [-0.528652350941107, -1.69779381790817, 0.598600127710372, 0.588353446420489], [-0.528652350941107, 0.512541529934543, -1.10387012798203, -1.02742318255518], [0.621523709890219, -0.224236919346362, 1.33880458670707, 1.19426968228637], [-0.364341485108061, 1.6177092038559, -0.585727006684342, -0.623479025311265], [-0.692963216774152, 0.328346917614317, -0.88180879028302, -1.22939526117714], [-0.692963216774152, -0.961015368627265, 0.450559235911032, 0.588353446420489], [-0.364341485108061, 1.6177092038559, -0.80778834438335, -1.02742318255518], [-0.364341485108061, 0.880930754574994, -0.88180879028302, -1.02742318255518], [-1.51451754593938, 0.512541529934543, -1.02984968208236, -1.02742318255518]]

1	NJ	[[0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [1.0, 0.0], [1.0, 0.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0]]	[[2.10032150238764, -0.0400423070261354, 1.41282503260674, 1.19426968228637], [1.11445630738936, -1.32940459326772, 0.96870235720872, 0.588353446420489], [-0.8572740826072, 0.144152305294091, -0.95582923618269, -0.825451103933224], [0.292901978224126, 1.06512536689522, -1.02984968208236, -1.02742318255518], [-1.35020668010634, -1.14520998094749, -1.02984968208236, -0.825451103933224], [-0.200030619275013, -0.408431531666587, 0.89468191130905, 1.39624176090833], [0.950145441556312, 0.144152305294091, 1.11674324900806, 1.59821383953028], [1.11445630738936, -1.32940459326772, 0.96870235720872, 0.588353446420489], [1.6073889048885, 0.696736142254768, 1.48684547850641, 1.80018591815224], [1.11445630738936, -1.32940459326772, 0.96870235720872, 0.588353446420489], [1.11445630738936, -1.32940459326772, 0.96870235720872, 0.588353446420489]]

2	NJ	[[1.0, 0.0]]	[[-0.528652350941107, 1.06512536689522, -1.02984968208236, -0.825451103933224]]

@fmcquillan99
Copy link

fmcquillan99 commented Apr 6, 2018

Oh I see, with the averaging approach:

buffer_size = avg_num_rows_per_group / num_segments
= 21.5 / 2
= 10.75

and rounding up we get 11.

Can you think of any drawbacks of using this approach?

Copy link
Contributor

@njayaram2 njayaram2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fmcquillan99
Copy link

fmcquillan99 commented Apr 10, 2018

LGTM

Default selection looks reasonable:

(0) data

DROP TABLE IF EXISTS iris_data;
CREATE TABLE iris_data(
    id serial,
    attributes numeric[],
    class_text text,
    class integer,
    state text
);
INSERT INTO iris_data(id, attributes, class_text, class, state) VALUES
(1,ARRAY[5.0,3.2,1.2,0.2],'Iris_setosa',1,'Alaska'),
(2,ARRAY[5.5,3.5,1.3,0.2],'Iris_setosa',1,'Alaska'),
(3,ARRAY[4.9,3.1,1.5,0.1],'Iris_setosa',1,'Alaska'),
(4,ARRAY[4.4,3.0,1.3,0.2],'Iris_setosa',1,'Alaska'),
(5,ARRAY[5.1,3.4,1.5,0.2],'Iris_setosa',1,'Alaska'),
(6,ARRAY[5.0,3.5,1.3,0.3],'Iris_setosa',1,'Alaska'),
(7,ARRAY[4.5,2.3,1.3,0.3],'Iris_setosa',1,'Alaska'),
(8,ARRAY[4.4,3.2,1.3,0.2],'Iris_setosa',1,'Alaska'),
(9,ARRAY[5.0,3.5,1.6,0.6],'Iris_setosa',1,'Alaska'),
(10,ARRAY[5.1,3.8,1.9,0.4],'Iris_setosa',1,'Alaska'),
(11,ARRAY[4.8,3.0,1.4,0.3],'Iris_setosa',1,'Alaska'),
(12,ARRAY[5.1,3.8,1.6,0.2],'Iris_setosa',1,'Alaska'),
(13,ARRAY[5.7,2.8,4.5,1.3],'Iris_versicolor',2,'Alaska'),
(14,ARRAY[6.3,3.3,4.7,1.6],'Iris_versicolor',2,'Alaska'),
(15,ARRAY[4.9,2.4,3.3,1.0],'Iris_versicolor',2,'Alaska'),
(16,ARRAY[6.6,2.9,4.6,1.3],'Iris_versicolor',2,'Alaska'),
(17,ARRAY[5.2,2.7,3.9,1.4],'Iris_versicolor',2,'Alaska'),
(18,ARRAY[5.0,2.0,3.5,1.0],'Iris_versicolor',2,'Alaska'),
(19,ARRAY[5.9,3.0,4.2,1.5],'Iris_versicolor',2,'Alaska'),
(20,ARRAY[6.0,2.2,4.0,1.0],'Iris_versicolor',2,'Alaska'),
(21,ARRAY[6.1,2.9,4.7,1.4],'Iris_versicolor',2,'Alaska'),
(22,ARRAY[5.6,2.9,3.6,1.3],'Iris_versicolor',2,'Alaska'),
(23,ARRAY[6.7,3.1,4.4,1.4],'Iris_versicolor',2,'Alaska'),
(24,ARRAY[5.6,3.0,4.5,1.5],'Iris_versicolor',2,'Alaska'),
(25,ARRAY[5.8,2.7,4.1,1.0],'Iris_versicolor',2,'Alaska'),
(26,ARRAY[6.2,2.2,4.5,1.5],'Iris_versicolor',2,'Alaska'),
(27,ARRAY[5.6,2.5,3.9,1.1],'Iris_versicolor',2,'Alaska'),
(28,ARRAY[5.0,3.4,1.5,0.2],'Iris_setosa',1,'Tennessee'),
(29,ARRAY[4.4,2.9,1.4,0.2],'Iris_setosa',1,'Tennessee'),
(30,ARRAY[4.9,3.1,1.5,0.1],'Iris_setosa',1,'Tennessee'),
(31,ARRAY[5.4,3.7,1.5,0.2],'Iris_setosa',1,'Tennessee'),
(32,ARRAY[4.8,3.4,1.6,0.2],'Iris_setosa',1,'Tennessee'),
(33,ARRAY[4.8,3.0,1.4,0.1],'Iris_setosa',1,'Tennessee'),
(34,ARRAY[4.3,3.0,1.1,0.1],'Iris_setosa',1,'Tennessee'),
(35,ARRAY[5.8,4.0,1.2,0.2],'Iris_setosa',1,'Tennessee'),
(36,ARRAY[5.7,4.4,1.5,0.4],'Iris_setosa',1,'Tennessee'),
(37,ARRAY[5.4,3.9,1.3,0.4],'Iris_setosa',1,'Tennessee'),
(38,ARRAY[6.0,2.9,4.5,1.5],'Iris_versicolor',2,'Tennessee'),
(39,ARRAY[5.7,2.6,3.5,1.0],'Iris_versicolor',2,'Tennessee'),
(40,ARRAY[5.5,2.4,3.8,1.1],'Iris_versicolor',2,'Tennessee'),
(41,ARRAY[5.5,2.4,3.7,1.0],'Iris_versicolor',2,'Tennessee'),
(42,ARRAY[5.8,2.7,3.9,1.2],'Iris_versicolor',2,'Tennessee'),
(43,ARRAY[6.0,2.7,5.1,1.6],'Iris_versicolor',2,'Tennessee'),
(44,ARRAY[5.4,3.0,4.5,1.5],'Iris_versicolor',2,'Tennessee'),
(45,ARRAY[6.0,3.4,4.5,1.6],'Iris_versicolor',2,'Tennessee'),
(46,ARRAY[6.7,3.1,4.7,1.5],'Iris_versicolor',2,'Tennessee'),
(47,ARRAY[6.3,2.3,4.4,1.3],'Iris_versicolor',2,'Tennessee'),
(48,ARRAY[5.6,3.0,4.1,1.3],'Iris_versicolor',2,'Tennessee'),
(49,ARRAY[5.5,2.5,4.0,1.3],'Iris_versicolor',2,'Tennessee'),
(50,ARRAY[5.5,2.6,4.4,1.2],'Iris_versicolor',2,'Tennessee'),
(51,ARRAY[6.1,3.0,4.6,1.4],'Iris_versicolor',2,'Tennessee'),
(52,ARRAY[5.8,2.6,4.0,1.2],'Iris_versicolor',2,'Tennessee');

(1) no groups, 2 segments, default buffer size

select * from iris_data_packed_summary;

-[ RECORD 1 ]------------+------------------------------
source_table             | iris_data
output_table             | iris_data_packed
dependent_varname        | class_text
independent_varname      | attributes
buffer_size              | 26
class_values             | {Iris_setosa,Iris_versicolor}
num_rows_processed       | 52
num_missing_rows_skipped | 0
grouping_cols            | 

(2) no groups, 2 segments, buffer size=10

madlib=# select * from iris_data_packed_summary;

-[ RECORD 1 ]------------+------------------------------
source_table             | iris_data
output_table             | iris_data_packed
dependent_varname        | class_text
independent_varname      | attributes
buffer_size              | 10
class_values             | {Iris_setosa,Iris_versicolor}
num_rows_processed       | 52
num_missing_rows_skipped | 0
grouping_cols            | 

(3) groups, 2 segments, default buffer size

select * from iris_data_packed_summary;

-[ RECORD 1 ]------------+------------------------------
source_table             | iris_data
output_table             | iris_data_packed
dependent_varname        | class_text
independent_varname      | attributes
buffer_size              | 13
class_values             | {Iris_setosa,Iris_versicolor}
num_rows_processed       | 52
num_missing_rows_skipped | 0
grouping_cols            | state
select __id__, state , dependent_varname from iris_data_packed order by state, __id__;
 __id__ |   state   |                                dependent_varname                                
--------+-----------+---------------------------------------------------------------------------------
      0 | Alaska    | {{1,0},{0,1},{0,1},{0,1},{0,1},{1,0},{1,0},{1,0},{1,0},{0,1},{0,1},{0,1},{1,0}}
      1 | Alaska    | {{1,0},{1,0},{0,1},{0,1},{0,1},{1,0},{1,0},{1,0},{1,0},{0,1},{0,1},{0,1},{0,1}}
      2 | Alaska    | {{0,1}}
      0 | Tennessee | {{0,1},{1,0},{0,1},{0,1},{0,1},{0,1},{0,1},{1,0},{0,1},{0,1},{0,1},{0,1},{0,1}}
      1 | Tennessee | {{0,1},{1,0},{1,0},{0,1},{1,0},{1,0},{1,0},{0,1},{0,1},{1,0},{1,0},{1,0}}
(5 rows)

^^^ Above buffer size is based on average group size:
i.e., alaska=27, tennessee=25 so avg=26
and number of segments (2)
i.e, 26/2=13

(4) groups, 2 segments, buffer size=10

-[ RECORD 1 ]------------+------------------------------
source_table             | iris_data
output_table             | iris_data_packed
dependent_varname        | class_text
independent_varname      | attributes
buffer_size              | 10
class_values             | {Iris_setosa,Iris_versicolor}
num_rows_processed       | 52
num_missing_rows_skipped | 0
grouping_cols            | state
select __id__, state , dependent_varname from iris_data_packed order by state, __id__;

 __id__ |   state   |                       dependent_varname                       
--------+-----------+---------------------------------------------------------------
      0 | Alaska    | {{0,1},{1,0},{1,0},{1,0},{0,1},{0,1},{1,0},{0,1},{0,1},{0,1}}
      1 | Alaska    | {{0,1},{0,1},{1,0},{1,0},{0,1},{1,0},{0,1},{1,0},{0,1},{0,1}}
      2 | Alaska    | {{1,0},{0,1},{1,0},{0,1},{0,1},{1,0},{1,0}}
      0 | Tennessee | {{0,1},{0,1},{1,0},{0,1},{0,1},{1,0},{0,1},{0,1},{1,0},{0,1}}
      1 | Tennessee | {{1,0},{0,1},{1,0},{0,1},{0,1},{0,1},{1,0},{0,1},{0,1},{1,0}}
      2 | Tennessee | {{1,0},{1,0},{0,1},{1,0},{0,1}}
(6 rows)

(5) mnist
tested mnist training set of 60,000 rows and got buffer size of 30,000 which is correct for 2 segments

@asfgit asfgit closed this in 3e519dc Apr 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants