forked from myui/hivemall
-
Notifications
You must be signed in to change notification settings - Fork 0
Pig_Logistic regression dataset generation
daijyc edited this page Feb 24, 2015
·
1 revision
1.txt
1
register hivemall-0.3-with-dependencies.jar
define lr_datagen HiveUDTF('hivemall.dataset.LogisticRegressionDataGeneratorUDTF', '("-n_examples 10k -n_features 10 -seed 100")');
rmf regression_data1
a = load '1.txt';
b = foreach a generate flatten(lr_datagen('-n_examples 10k -n_features 10 -seed 100')) as (label,features);
store b into 'regression_data1';
Find the details of the option in LogisticRegressionDataGeneratorUDTF.java.
You can generate a sparse dataset as well as a dense dataset. By the default, a sparse dataset is generated.
You can used "-cl" option to generation 0/1 label.
register hivemall-0.3-with-dependencies.jar
define lr_datagen HiveUDTF('hivemall.dataset.LogisticRegressionDataGeneratorUDTF', '("-cl")');
rmf cl
a = load '1.txt';
b = foreach a generate flatten(lr_datagen('-cl')) as (label,features);
store b into 'cl';
register hivemall-0.3-with-dependencies.jar
define lr_datagen HiveUDTF('hivemall.dataset.LogisticRegressionDataGeneratorUDTF', '("-dense -n_examples 9999 -n_features 100 -n_dims 100")');
rmf regression_data_dense
a = load '1.txt';
b = foreach a generate flatten(lr_datagen('-dense -n_examples 9999 -n_features 100 -n_dims 100')) as (label,features);
store b into 'regression_data_dense';
Dataset generation using (at max) 10 reducers.
register hivemall-0.3-with-dependencies.jar
%default n_parallel_datagen 10
define generate_series HiveUDTF('hivemall.tools.GenerateSeriesUDTF', '(1, ${n_parallel_datagen})');
define lr_datagen HiveUDTF('hivemall.dataset.LogisticRegressionDataGeneratorUDTF', '("-n_examples 100")');
rmf lrdata1k
a = load '1.txt';
b = foreach a generate flatten(generate_series(1, ${n_parallel_datagen}));
c = group b by $0 partition by RoundRobinPartitioner parallel ${n_parallel_datagen};
d = foreach c generate flatten(lr_datagen('-n_examples 100'));
store d into 'lrdata1k';