Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

[HIVEMALL-101] Separate optimizer implementation #79

Closed
wants to merge 36 commits into from

Conversation

takuti
Copy link
Member

@takuti takuti commented May 15, 2017

What changes were proposed in this pull request?

Finalize #14

What type of PR is it?

Improvement, Feature

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-101

How was this patch tested?

  • Unit test
  • Manual test on EMR

Todo:

  • Compare the accuracy of -loss logloss with the current logress() UDTF
  • Add document that includes general explanation of SGD, regularization and optimization techniques like MLlib and sklearn
  • Support -mini_batch option in a similar way to what RegressionBaseUDTF does; accumulate gradients over M samples, and update for mean value
  • Save samples to external files as the other UDTFs (see LDA/pLSA UDTFs) and add -iter option This should be the other issue [HIVEMALL-108]
  • Manage cumulative loss for future -iter support
  • Anything else for mix-server-related things?

}

@Override
protected final void checkTargetValue(final float target) throws UDFArgumentException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu This is a regressor which simply predicts real values. Why did you create this method? Values only in [0,1] are allowed...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takuti Maybe for logistic regression that is actually a classifier taking 0/1 values. @maropu is not expert of machine learning algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myui Ah, it makes sense since originally the generic regressor used LossFunctions.logisticLoss(target, predicted);. Thanks!

@coveralls
Copy link

coveralls commented May 16, 2017

Coverage Status

Coverage increased (+0.3%) to 38.948% when pulling 2b965fc on takuti:HIVEMALL-101 into 68f6b46 on apache:master.

@myui
Copy link
Member

myui commented May 16, 2017

@takuti checkTargetValue() is need for loss functions, e.g., for logistic loss.

@takuti
Copy link
Member Author

takuti commented May 16, 2017

Yep, that's why logistic loss is not selectable for now. checkTargetValue() will again come back later.

@coveralls
Copy link

coveralls commented May 18, 2017

Coverage Status

Changes Unknown when pulling c57d09e on takuti:HIVEMALL-101 into ** on apache:master**.

@coveralls
Copy link

coveralls commented May 19, 2017

Coverage Status

Coverage increased (+0.4%) to 39.072% when pulling 0f26894 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

@takuti
Copy link
Member Author

takuti commented May 19, 2017

^^^ since generic regressor does not accept classification loss (e.g. logloss) just like sklearn, I keep removing checkTargetValue() from the GeneralRegression class

@coveralls
Copy link

coveralls commented May 19, 2017

Coverage Status

Coverage increased (+0.6%) to 39.251% when pulling 34cf8a1 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

@takuti
Copy link
Member Author

takuti commented May 19, 2017

I listed TODOs in the top comment. If you have any other things I need to care, plz let me know.

@coveralls
Copy link

coveralls commented May 19, 2017

Coverage Status

Coverage increased (+0.5%) to 39.22% when pulling 5dc6f4e on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

@myui
Copy link
Member

myui commented May 20, 2017

@takuti -iter support should be another ticket. -minibatch support can be within this ticket.

Functional tests to confirm accuracy of -loss logistic to existing logress is required.

@coveralls
Copy link

coveralls commented May 22, 2017

Coverage Status

Coverage increased (+0.7%) to 39.438% when pulling c3b89f8 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

@coveralls
Copy link

coveralls commented May 22, 2017

Coverage Status

Coverage increased (+0.7%) to 39.424% when pulling c3b89f8 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

@takuti
Copy link
Member Author

takuti commented May 22, 2017

I supported -mini_batch option for regressor and classifier (same code).

The idea is just accumulating new_weight obtained from optimizer.update(). Once miniBatchSize samples are observed, a mean value of the accumulated new_weight will be set to a model via model.setWeight.

For SGD, it's clearly equivalent to what RegressorBaseUDTF does. However, I'm a little bit afraid if I can do the same thing for Adagrad, Adam, Adadelta and AdagradRDA. (Currently, doing the same thing for Adagrad, Adam and Adadelta are allowed. By contrast, AdagradRDA + -mini_batch option is not supported.)

BTW, practically, I observed that the naive Adagrad + -mini_batch implementation seems to work correctly as shown in the next comment:

@takuti
Copy link
Member Author

takuti commented May 22, 2017

I tested generic classifier and regressor on EMR by using the a9a data.

Classifier

set hivevar:n_samples=16281;
set hivevar:total_steps=32562;

logress

drop table if exists logress_model;
create table logress_model as
select
 feature,
 avg(weight) as weight
from
 (
  select
     logress(add_bias(features), label, '-total_steps ${total_steps}') as (feature, weight)
     -- logress(add_bias(features), label, '-total_steps ${total_steps} -mini_batch 10') as (feature, weight)
  from
     train_x3
 ) t
group by feature;
WITH test_exploded as (
  select
    rowid,
    label,
    extract_feature(feature) as feature,
    extract_weight(feature) as value
  from
    test LATERAL VIEW explode(add_bias(features)) t AS feature
),
predict as (
  select
    t.rowid,
    sigmoid(sum(m.weight * t.value)) as prob,
    CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label
  from
    test_exploded t LEFT OUTER JOIN
    logress_model m ON (t.feature = m.feature)
  group by
    t.rowid
),
submit as (
  select
    t.label as actual,
    pd.label as predicted,
    pd.prob as probability
  from
    test t JOIN predict pd
      on (t.rowid = pd.rowid)
)
select count(1) / ${n_samples} from submit
where actual = predicted;

train_classifier

train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}') as (feature, weight)
-- train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps} -mini_batch 10') as (feature, weight)

Results were completely same:

online mini-batch
logress 0.8414716540753026 0.848965051286776
train_classifier 0.8414716540753026 0.848965051286776

Regression

Solved the a9a label prediction as a regression problem.

// Since non-generic Adagrad was designed for logistic loss (i.e. classification), we cannot compare it with generic regressor under the exactly same condition.

train_adagrad_regr (internally uses logistic loss)

drop table if exists adagrad_model;
create table adagrad_model as
select
 feature,
 avg(weight) as weight
from
 (
  select
     train_adagrad_regr(features, label) as (feature, weight)
  from
     train_x3
 ) t
group by feature;
WITH test_exploded as (
  select
    rowid,
    label,
    extract_feature(feature) as feature,
    extract_weight(feature) as value
  from
    test LATERAL VIEW explode(add_bias(features)) t AS feature
),
predict as (
  select
    t.rowid,
    sigmoid(sum(m.weight * t.value)) as prob
  from
    test_exploded t LEFT OUTER JOIN
    adagrad_model m ON (t.feature = m.feature)
  group by
    t.rowid
),
submit as (
  select
    t.label as actual,
    pd.prob as probability
  from
    test t JOIN predict pd
      on (t.rowid = pd.rowid)
)
select rmse(probability, actual) from submit;

train_regression

train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg no') as (feature, weight)
-- train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg no -mini_batch 10') as (feature, weight)
online mini-batch
train_adagrad_regr (logistic loss) 0.3254586866367811 --
train_regression (squared loss) 0.3356422627079689 0.3348889704327727

As I mentioned in the last comment, I'm afraid whether the -mini_batch option works correctly for Adagrad. Fortunately, this example showed that the option slightly improved the accuracy of prediction in terms of RMSE.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.7%) to 39.422% when pulling f98bc73 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

1 similar comment
@coveralls
Copy link

coveralls commented May 22, 2017

Coverage Status

Coverage increased (+0.7%) to 39.422% when pulling f98bc73 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

@myui
Copy link
Member

myui commented May 22, 2017

@takuti train() can be modified to return current loss and cumulative loss should be managed for future iteration support, e.g., using
https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/common/ConversionState.java

@myui
Copy link
Member

myui commented May 23, 2017

@takuti I guess no mix-server-related issues in this PR. Will review for that though.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.07%) to 39.767% when pulling 0d573a0 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

@coveralls
Copy link

coveralls commented May 23, 2017

Coverage Status

Coverage increased (+0.7%) to 39.441% when pulling 0d573a0 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

@takuti takuti changed the title [WIP][HIVEMALL-101] Separate optimizer implementation [HIVEMALL-101] Separate optimizer implementation May 24, 2017
@takuti
Copy link
Member Author

takuti commented May 24, 2017

@myui Almost done basically. Could you review when you get a chance?

One thing I like to discuss here is that GeneralClassifierUDTF and GeneralRegressionUDTF currently has a lot of duplicated code. Thus, current class structure

  • Learner Base
    • Binary Online Classifier
      • General Classifier
    • Regression Base
      • General Regression

can be modified to

  • Learner Base
    • General Predictor Base
      • General Classifier
      • General Regression

for example.

If it sounds good for @myui, I will do so. Of course it's not mandatory, so keeping the current duplicated code is no problem.

@coveralls
Copy link

coveralls commented May 24, 2017

Coverage Status

Coverage increased (+0.7%) to 39.422% when pulling 2724dbc on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

@myui
Copy link
Member

myui commented May 24, 2017

@takuti It's preferred to have an abstract class. Please create it.

  • hivemall.LearnerBase
    • hivemall.GeneralLeanerBase
      • hivemall.classifier.GeneralClassifier
      • hivemall.regression.GeneralRegression

@takuti
Copy link
Member Author

takuti commented May 24, 2017

@myui Finished~

takuti added 23 commits June 9, 2017 10:09
for regression and classification, respectively.

+ updated the order of loss functions.
`loss_function` is not a part of Optimizer
- Use appropriate (i.e. strongly correlated) data
- Target value has to be float OIs
* except for AdagradRDA
* update unit tests accordingly
and update Regularizer implementation to integrate L1/L2 with ElasticNet
It's more useful for the future `-iter` support
@coveralls
Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage increased (+0.5%) to 39.978% when pulling 5439bd8 on takuti:HIVEMALL-101 into 1db5358 on apache:master.

@asfgit asfgit closed this in 3848ea6 Jun 14, 2017
@myui
Copy link
Member

myui commented Jun 14, 2017

@maropu @takuti merged this so huge patch finally.. Thank you for your contribution!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants