[HIVEMALL-101] Separate optimizer implementation #79

takuti · 2017-05-15T09:10:09Z

What changes were proposed in this pull request?

Finalize #14

What type of PR is it?

Improvement, Feature

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-101

How was this patch tested?

Unit test
Manual test on EMR

Todo:

Compare the accuracy of -loss logloss with the current logress() UDTF
Add document that includes general explanation of SGD, regularization and optimization techniques like MLlib and sklearn
Support -mini_batch option in a similar way to what RegressionBaseUDTF does; accumulate gradients over M samples, and update for mean value
~~Save samples to external files as the other UDTFs (see LDA/pLSA UDTFs) and add -iter option~~ This should be the other issue [HIVEMALL-108]
Manage cumulative loss for future -iter support
Anything else for mix-server-related things?

takuti · 2017-05-15T09:23:18Z

core/src/main/java/hivemall/regression/GeneralRegressionUDTF.java

+    }
+
+    @Override
+    protected final void checkTargetValue(final float target) throws UDFArgumentException {


@maropu This is a regressor which simply predicts real values. Why did you create this method? Values only in [0,1] are allowed...?

@takuti Maybe for logistic regression that is actually a classifier taking 0/1 values. @maropu is not expert of machine learning algorithm.

@myui Ah, it makes sense since originally the generic regressor used LossFunctions.logisticLoss(target, predicted);. Thanks!

coveralls · 2017-05-16T07:51:16Z

Coverage increased (+0.3%) to 38.948% when pulling 2b965fc on takuti:HIVEMALL-101 into 68f6b46 on apache:master.

myui · 2017-05-16T07:55:58Z

@takuti checkTargetValue() is need for loss functions, e.g., for logistic loss.

takuti · 2017-05-16T08:03:22Z

Yep, that's why logistic loss is not selectable for now. checkTargetValue() will again come back later.

myui · 2017-05-17T06:18:22Z

@takuti better to have this kind of documents.
http://spark.apache.org/docs/latest/mllib-optimization.html
http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation

BTW refer [1,2] for how Spark/scikit incorporates regularized updates. FYI
[1] https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/optimization/Updater.scala
[2] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/sgd_fast.pyx#L632

coveralls · 2017-05-18T09:18:37Z

Changes Unknown when pulling c57d09e on takuti:HIVEMALL-101 into ** on apache:master**.

coveralls · 2017-05-19T01:55:40Z

Coverage increased (+0.4%) to 39.072% when pulling 0f26894 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

takuti · 2017-05-19T02:13:20Z

^^^ since generic regressor does not accept classification loss (e.g. logloss) just like sklearn, I keep removing checkTargetValue() from the GeneralRegression class

coveralls · 2017-05-19T06:17:33Z

Coverage increased (+0.6%) to 39.251% when pulling 34cf8a1 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

takuti · 2017-05-19T08:40:16Z

I listed TODOs in the top comment. If you have any other things I need to care, plz let me know.

coveralls · 2017-05-19T08:53:25Z

Coverage increased (+0.5%) to 39.22% when pulling 5dc6f4e on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

myui · 2017-05-20T08:50:07Z

@takuti -iter support should be another ticket. -minibatch support can be within this ticket.

Functional tests to confirm accuracy of -loss logistic to existing logress is required.

coveralls · 2017-05-22T07:06:21Z

Coverage increased (+0.7%) to 39.438% when pulling c3b89f8 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

coveralls · 2017-05-22T07:32:39Z

Coverage increased (+0.7%) to 39.424% when pulling c3b89f8 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

takuti · 2017-05-22T09:35:52Z

I supported -mini_batch option for regressor and classifier (same code).

The idea is just accumulating new_weight obtained from optimizer.update(). Once miniBatchSize samples are observed, a mean value of the accumulated new_weight will be set to a model via model.setWeight.

For SGD, it's clearly equivalent to what RegressorBaseUDTF does. However, I'm a little bit afraid if I can do the same thing for Adagrad, Adam, Adadelta and AdagradRDA. (Currently, doing the same thing for Adagrad, Adam and Adadelta are allowed. By contrast, AdagradRDA + -mini_batch option is not supported.)

BTW, practically, I observed that the naive Adagrad + -mini_batch implementation seems to work correctly as shown in the next comment:

takuti · 2017-05-22T10:10:53Z

I tested generic classifier and regressor on EMR by using the a9a data.

Classifier

set hivevar:n_samples=16281;
set hivevar:total_steps=32562;

`logress`

drop table if exists logress_model;
create table logress_model as
select
 feature,
 avg(weight) as weight
from
 (
  select
     logress(add_bias(features), label, '-total_steps ${total_steps}') as (feature, weight)
     -- logress(add_bias(features), label, '-total_steps ${total_steps} -mini_batch 10') as (feature, weight)
  from
     train_x3
 ) t
group by feature;

WITH test_exploded as (
  select
    rowid,
    label,
    extract_feature(feature) as feature,
    extract_weight(feature) as value
  from
    test LATERAL VIEW explode(add_bias(features)) t AS feature
),
predict as (
  select
    t.rowid,
    sigmoid(sum(m.weight * t.value)) as prob,
    CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label
  from
    test_exploded t LEFT OUTER JOIN
    logress_model m ON (t.feature = m.feature)
  group by
    t.rowid
),
submit as (
  select
    t.label as actual,
    pd.label as predicted,
    pd.prob as probability
  from
    test t JOIN predict pd
      on (t.rowid = pd.rowid)
)
select count(1) / ${n_samples} from submit
where actual = predicted;

`train_classifier`

train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}') as (feature, weight)
-- train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps} -mini_batch 10') as (feature, weight)

Results were completely same:

	online	mini-batch
`logress`	0.8414716540753026	0.848965051286776
`train_classifier`	0.8414716540753026	0.848965051286776

Regression

Solved the a9a label prediction as a regression problem.

// Since non-generic Adagrad was designed for logistic loss (i.e. classification), we cannot compare it with generic regressor under the exactly same condition.

`train_adagrad_regr` (internally uses logistic loss)

drop table if exists adagrad_model;
create table adagrad_model as
select
 feature,
 avg(weight) as weight
from
 (
  select
     train_adagrad_regr(features, label) as (feature, weight)
  from
     train_x3
 ) t
group by feature;

WITH test_exploded as (
  select
    rowid,
    label,
    extract_feature(feature) as feature,
    extract_weight(feature) as value
  from
    test LATERAL VIEW explode(add_bias(features)) t AS feature
),
predict as (
  select
    t.rowid,
    sigmoid(sum(m.weight * t.value)) as prob
  from
    test_exploded t LEFT OUTER JOIN
    adagrad_model m ON (t.feature = m.feature)
  group by
    t.rowid
),
submit as (
  select
    t.label as actual,
    pd.prob as probability
  from
    test t JOIN predict pd
      on (t.rowid = pd.rowid)
)
select rmse(probability, actual) from submit;

`train_regression`

train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg no') as (feature, weight)
-- train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg no -mini_batch 10') as (feature, weight)

	online	mini-batch
`train_adagrad_regr` (logistic loss)	0.3254586866367811	--
`train_regression` (squared loss)	0.3356422627079689	0.3348889704327727

As I mentioned in the last comment, I'm afraid whether the -mini_batch option works correctly for Adagrad. Fortunately, this example showed that the option slightly improved the accuracy of prediction in terms of RMSE.

coveralls · 2017-05-22T10:34:36Z

Coverage increased (+0.7%) to 39.422% when pulling f98bc73 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

coveralls · 2017-05-22T10:34:36Z

Coverage increased (+0.7%) to 39.422% when pulling f98bc73 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

myui · 2017-05-22T18:37:42Z

@takuti train() can be modified to return current loss and cumulative loss should be managed for future iteration support, e.g., using
https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/common/ConversionState.java

myui · 2017-05-23T06:33:30Z

@takuti I guess no mix-server-related issues in this PR. Will review for that though.

coveralls · 2017-05-23T09:53:08Z

Coverage increased (+1.07%) to 39.767% when pulling 0d573a0 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

coveralls · 2017-05-23T09:53:08Z

Coverage increased (+0.7%) to 39.441% when pulling 0d573a0 on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

takuti · 2017-05-24T02:18:52Z

@myui Almost done basically. Could you review when you get a chance?

One thing I like to discuss here is that GeneralClassifierUDTF and GeneralRegressionUDTF currently has a lot of duplicated code. Thus, current class structure

Learner Base
- Binary Online Classifier
  - General Classifier
- Regression Base
  - General Regression

can be modified to

Learner Base
- General Predictor Base
  - General Classifier
  - General Regression

for example.

If it sounds good for @myui, I will do so. Of course it's not mandatory, so keeping the current duplicated code is no problem.

coveralls · 2017-05-24T02:22:22Z

Coverage increased (+0.7%) to 39.422% when pulling 2724dbc on takuti:HIVEMALL-101 into 10e7d45 on apache:master.

myui · 2017-05-24T03:35:20Z

@takuti It's preferred to have an abstract class. Please create it.

hivemall.LearnerBase
- hivemall.GeneralLeanerBase
  - hivemall.classifier.GeneralClassifier
  - hivemall.regression.GeneralRegression

takuti · 2017-05-24T04:59:50Z

@myui Finished~

for regression and classification, respectively. + updated the order of loss functions.

`loss_function` is not a part of Optimizer

They can be useful even for classification. Scikit-learn is same: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py#L570-L581

- Use appropriate (i.e. strongly correlated) data - Target value has to be float OIs

* except for AdagradRDA * update unit tests accordingly

and update Regularizer implementation to integrate L1/L2 with ElasticNet

It's more useful for the future `-iter` support

coveralls · 2017-06-09T01:32:13Z

Coverage increased (+0.5%) to 39.978% when pulling 5439bd8 on takuti:HIVEMALL-101 into 1db5358 on apache:master.

myui · 2017-06-14T17:55:05Z

@maropu @takuti merged this so huge patch finally.. Thank you for your contribution!

takuti mentioned this pull request May 15, 2017

[WIP] Separate optimizer implementations #14

Closed

takuti commented May 15, 2017

View reviewed changes

takuti force-pushed the HIVEMALL-101 branch from c57d09e to 0f26894 Compare May 19, 2017 01:35

takuti force-pushed the HIVEMALL-101 branch from 8be7270 to c3b89f8 Compare May 22, 2017 07:04

takuti changed the title ~~[WIP][HIVEMALL-101] Separate optimizer implementation~~ [HIVEMALL-101] Separate optimizer implementation May 24, 2017

takuti added 23 commits June 9, 2017 10:09

Wrap IllegalArgumentException in generic classifier/regressor UDTFs

fae3b41

Fix scala style

64218fe

No need checkTargetValue() in generic regressor

55ac22b

Fix usage of loss function in generic update() methods

8028ea4

Add HuberLoss and ModifiedHuberLoss

a8e3b48

for regression and classification, respectively. + updated the order of loss functions.

Remove unnecessary optimizer option test

f0e2efb

`loss_function` is not a part of Optimizer

Enable GeneralClassifier to use regressin loss

7cd4280

They can be useful even for classification. Scikit-learn is same: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py#L570-L581

Add ElasticNet

ac877bc

Replace toString() w/ getType() in loss functions

0124444

Update generic predictors' test cases

c2368f0

Refactor

89f1be8

Utilize loss() for testing

57b4ae3

Test both dense and sparse cases

a0ad230

Update generic regressor's test case

92bf763

- Use appropriate (i.e. strongly correlated) data - Target value has to be float OIs

Set given weight to weightValueReused

739fdb2

Support -mini_batch option in generic predictors

4b7cf21

* except for AdagradRDA * update unit tests accordingly

Fix place for proceeding optimization step

9c26d18

Update defines

a5e1cd0

Add l1_ratio option for ElasticNet

671a6d7

and update Regularizer implementation to integrate L1/L2 with ElasticNet

Update docs

e38b9d1

Use cumulative loss even for testing

899c7f1

It's more useful for the future `-iter` support

Create GeneralLearnerBaseUDTF for generic regressor/classifier

819bece

Refactor

5439bd8

takuti force-pushed the HIVEMALL-101 branch from 689bdbf to 5439bd8 Compare June 9, 2017 01:10

asfgit closed this in 3848ea6 Jun 14, 2017

takuti deleted the HIVEMALL-101 branch June 15, 2017 02:53

takuti mentioned this pull request Jun 20, 2017

[HIVEMALL-108] Support -iter option in generic predictors #87

Closed

myui mentioned this pull request Jul 11, 2017

[HIVEMALL-108-2] Support -iter option in generic predictors #98

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HIVEMALL-101] Separate optimizer implementation #79

[HIVEMALL-101] Separate optimizer implementation #79

takuti commented May 15, 2017 •

edited

Loading

takuti May 15, 2017

myui May 15, 2017

takuti May 15, 2017

coveralls commented May 16, 2017 •

edited

Loading

myui commented May 16, 2017 •

edited

Loading

takuti commented May 16, 2017

myui commented May 17, 2017

coveralls commented May 18, 2017 •

edited

Loading

coveralls commented May 19, 2017 •

edited

Loading

takuti commented May 19, 2017

coveralls commented May 19, 2017 •

edited

Loading

takuti commented May 19, 2017

coveralls commented May 19, 2017 •

edited

Loading

myui commented May 20, 2017

coveralls commented May 22, 2017 •

edited

Loading

coveralls commented May 22, 2017 •

edited

Loading

takuti commented May 22, 2017

takuti commented May 22, 2017 •

edited

Loading

coveralls commented May 22, 2017

coveralls commented May 22, 2017 •

edited

Loading

myui commented May 22, 2017 •

edited

Loading

myui commented May 23, 2017

coveralls commented May 23, 2017

coveralls commented May 23, 2017 •

edited

Loading

takuti commented May 24, 2017

coveralls commented May 24, 2017 •

edited

Loading

myui commented May 24, 2017

takuti commented May 24, 2017

coveralls commented Jun 9, 2017 •

edited

Loading

myui commented Jun 14, 2017

[HIVEMALL-101] Separate optimizer implementation #79

[HIVEMALL-101] Separate optimizer implementation #79

Conversation

takuti commented May 15, 2017 • edited Loading

What changes were proposed in this pull request?

What type of PR is it?

What is the Jira issue?

How was this patch tested?

takuti May 15, 2017

Choose a reason for hiding this comment

myui May 15, 2017

Choose a reason for hiding this comment

takuti May 15, 2017

Choose a reason for hiding this comment

coveralls commented May 16, 2017 • edited Loading

myui commented May 16, 2017 • edited Loading

takuti commented May 16, 2017

myui commented May 17, 2017

coveralls commented May 18, 2017 • edited Loading

coveralls commented May 19, 2017 • edited Loading

takuti commented May 19, 2017

coveralls commented May 19, 2017 • edited Loading

takuti commented May 19, 2017

coveralls commented May 19, 2017 • edited Loading

myui commented May 20, 2017

coveralls commented May 22, 2017 • edited Loading

coveralls commented May 22, 2017 • edited Loading

takuti commented May 22, 2017

takuti commented May 22, 2017 • edited Loading

Classifier

logress

train_classifier

Regression

train_adagrad_regr (internally uses logistic loss)

train_regression

coveralls commented May 22, 2017

coveralls commented May 22, 2017 • edited Loading

myui commented May 22, 2017 • edited Loading

myui commented May 23, 2017

coveralls commented May 23, 2017

coveralls commented May 23, 2017 • edited Loading

takuti commented May 24, 2017

coveralls commented May 24, 2017 • edited Loading

myui commented May 24, 2017

takuti commented May 24, 2017

coveralls commented Jun 9, 2017 • edited Loading

myui commented Jun 14, 2017

takuti commented May 15, 2017 •

edited

Loading

coveralls commented May 16, 2017 •

edited

Loading

myui commented May 16, 2017 •

edited

Loading

coveralls commented May 18, 2017 •

edited

Loading

coveralls commented May 19, 2017 •

edited

Loading

coveralls commented May 19, 2017 •

edited

Loading

coveralls commented May 19, 2017 •

edited

Loading

coveralls commented May 22, 2017 •

edited

Loading

coveralls commented May 22, 2017 •

edited

Loading

takuti commented May 22, 2017 •

edited

Loading

`logress`

`train_classifier`

`train_adagrad_regr` (internally uses logistic loss)

`train_regression`

coveralls commented May 22, 2017 •

edited

Loading

myui commented May 22, 2017 •

edited

Loading

coveralls commented May 23, 2017 •

edited

Loading

coveralls commented May 24, 2017 •

edited

Loading

coveralls commented Jun 9, 2017 •

edited

Loading