Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2309][MLlib] Generalize the binary logistic regression into multinomial logistic regression #1379

Merged
merged 0 commits into from Aug 3, 2014

Conversation

@dbtsai
Copy link
Member

commented Jul 12, 2014

Currently, there is no multi-class classifier in mllib. Logistic regression can be extended to multinomial classifier straightforwardly.
The following formula will be implemented.
http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25

Note: When multi-classes mode, there will be multiple intercepts, so we don't use the single intercept in GeneralizedLinearModel, and have all the intercepts into weights. It makes some inconsistency. For example, in the binary mode, the intercept can not be specified by users, but since in the multinomial mode, the intercepts are combined into weights, users can specify them.

@mengxr Should we just deprecate the intercept, and have everything in weights? It makes sense in term of optimization point of view, and also make the interface cleaner. Thanks.

@SparkQA

This comment has been minimized.

Copy link

commented Jul 12, 2014

QA tests have started for PR 1379. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16579/consoleFull

@SparkQA

This comment has been minimized.

Copy link

commented Jul 12, 2014

QA results for PR 1379:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* as used in multi-class classification (it is also used in binary logistic regression).

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16579/consoleFull

@mengxr

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2014

Jenkins, retest this please.

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Jul 21, 2014

I think it fails due to the apache license is not in the test file. As you suggest, I'll move it to be generated in the runtime. Would like to know the general feedback. I'll make the test pass tomorrow. Thanks.

@SparkQA

This comment has been minimized.

Copy link

commented Jul 22, 2014

QA tests have started for PR 1379. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16937/consoleFull

@SparkQA

This comment has been minimized.

Copy link

commented Jul 22, 2014

QA results for PR 1379:
- This patch FAILED unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16937/consoleFull

@mengxr

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2014

It is easier to review if it passes the tests. @SparkQA shows new public classes and interface changes. Could you remove the data file and generate some synthetic data for unit tests? Thanks!

@asfgit asfgit merged commit 18f29b9 into apache:master Aug 3, 2014

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Aug 3, 2014

@mengxr Is there any problem with asfgit? This is not finished yet, why asfgit said it's merged into apache:master.

@mengxr

This comment has been minimized.

Copy link
Contributor

commented Aug 3, 2014

... I have no idea. Let me check.

@mengxr

This comment has been minimized.

Copy link
Contributor

commented Aug 3, 2014

@pwendell I didn't see Closes #1379 in the merged commit. Is something wrong with asfgit?

@BigCrunsh

This comment has been minimized.

Copy link
Contributor

commented Oct 28, 2014

What is the current state of the PR? Can't see any changes in the code...

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Oct 28, 2014

@BigCrunsh I'm working on this. Let's see if we can merge in Spark 1.2

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2014

@dbtsai Hi! What is the current state of PR? I would like to download and test. Could you suggest where are the sources?

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2014

Apparently, I've found this implementation https://github.com/dbtsai/spark/tree/dbtsai-mlor. It did work on my examples producing reasonable results. Could you comment on the following? Why the number of parameters (weights) is equal to (num_features + 1)(num_classes-1) ? I would expect (num_features + 1)(num_classes) as it is here for example: http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Nov 21, 2014

@avulanov I will merge this on Spark 1.3, and sorry for delay since I was very busy recently. Yes, the branch you found should work, but it can not be cleanly merged in upstream, and I'm working on it. You can try that branch for now. Also, in the branch, we don't use LBFGS as optimizer, so the convergent rate will be slow.

Basically, you can model the whole problem using (num_features + 1)(num_classes), but the solution will not be unique. You can chose one of the class as base class to make the solution unique, and I chose the first class as base class. See Properties of softmax regression parameterization in the wiki page you refer. Or my presentation http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 for more technical detail. You can think about binary logistic regression, and you only have (num_features + 1) coefficients instead of 2 * (num_features + 1)

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Nov 21, 2014

@dbtsai Thanks for explanation! Do I understand correct, that if I want to get (num_features+1)*(num_classes) parameters from your model, I need to concatenate a vector of length (num_features+1) with zeros at the beginning of the vector that your model returns with model.weights?

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Nov 21, 2014

no, in the algorithm, I already model the problem http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/24 , so there will always be only (num_features + 1)(num_classes-1) parameters. Of course, you can chose any transformation to make it over-parameterize, see Properties of softmax regression parameterization session in wiki for detail.

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2014

@dbtsai I've tried your implementation with LBFGS optimizer and it seems to have similar performance in terms of running time and accuracy to SGD that you have right now. Do you think it worth testing it against our implementation of artificial neural network with no hidden layer #1290? It uses a different cost function but it still might be interesting to compare.

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Dec 3, 2014

@avulanov Sure, it's interesting to see the comparison. Let me know the result once you have it. I'm going to make it merge in 1.3, so will be easier to use it in the future.

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 6, 2014

@dbtsai Here are the results of my tests:

  • Settings:
  • Result
    • ANN classifier: training time: 00:47:55; accuracy: 0.848
    • MLOR: training time: 01:30:45; accuracy: 0.864
  • Average gradient(?) compute time (mapPartitionsWithIndex at RDDFunctions.scala:108)
    • ANN classifier: 51 seconds
    • MLOR: 2.1 minutes
  • Average update(optimize?) time (reduce at RDDFunctions.scala:112)
    • ANN classifier: 90 ms
    • MLOR: 90 ms

It seems that ANN is almost 2x faster (with the mentioned settings), though accuracy is 1.6% smaller. The difference in accuracy can be explained by the fact that ANN uses (half) squared error cost function instead of cross entropy and no softmax. They are supposed to be better for classification.

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Dec 8, 2014

@avulanov I did couple performance turning in the MLOR gradient calculation in my company's proprietary implementation which results 4x faster than the open source one in github you tested. I'm trying to make it open source and merge into spark soon. (ps, simple polynomial expansion with MLOR can increase the mnist8m accuracy from 86% to 94% in my experiment. See Prof. CJ Lin's talk - https://www.youtube.com/watch?v=GCIJP0cLSmU )

@jkbradley

This comment has been minimized.

Copy link
Member

commented Dec 8, 2014

@avulanov Nice tests! A few comments:

  • Computing accuracy: It would be good to test on the original MNIST test set, rather than a subset of the training set. The training set includes a bunch of duplicates of images with slight modifications, so results on it might be misleading.
  • The timing tests look pretty convincing for ANN! Can you please confirm whether both algorithms did all 40 iterations? Or did they sometimes stop early b/c of the convergence tolerance?
@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2014

@dbtsai 1) Could you elaborate on what kind of optimizations did you do? Probably, they could be applied to the broader MLlib, which is beneficial. 2) Do you know the reason why our ANN implementation worked faster than the MLOR you shared? This could also be interesting in terms of MLlib optimization. 3) Did you mean fitting a n-th degree polynom instead of a linear function? Thanks for the link, it seems very interesting!

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2014

@jkbradley Thank you! They took some time.

  • I totally agree with you, I need to perform tests on the original test set. It contains less attributes, i.e. 778 vs 784 in mnist8m, so one needs to add zeros to it to make it compatible.
  • They both did all 40 iterations.
@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Dec 9, 2014

@avulanov

  1. I did the same optimization for MLlib in my recently PRs.
  • Accessing the values in dense/sparse vector directly is very slow without having a local reference of primitive array due to the dereference. See #3577 and #3435. There is bytecode analysis for this issue in #3435
  • Breeze's foreachActive is very slow, so I implemented a 4x faster version in #3288 My experience is that if Breeze is used in critical code path, it has to be cautious.
  1. I don't check out your ANN implementation yet, but I will check today. I'll send you our optimized Gradient Computation code for MLOR. Will be interesting to see the new benchmark compared with the one you tested.
  2. See page 27 at Prof. CJ Lin's slide. http://www.csie.ntu.edu.tw/~cjlin/talks/SFmeetup.pdf It's just doing the feature expansion by mapping the data into higher dimension space.
@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2014

@dbtsai Thank you, I look forward for your code to perform benchmarks. Thanks again for the video! I've enjoy ed it, especially Q&A after the talk. At 51:23 Prof CJ Lin mentiones that "we released dataset of about 600 Gigabytes". Do you know where I can download it? It should be quite a challenging workload for classification in Spark! Upd: is it this one? http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Dec 10, 2014

@avulanov I remembered CJ Lin said he posted the 600GB dataset on his website.

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 17, 2014

@dbtsai Hi! Did you have a chance to check our implementation and send me the optimized one?

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Dec 19, 2014

@avulanov I don't check your implementation yet, but I'm ready to have the optimized MLOR for you to test. Can you try the LogisticGradient in https://github.com/AlpineNow/spark/commits/mlor

@DeveloperApi
class LogisticGradient extends Gradient {
  override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = {
    val gradient = Vectors.zeros(weights.size)
    val loss = compute(data, label, weights, gradient)
    (gradient, loss)
  }

  override def compute(
      data: Vector,
      label: Double,
      weights: Vector,
      cumGradient: Vector): Double = {
    assert((weights.size % data.size) == 0)
    val dataSize = data.size
    // (n + 1) is number of classes
    val n = (weights.size / dataSize)
    val numerators = Array.ofDim[Double](n)

    var denominator = 0.0
    var margin = 0.0

    val weightsArray = weights match {
      case dv: DenseVector => dv.values
      case _ =>
        throw new IllegalArgumentException(
          s"weights only supports dense vector but got type ${weights.getClass}.")
    }
    val cumGradientArray = cumGradient match {
      case dv: DenseVector => dv.values
      case _ =>
        throw new IllegalArgumentException(
          s"cumGradient only supports dense vector but got type ${cumGradient.getClass}.")
    }

    var i = 0
    while (i < n) {
      var sum = 0.0
      data.foreachActive { (index, value) =>
        if (value != 0.0) sum += value * weightsArray((i * dataSize) + index)
      }
      if (i == label.toInt - 1) margin = sum
      numerators(i) = math.exp(sum)
      denominator += numerators(i)
      i += 1
    }

    i = 0
    while (i < n) {
      val multiplier = numerators(i) / (denominator + 1.0) - {
        if (label != 0.0 && label == i + 1) 1.0 else 0.0
      }
      data.foreachActive { (index, value) =>
        if (value != 0.0) cumGradientArray(i * dataSize + index) += multiplier * value
      }
      i += 1
    }

    if (label > 0.0) {
      math.log1p(denominator) - margin
    } else {
      math.log1p(denominator)
    }
  }
}
@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Dec 20, 2014

@avulanov PS, you can just replace the gradient function without doing any change. Let me know how much performance gain you see, and I'm very interested in this. Thanks.

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2014

@dbtsai Thank you! Should I use the latest Spark with this Gradient?

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Dec 20, 2014

Yes, foreachActive is the new API in Spark 1.2.

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2014

@dbtsai GeneralizedLinearAlgorithm throws exception org.apache.spark.SparkException: Input validation failed.. Moreover, there is no regression with LBFGS. Probably I need to use some other your files, like I did it before. Should I clone https://github.com/AlpineNow/spark/commits/mlor and merge it with latest Spark?

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Dec 20, 2014

@avulanov The new branch is not finished yet. You need to rebase https://github.com/dbtsai/spark/tree/dbtsai-mlor to master, and just replace the gradient function.

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 22, 2014

@dbtsai I did local experiment on mnist and your new implementation seems to be more than 2x faster than the previous one! I am going to perform bigger experiments. In the meantime, could you suggest if optimizations that you did are applicable for ANN Gradient? It will be extremely helpful for us. https://github.com/bgreeven/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/ann/ArtificialNeuralNetwork.scala#L467

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Dec 23, 2014

New results of experiments with optimized ANN and MLOR are below. I used the same cluster of 6 machines with 12 workers total, mnist8m dataset as train and the standard mnist test converted to 784 attributes.

  • Results
    • ANN classifier: training time: 00:16:58 (was 00:47:55); accuracy: 0.9021
    • MLOR: training time: 00:09:46 (was 01:30:45); accuracy: 0.9084
  • Average step time (reduce at RDDFunctions.scala:112):
    • ANN classifier: 23 seconds (was 51 s)
    • MLOR: 14 seconds (was 2.1 mins)

The ANN became ~3x and MLOR ~10x faster (!) than before. The current MLOR is ~60% faster than current ANN. I assume that there are the following overheads in ANN: 1) it uses back-propagation, so there are two matrix vector multiplications on forward and backward passes 2) it does rolling the parameters stored in matrices to the vector form. I will be happy to know how these overheads can be reduced. We can't compare with previously obtained accuracy because I used different test sets.

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Dec 24, 2014

@avulanov It's very encouraging benchmark result you saw in real world cluster setup. Since I'm on vacation recently, I don't actually deploy the new code and benchmark in our cluster. Great to see such huge 10x performance gain (actually bigger than what I thought, and in my local single machine testing, I only saw 2~4x difference.)

What optimization do you do on your ANN implementation? The same things in MLOR?

@mengxr Is it possible to reopne this closed PR in github? There are lots of useful discussion here, so I don't want to open another PR in github. I think I'm mostly done except the unit-test, and I can push the code for code review now before our meeting. (PS, the now code is more generalized than binary one, and has the same performance in the binary special case in my local testing.)

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Jan 5, 2015

@dbtsai
Just back from vacation too:)

I used my old implementation of the matrix form of back propagation and made sure that it properly uses stride of matrices in breeze. Also, I optimized roll of parameters into vector combined with in-place update of cumulative sum.

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2015

@dbtsai BTW., have you thought about batch processing of input vectors, i.e. stack N vectors into matrix and perform optimization with this matrix instead of vector? With native BLAS enabled this might improve the performance.

@dbtsai

This comment has been minimized.

Copy link
Member Author

commented Jan 8, 2015

@avulanov I've thought about that. However, @mengxr told me that they have a intern trying to do this type of experiment last year, and they don't see significant performance gain. I'm thinking to implement the whole gradient function using native code/SIMD by batching the input vectors as matrix. Since for MLOR, the computation of objective function is very expensive.

@avulanov

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2015

@dbtsai I did batching for artificial neural networks and the performance improved ~5x #1290 (comment)

asfgit pushed a commit that referenced this pull request Feb 3, 2015
[SPARK-2309][MLlib] Multinomial Logistic Regression
#1379 is automatically closed by asfgit, and github can not reopen it once it's closed, so this will be the new PR.

Binary Logistic Regression can be extended to Multinomial Logistic Regression by running K-1 independent Binary Logistic Regression models. The following formula is implemented.
http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3833 from dbtsai/mlor and squashes the following commits:

4e2f354 [DB Tsai] triger jenkins
697b7c9 [DB Tsai] address some feedback
4ce4d33 [DB Tsai] refactoring
ff843b3 [DB Tsai] rebase
f114135 [DB Tsai] refactoring
4348426 [DB Tsai] Addressed feedback from Sean Owen
a252197 [DB Tsai] first commit
preaudc pushed a commit to preaudc/spark that referenced this pull request Apr 17, 2015
[SPARK-2309][MLlib] Multinomial Logistic Regression
apache#1379 is automatically closed by asfgit, and github can not reopen it once it's closed, so this will be the new PR.

Binary Logistic Regression can be extended to Multinomial Logistic Regression by running K-1 independent Binary Logistic Regression models. The following formula is implemented.
http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25

Author: DB Tsai <dbtsai@alpinenow.com>

Closes apache#3833 from dbtsai/mlor and squashes the following commits:

4e2f354 [DB Tsai] triger jenkins
697b7c9 [DB Tsai] address some feedback
4ce4d33 [DB Tsai] refactoring
ff843b3 [DB Tsai] rebase
f114135 [DB Tsai] refactoring
4348426 [DB Tsai] Addressed feedback from Sean Owen
a252197 [DB Tsai] first commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.