Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLLIB] [spark-2352] Implementation of an Artificial Neural Network (ANN) #1290

Closed
wants to merge 143 commits into from

Conversation

bgreeven
Copy link

@bgreeven bgreeven commented Jul 3, 2014

The code contains a multi-layer ANN implementation, with variable number of inputs, outputs and hidden nodes. It takes as input an RDD vector pairs, corresponding to the training set with inputs and outputs.

Next to two automated tests, an example program is also included, which also contains a graphical representation.

@mengxr
Copy link
Contributor

mengxr commented Jul 3, 2014

@bgreeven Please add [MLLIB] to your PR, following https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . It makes easier for people who want to search MLlib's PRs. Thanks!

@mengxr
Copy link
Contributor

mengxr commented Jul 3, 2014

Jenkins, test this please.

@bgreeven bgreeven changed the title [spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN) [MLLIB] [spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN) Jul 4, 2014
@mengxr
Copy link
Contributor

mengxr commented Jul 16, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 16, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16708/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 16, 2014

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN( noInp: Integer, noHid: Integer, noOut: Integer, b: Double ) extends Gradient with ANN {
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16708/consoleFull

@matthelb
Copy link

@bgreeven Are you continuing work on this pull request so that it passes all unit tests?

@bgreeven
Copy link
Author

Hi Matthew,

Sure, I can. I was on holiday during the last two weeks, but now back in office. I'll update the code this week.

Best regards,

Bert

From: Matthew Burke [mailto:notifications@github.com]
Sent: 20 July 2014 06:46
To: apache/spark
Cc: Bert Greevenbosch
Subject: Re: [spark] [MLLIB] [spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN) (#1290)

@bgreevenhttps://github.com/bgreeven Are you continuing work on this pull request so that it passes all unit tests?


Reply to this email directly or view it on GitHubhttps://github.com//pull/1290#issuecomment-49531526.

@bgreeven
Copy link
Author

I updated the two sources to comply with "sbt/sbt scalastyle". Maybe retry the unit tests with the new modifications?

@mengxr
Copy link
Contributor

mengxr commented Jul 30, 2014

Jenkins, add to whitelist.

@mengxr
Copy link
Contributor

mengxr commented Jul 30, 2014

Jenkins, test this please.

@mengxr
Copy link
Contributor

mengxr commented Jul 30, 2014

@bgreeven Jenkins will be automatically triggered for future updates.

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17440/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17440/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 1, 2014

@bgreeven The filename mllib/src/main/scala/org/apache/spark/mllib/ann/GeneralizedSteepestDescendAlgorithm doesn't have .scala extension.

@bgreeven
Copy link
Author

bgreeven commented Aug 1, 2014

Thanks a lot! I have added the extension now.

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA results for PR 1290:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA results for PR 1290:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull

@hunggpham
Copy link

Hi Bert,

I want to try your ANN on Spark but could not find it in the latest clone. It's probably not there yet despite the successful tests and merge messages above (10 days ago). How can I get a copy of your ANN code and try it out.

Thanks,
Hung Pham

@debasish83
Copy link

Hung,
You can merge the repository on your spark fork and you should be able to see the code..

@debasish83
Copy link

SteepestDescend should be SteepestDescent !

@bgreeven
Copy link
Author

SteepestDescend -> SteepestDescent can be changed. Thanks for noticing.

Hung Pham, did it work out for you now?

@hunggpham
Copy link

Yes i forked your repository and can see the codes now. One question:
I don't see backprop codes. Will that be added later? Thanks.

Sent from cell phone. Please excuse typo & brevity
On Aug 12, 2014 1:34 AM, "Bert Greevenbosch" notifications@github.com
wrote:

SteepestDescend -> SteepestDescent can be changed. Thanks for noticing.

Hung Pham, did it work out for you now?


Reply to this email directly or view it on GitHub
#1290 (comment).

@bgreeven
Copy link
Author

The ANN uses the existent GradientDescent from mllib.optimization for back propagation. It uses the gradient from the new LeastSquaresGradientANN class, and updates using the new ANNUpdater class.

This line in ANNUpdater.compute is the backbone of the back propagation:

brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)

@hunggpham
Copy link

I finally see the backprop codes in the 2 for loops inside
LeastSquaresGradientANN
that calculates the gradient which is then used to update the weights by
ANNUpdater.

Thanks, Bert.

On Tue, Aug 12, 2014 at 8:54 PM, Bert Greevenbosch <notifications@github.com

wrote:

The ANN uses the existent GradientDescent from mllib.optimization for back
propagation. It uses the gradient from the new LeastSquaresGradientANN
class, and updates using the new ANNUpdater class.

This line in ANNUpdater.compute is the backbone of the back propagation:

brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)


Reply to this email directly or view it on GitHub
#1290 (comment).

@avulanov
Copy link
Contributor

@bgreeven I've tried to train ParallelANNWithSGD with 3 layers 1000x500x18, numiterations 1000, stepSize 1. My dataset has ~2000 instances, 1000 features, 18 classes. After 17 hours it didn't finish and I killed the Spark process. I think there are some performance issues. I'll try to look at your code but without comments it would be challenging :)

@hntd187
Copy link

hntd187 commented Jun 9, 2015

@avulanov If it would aid in speeding this up I can test or benchmark on some EC2 instances we have, which run on mesos. If you want to give a general dataset to use we could work something out.

@avulanov
Copy link
Contributor

@hntd187 It would be good to discuss this. Currently I plan to use mnist8m and 6 layer network 784-2500-2000-1500-1000-500-10 which is the best fully configuration for mnist from http://yann.lecun.com/exdb/mnist/. However I am still looking for a more modern dataset probably with more features and corresponding configuration. Are you aware of any?

@hntd187
Copy link

hntd187 commented Jun 12, 2015

@avulanov To be perfectly honest, does the "modern-ness" of the dataset really matter? This dataset has been a standard for a long time in this area so it seems perfectly reasonable to use this as most people working in this area would recognize the data and know relatively how to compare it to their own implementation.

@avulanov
Copy link
Contributor

@hntd187 This is true, however it seems that "modern" datasets tend to have more features, so 784 features of mnist might seems too little these days. Anyway, the basic idea of benchmark is as follows: compare performance of Caffe and this implementation both in CPU and GPU mode with different numbers of nodes (workers) for Spark. Performance should be measured in samples/second processed. Here comes another problem: data formats that are supported by Spark and Caffe do not intersect. I can convert mnist8m (libsvm) to HDF5 for Caffe, however it will have different size that means that Caffe will read different amount of data from disk. Do you have an idea how to handle this problem?

@hntd187
Copy link

hntd187 commented Jun 12, 2015

@avulanov Can spark even read an HDF5 file or would we have to write that as well? While, I can't donate any professional time to this conversion problem, but I may be able to assist if we wanted to write a conversion independently. I suppose the problem here, is even if we get HDF5 and run it in caffe, how would we get spark to use it? Reading a bit online and looking around, it seems to be the consensus to use the pyhdf5 library to read the files in and do a flatMap to RDDs, but that would seem horribly inefficient on a large dataset and we'd be shooting yourselves in the foot trying to make that scale. So I think, our best bet is if we want to compare to caffe is either, get caffe to read another format or add HDF5 reading capability to spark either via a hack or an actual contribution. First one is not ideal, second one is obviously more time consuming.

thvasilo pushed a commit to thvasilo/flink that referenced this pull request Jun 12, 2015
@avulanov
Copy link
Contributor

@hntd187 Thanks for suggestion, it seems that implementing the HDF5 reader for Spark is the most reasonable option. I need to think what would be the minimum viable implementation.

@thvasilo You should consider using the latest version, https://github.com/avulanov/spark/tree/ann-interface-gemm and also DBN from https://github.com/witgo/spark/tree/ann-interface-gemm-dbn

@hntd187
Copy link

hntd187 commented Jun 12, 2015

@avulanov Would you like to split some of this work up or do you want to tackle this alone?

@avulanov
Copy link
Contributor

@hntd187 Any help is really appreciated. We can split it into two functions: read and write. The good place to implement them is MLUtils as saveAsHDF5 and loadHDF5.

@hntd187
Copy link

hntd187 commented Jun 17, 2015

@avulanov How about I take the read and you take the write?

In an ideal world we should be able to take the implementation from here https://github.com/h5py/h5py and load it into some form of RDD.

Here are the java tools for HDF5 http://www.hdfgroup.org/products/java/release/download.html which is the bindings for the file format, hopefully given the implementations out there this should be pretty straight forward.

@hntd187
Copy link

hntd187 commented Jun 18, 2015

@avulanov Also, we're going to have to add a dependency with this with the HDF5 library, I think this should be handled the way the netlib is handled with the user having to enable a profile when building spark. So, normally it wouldn't be available, but if you build with it you can use it. I'll update the POM to account for that.

@avulanov
Copy link
Contributor

@hntd187 Thanks for the links.

I am not sure that presence of hdf5 library should be handled on compilation step because there will be no fallback for the functions we are going to implement, as it is the case for netlib (it falls back to java implementation if you don't include jni binaries). Lets continue our discussion here https://issues.apache.org/jira/browse/SPARK-8449

@Myasuka
Copy link
Member

Myasuka commented Jul 16, 2015

hi, @avulanov , I have forked your repository about ann-benchmark https://github.com/avulanov/ann-benchmark/blob/master/spark/spark.scala . I feel a little confused about the mini-batch training, it seems that batchSize in code val trainer = new FeedForwardTrainer(topology, 780, 10).setBatchSize(batchSize) means the size of sub-block matrix you group the original input matrix into, and setMiniBatchFraction(1.0) in trainer.SGDOptimizer.setNumIterations(numIterations).setMiniBatchFraction(1.0).setStepSize(0.03) means you actually use full-batch gradient descent not the mini-batch gradient descent method. Does it performs well on mnist8m data? Maybe you can share the training parameters in detail, such as layer units, mini-batch size, stepsize and so on.

@avulanov
Copy link
Contributor

@Myasuka Thank you for your interest in the benchmark. The goal of the benchmark is to measure the scalability of my implementation and to compare its efficiency with the other tools, such as Caffe. I measure the time needed for one epoch of batch gradient descent on a large network with ~12M of parameters. I don't measure the convergence rate or the accuracy, because they are very use-case specific and they don't show directly how scalable the particular machine learning tool is. The benchmark could be improved though and I am working on it, so thank you for your suggestion.

@Myasuka
Copy link
Member

Myasuka commented Jul 17, 2015

@avulanov I try to run test on minist data with SGD optimizer, however, I cannot reproduce the result in #1290 (comment), I use topology with (780, 10), and set batchsize as 1000, miniBatchFraction as 1.0, numIterations as 500, the accuracy is only 75%, if I set miniBatchFraction as 0.1, the accuracy still stays at 75%. Would you please share your training parameter in detail so that I can promote the accuray to 90% ?

@avulanov
Copy link
Contributor

@Myasuka LBFGS was used in the mentioned experiment. SGD needs more iterations to converge in this case.

@bnoreus
Copy link

bnoreus commented Jul 24, 2015

Hey guys,

I want to start of by saying Thank You for this piece of code. The ANN has been working beautifully so far. I have one question though: When I run the training on a dataset I imported from AWS s3, the logging says "Opening s3://blabla.txt for reading" over and over again. I interpret this as the program opening the S3-file a lot of times, instead of just once. Is this true? Wouldnt it be much faster if the file was only opened once?

@avulanov
Copy link
Contributor

@bnoreus Thank you for your feedback. This code does not implement any file-related operations. It works with RDD only. I assume that the logging comes from the other piece of code you are using.

asfgit pushed a commit that referenced this pull request Jul 31, 2015
This pull request contains the following feature for ML:
   - Multilayer Perceptron classifier

This implementation is based on our initial pull request with bgreeven: #1290 and inspired by very insightful suggestions from mengxr and witgo (I would like to thank all other people from the mentioned thread for useful discussions). The original code was extensively tested and benchmarked. Since then, I've addressed two main requirements that prevented the code from merging into the main branch:
   - Extensible interface, so it will be easy to implement new types of networks
     - Main building blocks are traits `Layer` and `LayerModel`. They are used for constructing layers of ANN. New layers can be added by extending the `Layer` and `LayerModel` traits. These traits are private in this release in order to save path to improve them based on community feedback
     - Back propagation is implemented in general form, so there is no need to change it (optimization algorithm) when new layers are implemented
   - Speed and scalability: this implementation has to be comparable in terms of speed to the state of the art single node implementations.
     - The developed benchmark for large ANN shows that the proposed code is on par with C++ CPU implementation and scales nicely with the number of workers. Details can be found here: https://github.com/avulanov/ann-benchmark

   - DBN and RBM by witgo https://github.com/witgo/spark/tree/ann-interface-gemm-dbn
   - Dropout https://github.com/avulanov/spark/tree/ann-interface-gemm

mengxr and dbtsai kindly agreed to perform code review.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: Bert Greevenbosch <opensrc@bertgreevenbosch.nl>

Closes #7621 from avulanov/SPARK-2352-ann and squashes the following commits:

4806b6f [Alexander Ulanov] Addressing reviewers comments.
a7e7951 [Alexander Ulanov] Default blockSize: 100. Added documentation to blockSize parameter and DataStacker class
f69bb3d [Alexander Ulanov] Addressing reviewers comments.
374bea6 [Alexander Ulanov] Moving ANN to ML package. GradientDescent constructor is now spark private.
43b0ae2 [Alexander Ulanov] Addressing reviewers comments. Adding multiclass test.
9d18469 [Alexander Ulanov] Addressing reviewers comments: unnecessary copy of data in predict
35125ab [Alexander Ulanov] Style fix in tests
e191301 [Alexander Ulanov] Apache header
a226133 [Alexander Ulanov] Multilayer Perceptron regressor and classifier
@mengxr
Copy link
Contributor

mengxr commented Jul 31, 2015

@bgreeven We recently merged #7621 from @avulanov . Under the hood, it contains the ANN implementation based on this PR. Additional features should come in follow-up PRs. So do you mind closing this PR for now? We can move the discussion to the JIRA page on individual features. Thanks a lot for your contribution and everyone for the discussion!

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@08s011003
Copy link

@bgreeven Hi, I try to train the model using this implementation and found weird outcome from the output from LBFGS as following:
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 6.22e-08) 3.59731
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 3.17e-08) 1.77486
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 1.56e-08) 0.885332
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 7.91e-09)
0.442205

I launch model training by : var model = ArtificialNeuralNetwork.train(trainData, Array(2, 3), 5000, 1e-8)

The problem is training process just iterate a few step before return. Obviously the error of the validation test is too large to satisfy expectation.What is the problem?

@asfgit asfgit closed this in 423cdfd Aug 11, 2015
@mengxr
Copy link
Contributor

mengxr commented Aug 11, 2015

I closed this PR. We can use Apache JIRA to continue discussion on individual issues.

wangyum pushed a commit that referenced this pull request May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.