New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLLIB] [spark-2352] Implementation of an Artificial Neural Network (ANN) #1290

Closed
wants to merge 143 commits into
base: master
from

Conversation

Projects
None yet
@bgreeven

bgreeven commented Jul 3, 2014

The code contains a multi-layer ANN implementation, with variable number of inputs, outputs and hidden nodes. It takes as input an RDD vector pairs, corresponding to the training set with inputs and outputs.

Next to two automated tests, an example program is also included, which also contains a graphical representation.

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jul 3, 2014

Contributor

@bgreeven Please add [MLLIB] to your PR, following https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . It makes easier for people who want to search MLlib's PRs. Thanks!

Contributor

mengxr commented Jul 3, 2014

@bgreeven Please add [MLLIB] to your PR, following https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . It makes easier for people who want to search MLlib's PRs. Thanks!

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jul 3, 2014

Contributor

Jenkins, test this please.

Contributor

mengxr commented Jul 3, 2014

Jenkins, test this please.

@bgreeven bgreeven changed the title from [spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN) to [MLLIB] [spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN) Jul 4, 2014

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jul 16, 2014

Contributor

Jenkins, retest this please.

Contributor

mengxr commented Jul 16, 2014

Jenkins, retest this please.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jul 16, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16708/consoleFull

SparkQA commented Jul 16, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16708/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jul 16, 2014

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN( noInp: Integer, noHid: Integer, noOut: Integer, b: Double ) extends Gradient with ANN {
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16708/consoleFull

SparkQA commented Jul 16, 2014

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN( noInp: Integer, noHid: Integer, noOut: Integer, b: Double ) extends Gradient with ANN {
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16708/consoleFull

@matthelb

This comment has been minimized.

Show comment
Hide comment
@matthelb

matthelb Jul 19, 2014

@bgreeven Are you continuing work on this pull request so that it passes all unit tests?

matthelb commented Jul 19, 2014

@bgreeven Are you continuing work on this pull request so that it passes all unit tests?

@bgreeven

This comment has been minimized.

Show comment
Hide comment
@bgreeven

bgreeven Jul 29, 2014

Hi Matthew,

Sure, I can. I was on holiday during the last two weeks, but now back in office. I'll update the code this week.

Best regards,

Bert

From: Matthew Burke [mailto:notifications@github.com]
Sent: 20 July 2014 06:46
To: apache/spark
Cc: Bert Greevenbosch
Subject: Re: [spark] [MLLIB] [spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN) (#1290)

@bgreevenhttps://github.com/bgreeven Are you continuing work on this pull request so that it passes all unit tests?


Reply to this email directly or view it on GitHubhttps://github.com/apache/spark/pull/1290#issuecomment-49531526.

bgreeven commented Jul 29, 2014

Hi Matthew,

Sure, I can. I was on holiday during the last two weeks, but now back in office. I'll update the code this week.

Best regards,

Bert

From: Matthew Burke [mailto:notifications@github.com]
Sent: 20 July 2014 06:46
To: apache/spark
Cc: Bert Greevenbosch
Subject: Re: [spark] [MLLIB] [spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN) (#1290)

@bgreevenhttps://github.com/bgreeven Are you continuing work on this pull request so that it passes all unit tests?


Reply to this email directly or view it on GitHubhttps://github.com/apache/spark/pull/1290#issuecomment-49531526.

@bgreeven

This comment has been minimized.

Show comment
Hide comment
@bgreeven

bgreeven Jul 30, 2014

I updated the two sources to comply with "sbt/sbt scalastyle". Maybe retry the unit tests with the new modifications?

bgreeven commented Jul 30, 2014

I updated the two sources to comply with "sbt/sbt scalastyle". Maybe retry the unit tests with the new modifications?

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jul 30, 2014

Contributor

Jenkins, add to whitelist.

Contributor

mengxr commented Jul 30, 2014

Jenkins, add to whitelist.

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jul 30, 2014

Contributor

Jenkins, test this please.

Contributor

mengxr commented Jul 30, 2014

Jenkins, test this please.

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jul 30, 2014

Contributor

@bgreeven Jenkins will be automatically triggered for future updates.

Contributor

mengxr commented Jul 30, 2014

@bgreeven Jenkins will be automatically triggered for future updates.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jul 30, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull

SparkQA commented Jul 30, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jul 30, 2014

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull

SparkQA commented Jul 30, 2014

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jul 30, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17440/consoleFull

SparkQA commented Jul 30, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17440/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jul 30, 2014

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17440/consoleFull

SparkQA commented Jul 30, 2014

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17440/consoleFull

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Aug 1, 2014

Contributor

@bgreeven The filename mllib/src/main/scala/org/apache/spark/mllib/ann/GeneralizedSteepestDescendAlgorithm doesn't have .scala extension.

Contributor

mengxr commented Aug 1, 2014

@bgreeven The filename mllib/src/main/scala/org/apache/spark/mllib/ann/GeneralizedSteepestDescendAlgorithm doesn't have .scala extension.

@bgreeven

This comment has been minimized.

Show comment
Hide comment
@bgreeven

bgreeven Aug 1, 2014

Thanks a lot! I have added the extension now.

bgreeven commented Aug 1, 2014

Thanks a lot! I have added the extension now.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 1, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull

SparkQA commented Aug 1, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 1, 2014

QA results for PR 1290:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull

SparkQA commented Aug 1, 2014

QA results for PR 1290:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 1, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull

SparkQA commented Aug 1, 2014

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 1, 2014

QA results for PR 1290:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull

SparkQA commented Aug 1, 2014

QA results for PR 1290:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull

@hunggpham

This comment has been minimized.

Show comment
Hide comment
@hunggpham

hunggpham Aug 11, 2014

Hi Bert,

I want to try your ANN on Spark but could not find it in the latest clone. It's probably not there yet despite the successful tests and merge messages above (10 days ago). How can I get a copy of your ANN code and try it out.

Thanks,
Hung Pham

hunggpham commented Aug 11, 2014

Hi Bert,

I want to try your ANN on Spark but could not find it in the latest clone. It's probably not there yet despite the successful tests and merge messages above (10 days ago). How can I get a copy of your ANN code and try it out.

Thanks,
Hung Pham

@debasish83

This comment has been minimized.

Show comment
Hide comment
@debasish83

debasish83 Aug 11, 2014

Hung,
You can merge the repository on your spark fork and you should be able to see the code..

debasish83 commented Aug 11, 2014

Hung,
You can merge the repository on your spark fork and you should be able to see the code..

@debasish83

This comment has been minimized.

Show comment
Hide comment
@debasish83

debasish83 Aug 11, 2014

SteepestDescend should be SteepestDescent !

debasish83 commented Aug 11, 2014

SteepestDescend should be SteepestDescent !

@bgreeven

This comment has been minimized.

Show comment
Hide comment
@bgreeven

bgreeven Aug 12, 2014

SteepestDescend -> SteepestDescent can be changed. Thanks for noticing.

Hung Pham, did it work out for you now?

bgreeven commented Aug 12, 2014

SteepestDescend -> SteepestDescent can be changed. Thanks for noticing.

Hung Pham, did it work out for you now?

@hunggpham

This comment has been minimized.

Show comment
Hide comment
@hunggpham

hunggpham Aug 12, 2014

Yes i forked your repository and can see the codes now. One question:
I don't see backprop codes. Will that be added later? Thanks.

Sent from cell phone. Please excuse typo & brevity
On Aug 12, 2014 1:34 AM, "Bert Greevenbosch" notifications@github.com
wrote:

SteepestDescend -> SteepestDescent can be changed. Thanks for noticing.

Hung Pham, did it work out for you now?


Reply to this email directly or view it on GitHub
#1290 (comment).

hunggpham commented Aug 12, 2014

Yes i forked your repository and can see the codes now. One question:
I don't see backprop codes. Will that be added later? Thanks.

Sent from cell phone. Please excuse typo & brevity
On Aug 12, 2014 1:34 AM, "Bert Greevenbosch" notifications@github.com
wrote:

SteepestDescend -> SteepestDescent can be changed. Thanks for noticing.

Hung Pham, did it work out for you now?


Reply to this email directly or view it on GitHub
#1290 (comment).

@bgreeven

This comment has been minimized.

Show comment
Hide comment
@bgreeven

bgreeven Aug 13, 2014

The ANN uses the existent GradientDescent from mllib.optimization for back propagation. It uses the gradient from the new LeastSquaresGradientANN class, and updates using the new ANNUpdater class.

This line in ANNUpdater.compute is the backbone of the back propagation:

brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)

bgreeven commented Aug 13, 2014

The ANN uses the existent GradientDescent from mllib.optimization for back propagation. It uses the gradient from the new LeastSquaresGradientANN class, and updates using the new ANNUpdater class.

This line in ANNUpdater.compute is the backbone of the back propagation:

brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)

@hunggpham

This comment has been minimized.

Show comment
Hide comment
@hunggpham

hunggpham Aug 13, 2014

I finally see the backprop codes in the 2 for loops inside
LeastSquaresGradientANN
that calculates the gradient which is then used to update the weights by
ANNUpdater.

Thanks, Bert.

On Tue, Aug 12, 2014 at 8:54 PM, Bert Greevenbosch <notifications@github.com

wrote:

The ANN uses the existent GradientDescent from mllib.optimization for back
propagation. It uses the gradient from the new LeastSquaresGradientANN
class, and updates using the new ANNUpdater class.

This line in ANNUpdater.compute is the backbone of the back propagation:

brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)


Reply to this email directly or view it on GitHub
#1290 (comment).

hunggpham commented Aug 13, 2014

I finally see the backprop codes in the 2 for loops inside
LeastSquaresGradientANN
that calculates the gradient which is then used to update the weights by
ANNUpdater.

Thanks, Bert.

On Tue, Aug 12, 2014 at 8:54 PM, Bert Greevenbosch <notifications@github.com

wrote:

The ANN uses the existent GradientDescent from mllib.optimization for back
propagation. It uses the gradient from the new LeastSquaresGradientANN
class, and updates using the new ANNUpdater class.

This line in ANNUpdater.compute is the backbone of the back propagation:

brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)


Reply to this email directly or view it on GitHub
#1290 (comment).

@avulanov

This comment has been minimized.

Show comment
Hide comment
@avulanov

avulanov Aug 15, 2014

Contributor

@bgreeven I've tried to train ParallelANNWithSGD with 3 layers 1000x500x18, numiterations 1000, stepSize 1. My dataset has ~2000 instances, 1000 features, 18 classes. After 17 hours it didn't finish and I killed the Spark process. I think there are some performance issues. I'll try to look at your code but without comments it would be challenging :)

Contributor

avulanov commented Aug 15, 2014

@bgreeven I've tried to train ParallelANNWithSGD with 3 layers 1000x500x18, numiterations 1000, stepSize 1. My dataset has ~2000 instances, 1000 features, 18 classes. After 17 hours it didn't finish and I killed the Spark process. I think there are some performance issues. I'll try to look at your code but without comments it would be challenging :)

@hntd187

This comment has been minimized.

Show comment
Hide comment
@hntd187

hntd187 Jun 9, 2015

@avulanov If it would aid in speeding this up I can test or benchmark on some EC2 instances we have, which run on mesos. If you want to give a general dataset to use we could work something out.

hntd187 commented Jun 9, 2015

@avulanov If it would aid in speeding this up I can test or benchmark on some EC2 instances we have, which run on mesos. If you want to give a general dataset to use we could work something out.

@avulanov

This comment has been minimized.

Show comment
Hide comment
@avulanov

avulanov Jun 12, 2015

Contributor

@hntd187 It would be good to discuss this. Currently I plan to use mnist8m and 6 layer network 784-2500-2000-1500-1000-500-10 which is the best fully configuration for mnist from http://yann.lecun.com/exdb/mnist/. However I am still looking for a more modern dataset probably with more features and corresponding configuration. Are you aware of any?

Contributor

avulanov commented Jun 12, 2015

@hntd187 It would be good to discuss this. Currently I plan to use mnist8m and 6 layer network 784-2500-2000-1500-1000-500-10 which is the best fully configuration for mnist from http://yann.lecun.com/exdb/mnist/. However I am still looking for a more modern dataset probably with more features and corresponding configuration. Are you aware of any?

@hntd187

This comment has been minimized.

Show comment
Hide comment
@hntd187

hntd187 Jun 12, 2015

@avulanov To be perfectly honest, does the "modern-ness" of the dataset really matter? This dataset has been a standard for a long time in this area so it seems perfectly reasonable to use this as most people working in this area would recognize the data and know relatively how to compare it to their own implementation.

hntd187 commented Jun 12, 2015

@avulanov To be perfectly honest, does the "modern-ness" of the dataset really matter? This dataset has been a standard for a long time in this area so it seems perfectly reasonable to use this as most people working in this area would recognize the data and know relatively how to compare it to their own implementation.

@avulanov

This comment has been minimized.

Show comment
Hide comment
@avulanov

avulanov Jun 12, 2015

Contributor

@hntd187 This is true, however it seems that "modern" datasets tend to have more features, so 784 features of mnist might seems too little these days. Anyway, the basic idea of benchmark is as follows: compare performance of Caffe and this implementation both in CPU and GPU mode with different numbers of nodes (workers) for Spark. Performance should be measured in samples/second processed. Here comes another problem: data formats that are supported by Spark and Caffe do not intersect. I can convert mnist8m (libsvm) to HDF5 for Caffe, however it will have different size that means that Caffe will read different amount of data from disk. Do you have an idea how to handle this problem?

Contributor

avulanov commented Jun 12, 2015

@hntd187 This is true, however it seems that "modern" datasets tend to have more features, so 784 features of mnist might seems too little these days. Anyway, the basic idea of benchmark is as follows: compare performance of Caffe and this implementation both in CPU and GPU mode with different numbers of nodes (workers) for Spark. Performance should be measured in samples/second processed. Here comes another problem: data formats that are supported by Spark and Caffe do not intersect. I can convert mnist8m (libsvm) to HDF5 for Caffe, however it will have different size that means that Caffe will read different amount of data from disk. Do you have an idea how to handle this problem?

@hntd187

This comment has been minimized.

Show comment
Hide comment
@hntd187

hntd187 Jun 12, 2015

@avulanov Can spark even read an HDF5 file or would we have to write that as well? While, I can't donate any professional time to this conversion problem, but I may be able to assist if we wanted to write a conversion independently. I suppose the problem here, is even if we get HDF5 and run it in caffe, how would we get spark to use it? Reading a bit online and looking around, it seems to be the consensus to use the pyhdf5 library to read the files in and do a flatMap to RDDs, but that would seem horribly inefficient on a large dataset and we'd be shooting yourselves in the foot trying to make that scale. So I think, our best bet is if we want to compare to caffe is either, get caffe to read another format or add HDF5 reading capability to spark either via a hack or an actual contribution. First one is not ideal, second one is obviously more time consuming.

hntd187 commented Jun 12, 2015

@avulanov Can spark even read an HDF5 file or would we have to write that as well? While, I can't donate any professional time to this conversion problem, but I may be able to assist if we wanted to write a conversion independently. I suppose the problem here, is even if we get HDF5 and run it in caffe, how would we get spark to use it? Reading a bit online and looking around, it seems to be the consensus to use the pyhdf5 library to read the files in and do a flatMap to RDDs, but that would seem horribly inefficient on a large dataset and we'd be shooting yourselves in the foot trying to make that scale. So I think, our best bet is if we want to compare to caffe is either, get caffe to read another format or add HDF5 reading capability to spark either via a hack or an actual contribution. First one is not ideal, second one is obviously more time consuming.

thvasilo added a commit to thvasilo/flink that referenced this pull request Jun 12, 2015

@avulanov

This comment has been minimized.

Show comment
Hide comment
@avulanov

avulanov Jun 12, 2015

Contributor

@hntd187 Thanks for suggestion, it seems that implementing the HDF5 reader for Spark is the most reasonable option. I need to think what would be the minimum viable implementation.

@thvasilo You should consider using the latest version, https://github.com/avulanov/spark/tree/ann-interface-gemm and also DBN from https://github.com/witgo/spark/tree/ann-interface-gemm-dbn

Contributor

avulanov commented Jun 12, 2015

@hntd187 Thanks for suggestion, it seems that implementing the HDF5 reader for Spark is the most reasonable option. I need to think what would be the minimum viable implementation.

@thvasilo You should consider using the latest version, https://github.com/avulanov/spark/tree/ann-interface-gemm and also DBN from https://github.com/witgo/spark/tree/ann-interface-gemm-dbn

@hntd187

This comment has been minimized.

Show comment
Hide comment
@hntd187

hntd187 Jun 12, 2015

@avulanov Would you like to split some of this work up or do you want to tackle this alone?

hntd187 commented Jun 12, 2015

@avulanov Would you like to split some of this work up or do you want to tackle this alone?

@avulanov

This comment has been minimized.

Show comment
Hide comment
@avulanov

avulanov Jun 17, 2015

Contributor

@hntd187 Any help is really appreciated. We can split it into two functions: read and write. The good place to implement them is MLUtils as saveAsHDF5 and loadHDF5.

Contributor

avulanov commented Jun 17, 2015

@hntd187 Any help is really appreciated. We can split it into two functions: read and write. The good place to implement them is MLUtils as saveAsHDF5 and loadHDF5.

@hntd187

This comment has been minimized.

Show comment
Hide comment
@hntd187

hntd187 Jun 17, 2015

@avulanov How about I take the read and you take the write?

In an ideal world we should be able to take the implementation from here https://github.com/h5py/h5py and load it into some form of RDD.

Here are the java tools for HDF5 http://www.hdfgroup.org/products/java/release/download.html which is the bindings for the file format, hopefully given the implementations out there this should be pretty straight forward.

hntd187 commented Jun 17, 2015

@avulanov How about I take the read and you take the write?

In an ideal world we should be able to take the implementation from here https://github.com/h5py/h5py and load it into some form of RDD.

Here are the java tools for HDF5 http://www.hdfgroup.org/products/java/release/download.html which is the bindings for the file format, hopefully given the implementations out there this should be pretty straight forward.

@hntd187

This comment has been minimized.

Show comment
Hide comment
@hntd187

hntd187 Jun 18, 2015

@avulanov Also, we're going to have to add a dependency with this with the HDF5 library, I think this should be handled the way the netlib is handled with the user having to enable a profile when building spark. So, normally it wouldn't be available, but if you build with it you can use it. I'll update the POM to account for that.

hntd187 commented Jun 18, 2015

@avulanov Also, we're going to have to add a dependency with this with the HDF5 library, I think this should be handled the way the netlib is handled with the user having to enable a profile when building spark. So, normally it wouldn't be available, but if you build with it you can use it. I'll update the POM to account for that.

@avulanov

This comment has been minimized.

Show comment
Hide comment
@avulanov

avulanov Jun 18, 2015

Contributor

@hntd187 Thanks for the links.

I am not sure that presence of hdf5 library should be handled on compilation step because there will be no fallback for the functions we are going to implement, as it is the case for netlib (it falls back to java implementation if you don't include jni binaries). Lets continue our discussion here https://issues.apache.org/jira/browse/SPARK-8449

Contributor

avulanov commented Jun 18, 2015

@hntd187 Thanks for the links.

I am not sure that presence of hdf5 library should be handled on compilation step because there will be no fallback for the functions we are going to implement, as it is the case for netlib (it falls back to java implementation if you don't include jni binaries). Lets continue our discussion here https://issues.apache.org/jira/browse/SPARK-8449

@Myasuka

This comment has been minimized.

Show comment
Hide comment
@Myasuka

Myasuka Jul 16, 2015

hi, @avulanov , I have forked your repository about ann-benchmark https://github.com/avulanov/ann-benchmark/blob/master/spark/spark.scala . I feel a little confused about the mini-batch training, it seems that batchSize in code val trainer = new FeedForwardTrainer(topology, 780, 10).setBatchSize(batchSize) means the size of sub-block matrix you group the original input matrix into, and setMiniBatchFraction(1.0) in trainer.SGDOptimizer.setNumIterations(numIterations).setMiniBatchFraction(1.0).setStepSize(0.03) means you actually use full-batch gradient descent not the mini-batch gradient descent method. Does it performs well on mnist8m data? Maybe you can share the training parameters in detail, such as layer units, mini-batch size, stepsize and so on.

Myasuka commented Jul 16, 2015

hi, @avulanov , I have forked your repository about ann-benchmark https://github.com/avulanov/ann-benchmark/blob/master/spark/spark.scala . I feel a little confused about the mini-batch training, it seems that batchSize in code val trainer = new FeedForwardTrainer(topology, 780, 10).setBatchSize(batchSize) means the size of sub-block matrix you group the original input matrix into, and setMiniBatchFraction(1.0) in trainer.SGDOptimizer.setNumIterations(numIterations).setMiniBatchFraction(1.0).setStepSize(0.03) means you actually use full-batch gradient descent not the mini-batch gradient descent method. Does it performs well on mnist8m data? Maybe you can share the training parameters in detail, such as layer units, mini-batch size, stepsize and so on.

@avulanov

This comment has been minimized.

Show comment
Hide comment
@avulanov

avulanov Jul 16, 2015

Contributor

@Myasuka Thank you for your interest in the benchmark. The goal of the benchmark is to measure the scalability of my implementation and to compare its efficiency with the other tools, such as Caffe. I measure the time needed for one epoch of batch gradient descent on a large network with ~12M of parameters. I don't measure the convergence rate or the accuracy, because they are very use-case specific and they don't show directly how scalable the particular machine learning tool is. The benchmark could be improved though and I am working on it, so thank you for your suggestion.

Contributor

avulanov commented Jul 16, 2015

@Myasuka Thank you for your interest in the benchmark. The goal of the benchmark is to measure the scalability of my implementation and to compare its efficiency with the other tools, such as Caffe. I measure the time needed for one epoch of batch gradient descent on a large network with ~12M of parameters. I don't measure the convergence rate or the accuracy, because they are very use-case specific and they don't show directly how scalable the particular machine learning tool is. The benchmark could be improved though and I am working on it, so thank you for your suggestion.

@Myasuka

This comment has been minimized.

Show comment
Hide comment
@Myasuka

Myasuka Jul 17, 2015

@avulanov I try to run test on minist data with SGD optimizer, however, I cannot reproduce the result in #1290 (comment), I use topology with (780, 10), and set batchsize as 1000, miniBatchFraction as 1.0, numIterations as 500, the accuracy is only 75%, if I set miniBatchFraction as 0.1, the accuracy still stays at 75%. Would you please share your training parameter in detail so that I can promote the accuray to 90% ?

Myasuka commented Jul 17, 2015

@avulanov I try to run test on minist data with SGD optimizer, however, I cannot reproduce the result in #1290 (comment), I use topology with (780, 10), and set batchsize as 1000, miniBatchFraction as 1.0, numIterations as 500, the accuracy is only 75%, if I set miniBatchFraction as 0.1, the accuracy still stays at 75%. Would you please share your training parameter in detail so that I can promote the accuray to 90% ?

@avulanov

This comment has been minimized.

Show comment
Hide comment
@avulanov

avulanov Jul 17, 2015

Contributor

@Myasuka LBFGS was used in the mentioned experiment. SGD needs more iterations to converge in this case.

Contributor

avulanov commented Jul 17, 2015

@Myasuka LBFGS was used in the mentioned experiment. SGD needs more iterations to converge in this case.

@bnoreus

This comment has been minimized.

Show comment
Hide comment
@bnoreus

bnoreus Jul 24, 2015

Hey guys,

I want to start of by saying Thank You for this piece of code. The ANN has been working beautifully so far. I have one question though: When I run the training on a dataset I imported from AWS s3, the logging says "Opening s3://blabla.txt for reading" over and over again. I interpret this as the program opening the S3-file a lot of times, instead of just once. Is this true? Wouldnt it be much faster if the file was only opened once?

bnoreus commented Jul 24, 2015

Hey guys,

I want to start of by saying Thank You for this piece of code. The ANN has been working beautifully so far. I have one question though: When I run the training on a dataset I imported from AWS s3, the logging says "Opening s3://blabla.txt for reading" over and over again. I interpret this as the program opening the S3-file a lot of times, instead of just once. Is this true? Wouldnt it be much faster if the file was only opened once?

@avulanov

This comment has been minimized.

Show comment
Hide comment
@avulanov

avulanov Jul 27, 2015

Contributor

@bnoreus Thank you for your feedback. This code does not implement any file-related operations. It works with RDD only. I assume that the logging comes from the other piece of code you are using.

Contributor

avulanov commented Jul 27, 2015

@bnoreus Thank you for your feedback. This code does not implement any file-related operations. It works with RDD only. I assume that the logging comes from the other piece of code you are using.

asfgit pushed a commit that referenced this pull request Jul 31, 2015

[SPARK-9471] [ML] Multilayer Perceptron
This pull request contains the following feature for ML:
   - Multilayer Perceptron classifier

This implementation is based on our initial pull request with bgreeven: #1290 and inspired by very insightful suggestions from mengxr and witgo (I would like to thank all other people from the mentioned thread for useful discussions). The original code was extensively tested and benchmarked. Since then, I've addressed two main requirements that prevented the code from merging into the main branch:
   - Extensible interface, so it will be easy to implement new types of networks
     - Main building blocks are traits `Layer` and `LayerModel`. They are used for constructing layers of ANN. New layers can be added by extending the `Layer` and `LayerModel` traits. These traits are private in this release in order to save path to improve them based on community feedback
     - Back propagation is implemented in general form, so there is no need to change it (optimization algorithm) when new layers are implemented
   - Speed and scalability: this implementation has to be comparable in terms of speed to the state of the art single node implementations.
     - The developed benchmark for large ANN shows that the proposed code is on par with C++ CPU implementation and scales nicely with the number of workers. Details can be found here: https://github.com/avulanov/ann-benchmark

   - DBN and RBM by witgo https://github.com/witgo/spark/tree/ann-interface-gemm-dbn
   - Dropout https://github.com/avulanov/spark/tree/ann-interface-gemm

mengxr and dbtsai kindly agreed to perform code review.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: Bert Greevenbosch <opensrc@bertgreevenbosch.nl>

Closes #7621 from avulanov/SPARK-2352-ann and squashes the following commits:

4806b6f [Alexander Ulanov] Addressing reviewers comments.
a7e7951 [Alexander Ulanov] Default blockSize: 100. Added documentation to blockSize parameter and DataStacker class
f69bb3d [Alexander Ulanov] Addressing reviewers comments.
374bea6 [Alexander Ulanov] Moving ANN to ML package. GradientDescent constructor is now spark private.
43b0ae2 [Alexander Ulanov] Addressing reviewers comments. Adding multiclass test.
9d18469 [Alexander Ulanov] Addressing reviewers comments: unnecessary copy of data in predict
35125ab [Alexander Ulanov] Style fix in tests
e191301 [Alexander Ulanov] Apache header
a226133 [Alexander Ulanov] Multilayer Perceptron regressor and classifier
@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jul 31, 2015

Contributor

@bgreeven We recently merged #7621 from @avulanov . Under the hood, it contains the ANN implementation based on this PR. Additional features should come in follow-up PRs. So do you mind closing this PR for now? We can move the discussion to the JIRA page on individual features. Thanks a lot for your contribution and everyone for the discussion!

Contributor

mengxr commented Jul 31, 2015

@bgreeven We recently merged #7621 from @avulanov . Under the hood, it contains the ANN implementation based on this PR. Additional features should come in follow-up PRs. So do you mind closing this PR for now? We can move the discussion to the JIRA page on individual features. Thanks a lot for your contribution and everyone for the discussion!

markgrover pushed a commit to markgrover/spark that referenced this pull request Jul 31, 2015

[SPARK-9471] [ML] Multilayer Perceptron
This pull request contains the following feature for ML:
   - Multilayer Perceptron classifier

This implementation is based on our initial pull request with bgreeven: apache#1290 and inspired by very insightful suggestions from mengxr and witgo (I would like to thank all other people from the mentioned thread for useful discussions). The original code was extensively tested and benchmarked. Since then, I've addressed two main requirements that prevented the code from merging into the main branch:
   - Extensible interface, so it will be easy to implement new types of networks
     - Main building blocks are traits `Layer` and `LayerModel`. They are used for constructing layers of ANN. New layers can be added by extending the `Layer` and `LayerModel` traits. These traits are private in this release in order to save path to improve them based on community feedback
     - Back propagation is implemented in general form, so there is no need to change it (optimization algorithm) when new layers are implemented
   - Speed and scalability: this implementation has to be comparable in terms of speed to the state of the art single node implementations.
     - The developed benchmark for large ANN shows that the proposed code is on par with C++ CPU implementation and scales nicely with the number of workers. Details can be found here: https://github.com/avulanov/ann-benchmark

   - DBN and RBM by witgo https://github.com/witgo/spark/tree/ann-interface-gemm-dbn
   - Dropout https://github.com/avulanov/spark/tree/ann-interface-gemm

mengxr and dbtsai kindly agreed to perform code review.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: Bert Greevenbosch <opensrc@bertgreevenbosch.nl>

Closes #7621 from avulanov/SPARK-2352-ann and squashes the following commits:

4806b6f [Alexander Ulanov] Addressing reviewers comments.
a7e7951 [Alexander Ulanov] Default blockSize: 100. Added documentation to blockSize parameter and DataStacker class
f69bb3d [Alexander Ulanov] Addressing reviewers comments.
374bea6 [Alexander Ulanov] Moving ANN to ML package. GradientDescent constructor is now spark private.
43b0ae2 [Alexander Ulanov] Addressing reviewers comments. Adding multiclass test.
9d18469 [Alexander Ulanov] Addressing reviewers comments: unnecessary copy of data in predict
35125ab [Alexander Ulanov] Style fix in tests
e191301 [Alexander Ulanov] Apache header
a226133 [Alexander Ulanov] Multilayer Perceptron regressor and classifier
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 4, 2015

Merged build finished. Test FAILed.

AmplabJenkins commented Aug 4, 2015

Merged build finished. Test FAILed.

dennishuo added a commit to dennishuo/spark that referenced this pull request Aug 7, 2015

[SPARK-9471] [ML] Multilayer Perceptron
This pull request contains the following feature for ML:
   - Multilayer Perceptron classifier

This implementation is based on our initial pull request with bgreeven: apache#1290 and inspired by very insightful suggestions from mengxr and witgo (I would like to thank all other people from the mentioned thread for useful discussions). The original code was extensively tested and benchmarked. Since then, I've addressed two main requirements that prevented the code from merging into the main branch:
   - Extensible interface, so it will be easy to implement new types of networks
     - Main building blocks are traits `Layer` and `LayerModel`. They are used for constructing layers of ANN. New layers can be added by extending the `Layer` and `LayerModel` traits. These traits are private in this release in order to save path to improve them based on community feedback
     - Back propagation is implemented in general form, so there is no need to change it (optimization algorithm) when new layers are implemented
   - Speed and scalability: this implementation has to be comparable in terms of speed to the state of the art single node implementations.
     - The developed benchmark for large ANN shows that the proposed code is on par with C++ CPU implementation and scales nicely with the number of workers. Details can be found here: https://github.com/avulanov/ann-benchmark

   - DBN and RBM by witgo https://github.com/witgo/spark/tree/ann-interface-gemm-dbn
   - Dropout https://github.com/avulanov/spark/tree/ann-interface-gemm

mengxr and dbtsai kindly agreed to perform code review.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: Bert Greevenbosch <opensrc@bertgreevenbosch.nl>

Closes #7621 from avulanov/SPARK-2352-ann and squashes the following commits:

4806b6f [Alexander Ulanov] Addressing reviewers comments.
a7e7951 [Alexander Ulanov] Default blockSize: 100. Added documentation to blockSize parameter and DataStacker class
f69bb3d [Alexander Ulanov] Addressing reviewers comments.
374bea6 [Alexander Ulanov] Moving ANN to ML package. GradientDescent constructor is now spark private.
43b0ae2 [Alexander Ulanov] Addressing reviewers comments. Adding multiclass test.
9d18469 [Alexander Ulanov] Addressing reviewers comments: unnecessary copy of data in predict
35125ab [Alexander Ulanov] Style fix in tests
e191301 [Alexander Ulanov] Apache header
a226133 [Alexander Ulanov] Multilayer Perceptron regressor and classifier
@08s011003

This comment has been minimized.

Show comment
Hide comment
@08s011003

08s011003 Aug 11, 2015

@bgreeven Hi, I try to train the model using this implementation and found weird outcome from the output from LBFGS as following:
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 6.22e-08) 3.59731
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 3.17e-08) 1.77486
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 1.56e-08) 0.885332
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 7.91e-09)
0.442205

I launch model training by : var model = ArtificialNeuralNetwork.train(trainData, Array(2, 3), 5000, 1e-8)

The problem is training process just iterate a few step before return. Obviously the error of the validation test is too large to satisfy expectation.What is the problem?

08s011003 commented Aug 11, 2015

@bgreeven Hi, I try to train the model using this implementation and found weird outcome from the output from LBFGS as following:
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 6.22e-08) 3.59731
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 3.17e-08) 1.77486
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 1.56e-08) 0.885332
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 7.91e-09)
0.442205

I launch model training by : var model = ArtificialNeuralNetwork.train(trainData, Array(2, 3), 5000, 1e-8)

The problem is training process just iterate a few step before return. Obviously the error of the validation test is too large to satisfy expectation.What is the problem?

@asfgit asfgit closed this in 423cdfd Aug 11, 2015

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Aug 11, 2015

Contributor

I closed this PR. We can use Apache JIRA to continue discussion on individual issues.

Contributor

mengxr commented Aug 11, 2015

I closed this PR. We can use Apache JIRA to continue discussion on individual issues.

liufengdb pushed a commit to liufengdb/spark that referenced this pull request Dec 10, 2017

[SC-8968][SQL] Add new built-in function date_trunc()
Adding `date_trunc()` as a built-in function.
`date_trunc()` is common in other databases and is one of the functions triggered by Superset, but Spark or Hive does not have support for this.
We do have [`trunc()`](https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html#trunc-org.apache.spark.sql.Column-java.lang.String-) but this only works with 'MONTH' and 'YEAR' level on the DateType input.

`date_trunc()` in other databases:
AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html
PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html
Presto: https://prestodb.io/docs/current/functions/datetime.html

Added new unit tests and for the new `date_trunc()` functions.

Author: Youngbin Kim <ykim828hotmail.com>

Closes #1290 from youngbink/date_trunc.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

## Does this PR need to be warmfixed?
If this PR needs to be warmfixed (i.e. merged into the release branch after the code freeze), please follow steps below.

What type of warmfix is this? Please select **exactly one choice**, or write description in Other.
- [ ] Regression (e.g. fixing the behavior of a feature that regressed in the current release cycle)
- [ ] ES ticket fix (e.g. Customer or internally requested update/fix)
- [ ] Other (please describe):

Make the following updates to this PR:
- [ ] Add "[WARMFIX]" in the title of this PR.
- [ ] Label the PR using a label that corrsponding to the WARMFIX branch. The label name should be in the format of "dbr-branch-a.b" (e.g. "dbr-branch-3.2"), which matches the release branch name for Runtime release "a.b".
- [ ] Ask your team lead to sign off this warmfix and add the "warmfix-approved" label.
- [ ] When using merge script, make sure to get this PR merged in the WARMFIX branch.

Author: Youngbin Kim <ykim828@hotmail.com>

Closes #1461 from youngbink/date_trunc-3.x.

tdas pushed a commit to tdas/spark that referenced this pull request Feb 13, 2018

[SC-8968][SQL] Add new built-in function date_trunc()
## What changes were proposed in this pull request?

Adding `date_trunc()` as a built-in function.
`date_trunc()` is common in other databases and is one of the functions triggered by Superset, but Spark or Hive does not have support for this.
We do have [`trunc()`](https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html#trunc-org.apache.spark.sql.Column-java.lang.String-) but this only works with 'MONTH' and 'YEAR' level on the DateType input.

`date_trunc()` in other databases:
AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html
PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html
Presto: https://prestodb.io/docs/current/functions/datetime.html

## How was this patch tested?

Added new unit tests and for the new `date_trunc()` functions.

Author: Youngbin Kim <ykim828@hotmail.com>

Closes #1290 from youngbink/date_trunc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment