Skip to content

Scalable implementation of artificial neural networks for Spark deep learning

License

Notifications You must be signed in to change notification settings

avulanov/scalable-deeplearning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Scalable Implementation of Deep Learning on Spark

This library is based on the implementation of artificial neural networks in Spark ML. In addition to the multilayer perceptron, it contains new Spark deep learning features that were not yet merged to Spark ML. Currently, they are Stacked Autoencoder and tensor data flow. Highlights of the library:

  • Provides Spark ML pipeline API
  • Implements data parallel training
  • Supports native CPU BLAS
  • Employs tensor data flow
  • Provides extensible API for developers of new features

Installation

Requirements

  • Apache Spark 2.0 or higher
  • Java and Scala
  • Maven

Build

Clone and compile:

git clone https://github.com/avulanov/scalable-deeplearning.git
cd scalable-deeplearning
sbt assembly (or mvn assembly)

The jar library will be availabe in target folder. assembly includes optimized numerical processing library netlib-java. Optionally, one can build package.

Performance configuration

Scaladl uses netlib-java library for optimized numerical processing with native BLAS. All netlib-java classes are included in scaladl.jar. The latter has to be in the classpath before Spark's own libraries because Spark has a subset of netlib. In order to do this, set spark.driver.userClassPathFirst to true in spark-defaults.conf.

If native BLAS libraries are not available at runtime or scaladl is not the first in the classpath, you will see a warning WARN BLAS: Failed to load implementation from: and reference or pure JVM implementation will be used. Native BLAS library such as OpenBLAS (libopenblas.so or dll) or ATLAS (libatlas.so) should be in the path of all nodes that run Spark. Netlib-java requires the library to be named as libblas.so.3, and one has to create a symlink. The same is for Windows and libblas3.dll. Below are the setup details for different platforms. With proper configuration, you will see an info INFO JniLoader: successfully loaded ...netlib-native_system-....

Linux:

Install native blas library (depending on your distributive):

yum install openblas <OR> apt-get openblas <OR> download and compile OpenBLAS

Create symlink to native BLAS within its folder /your/blas

ln -s libopenblas.so libblas.so.3

Add it to your library path. Make sure there is no other folder with libblas.so.3 in your path.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/your/blas

Windows:

Copy the following dlls from MINGW distribution and from OpenBLAS to the folder blas. Make sure they are all the same 64 or 32 bit. Add that folder to your path variable.

libquadmath-0.dll // MINGW
libgcc_s_seh-1.dll // MINGW
libgfortran-3.dll // MINGW
libopeblas.dll // OpenBLAS binary
liblapack3.dll // copy of libopeblas.dll
libblas3.dll // copy of libopenblas.dll

Example of use

Built-in examples

Scaldl provides working examples of MNIST classification and pre-training with stacked autoencoder. Examples are in scaladl.examples package. They can be run via Spark submit:

./spark-submit --class scaladl.examples.MnistClassification --master spark://master:7077 /path/to/scaldl.jar /path/to/mnist-libsvm

Spark shell

Start Spark with this library:

./spark-shell --jars scaladl.jar

Or use it as external dependency for your application.

Multilayer perceptron

MNIST classification

  • Load MNIST handwritten recognition data stored in LIBSVM format as a DataFrame
  • Initialize the multilayer perceptron classifier with 784 inputs, 32 neurons in hidden layer and 10 outputs
  • Train and predict
import org.apache.spark.ml.scaladl.MultilayerPerceptronClassifier
val train = spark.read.format("libsvm").option("numFeatures", 784).load("mnist.scale").persist()
val test = spark.read.format("libsvm").option("numFeatures", 784).load("mnist.scale.t").persist()
train.count() // materialize data lazy persisted in memory
test.count() // materialize data lazy persisted in memory
val trainer = new MultilayerPerceptronClassifier().setLayers(Array(784, 32, 10)).setMaxIter(100)
val model = trainer.fit(train)
val result = model.transform(test)

Stacked Autoencoder

Pre-training

  • Load MNIST data
  • Initialize the stacked autoencoder with 784 inputs and 32 neurons in hidden layer
  • Train stacked autoencoder
  • Initialize the multilayer perceptron classifier with 784 inputs, 32 neurons in hidden layer and
import org.apache.spark.ml.scaladl.{MultilayerPerceptronClassifier, StackedAutoencoder}
val train = spark.read.format("libsvm").option("numFeatures", 784).load(mnistTrain).persist()
train.count()
val stackedAutoencoder = new StackedAutoencoder().setLayers(Array(784, 32))
  .setInputCol("features")
  .setOutputCol("output")
  .setDataIn01Interval(true)
  .setBuildDecoder(false)
val saModel = stackedAutoencoder.fit(train)
val autoWeights = saModel.encoderWeights
val trainer = new MultilayerPerceptronClassifier().setLayers(Array(784, 32, 10)).setMaxIter(1)
val initialWeights = trainer.fit(train).weights
System.arraycopy(autoWeights.toArray, 0, initialWeights.toArray, 0, autoWeights.toArray.length)
trainer.setInitialWeights(initialWeights).setMaxIter(10)
val model = trainer.fit(train)

Contributions

Contributions are welcome, in particular in the following areas:

  • New layers
    • Convolutional
    • ReLu
  • Flexibility
    • Implement the reader of Caffe/other deep learning configuration format
    • Implement Python/R/Java interface
  • Efficiency
    • Switch from double to single precision
    • Implement wrapper to specialized deep learning libraries, e.g. TensorFlow
  • Refactoring
    • Implement own version of L-BFGS to remove dependency on breeze

About

Scalable implementation of artificial neural networks for Spark deep learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages