Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10409] [ML] Add Multilayer Perceptron Regression to ML #13617

Closed
wants to merge 19 commits into from

Conversation

JeremyNixon
Copy link
Contributor

@JeremyNixon JeremyNixon commented Jun 11, 2016

What changes were proposed in this pull request?

This is a pull request adding support for Multilayer Perceptron Regression, the counterpart to the Multilayer Perceptron Classifier (hereafter MLPR and MLPC).

Outline

  1. Major Changes
  2. API Decisions
  3. Automating Scaling
  4. Naming
  5. Features
  6. Reference Resources
  7. Testing

Convenience Link to JIRA: https://issues.apache.org/jira/browse/SPARK-10409

Major Changes

There are two major differences between MLPR and MLPC. The first is the use of an linear (identity) activation function and a sum of squared error cost function in the last layer of the network. The second is the requirement to scale the data to [0,1] and back to make it easy for the weights to fit a value in the proper range.

Linear, Relu, Tanh Activations

In the forward pass the linear activation passes the value from the fully connected layer through to become the network prediction. In weight adjustment during the backward pass its derivative is one. All regression models will use the linear activation in the last layer, and so there is no option (as there is in MLPC) to use another activation function and cost function in the last layer.

The Relu and Tanh are activation functions that will benefit the accuracy and convergence speed MLPC and MLPR. Tanh zero-centers the data passed to neurons which aids optimization. Relu avoids saturating the gradients.

Automated Scaling

The data scaling is done through min-max scaling, where the minimum label is subtracted from every value (leading to a range from [0 to max-min]) and then dividing by max-min to get a scale from 0 to 1. The corner case where max-min = 0 is resolved by omitting the division step.

Motivating Examples

screen shot 2016-06-10 at 11 26 18 pm

screen shot 2016-06-10 at 11 16 46 pm

API Decisions

The API is identical to MLPC with the exception of softmaxOnTop - there is no option on the last layer activation function, or on the cost function to be used (MLPC gives a choice between cross entropy and sum of square error). This API has the user call MLPR with a set of layers that represent the topology of the network. The number of hidden layers is inferred by the parameter for the labels and is equal to the total number of layers - 2. Each hidden layer will be a feedforward layer with a sigmoid activation function up to the output layer and its linear activation.

Input/Output Layer Argument

For MLPR, the output count will always be 1, and the number of inputs will always be equal to the number of features in the training dataset. One API choice could be to omit the input and output counts and only have the user supply the number of neurons in the hidden layers, and automate the input and output counts by looking at the training data. At the very least, it makes sense to validate the user’s layer parameter and display a helpful error message instead of the error in the data stacker that currently appears if the improper number of inputs or outputs is provided.

Modular API

It also would make sense for the API to be modular. A user will want the flexibility to use the linear layer at different points in the network (as well as in MLPC), and will certainly want to be able to use new activation functions (tanh, relu) that are added to improve the performance of these models. That flexibility allows a user to tune the network to their dataset and will be particularly important for convnets or recurrent nets in future. We should decide on the best way to enable tanh and relu activations in this algorithm and in the classifier for the time being.

Automating Scaling

Current behavior is to automatically scale the data for the user. This makes a pass over the data. There are a few options. We could:

  1. Scale data internally, always.
  2. Scale data internally unless user provides min/max themselves.
  3. Create argument that turns internal scaling off/on. Default it to one or the other. Warn user if running on unscaled data.

As well as all the variants between autoscaling or not, adding an argument or not, and warning the user or not.

The algorithm will run quite poorly on unscaled data, and so it makes sense to safeguard the user from this. But the same will be true of data that is not centered and scaled, and we don’t provide this automatically (though it may not be a bad idea as an option, given how sensitive this non-convex (whenever there are hidden layers) function can be to unscaled data). And so there’s a question of how much we hold the user’s hand. I advocate for helpful defaults that can be overridden, where we scale automatically but give an option to run without scaling and don’t run autoscaling if both the min and max are provided by the user.

Naming

Lastly there’s the naming of the multiLayerPerceptron/ multilayerPerceptronRegresson function in the FeedForwardTopology class in Layer.scala. For consistency it may make sense to change multiLayerPerceptron to multiLayerPerceptronClassifier.

Features

There are a few features that have been checked:

  1. Integrates cleanly with pipeline API
  2. Model save/load is enabled
  3. The example data is the popular LoadBoston dataset, scaled.
  4. Example code is included

Reference Resources

Christopher M. Bishop. Neural Networks for Pattern Recognition.
Patrick Nicolas. Scala for Machine Learning, Chapter 9.
Ian Goodfellow Yoshua Bengio and Aaron Courville. Deep Learning, Chapter 6.

How was this patch tested?

The unit tests follow MLPC with the addition of a test for gradient descent. There are unit tests for:

  1. L-BFGS behavior on toy data
  2. Gradient descent on toy data
  3. Input Validation
  4. Set Weights Parameter
  5. Save/Load Functionality working
  6. Read / Write returns a model with similar layers and weights
  7. Support for all Numeric Types

@JeremyNixon
Copy link
Contributor Author

@MLnick
Copy link
Contributor

MLnick commented Jun 11, 2016

jenkins add to whitelist

@SparkQA
Copy link

SparkQA commented Jun 11, 2016

Test build #60347 has finished for PR 13617 at commit 46783ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2016

Test build #60354 has finished for PR 13617 at commit 138fd25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks for the PR (with a great description)! FYI review on a big new feature may need to wait until 2.0 QA is done.

Also, @mengxr @avulanov have discussed regression before, and IIRC it was unclear if there were many use cases relative to classification. It may make sense to focus on merging improvements to classification first, though please push back if you can cite some use cases. Thanks!

@JeremyNixon
Copy link
Contributor Author

JeremyNixon commented Jun 20, 2016

@jkbradley, of course you’re welcome for the PR! I’d be happy to discuss a few use cases.

Among MLlib algorithms, MLPR has the unique ability to generalize to unseen features that have a nonlinear relationship with the output. Examples of learning relationships such as x^2 = y and x1*x2=y show that its performance is exceptionally better on such problems whenever the features leave the range that the model was trained on. These type of relationships show up in almost every important modeling problem.

Anyone looking to put a model into production who wants to perform well on new data that is seen that isn’t well represented in training needs an algorithm that can generalize to that range. Below is a classic example - two variables in the dataset interact to predict the outcome variable well. Within the range of the training data, MLPR’s performance is on par with Gradient Boosting, Random Forests and Linear Regression. But outside the range of the training data, the tree based models are incapable of generalizing. Linear regression is only capable of generalizing simple linear relationships, so it forces the user to manually encode the complex relationships they want modeled. And so because MLPR is capable of automatically modeling the target as a non-linear function of features with a structure that generalizes well, it outperforms every other algorithm in MLlib in a context like this.

screen shot 2016-06-19 at 11 28 52 pm

MLPR also shows consistent, robust performance on standard datasets. Below are examples of its performance relative to other models on Load Boston, Diabetes, and Iris (Avaliable here: http://scikit-learn.org/stable/datasets/#toy-datasets). All models are using their default parameters (tanh activations and 50 neurons in a single hidden layer for MLPR) and are evaluated using RMSE. Train/Test split is a random 70/30 split with no validation set. All data is scaled (mean and std) in preprocessing.

Load Boston
NN - 3.87
DT - 4.17
RF - 3.23
GBT - 4.34
L2 LR - 4.4

Diabetes
NN - 51.3
DT - 65.2
RF - 55.6
GBT 67.4
L2 LR - 52.24

Iris (Predicting Sepal Length)
NN - 0.376
DT - 0.451
RF - 0.386
GBT - 0.444
LR - 0.295

Together these properties (generalization to unseen feature values + consistent performance) make it a valuable algorithm to have in a production systems that demands robust predictions. It learns a very different type of structure from the decision tree based models already in MLlib, and so has value as a part of an ensemble whether or not it has the highest predictive score on the validation data. Situations where it does have the best predictive score are clear use cases.

You bring up improvements to classification as well. One downside to the current implementation of MLPC is that it forces users to use a Sigmoid activation function, which has the unfortunate property of saturating the gradients. I provide support here for the more modern Tanh, Relu and Linear activations which gives the user options for activations that are zero-centered or do not kill gradients that can speed up convergence dramatically and improve accuracy. These benefits will go to MLPR and MLPC, and should be included regardless of the decision on the MLPR API.

With a linear activation/layer and squared error loss included, the library has all the functionality necessary to run MLPR. That functionality already effectively exists in the library - all of the critical components, from the topology to the optimizer to the activation functions are already supported and maintained in MLlib. All we require is an API to call the algorithm.

That API could be as minimal as a single parameter to MLPC that replaces the last layer with a linear layer w/ squared error.

The downside to that is inconsistency with the rest of MLlib and skimping on automated scaling which will put users through a lot more work or risk them getting extremely poor results from misuse. The naming may also lead to confusion, where the user would be doing regression with an algorithm named for classification.

The current proposed API is consistent with the rest of MLlib and with MLPC. It enables automated scaling and gives users a consistent experience, and so I recommend it. I can understand wanting the algorithm without having to support another API, and so we can entertain more flexible options if that looks attractive.

I entirely understand w.r.t. 2.0 QA - I look forward to hearing the thoughts of @avulanov and @mengxr!

/**
* Creates a multi-layer perceptron for regression
*
* @param layerSizes sizes of layers including input and output size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it include output size? I think your output size is always 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wanted to check - is there a reason we can't support n-dimension output? It's less common but the MLP can support it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right @viirya, the output is almost always 1. There are important corner cases with multiple outputs, like Overfeat's state of the art object localization predictions (see section 4.2). Instead of training 1 linear regression model on top of the network's generated features, it trains 4 linear regression models that each predict the corner of a box that surrounds an object in the image. I think that keeping the flexibility makes sense, but we'll need to add SPARK-9120.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60844 has finished for PR 13617 at commit 2dc114f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Jun 20, 2016

As per @avulanov's comment on SPARK-15581, if we do indeed plan to add the "essentials" for DL to Spark (e.g. MLP, CNN, autencoder), then MLPR seems like it should be in there too. Especially since this PR is mostly "wrapper" code to expose the DF-based API, example, and tests. The core changes are minimal and open up a powerful model for users - I guess what I am saying is the "risk vs reward" here seems good.

Also, FWIW this is in scikit-learn dev (http://scikit-learn.org/dev/modules/neural_networks_supervised.html)

@sethah
Copy link
Contributor

sethah commented Jun 21, 2016

@JeremyNixon Does it make sense to pull the new activation functions out of this PR and into a standalone? I know this PR depends upon some of them, but since it's a WIP and the other change is smaller it can likely be merged before this one.

Regarding the use cases, you mention that MLPR has advantages on generalizing and learning non-linear relationships (advantages over what is currently in MLlib, anyway). Linear regression can be used to model non-linear relationships using some feature engineering, though it can be cumbersome and is not always practical. MLPR should be better, but presumably takes longer to train. It might be nice to show example(s) of a case where the output is non-linear in the features with MLPR and LR in Spark.ML, where LR is used with polynomial expansion, on a large dataset. Comparing predictive performance and algorithm runtimes would help paint a clearer picture of the tradeoffs. At some point, the number of features make modeling higher order interactions with Linear Regression impractical, but I'm not sure exactly where that point is and how well MLPR can perform on the same data.

@jkbradley
Copy link
Member

@JeremyNixon Thanks for your thoughts. I agree this should get in, but want to make sure the priorities are clear.

With respect to examples of improvements, I really meant either (a) research papers showing the importance or (b) industry use cases. One can always construct examples where an algorithm is helpful, and I agree that feature engineering is likely a good use case. But references are very helpful for guidance.

+1 for separating the activation functions out into another PR

About scaling: I'd say this should mimic LinearRegression's standardization API.

@avulanov
Copy link
Contributor

@JeremyNixon Thank you for your PR! Actually, regression was in the original Multilayer perceptron PR: a226133. However, we removed it after discussion with @mengxr. The reason is that regression needs to have only one output to be consistent with RegressionModel in Spark ML. We did not find an evidence that multilayer perceptron with one output is widely used in research or in industry. We posted a JIRA issue indicating that use cases are needed to justify the implementation of this model https://issues.apache.org/jira/browse/SPARK-10409. There was no discussion until now, and I am glad that we finally have it. I think we are still missing some strong motivating use cases. Could you provide few references to research papers or industrial applications that rely on MLP regression?

The other way of addressing this problem would be to implement multilayer perceptron regression with multiple outputs. Justifying its usefulness might be simpler. We might need to implement multivariate regression interface beforehand: https://issues.apache.org/jira/browse/SPARK-9120

+1 for separating the activation functions into another PR. Currently, there is no public API to specify activation functions in hidden layers.

@JeremyNixon
Copy link
Contributor Author

@avulanov Great to hear from you! I'd love to give you a short tour of MLPR's use cases.
@jkbradley Wonderful to hear that you agree this should get in, and I'm happy to provide a few applications and results from academia and industry.

  1. Computer Vision
    a. Object Localization / Detection as DNN Regression
    b. Human Pose Regression
  2. Finance
    a. Currency Exchange Rate
    b. Stock Price Prediction
    c. Forecasting Financial Time Series
    d. Crude Oil Price Prediction
  3. Atmospheric Sciences
    a. Air Quality Prediction
    b. Carbon Dioxide Pollution Prediction
    c. Ozone Concentration Modeling
    d. Sulphur Dioxide Concentration Prediction
  4. Infrastructure
    a. Road Tunnel Cost Estimation
    b. Highway Engineering Cost Estimation
  5. Geology / Physics
    a. Meteorology and Oceanography Applications
    b. Pacific Sea Surface Temperature Prediction
    c. Hydrological Modeling
  6. Summary

Computer Vision

(Assumes we include convolutional and pooling layer types)

Detection as DNN Regression - Object Localization Detection

Precise object localization is necessary to track an object’s shape or movement. Includes a regression layer which generates an object binary mask, a binary representation of the object in the image. This creates an object detector, learning the location of an object or even specific parts of an object in an image.
http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf

ImageNet winning solution for Object Localization

Overfeat: http://arxiv.org/pdf/1312.6229v4.pdf

It would be nice to support multiple outputs for an application like object localization -
“4.2, Regressor Training: The regression network takes as input the pooled feature maps from layer 5. It has 2 fully-connected hidden layers of size 4096 and 1024 channels, respectively. The final output layer has 4 units which specify the coordinates for the bounding box edges.”

Pose Regression

Estimate the pose of humans in video, results significantly better than the previous state of the art. Able to detect sign language, generalizes to finding the location of elbows/hands/head etc.
https://www.robots.ox.ac.uk/~vgg/publications/2014/Pfister14a/pfister14a.pdf

Finance

Currency Exchange Rate

Neural Network Regression for forecasting the exchange rate between currencies. NN outperforms standard ARIMA methodology for forecasting.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.2442

Accurate Currency Exchange Rate Forecasting using MLPR
http://liawww.epfl.ch/uploads/project_reports/report_282.pdf

Stock Price Prediction: Comparison of Methods

Neural Network Regression outperforms other regression methods in stock price prediction.
https://arxiv.org/pdf/1003.1457.pdf

Forecasting Financial Time Series

Applying deep regression networks to forecast market prices.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15.8688&rep=rep1&type=pdf

Crude Oil Price Prediction

Spot price forecasting for world crude oil.
http://www.sciencedirect.com/science/article/pii/S0140988308000765

Atmospheric Sciences

Overview

There are numerous applications across the atmospheric sciences, where highly nonlinear relationships need to be appropriately modeled.
https://www.researchgate.net/publication/263416087_Artificial_Neural_Networks_The_Multilayer_Perceptron_-_A_Review_of_Applications_in_the_Atmospheric_Sciences

Air Quality Prediction

Modeling nonlinear relationship between meteorology and pollution for surface ozone concentrations in industrialized areas.
https://www.researchgate.net/profile/VR_Prybutok/publication/8612909_Prybutok_R._A_neural_network_model_forecasting_for_prediction_of_daily_maximum_ozone_concentration_in_an_industrialized_urban_area._Environ._Pollut._92(3)_349-357/links/0deec53babcab9c32f000000.pdf

Air Pollution Prediction - Carbon Dioxide

Neural Network Regression outperforms multiple linear regression for carbon dioxide air pollution prediction in China.
http://202.116.197.15/cadalcanton/Fulltext/21276_2014319_102457_186.pdf

Atmospheric Sulphur dioxide concentrations

Many applications of Neural Network Regression to air pollution, including predicting sulfur dioxide concentration.
http://cdn.intechweb.org/pdfs/17396.pdf

Ozone Concentration Comparison

Neural Networks for Regression outperform decision trees and linear regression for modeling nonlinear relationships required to predict ozone concentration.
https://www.researchgate.net/publication/263416130_Statistical_Surface_Ozone_Models_An_Improved_Methodology_to_Account_for_Non-Linear_Behaviour

Infrastructure

Road Tunnel Cost Estimation

Regression Neural Network leads to accurate cost estimation for road tunnels.
http://ascelibrary.org/doi/abs/10.1061/(ASCE)CO.1943-7862.0000479

Highway Engineering Cost Estimation

Neural Networks reliably predict the cost of highway construction projects.
http://www.jcomputers.us/vol5/jcp0511-19.pdf

Geophysics

Pacific Sea Surface Temperature

Surface temperature prediction environments are nonlinear. Presentation of an MLPR outperforming linear regression models over the domain.
http://www.ncbi.nlm.nih.gov/pubmed/16527455

Meteorology and Oceanography

Improving neural network methods for many tasks in meteorology and oceanography, including seasonal climate forecasting, various time series, satellite imagery analysis, ocean acoustics and more.
https://open.library.ubc.ca/cIRcle/collections/facultyresearchandpublications/32536/items/1.0041821

Hydrological Modeling

River flow forecasting from satellite data with neural networks.
http://hydrol-earth-syst-sci.net/13/1607/2009/hess-13-1607-2009.pdf
Modeling of nonlinear hydrological relationships for river basin (watershed) management.
http://jh.iwaponline.com/content/ppiwajhydro/10/1/3.full.pdf

@JeremyNixon
Copy link
Contributor Author

+1 for multiple outputs. Deep NN Regression with multiple outputs has gotten state of the art performance on object localization tasks.

Let's have a conversation about the public API for activations but also for flexible neural network models in general - I'll put together a design doc and ping you on JIRA as well as break the activation functions into another PR.

@avulanov
Copy link
Contributor

avulanov commented Jul 12, 2016

@JeremyNixon Thanks for the comprehensive list of references!

The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. However, other ANN features are of higher priority for Spark ML right now: https://issues.apache.org/jira/browse/SPARK-15581. In particular, it is Autoencoder and CNN. It would be great if you could help with them. For example, review the Autoencoder PR: #13621

With regards to the advanced ANN features, we are currently building a package that is supposed to support them. Eventually, some of them, might find their place in the main branch. It would be great to collaborate on this effort.

@avulanov
Copy link
Contributor

avulanov commented Sep 9, 2016

@JeremyNixon I have released version 1.0.0 of scalable-deeplearning package. This package is based on the implementation of artificial neural networks in Spark ML. It is intended for new Spark deep learning features that were not yet merged to Spark ML or that are too specific to be merged. Contributions are very welcome. I think we can merge your mlp regression proposal after some modifications. Are you interested?

@JeremyNixon
Copy link
Contributor Author

@avulanov I am interested - how about I replicate this PR at github.com/avulanov/scalable-deeplearning and we discuss details there?

@SparkQA
Copy link

SparkQA commented Sep 12, 2016

Test build #65278 has finished for PR 13617 at commit 509cb23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@avulanov
Copy link
Contributor

@JeremyNixon Sounds good!

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67929 has finished for PR 13617 at commit a5d9972.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67932 has finished for PR 13617 at commit 322f3bd.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67934 has finished for PR 13617 at commit be4c5ea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67933 has finished for PR 13617 at commit f3a1193.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74122 has finished for PR 13617 at commit 16b7bc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MultilayerPerceptronRegressor @Since(\"2.0.2\") (

@Nickersoft
Copy link

@JeremyNixon @avulanov Any update on this? I noticed neither this PR or the one on the deeplearning package was ever merged, and it is the only resource I can find regarding neural net-based regression in Spark.

@yolile
Copy link

yolile commented Jan 12, 2018

@JeremyNixon @avulanov @MLnick @mengxr @jkbradley any update? Is this PR going to be merged?

@Neuw84
Copy link

Neuw84 commented Jun 4, 2018

Hi,
Another important use case for the MLP regressor is for forecasting machine sensor data. One of our clients uses this approach for predictive maintenance on Industry 4.0. assets. We were hoping to replace their custom implementation using ad-hoc library with Spark ML implementation but we are blocked until this get merged.

Can't go into too much detail about the use case, but it's in production in industrial environments. The general approach is to predict one sensor value based on others.

Any updates?

P.D. Anyway, just in case it helps anyone I have backported the pull request code to the 2.3 branch on my fork

Neuw84 added a commit to Neuw84/spark that referenced this pull request Jun 5, 2018
Neuw84 added a commit to Neuw84/spark that referenced this pull request Jun 7, 2018
@d-kulikov
Copy link

Hi, does anyone have an idea why this was abandoned?

@srowen
Copy link
Member

srowen commented Oct 14, 2019

Generally speaking, I'd say this was superseded by third-party deep learning packages, several of which can be used on top of Spark.

@srowen srowen closed this Oct 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet