-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10409] [ML] Add Multilayer Perceptron Regression to ML #13617
Conversation
jenkins add to whitelist |
Test build #60347 has finished for PR 13617 at commit
|
Test build #60354 has finished for PR 13617 at commit
|
Thanks for the PR (with a great description)! FYI review on a big new feature may need to wait until 2.0 QA is done. Also, @mengxr @avulanov have discussed regression before, and IIRC it was unclear if there were many use cases relative to classification. It may make sense to focus on merging improvements to classification first, though please push back if you can cite some use cases. Thanks! |
@jkbradley, of course you’re welcome for the PR! I’d be happy to discuss a few use cases. Among MLlib algorithms, MLPR has the unique ability to generalize to unseen features that have a nonlinear relationship with the output. Examples of learning relationships such as x^2 = y and x1*x2=y show that its performance is exceptionally better on such problems whenever the features leave the range that the model was trained on. These type of relationships show up in almost every important modeling problem. Anyone looking to put a model into production who wants to perform well on new data that is seen that isn’t well represented in training needs an algorithm that can generalize to that range. Below is a classic example - two variables in the dataset interact to predict the outcome variable well. Within the range of the training data, MLPR’s performance is on par with Gradient Boosting, Random Forests and Linear Regression. But outside the range of the training data, the tree based models are incapable of generalizing. Linear regression is only capable of generalizing simple linear relationships, so it forces the user to manually encode the complex relationships they want modeled. And so because MLPR is capable of automatically modeling the target as a non-linear function of features with a structure that generalizes well, it outperforms every other algorithm in MLlib in a context like this. MLPR also shows consistent, robust performance on standard datasets. Below are examples of its performance relative to other models on Load Boston, Diabetes, and Iris (Avaliable here: http://scikit-learn.org/stable/datasets/#toy-datasets). All models are using their default parameters (tanh activations and 50 neurons in a single hidden layer for MLPR) and are evaluated using RMSE. Train/Test split is a random 70/30 split with no validation set. All data is scaled (mean and std) in preprocessing. Load Boston Diabetes Iris (Predicting Sepal Length) Together these properties (generalization to unseen feature values + consistent performance) make it a valuable algorithm to have in a production systems that demands robust predictions. It learns a very different type of structure from the decision tree based models already in MLlib, and so has value as a part of an ensemble whether or not it has the highest predictive score on the validation data. Situations where it does have the best predictive score are clear use cases. You bring up improvements to classification as well. One downside to the current implementation of MLPC is that it forces users to use a Sigmoid activation function, which has the unfortunate property of saturating the gradients. I provide support here for the more modern Tanh, Relu and Linear activations which gives the user options for activations that are zero-centered or do not kill gradients that can speed up convergence dramatically and improve accuracy. These benefits will go to MLPR and MLPC, and should be included regardless of the decision on the MLPR API. With a linear activation/layer and squared error loss included, the library has all the functionality necessary to run MLPR. That functionality already effectively exists in the library - all of the critical components, from the topology to the optimizer to the activation functions are already supported and maintained in MLlib. All we require is an API to call the algorithm. That API could be as minimal as a single parameter to MLPC that replaces the last layer with a linear layer w/ squared error. The downside to that is inconsistency with the rest of MLlib and skimping on automated scaling which will put users through a lot more work or risk them getting extremely poor results from misuse. The naming may also lead to confusion, where the user would be doing regression with an algorithm named for classification. The current proposed API is consistent with the rest of MLlib and with MLPC. It enables automated scaling and gives users a consistent experience, and so I recommend it. I can understand wanting the algorithm without having to support another API, and so we can entertain more flexible options if that looks attractive. I entirely understand w.r.t. 2.0 QA - I look forward to hearing the thoughts of @avulanov and @mengxr! |
/** | ||
* Creates a multi-layer perceptron for regression | ||
* | ||
* @param layerSizes sizes of layers including input and output size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it include output size? I think your output size is always 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wanted to check - is there a reason we can't support n-dimension output? It's less common but the MLP can support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right @viirya, the output is almost always 1. There are important corner cases with multiple outputs, like Overfeat's state of the art object localization predictions (see section 4.2). Instead of training 1 linear regression model on top of the network's generated features, it trains 4 linear regression models that each predict the corner of a box that surrounds an object in the image. I think that keeping the flexibility makes sense, but we'll need to add SPARK-9120.
Test build #60844 has finished for PR 13617 at commit
|
As per @avulanov's comment on SPARK-15581, if we do indeed plan to add the "essentials" for DL to Spark (e.g. MLP, CNN, autencoder), then MLPR seems like it should be in there too. Especially since this PR is mostly "wrapper" code to expose the DF-based API, example, and tests. The core changes are minimal and open up a powerful model for users - I guess what I am saying is the "risk vs reward" here seems good. Also, FWIW this is in scikit-learn dev (http://scikit-learn.org/dev/modules/neural_networks_supervised.html) |
@JeremyNixon Does it make sense to pull the new activation functions out of this PR and into a standalone? I know this PR depends upon some of them, but since it's a WIP and the other change is smaller it can likely be merged before this one. Regarding the use cases, you mention that MLPR has advantages on generalizing and learning non-linear relationships (advantages over what is currently in MLlib, anyway). Linear regression can be used to model non-linear relationships using some feature engineering, though it can be cumbersome and is not always practical. MLPR should be better, but presumably takes longer to train. It might be nice to show example(s) of a case where the output is non-linear in the features with MLPR and LR in Spark.ML, where LR is used with polynomial expansion, on a large dataset. Comparing predictive performance and algorithm runtimes would help paint a clearer picture of the tradeoffs. At some point, the number of features make modeling higher order interactions with Linear Regression impractical, but I'm not sure exactly where that point is and how well MLPR can perform on the same data. |
@JeremyNixon Thanks for your thoughts. I agree this should get in, but want to make sure the priorities are clear. With respect to examples of improvements, I really meant either (a) research papers showing the importance or (b) industry use cases. One can always construct examples where an algorithm is helpful, and I agree that feature engineering is likely a good use case. But references are very helpful for guidance. +1 for separating the activation functions out into another PR About scaling: I'd say this should mimic LinearRegression's standardization API. |
@JeremyNixon Thank you for your PR! Actually, regression was in the original Multilayer perceptron PR: a226133. However, we removed it after discussion with @mengxr. The reason is that regression needs to have only one output to be consistent with The other way of addressing this problem would be to implement multilayer perceptron regression with multiple outputs. Justifying its usefulness might be simpler. We might need to implement multivariate regression interface beforehand: https://issues.apache.org/jira/browse/SPARK-9120 +1 for separating the activation functions into another PR. Currently, there is no public API to specify activation functions in hidden layers. |
@avulanov Great to hear from you! I'd love to give you a short tour of MLPR's use cases.
Computer Vision(Assumes we include convolutional and pooling layer types) Detection as DNN Regression - Object Localization DetectionPrecise object localization is necessary to track an object’s shape or movement. Includes a regression layer which generates an object binary mask, a binary representation of the object in the image. This creates an object detector, learning the location of an object or even specific parts of an object in an image. ImageNet winning solution for Object LocalizationOverfeat: http://arxiv.org/pdf/1312.6229v4.pdf It would be nice to support multiple outputs for an application like object localization - Pose RegressionEstimate the pose of humans in video, results significantly better than the previous state of the art. Able to detect sign language, generalizes to finding the location of elbows/hands/head etc. FinanceCurrency Exchange RateNeural Network Regression for forecasting the exchange rate between currencies. NN outperforms standard ARIMA methodology for forecasting. Accurate Currency Exchange Rate Forecasting using MLPR Stock Price Prediction: Comparison of MethodsNeural Network Regression outperforms other regression methods in stock price prediction. Forecasting Financial Time SeriesApplying deep regression networks to forecast market prices. Crude Oil Price PredictionSpot price forecasting for world crude oil. Atmospheric SciencesOverviewThere are numerous applications across the atmospheric sciences, where highly nonlinear relationships need to be appropriately modeled. Air Quality PredictionModeling nonlinear relationship between meteorology and pollution for surface ozone concentrations in industrialized areas. Air Pollution Prediction - Carbon DioxideNeural Network Regression outperforms multiple linear regression for carbon dioxide air pollution prediction in China. Atmospheric Sulphur dioxide concentrationsMany applications of Neural Network Regression to air pollution, including predicting sulfur dioxide concentration. Ozone Concentration ComparisonNeural Networks for Regression outperform decision trees and linear regression for modeling nonlinear relationships required to predict ozone concentration. InfrastructureRoad Tunnel Cost EstimationRegression Neural Network leads to accurate cost estimation for road tunnels. Highway Engineering Cost EstimationNeural Networks reliably predict the cost of highway construction projects. GeophysicsPacific Sea Surface TemperatureSurface temperature prediction environments are nonlinear. Presentation of an MLPR outperforming linear regression models over the domain. Meteorology and OceanographyImproving neural network methods for many tasks in meteorology and oceanography, including seasonal climate forecasting, various time series, satellite imagery analysis, ocean acoustics and more. Hydrological ModelingRiver flow forecasting from satellite data with neural networks. |
+1 for multiple outputs. Deep NN Regression with multiple outputs has gotten state of the art performance on object localization tasks. Let's have a conversation about the public API for activations but also for flexible neural network models in general - I'll put together a design doc and ping you on JIRA as well as break the activation functions into another PR. |
@JeremyNixon Thanks for the comprehensive list of references! The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. However, other ANN features are of higher priority for Spark ML right now: https://issues.apache.org/jira/browse/SPARK-15581. In particular, it is Autoencoder and CNN. It would be great if you could help with them. For example, review the Autoencoder PR: #13621 With regards to the advanced ANN features, we are currently building a package that is supposed to support them. Eventually, some of them, might find their place in the main branch. It would be great to collaborate on this effort. |
@JeremyNixon I have released version 1.0.0 of scalable-deeplearning package. This package is based on the implementation of artificial neural networks in Spark ML. It is intended for new Spark deep learning features that were not yet merged to Spark ML or that are too specific to be merged. Contributions are very welcome. I think we can merge your mlp regression proposal after some modifications. Are you interested? |
@avulanov I am interested - how about I replicate this PR at github.com/avulanov/scalable-deeplearning and we discuss details there? |
Test build #65278 has finished for PR 13617 at commit
|
@JeremyNixon Sounds good! |
Test build #67929 has finished for PR 13617 at commit
|
Test build #67932 has finished for PR 13617 at commit
|
Test build #67934 has finished for PR 13617 at commit
|
Test build #67933 has finished for PR 13617 at commit
|
Test build #74122 has finished for PR 13617 at commit
|
@JeremyNixon @avulanov Any update on this? I noticed neither this PR or the one on the deeplearning package was ever merged, and it is the only resource I can find regarding neural net-based regression in Spark. |
@JeremyNixon @avulanov @MLnick @mengxr @jkbradley any update? Is this PR going to be merged? |
Hi, Can't go into too much detail about the use case, but it's in production in industrial environments. The general approach is to predict one sensor value based on others. Any updates? P.D. Anyway, just in case it helps anyone I have backported the pull request code to the 2.3 branch on my fork |
Hi, does anyone have an idea why this was abandoned? |
Generally speaking, I'd say this was superseded by third-party deep learning packages, several of which can be used on top of Spark. |
What changes were proposed in this pull request?
This is a pull request adding support for Multilayer Perceptron Regression, the counterpart to the Multilayer Perceptron Classifier (hereafter MLPR and MLPC).
Outline
Convenience Link to JIRA: https://issues.apache.org/jira/browse/SPARK-10409
Major Changes
There are two major differences between MLPR and MLPC. The first is the use of an linear (identity) activation function and a sum of squared error cost function in the last layer of the network. The second is the requirement to scale the data to [0,1] and back to make it easy for the weights to fit a value in the proper range.
Linear, Relu, Tanh Activations
In the forward pass the linear activation passes the value from the fully connected layer through to become the network prediction. In weight adjustment during the backward pass its derivative is one. All regression models will use the linear activation in the last layer, and so there is no option (as there is in MLPC) to use another activation function and cost function in the last layer.
The Relu and Tanh are activation functions that will benefit the accuracy and convergence speed MLPC and MLPR. Tanh zero-centers the data passed to neurons which aids optimization. Relu avoids saturating the gradients.
Automated Scaling
The data scaling is done through min-max scaling, where the minimum label is subtracted from every value (leading to a range from [0 to max-min]) and then dividing by max-min to get a scale from 0 to 1. The corner case where max-min = 0 is resolved by omitting the division step.
Motivating Examples
API Decisions
The API is identical to MLPC with the exception of softmaxOnTop - there is no option on the last layer activation function, or on the cost function to be used (MLPC gives a choice between cross entropy and sum of square error). This API has the user call MLPR with a set of layers that represent the topology of the network. The number of hidden layers is inferred by the parameter for the labels and is equal to the total number of layers - 2. Each hidden layer will be a feedforward layer with a sigmoid activation function up to the output layer and its linear activation.
Input/Output Layer Argument
For MLPR, the output count will always be 1, and the number of inputs will always be equal to the number of features in the training dataset. One API choice could be to omit the input and output counts and only have the user supply the number of neurons in the hidden layers, and automate the input and output counts by looking at the training data. At the very least, it makes sense to validate the user’s layer parameter and display a helpful error message instead of the error in the data stacker that currently appears if the improper number of inputs or outputs is provided.
Modular API
It also would make sense for the API to be modular. A user will want the flexibility to use the linear layer at different points in the network (as well as in MLPC), and will certainly want to be able to use new activation functions (tanh, relu) that are added to improve the performance of these models. That flexibility allows a user to tune the network to their dataset and will be particularly important for convnets or recurrent nets in future. We should decide on the best way to enable tanh and relu activations in this algorithm and in the classifier for the time being.
Automating Scaling
Current behavior is to automatically scale the data for the user. This makes a pass over the data. There are a few options. We could:
As well as all the variants between autoscaling or not, adding an argument or not, and warning the user or not.
The algorithm will run quite poorly on unscaled data, and so it makes sense to safeguard the user from this. But the same will be true of data that is not centered and scaled, and we don’t provide this automatically (though it may not be a bad idea as an option, given how sensitive this non-convex (whenever there are hidden layers) function can be to unscaled data). And so there’s a question of how much we hold the user’s hand. I advocate for helpful defaults that can be overridden, where we scale automatically but give an option to run without scaling and don’t run autoscaling if both the min and max are provided by the user.
Naming
Lastly there’s the naming of the multiLayerPerceptron/ multilayerPerceptronRegresson function in the FeedForwardTopology class in Layer.scala. For consistency it may make sense to change multiLayerPerceptron to multiLayerPerceptronClassifier.
Features
There are a few features that have been checked:
Reference Resources
Christopher M. Bishop. Neural Networks for Pattern Recognition.
Patrick Nicolas. Scala for Machine Learning, Chapter 9.
Ian Goodfellow Yoshua Bengio and Aaron Courville. Deep Learning, Chapter 6.
How was this patch tested?
The unit tests follow MLPC with the addition of a test for gradient descent. There are unit tests for: