Skip to content

Commit

Permalink
[SPARK-9671] [MLLIB] re-org user guide and add migration guide
Browse files Browse the repository at this point in the history
This PR updates the MLlib user guide and adds migration guide for 1.4->1.5.

* merge migration guide for `spark.mllib` and `spark.ml` packages
* remove dependency section from `spark.ml` guide
* move the paragraph about `spark.mllib` and `spark.ml` to the top and recommend `spark.ml`
* move Sam's talk to footnote to make the section focus on dependencies

Minor changes to code examples and other wording will be in a separate PR.

jkbradley srowen feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8498 from mengxr/SPARK-9671.
  • Loading branch information
mengxr committed Aug 28, 2015
1 parent 4572321 commit 88032ec
Show file tree
Hide file tree
Showing 3 changed files with 95 additions and 106 deletions.
52 changes: 6 additions & 46 deletions docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,11 @@ title: Spark ML Programming Guide
\]`


Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
high-level APIs that help users create and tune practical machine learning pipelines.

*Graduated from Alpha!* The Pipelines API is no longer an alpha component, although many elements of it are still `Experimental` or `DeveloperApi`.

Note that we will keep supporting and adding features to `spark.mllib` along with the
development of `spark.ml`.
Users should be comfortable using `spark.mllib` features and expect more features coming.
Developers should contribute new algorithms to `spark.mllib` and can optionally contribute
to `spark.ml`.

See the [Algorithm Guides section](#algorithm-guides) below for guides on sub-packages of `spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and more.

The `spark.ml` package aims to provide a uniform set of high-level APIs built on top of
[DataFrames](sql-programming-guide.html#dataframes) that help users create and tune practical
machine learning pipelines.
See the [Algorithm Guides section](#algorithm-guides) below for guides on sub-packages of
`spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and more.

**Table of Contents**

Expand Down Expand Up @@ -171,7 +163,7 @@ This is useful if there are two algorithms with the `maxIter` parameter in a `Pi

# Algorithm Guides

There are now several algorithms in the Pipelines API which are not in the lower-level MLlib API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
There are now several algorithms in the Pipelines API which are not in the `spark.mllib` API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.

**Pipelines API Algorithm Guides**

Expand Down Expand Up @@ -880,35 +872,3 @@ jsc.stop();
</div>

</div>

# Dependencies

Spark ML currently depends on MLlib and has the same dependencies.
Please see the [MLlib Dependencies guide](mllib-guide.html#dependencies) for more info.

Spark ML also depends upon Spark SQL, but the relevant parts of Spark SQL do not bring additional dependencies.

# Migration Guide

## From 1.3 to 1.4

Several major API changes occurred, including:
* `Param` and other APIs for specifying parameters
* `uid` unique IDs for Pipeline components
* Reorganization of certain classes
Since the `spark.ml` API was an Alpha Component in Spark 1.3, we do not list all changes here.

However, now that `spark.ml` is no longer an Alpha Component, we will provide details on any API changes for future releases.

## From 1.2 to 1.3

The main API changes are from Spark SQL. We list the most important changes here:

* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in Spark ML which used to use SchemaRDD now use DataFrame.
* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.

Other changes were in `LogisticRegression`:

* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.
119 changes: 59 additions & 60 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,28 @@ displayTitle: Machine Learning Library (MLlib) Guide
description: MLlib machine learning library overview for Spark SPARK_VERSION_SHORT
---

MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization primitives.
Guides for individual algorithms are listed below.
MLlib is Spark's machine learning (ML) library.
Its goal is to make practical machine learning scalable and easy.
It consists of common learning algorithms and utilities, including classification, regression,
clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization
primitives and higher-level pipeline APIs.

The API is divided into 2 parts:
It divides into two packages:

* [The original `spark.mllib` API](mllib-guide.html#mllib-types-algorithms-and-utilities) is the primary API.
* [The "Pipelines" `spark.ml` API](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) is a higher-level API for constructing ML workflows.
* [`spark.mllib`](mllib-guide.html#mllib-types-algorithms-and-utilities) contains the original API
built on top of RDDs.
* [`spark.ml`](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) provides higher-level API
built on top of DataFrames for constructing ML pipelines.

We list major functionality from both below, with links to detailed guides.
Using `spark.ml` is recommended because with DataFrames the API is more versatile and flexible.
But we will keep supporting `spark.mllib` along with the development of `spark.ml`.
Users should be comfortable using `spark.mllib` features and expect more features coming.
Developers should contribute new algorithms to `spark.ml` if they fit the ML pipeline concept well,
e.g., feature extractors and transformers.

# MLlib types, algorithms and utilities
We list major functionality from both below, with links to detailed guides.

This lists functionality included in `spark.mllib`, the main MLlib API.
# spark.mllib: data types, algorithms, and utilities

* [Data types](mllib-data-types.html)
* [Basic statistics](mllib-statistics.html)
Expand Down Expand Up @@ -56,71 +63,63 @@ This lists functionality included in `spark.mllib`, the main MLlib API.
* [limited-memory BFGS (L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
* [PMML model export](mllib-pmml-model-export.html)

MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.

# spark.ml: high-level APIs for ML pipelines

Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
high-level APIs that help users create and tune practical machine learning pipelines.

*Graduated from Alpha!* The Pipelines API is no longer an alpha component, although many elements of it are still `Experimental` or `DeveloperApi`.

Note that we will keep supporting and adding features to `spark.mllib` along with the
development of `spark.ml`.
Users should be comfortable using `spark.mllib` features and expect more features coming.
Developers should contribute new algorithms to `spark.mllib` and can optionally contribute
to `spark.ml`.

Guides for `spark.ml` include:
**[spark.ml programming guide](ml-guide.html)** provides an overview of the Pipelines API and major
concepts. It also contains sections on using algorithms within the Pipelines API, for example:

* **[spark.ml programming guide](ml-guide.html)**: overview of the Pipelines API and major concepts
* Guides on using algorithms within the Pipelines API:
* [Feature transformers](ml-features.html), including a few not in the lower-level `spark.mllib` API
* [Decision trees](ml-decision-tree.html)
* [Ensembles](ml-ensembles.html)
* [Linear methods](ml-linear-methods.html)
* [Feature Extraction, Transformation, and Selection](ml-features.html)
* [Decision Trees for Classification and Regression](ml-decision-tree.html)
* [Ensembles](ml-ensembles.html)
* [Linear methods with elastic net regularization](ml-linear-methods.html)
* [Multilayer perceptron classifier](ml-ann.html)

# Dependencies

MLlib uses the linear algebra package
[Breeze](http://www.scalanlp.org/), which depends on
[netlib-java](https://github.com/fommil/netlib-java) for optimised
numerical processing. If natives are not available at runtime, you
will see a warning message and a pure JVM implementation will be used
instead.
MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
If natives libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
implementation will be used instead.

To learn more about the benefits and background of system optimised
natives, you may wish to watch Sam Halliday's ScalaX talk on
[High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/)).
Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
proxies by default.
To configure `netlib-java` / Breeze to use system optimised binaries, include
`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
platform's additional installation instructions.

Due to licensing issues with runtime proprietary binaries, we do not
include `netlib-java`'s native proxies by default. To configure
`netlib-java` / Breeze to use system optimised binaries, include
`com.github.fommil.netlib:all:1.1.2` (or build Spark with
`-Pnetlib-lgpl`) as a dependency of your project and read the
[netlib-java](https://github.com/fommil/netlib-java) documentation for
your platform's additional installation instructions.
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.

To use MLlib in Python, you will need [NumPy](http://www.numpy.org)
version 1.4 or newer.
[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).

---
# Migration guide

# Migration Guide
MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.

## From 1.4 to 1.5

For the `spark.ml` package, please see the [spark.ml Migration Guide](ml-guide.html#migration-guide).
In the `spark.mllib` package, there are no break API changes but several behavior changes:

## From 1.3 to 1.4
* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
`RegressionMetrics.explainedVariance` returns the average regression sum of squares.
* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become
sorted.
* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default
convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4.

In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
In the `spark.ml` package, there exists one break API change and one behavior change:

* Gradient-Boosted Trees
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed
from `Params.setDefault` due to a
[Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is
added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.

## Previous Spark Versions
## Previous Spark versions

Earlier migration guides are archived [on this page](mllib-migration-guides.html).

---
30 changes: 30 additions & 0 deletions docs/mllib-migration-guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,25 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT

The migration guide for the current Spark version is kept on the [MLlib Programming Guide main page](mllib-guide.html#migration-guide).

## From 1.3 to 1.4

In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:

* Gradient-Boosted Trees
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.

In the `spark.ml` package, several major API changes occurred, including:

* `Param` and other APIs for specifying parameters
* `uid` unique IDs for Pipeline components
* Reorganization of certain classes

Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list all changes here.
However, since 1.4 `spark.ml` is no longer an alpha component, we will provide details on any API
changes for future releases.

## From 1.2 to 1.3

In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
Expand All @@ -23,6 +42,17 @@ In the `spark.mllib` package, there were several breaking changes. The first ch
* In linear regression (including Lasso and ridge regression), the squared loss is now divided by 2.
So in order to produce the same result as in 1.2, the regularization parameter needs to be divided by 2 and the step size needs to be multiplied by 2.

In the `spark.ml` package, the main API changes are from Spark SQL. We list the most important changes here:

* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in Spark ML which used to use SchemaRDD now use DataFrame.
* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.

Other changes were in `LogisticRegression`:

* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.

## From 1.1 to 1.2

The only API changes in MLlib v1.2 are in
Expand Down

0 comments on commit 88032ec

Please sign in to comment.