Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9671] [MLLIB] re-org user guide and add migration guide #8498

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 6 additions & 45 deletions docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,10 @@ title: Spark ML Programming Guide
\]`


Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
high-level APIs that help users create and tune practical machine learning pipelines.

*Graduated from Alpha!* The Pipelines API is no longer an alpha component, although many elements of it are still `Experimental` or `DeveloperApi`.

Note that we will keep supporting and adding features to `spark.mllib` along with the
development of `spark.ml`.
Users should be comfortable using `spark.mllib` features and expect more features coming.
Developers should contribute new algorithms to `spark.mllib` and can optionally contribute
to `spark.ml`.

See the [Algorithm Guides section](#algorithm-guides) below for guides on sub-packages of `spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and more.

The `spark.ml` package aims to provide a uniform set of high-level APIs that help users create and
tune practical machine learning pipelines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is already mentioned in mllib-guide but should we mention DataFrames here in ml-guide's introduction since it is the key distinction in comparison to spark.ml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

See the [Algorithm Guides section](#algorithm-guides) below for guides on sub-packages of
`spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and more.

**Table of Contents**

Expand Down Expand Up @@ -171,7 +162,7 @@ This is useful if there are two algorithms with the `maxIter` parameter in a `Pi

# Algorithm Guides

There are now several algorithms in the Pipelines API which are not in the lower-level MLlib API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
There are now several algorithms in the Pipelines API which are not in the `spark.mllib` API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.

**Pipelines API Algorithm Guides**

Expand Down Expand Up @@ -868,34 +859,4 @@ jsc.stop();

</div>

# Dependencies

Spark ML currently depends on MLlib and has the same dependencies.
Please see the [MLlib Dependencies guide](mllib-guide.html#dependencies) for more info.

Spark ML also depends upon Spark SQL, but the relevant parts of Spark SQL do not bring additional dependencies.

# Migration Guide

## From 1.3 to 1.4

Several major API changes occurred, including:
* `Param` and other APIs for specifying parameters
* `uid` unique IDs for Pipeline components
* Reorganization of certain classes
Since the `spark.ml` API was an Alpha Component in Spark 1.3, we do not list all changes here.

However, now that `spark.ml` is no longer an Alpha Component, we will provide details on any API changes for future releases.

## From 1.2 to 1.3

The main API changes are from Spark SQL. We list the most important changes here:

* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in Spark ML which used to use SchemaRDD now use DataFrame.
* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.

Other changes were in `LogisticRegression`:

* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these dividers only present in ml-guide and mllib-guide but not others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one footnote in mllib-guide.md with this PR. I should remove the one in ml-guide.md.

119 changes: 59 additions & 60 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,28 @@ displayTitle: Machine Learning Library (MLlib) Guide
description: MLlib machine learning library overview for Spark SPARK_VERSION_SHORT
---

MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization primitives.
Guides for individual algorithms are listed below.
MLlib is Spark's machine learning (ML) library.
Its goal is to make practical machine learning scalable and easy.
It consists of common learning algorithms and utilities, including classification, regression,
clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization
primitives and higher-level pipeline APIs.

The API is divided into 2 parts:
It divides into two packages:

* [The original `spark.mllib` API](mllib-guide.html#mllib-types-algorithms-and-utilities) is the primary API.
* [The "Pipelines" `spark.ml` API](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) is a higher-level API for constructing ML workflows.
* [`spark.mllib`](mllib-guide.html#mllib-types-algorithms-and-utilities) contains the original API
built on top of RDDs.
* [`spark.ml`](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) provides higher-level API
built on top of DataFrames for constructing ML pipelines.

We list major functionality from both below, with links to detailed guides.
Using `spark.ml` is recommended because with DataFrames the API is more versatile and flexible.
But we will keep supporting `spark.mllib` along with the development of `spark.ml`.
Users should be comfortable using `spark.mllib` features and expect more features coming.
Developers should contribute new algorithms to `spark.ml` if they fit the ML pipeline concept well,
e.g., feature extractors and transformers.

# MLlib types, algorithms and utilities
We list major functionality from both below, with links to detailed guides.

This lists functionality included in `spark.mllib`, the main MLlib API.
# spark.mllib: data types, algorithms, and utilities

* [Data types](mllib-data-types.html)
* [Basic statistics](mllib-statistics.html)
Expand Down Expand Up @@ -56,71 +63,63 @@ This lists functionality included in `spark.mllib`, the main MLlib API.
* [limited-memory BFGS (L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
* [PMML model export](mllib-pmml-model-export.html)

MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.

# spark.ml: high-level APIs for ML pipelines

Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
high-level APIs that help users create and tune practical machine learning pipelines.

*Graduated from Alpha!* The Pipelines API is no longer an alpha component, although many elements of it are still `Experimental` or `DeveloperApi`.

Note that we will keep supporting and adding features to `spark.mllib` along with the
development of `spark.ml`.
Users should be comfortable using `spark.mllib` features and expect more features coming.
Developers should contribute new algorithms to `spark.mllib` and can optionally contribute
to `spark.ml`.

Guides for `spark.ml` include:
**[spark.ml programming guide](ml-guide.html)** provides an overview of the Pipelines API and major
concepts. It also contains sections on using algorithms within the Pipelines API, for example:

* **[spark.ml programming guide](ml-guide.html)**: overview of the Pipelines API and major concepts
* Guides on using algorithms within the Pipelines API:
* [Feature transformers](ml-features.html), including a few not in the lower-level `spark.mllib` API
* [Decision trees](ml-decision-tree.html)
* [Ensembles](ml-ensembles.html)
* [Linear methods](ml-linear-methods.html)
* [Feature extractors and transformers](ml-features.html)
* [Linear methods](ml-linear-methods.html)
* [Decision trees](ml-decision-tree.html)
* [Ensembles](ml-ensembles.html)
* [Artificial neural network](ml-ann.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referred to as "multilayer perceptron classifier" in ml-guide, we should be consistent with how we refer to it (I prefer MLP because ANN usually include other flavors e.g. convnets, RBMs); ml-ann.md will also have to be renamed if we make this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just duplicate what's in ml-guide since the content is identical with just differences in naming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. I tried to keep this list as examples but not a full list. It would be hard to sync two full lists.


# Dependencies

MLlib uses the linear algebra package
[Breeze](http://www.scalanlp.org/), which depends on
[netlib-java](https://github.com/fommil/netlib-java) for optimised
numerical processing. If natives are not available at runtime, you
will see a warning message and a pure JVM implementation will be used
instead.
MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
If natives libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
implementation will be used instead.

To learn more about the benefits and background of system optimised
natives, you may wish to watch Sam Halliday's ScalaX talk on
[High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/)).
Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
proxies by default.
To configure `netlib-java` / Breeze to use system optimised binaries, include
`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
platform's additional installation instructions.

Due to licensing issues with runtime proprietary binaries, we do not
include `netlib-java`'s native proxies by default. To configure
`netlib-java` / Breeze to use system optimised binaries, include
`com.github.fommil.netlib:all:1.1.2` (or build Spark with
`-Pnetlib-lgpl`) as a dependency of your project and read the
[netlib-java](https://github.com/fommil/netlib-java) documentation for
your platform's additional installation instructions.
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.

To use MLlib in Python, you will need [NumPy](http://www.numpy.org)
version 1.4 or newer.
[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).

---
# Migration guide

# Migration Guide
MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.

## From 1.4 to 1.5

For the `spark.ml` package, please see the [spark.ml Migration Guide](ml-guide.html#migration-guide).
In the `spark.mllib` package, there are no break API changes but several behavior changes:

## From 1.3 to 1.4
* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
`RegressionMetrics.explainedVariance` returns the average regression sum of squares.
* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become
sorted.
* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default
convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4.

In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
In the `spark.ml` package, there exists one break API change and one behavior change:

* Gradient-Boosted Trees
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed
from `Params.setDefault` due to a
[Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is
added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.

## Previous Spark Versions
## Previous Spark versions

Earlier migration guides are archived [on this page](mllib-migration-guides.html).

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on divider

30 changes: 30 additions & 0 deletions docs/mllib-migration-guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,25 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT

The migration guide for the current Spark version is kept on the [MLlib Programming Guide main page](mllib-guide.html#migration-guide).

## From 1.3 to 1.4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No content change here. Just moved the paragraphs from mllib-guide and ml-guide.


In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:

* Gradient-Boosted Trees
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.

In the `spark.ml` package, several major API changes occurred, including:

* `Param` and other APIs for specifying parameters
* `uid` unique IDs for Pipeline components
* Reorganization of certain classes

Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list all changes here.
However, since 1.4 `spark.ml` is no longer an alpha component, we will provide details on any API
changes for future releases.

## From 1.2 to 1.3

In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
Expand All @@ -23,6 +42,17 @@ In the `spark.mllib` package, there were several breaking changes. The first ch
* In linear regression (including Lasso and ridge regression), the squared loss is now divided by 2.
So in order to produce the same result as in 1.2, the regularization parameter needs to be divided by 2 and the step size needs to be multiplied by 2.

In the `spark.ml` package, the main API changes are from Spark SQL. We list the most important changes here:

* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in Spark ML which used to use SchemaRDD now use DataFrame.
* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.

Other changes were in `LogisticRegression`:

* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.

## From 1.1 to 1.2

The only API changes in MLlib v1.2 are in
Expand Down