Skip to content

Commit

Permalink
[SPARK-10348] [MLLIB] updates ml-guide
Browse files Browse the repository at this point in the history
* replace `ML Dataset` by `DataFrame` to unify the abstraction
* ML algorithms -> pipeline components to describe the main concept
* remove Scala API doc links from the main guide
* `Section Title` -> `Section tile` to be consistent with other section titles in MLlib guide
* modified lines break at 100 chars or periods

jkbradley feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8517 from mengxr/SPARK-10348.
  • Loading branch information
mengxr committed Aug 30, 2015
1 parent 13f5f8e commit 905fbe4
Show file tree
Hide file tree
Showing 2 changed files with 78 additions and 52 deletions.
118 changes: 72 additions & 46 deletions docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,61 +24,74 @@ title: Spark ML Programming Guide
The `spark.ml` package aims to provide a uniform set of high-level APIs built on top of
[DataFrames](sql-programming-guide.html#dataframes) that help users create and tune practical
machine learning pipelines.
See the [Algorithm Guides section](#algorithm-guides) below for guides on sub-packages of
See the [algorithm guides](#algorithm-guides) section below for guides on sub-packages of
`spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and more.

**Table of Contents**
**Table of contents**

* This will become a table of contents (this text will be scraped).
{:toc}

# Main Concepts
# Main concepts

Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Spark ML API.
Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple
algorithms into a single pipeline, or workflow.
This section covers the key concepts introduced by the Spark ML API, where the pipeline concept is
mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.

* **[ML Dataset](ml-guide.html#ml-dataset)**: Spark ML uses the [`DataFrame`](api/scala/index.html#org.apache.spark.sql.DataFrame) from Spark SQL as a dataset which can hold a variety of data types.
E.g., a dataset could have different columns storing text, feature vectors, true labels, and predictions.
* **[`DataFrame`](ml-guide.html#dataframe)**: Spark ML uses `DataFrame` from Spark SQL as an ML
dataset, which can hold a variety of data types.
E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.

* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
E.g., an ML model is a `Transformer` which transforms an RDD with features into an RDD with predictions.
E.g., an ML model is a `Transformer` which transforms `DataFrame` with features into a `DataFrame` with predictions.

* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
E.g., a learning algorithm is an `Estimator` which trains on a dataset and produces a model.
E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.

* **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s and `Estimator`s together to specify an ML workflow.

* **[`Param`](ml-guide.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.
* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.

## ML Dataset
## DataFrame

Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
Spark ML adopts the [`DataFrame`](api/scala/index.html#org.apache.spark.sql.DataFrame) from Spark SQL in order to support a variety of data types under a unified Dataset concept.
Spark ML adopts the `DataFrame` from Spark SQL in order to support a variety of data types.

`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-programming-guide.html#spark-sql-datatype-reference) for a list of supported types.
In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) types.
In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](mllib-data-types.html#local-vector) types.

A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`. See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.

Columns in a `DataFrame` are named. The code examples below use names such as "text," "features," and "label."

## ML Algorithms
## Pipeline components

### Transformers

A [`Transformer`](api/scala/index.html#org.apache.spark.ml.Transformer) is an abstraction which includes feature transformers and learned models. Technically, a `Transformer` implements a method `transform()` which converts one `DataFrame` into another, generally by appending one or more columns.
A `Transformer` is an abstraction that includes feature transformers and learned models.

This comment has been minimized.

Copy link
@jaceklaskowski

jaceklaskowski Sep 1, 2015

Contributor

learned not trained? Are they interchangeable?

Technically, a `Transformer` implements a method `transform()`, which converts one `DataFrame` into
another, generally by appending one or more columns.
For example:

* A feature transformer might take a dataset, read a column (e.g., text), convert it into a new column (e.g., feature vectors), append the new column to the dataset, and output the updated dataset.
* A learning model might take a dataset, read the column containing feature vectors, predict the label for each feature vector, append the labels as a new column, and output the updated dataset.
* A feature transformer might take a `DataFrame`, read a column (e.g., text), map it into a new
column (e.g., feature vectors), and output a new `DataFrame` with the mapped column appended.
* A learning model might take a `DataFrame`, read the column containing feature vectors, predict the
label for each feature vector, and output a new `DataFrame` with predicted labels appended as a
column.

### Estimators

An [`Estimator`](api/scala/index.html#org.apache.spark.ml.Estimator) abstracts the concept of a learning algorithm or any algorithm which fits or trains on data. Technically, an `Estimator` implements a method `fit()` which accepts a `DataFrame` and produces a `Transformer`.
For example, a learning algorithm such as `LogisticRegression` is an `Estimator`, and calling `fit()` trains a `LogisticRegressionModel`, which is a `Transformer`.
An `Estimator` abstracts the concept of a learning algorithm or any algorithm that fits or trains on
data.
Technically, an `Estimator` implements a method `fit()`, which accepts a `DataFrame` and produces a
`Model`, which is a `Transformer`.
For example, a learning algorithm such as `LogisticRegression` is an `Estimator`, and calling
`fit()` trains a `LogisticRegressionModel`, which is a `Model` and hence a `Transformer`.

### Properties of ML Algorithms
### Properties of pipeline components

`Transformer`s and `Estimator`s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.
`Transformer.transform()`s and `Estimator.fit()`s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.

Each instance of a `Transformer` or `Estimator` has a unique ID, which is useful in specifying parameters (discussed below).

Expand All @@ -91,15 +104,16 @@ E.g., a simple text document processing workflow might include several stages:
* Convert each document's words into a numerical feature vector.
* Learn a prediction model using the feature vectors and labels.

Spark ML represents such a workflow as a [`Pipeline`](api/scala/index.html#org.apache.spark.ml.Pipeline),
which consists of a sequence of [`PipelineStage`s](api/scala/index.html#org.apache.spark.ml.PipelineStage) (`Transformer`s and `Estimator`s) to be run in a specific order. We will use this simple workflow as a running example in this section.
Spark ML represents such a workflow as a `Pipeline`, which consists of a sequence of
`PipelineStage`s (`Transformer`s and `Estimator`s) to be run in a specific order.
We will use this simple workflow as a running example in this section.

### How It Works
### How it works

A `Pipeline` is specified as a sequence of stages, and each stage is either a `Transformer` or an `Estimator`.
These stages are run in order, and the input dataset is modified as it passes through each stage.
For `Transformer` stages, the `transform()` method is called on the dataset.
For `Estimator` stages, the `fit()` method is called to produce a `Transformer` (which becomes part of the `PipelineModel`, or fitted `Pipeline`), and that `Transformer`'s `transform()` method is called on the dataset.
These stages are run in order, and the input `DataFrame` is transformed as it passes through each stage.
For `Transformer` stages, the `transform()` method is called on the `DataFrame`.
For `Estimator` stages, the `fit()` method is called to produce a `Transformer` (which becomes part of the `PipelineModel`, or fitted `Pipeline`), and that `Transformer`'s `transform()` method is called on the `DataFrame`.

We illustrate this for the simple text document workflow. The figure below is for the *training time* usage of a `Pipeline`.

Expand All @@ -115,14 +129,17 @@ We illustrate this for the simple text document workflow. The figure below is f
Above, the top row represents a `Pipeline` with three stages.
The first two (`Tokenizer` and `HashingTF`) are `Transformer`s (blue), and the third (`LogisticRegression`) is an `Estimator` (red).
The bottom row represents data flowing through the pipeline, where cylinders indicate `DataFrame`s.
The `Pipeline.fit()` method is called on the original dataset which has raw text documents and labels.
The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words into the dataset.
The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the dataset.
The `Pipeline.fit()` method is called on the original `DataFrame`, which has raw text documents and labels.
The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words to the `DataFrame`.
The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the `DataFrame`.
Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls `LogisticRegression.fit()` to produce a `LogisticRegressionModel`.
If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s `transform()` method on the dataset before passing the dataset to the next stage.
If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s `transform()`
method on the `DataFrame` before passing the `DataFrame` to the next stage.

A `Pipeline` is an `Estimator`.
Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel` which is a `Transformer`. This `PipelineModel` is used at *test time*; the figure below illustrates this usage.
Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a
`Transformer`.
This `PipelineModel` is used at *test time*; the figure below illustrates this usage.

<p style="text-align: center;">
<img
Expand All @@ -134,7 +151,8 @@ Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel` wh
</p>

In the figure above, the `PipelineModel` has the same number of stages as the original `Pipeline`, but all `Estimator`s in the original `Pipeline` have become `Transformer`s.
When the `PipelineModel`'s `transform()` method is called on a test dataset, the data are passed through the `Pipeline` in order.
When the `PipelineModel`'s `transform()` method is called on a test dataset, the data are passed
through the fitted pipeline in order.
Each stage's `transform()` method updates the dataset and passes it to the next stage.

`Pipeline`s and `PipelineModel`s help to ensure that training and test data go through identical feature processing steps.
Expand All @@ -143,40 +161,48 @@ Each stage's `transform()` method updates the dataset and passes it to the next

*DAG `Pipeline`s*: A `Pipeline`'s stages are specified as an ordered array. The examples given here are all for linear `Pipeline`s, i.e., `Pipeline`s in which each stage uses data produced by the previous stage. It is possible to create non-linear `Pipeline`s as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the `Pipeline` forms a DAG, then the stages must be specified in topological order.

*Runtime checking*: Since `Pipeline`s can operate on datasets with varied types, they cannot use compile-time type checking. `Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`. This type checking is done using the dataset *schema*, a description of the data types of columns in the `DataFrame`.
*Runtime checking*: Since `Pipeline`s can operate on `DataFrame`s with varied types, they cannot use
compile-time type checking.
`Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`.
This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`.

## Parameters

Spark ML `Estimator`s and `Transformer`s use a uniform API for specifying parameters.

A [`Param`](api/scala/index.html#org.apache.spark.ml.param.Param) is a named parameter with self-contained documentation.
A [`ParamMap`](api/scala/index.html#org.apache.spark.ml.param.ParamMap) is a set of (parameter, value) pairs.
A `Param` is a named parameter with self-contained documentation.
A `ParamMap` is a set of (parameter, value) pairs.

There are two main ways to pass parameters to an algorithm:

1. Set parameters for an instance. E.g., if `lr` is an instance of `LogisticRegression`, one could call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations. This API resembles the API used in MLlib.
1. Set parameters for an instance. E.g., if `lr` is an instance of `LogisticRegression`, one could
call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations.
This API resembles the API used in `spark.mllib` package.
2. Pass a `ParamMap` to `fit()` or `transform()`. Any parameters in the `ParamMap` will override parameters previously specified via setter methods.

Parameters belong to specific instances of `Estimator`s and `Transformer`s.
For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.

# Algorithm Guides
# Algorithm guides

There are now several algorithms in the Pipelines API which are not in the `spark.mllib` API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.

**Pipelines API Algorithm Guides**

* [Feature Extraction, Transformation, and Selection](ml-features.html)
* [Decision Trees for Classification and Regression](ml-decision-tree.html)
* [Feature extraction, transformation, and selection](ml-features.html)
* [Decision Trees for classification and regression](ml-decision-tree.html)
* [Ensembles](ml-ensembles.html)
* [Linear methods with elastic net regularization](ml-linear-methods.html)
* [Multilayer perceptron classifier](ml-ann.html)

# Code Examples
# Code examples

This section gives code examples illustrating the functionality discussed above.
There is not yet documentation for specific algorithms in Spark ML. For more info, please refer to the [API Documentation](api/scala/index.html#org.apache.spark.ml.package). Spark ML algorithms are currently wrappers for MLlib algorithms, and the [MLlib programming guide](mllib-guide.html) has details on specific algorithms.
For more info, please refer to the API documentation
([Scala](api/scala/index.html#org.apache.spark.ml.package),
[Java](api/java/org/apache/spark/ml/package-summary.html),
and [Python](api/python/pyspark.ml.html)).
Some Spark ML algorithms are wrappers for `spark.mllib` algorithms, and the
[MLlib programming guide](mllib-guide.html) has details on specific algorithms.

## Example: Estimator, Transformer, and Param

Expand Down Expand Up @@ -627,7 +653,7 @@ sc.stop()

</div>

## Example: Model Selection via Cross-Validation
## Example: model selection via cross-validation

An important task in ML is *model selection*, or using data to find the best model or parameters for a given task. This is also called *tuning*.
`Pipeline`s facilitate model selection by making it easy to tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately.
Expand Down Expand Up @@ -873,11 +899,11 @@ jsc.stop();

</div>

## Example: Model Selection via Train Validation Split
## Example: model selection via train validation split
In addition to `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
`TrainValidationSplit` only evaluates each combination of parameters once as opposed to k times in
case of `CrossValidator`. It is therefore less expensive,
but will not produce as reliable results when the training dataset is not sufficiently large..
but will not produce as reliable results when the training dataset is not sufficiently large.

`TrainValidationSplit` takes an `Estimator`, a set of `ParamMap`s provided in the `estimatorParamMaps` parameter,
and an `Evaluator`.
Expand Down
12 changes: 6 additions & 6 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ primitives and higher-level pipeline APIs.
It divides into two packages:

* [`spark.mllib`](mllib-guide.html#mllib-types-algorithms-and-utilities) contains the original API
built on top of RDDs.
built on top of [RDDs](programming-guide.html#resilient-distributed-datasets-rdds).
* [`spark.ml`](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) provides higher-level API
built on top of DataFrames for constructing ML pipelines.
built on top of [DataFrames](sql-programming-guide.html#dataframes) for constructing ML pipelines.

Using `spark.ml` is recommended because with DataFrames the API is more versatile and flexible.
But we will keep supporting `spark.mllib` along with the development of `spark.ml`.
Expand Down Expand Up @@ -57,19 +57,19 @@ We list major functionality from both below, with links to detailed guides.
* [FP-growth](mllib-frequent-pattern-mining.html#fp-growth)
* [association rules](mllib-frequent-pattern-mining.html#association-rules)
* [PrefixSpan](mllib-frequent-pattern-mining.html#prefix-span)
* [Evaluation Metrics](mllib-evaluation-metrics.html)
* [Evaluation metrics](mllib-evaluation-metrics.html)
* [PMML model export](mllib-pmml-model-export.html)
* [Optimization (developer)](mllib-optimization.html)
* [stochastic gradient descent](mllib-optimization.html#stochastic-gradient-descent-sgd)
* [limited-memory BFGS (L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
* [PMML model export](mllib-pmml-model-export.html)

# spark.ml: high-level APIs for ML pipelines

**[spark.ml programming guide](ml-guide.html)** provides an overview of the Pipelines API and major
concepts. It also contains sections on using algorithms within the Pipelines API, for example:

* [Feature Extraction, Transformation, and Selection](ml-features.html)
* [Decision Trees for Classification and Regression](ml-decision-tree.html)
* [Feature extraction, transformation, and selection](ml-features.html)
* [Decision trees for classification and regression](ml-decision-tree.html)
* [Ensembles](ml-ensembles.html)
* [Linear methods with elastic net regularization](ml-linear-methods.html)
* [Multilayer perceptron classifier](ml-ann.html)
Expand Down

0 comments on commit 905fbe4

Please sign in to comment.