Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs #653

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@ The markdown code can be compiled to HTML using the
[Jekyll tool](http://jekyllrb.com).
To use the `jekyll` command, you will need to have Jekyll installed.
The easiest way to do this is via a Ruby Gem, see the
[jekyll installation instructions](http://jekyllrb.com/docs/installation).
Compiling the site with Jekyll will create a directory called
_site containing index.html as well as the rest of the compiled files.
[jekyll installation instructions](http://jekyllrb.com/docs/installation).
If not already installed, you need to install `kramdown` with `sudo gem install kramdown`.
Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called
`_site` containing index.html as well as the rest of the compiled files.

You can modify the default Jekyll build as follows:

Expand Down Expand Up @@ -44,6 +45,6 @@ You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PR

Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`.

When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).

NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.
2 changes: 1 addition & 1 deletion docs/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ SPARK_VERSION_SHORT: 1.0.0
SCALA_BINARY_VERSION: "2.10"
SCALA_VERSION: "2.10.4"
MESOS_VERSION: 0.13.0
SPARK_ISSUE_TRACKER_URL: https://spark-project.atlassian.net
SPARK_ISSUE_TRACKER_URL: https://issues.apache.org/jira/browse/SPARK
SPARK_GITHUB_URL: https://github.com/apache/spark
2 changes: 1 addition & 1 deletion docs/bagel-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ import org.apache.spark.bagel.Bagel._
Next, we load a sample graph from a text file as a distributed dataset and package it into `PRVertex` objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.

{% highlight scala %}
val input = sc.textFile("pagerank_data.txt")
val input = sc.textFile("data/pagerank_data.txt")

val numVerts = input.count()

Expand Down
2 changes: 1 addition & 1 deletion docs/cluster-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ The following table summarizes terms you'll see used to refer to cluster concept
<td>Distinguishes where the driver process runs. In "cluster" mode, the framework launches
the driver inside of the cluster. In "client" mode, the submitter launches the driver
outside of the cluster.</td>
<tr>
</tr>
<tr>
<td>Worker node</td>
<td>Any node that can run application code in the cluster</td>
Expand Down
10 changes: 5 additions & 5 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ application name), as well as arbitrary key-value pairs through the `set()` meth
initialize an application as follows:

{% highlight scala %}
val conf = new SparkConf()
.setMaster("local")
.setAppName("My application")
.set("spark.executor.memory", "1g")
val conf = new SparkConf().
setMaster("local").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we ask users to use :paste instead of putting . at the end of the line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the problem is that it didn't work as-is since the first line can be interpreted as a complete statement. I figured that it's best if the snippets work as given, without additional commands or config. I reviewed most of the other Scala snippets in the docs here, and there were only a few cases like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for running in the REPL, where we don't really need to create SparkConf. With the latest spark-submit, this line should change to new SparkConf().setAppName("My application").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, this snippet won't be pasted into a REPL, true. (I think the other case I saw is for the REPL, so should have this kind of change.)

But you're saying it can be simplified anyway to that one line? I can change that but I wonder if the idea is to just show use of setters? if so I could revert the change... or just leave for consistency with the other REPL-friendly snippet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v1.0, the recommended way of launching an app is through spark-submit, where you set Spark configurations through command-line arguments. It is easier to switch masters than hard code setMaster("local"). Also, it works for YARN.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, although on re-reading I think the purpose of this snippet is to explain how one would invoke Spark programmatically via SparkConf (or else the whole thing should go away). It is something you might want to do in a Scala program, and might even want to pop into a Scala REPL (i.e. not spark-shell). I suggest leaving it; am I really off-base on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove setMaster and set("spark.executor.memory" ...), then it fits in a single line. Those properties should be set with spark-submit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mengxr we want to allow these properties to be set directly as well... some applications will use this.

I think it might be good here to show both. First say that they can be set directly and then explain that they can be set through arguments to spark-submit if you create a SparkContext with an empty conf.

./bin/spark-submit --name "My application" --master local --executor-memory 1g

setAppName("My application").
set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
{% endhighlight %}

Expand Down Expand Up @@ -318,7 +318,7 @@ Apart from these, the following properties are also available, and may be useful
When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
objects to prevent writing redundant data, however that stops garbage collection of those
objects. By calling 'reset' you flush that info from the serializer, and allow old
objects to be collected. To turn off this periodic reset set it to a value of <= 0.
objects to be collected. To turn off this periodic reset set it to a value &lt;= 0.
By default it will reset the serializer every 10,000 objects.
</td>
</tr>
Expand Down
20 changes: 10 additions & 10 deletions docs/java-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ classes. RDD methods like `map` are overloaded by specialized `PairFunction`
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
types. Common methods like `filter` and `sample` are implemented by
each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`,
etc (this acheives the "same-result-type" principle used by the [Scala collections
etc (this achieves the "same-result-type" principle used by the [Scala collections
framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)).

## Function Interfaces
Expand Down Expand Up @@ -102,7 +102,7 @@ the following changes:
`Function` classes will need to use `implements` rather than `extends`.
* Certain transformation functions now have multiple versions depending
on the return type. In Spark core, the map functions (`map`, `flatMap`, and
`mapPartitons`) have type-specific versions, e.g.
`mapPartitions`) have type-specific versions, e.g.
[`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction))
and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
Expand All @@ -115,11 +115,11 @@ As an example, we will implement word count using the Java API.
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;

JavaSparkContext sc = new JavaSparkContext(...);
JavaRDD<String> lines = ctx.textFile("hdfs://...");
JavaSparkContext jsc = new JavaSparkContext(...);
JavaRDD<String> lines = jsc.textFile("hdfs://...");
JavaRDD<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
@Override public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
}
Expand All @@ -140,10 +140,10 @@ Here, the `FlatMapFunction` was created inline; another option is to subclass

{% highlight java %}
class Split extends FlatMapFunction<String, String> {
public Iterable<String> call(String s) {
@Override public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
);
}
JavaRDD<String> words = lines.flatMap(new Split());
{% endhighlight %}

Expand All @@ -162,8 +162,8 @@ Continuing with the word count example, we map each word to a `(word, 1)` pair:
import scala.Tuple2;
JavaPairRDD<String, Integer> ones = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2(s, 1);
@Override public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}
);
Expand All @@ -178,7 +178,7 @@ occurrences of each word:
{% highlight java %}
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
@Override public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
Expand Down
14 changes: 9 additions & 5 deletions docs/mllib-basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ title: <a href="mllib-guide.html">MLlib</a> - Basics
MLlib supports local vectors and matrices stored on a single machine,
as well as distributed matrices backed by one or more RDDs.
In the current implementation, local vectors and matrices are simple data models
to serve public interfaces. The underly linear algebra operations are provided by
to serve public interfaces. The underlying linear algebra operations are provided by
[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
A training example used in supervised learning is called "labeled point" in MLlib.

Expand Down Expand Up @@ -205,7 +205,7 @@ import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.rdd.RDDimport;

RDD[LabeledPoint] training = MLUtils.loadLibSVMData(sc, "mllib/data/sample_libsvm_data.txt")
RDD<LabeledPoint> training = MLUtils.loadLibSVMData(jsc, "mllib/data/sample_libsvm_data.txt");
{% endhighlight %}
</div>
</div>
Expand Down Expand Up @@ -307,6 +307,7 @@ A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.R
created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics.

{% highlight java %}
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;

Expand Down Expand Up @@ -348,10 +349,10 @@ val mat: RowMatrix = ... // a RowMatrix
val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzers) // number of nonzeros in each column
println(summary.numNonzeros) // number of nonzeros in each column

// Compute the covariance matrix.
val Cov: Matrix = mat.computeCovariance()
val cov: Matrix = mat.computeCovariance()
{% endhighlight %}
</div>
</div>
Expand Down Expand Up @@ -397,11 +398,12 @@ wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `Row
its row indices.

{% highlight java %}
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.distributed.IndexedRow;
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;

JavaRDD[IndexedRow] rows = ... // a JavaRDD of indexed rows
JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());

Expand Down Expand Up @@ -458,7 +460,9 @@ wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a
with sparse rows by calling `toIndexedRowMatrix`.

{% highlight scala %}
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;

JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
Expand Down
4 changes: 2 additions & 2 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ models are trained for each cluster).
MLlib supports
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
the most commonly used clustering algorithms that clusters the data points into
predfined number of clusters. The MLlib implementation includes a parallelized
predefined number of clusters. The MLlib implementation includes a parallelized
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
The implementation in MLlib has the following parameters:
Expand All @@ -30,7 +30,7 @@ initialization via k-means\|\|.
* *runs* is the number of times to run the k-means algorithm (k-means is not
guaranteed to find a globally optimal solution, and when run multiple times on
a given dataset, the algorithm returns the best clustering result).
* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
* *epsilon* determines the distance threshold within which we consider k-means to have converged.

## Examples
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-collaborative-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ val ratesAndPreds = ratings.map{
}.join(predictions)
val MSE = ratesAndPreds.map{
case ((user, product), (r1, r2)) => math.pow((r1- r2), 2)
}.reduce(_ + _)/ratesAndPreds.count
}.mean()
println("Mean Squared Error = " + MSE)
{% endhighlight %}

Expand Down
8 changes: 4 additions & 4 deletions docs/mllib-decision-tree.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,19 +83,19 @@ Section 9.2.4 in
[Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
details). For example, for a binary classification problem with one categorical feature with three
categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
features are orded as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
features are ordered as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
and A , B \| C where \| denotes the split.

### Stopping rule

The recursive tree construction is stopped at a node when one of the two conditions is met:

1. The node depth is equal to the `maxDepth` training parammeter
1. The node depth is equal to the `maxDepth` training parameter
2. No split candidate leads to an information gain at the node.

### Practical limitations

1. The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)*
1. The tree implementation stores an `Array[Double]` of size *O(#features \* #splits \* 2^maxDepth)*
in memory for aggregating histograms over partitions. The current implementation might not scale
to very deep trees since the memory requirement grows exponentially with tree depth.
2. The implemented algorithm reads both sparse and dense data. However, it is not optimized for
Expand Down Expand Up @@ -178,7 +178,7 @@ val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
{% endhighlight %}
</div>
Expand Down
7 changes: 7 additions & 0 deletions docs/mllib-dimensionality-reduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@ say, less than $1000$, but many rows, which we call *tall-and-skinny*.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.SingularValueDecomposition

val mat: RowMatrix = ...

// Compute the top 20 singular values and corresponding singular vectors.
Expand Down Expand Up @@ -74,6 +78,9 @@ and use them to project the vectors into a low-dimensional space.
The number of columns should be small, e.g, less than 1000.

{% highlight scala %}
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val mat: RowMatrix = ...

// Compute the top 10 principal components.
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

double[] array = ... // a double array
Vector vector = Vectors.dense(array) // a dense vector
Vector vector = Vectors.dense(array); // a dense vector
{% endhighlight %}

[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to
Expand Down
13 changes: 7 additions & 6 deletions docs/mllib-linear-methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ methods MLlib supports:
<tbody>
<tr>
<td>hinge loss</td><td>$\max \{0, 1-y \wv^T \x \}, \quad y \in \{-1, +1\}$</td>
<td>$\begin{cases}-y \cdot \x & \text{if $y \wv^T \x <1$}, \\ 0 &
<td>$\begin{cases}-y \cdot \x &amp; \text{if $y \wv^T \x &lt;1$}, \\ 0 &amp;
\text{otherwise}.\end{cases}$</td>
</tr>
<tr>
Expand Down Expand Up @@ -225,10 +225,11 @@ algorithm for 200 iterations.
import org.apache.spark.mllib.optimization.L1Updater

val svmAlg = new SVMWithSGD()
svmAlg.optimizer.setNumIterations(200)
.setRegParam(0.1)
.setUpdater(new L1Updater)
val modelL1 = svmAlg.run(parsedData)
svmAlg.optimizer.
setNumIterations(200).
setRegParam(0.1).
setUpdater(new L1Updater)
val modelL1 = svmAlg.run(training)
{% endhighlight %}

Similarly, you can use replace `SVMWithSGD` by
Expand Down Expand Up @@ -322,7 +323,7 @@ val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.reduce(_ + _) / valuesAndPreds.count
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
{% endhighlight %}

Expand Down
Loading