diff --git a/docs/README.md b/docs/README.md index 75b1811ba99af..f1eb644f93406 100644 --- a/docs/README.md +++ b/docs/README.md @@ -14,9 +14,10 @@ The markdown code can be compiled to HTML using the [Jekyll tool](http://jekyllrb.com). To use the `jekyll` command, you will need to have Jekyll installed. The easiest way to do this is via a Ruby Gem, see the -[jekyll installation instructions](http://jekyllrb.com/docs/installation). -Compiling the site with Jekyll will create a directory called -_site containing index.html as well as the rest of the compiled files. +[jekyll installation instructions](http://jekyllrb.com/docs/installation). +If not already installed, you need to install `kramdown` with `sudo gem install kramdown`. +Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called +`_site` containing index.html as well as the rest of the compiled files. You can modify the default Jekyll build as follows: @@ -44,6 +45,6 @@ You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PR Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`. -When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/). +When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/). NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`. diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index 79b0061e2c5a1..162c415b5883e 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -181,7 +181,7 @@ The following table summarizes terms you'll see used to refer to cluster concept Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster. - + Worker node Any node that can run application code in the cluster diff --git a/docs/configuration.md b/docs/configuration.md index d6f316ba5f773..b94f94f3c0688 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -318,7 +318,7 @@ Apart from these, the following properties are also available, and may be useful When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches objects to prevent writing redundant data, however that stops garbage collection of those objects. By calling 'reset' you flush that info from the serializer, and allow old - objects to be collected. To turn off this periodic reset set it to a value of <= 0. + objects to be collected. To turn off this periodic reset set it to a value <= 0. By default it will reset the serializer every 10,000 objects. diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md index ba67450089aa2..4f8b4b41d8538 100644 --- a/docs/mllib-decision-tree.md +++ b/docs/mllib-decision-tree.md @@ -95,7 +95,7 @@ The recursive tree construction is stopped at a node when one of the two conditi ### Practical limitations -1. The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)* +1. The tree implementation stores an `Array[Double]` of size *O(#features \* #splits \* 2^maxDepth)* in memory for aggregating histograms over partitions. The current implementation might not scale to very deep trees since the memory requirement grows exponentially with tree depth. 2. The implemented algorithm reads both sparse and dense data. However, it is not optimized for diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md index fe132e2abf771..fe52cb20f8135 100644 --- a/docs/mllib-linear-methods.md +++ b/docs/mllib-linear-methods.md @@ -63,7 +63,7 @@ methods MLlib supports: hinge loss$\max \{0, 1-y \wv^T \x \}, \quad y \in \{-1, +1\}$ - $\begin{cases}-y \cdot \x & \text{if $y \wv^T \x <1$}, \\ 0 & + $\begin{cases}-y \cdot \x & \text{if $y \wv^T \x <1$}, \\ 0 & \text{otherwise}.\end{cases}$ diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md index 531c7b22c4b5e..c47508b7daa2d 100644 --- a/docs/mllib-naive-bayes.md +++ b/docs/mllib-naive-bayes.md @@ -109,7 +109,7 @@ smoothing parameter `lambda` as input, and output a [NaiveBayesModel](api/pyspark/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be used for evaluation and prediction. - + {% highlight python %} from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import NaiveBayes diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md index b8d89cf00ffbd..e7ceaa22c3302 100644 --- a/docs/scala-programming-guide.md +++ b/docs/scala-programming-guide.md @@ -48,12 +48,12 @@ how to access a cluster. To create a `SparkContext` you first need to build a `S that contains information about your application. {% highlight scala %} -val conf = new SparkConf().setAppName().setMaster() +val conf = new SparkConf().setAppName(appName).setMaster(master) new SparkContext(conf) {% endhighlight %} -The `` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls) -to connect to, or a special "local" string to run in local mode, as described below. `` is +The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls) +to connect to, or a special "local" string to run in local mode, as described below. `appName` is a name for your application, which will be shown in the cluster web UI. It's also possible to set these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file) which avoids hard-coding the master name in your application. @@ -81,9 +81,8 @@ The master URL passed to Spark can be in one of the following formats: - - diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 0c743c9d6010d..8a785450adfde 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -416,3 +416,4 @@ results = hiveCtx.hql("FROM src SELECT key, value").collect() {% endhighlight %} +
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). +
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.