Skip to content
Permalink
Browse files

[SPARK-23165][DOC] Spelling mistake fix in quick-start doc.

## What changes were proposed in this pull request?

Fix spelling in quick-start doc.

## How was this patch tested?

Doc only.

Author: Shashwat Anand <me@shashwat.me>

Closes #20336 from ashashwat/SPARK-23165.

(cherry picked from commit 84a076e)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
  • Loading branch information...
ashashwat authored and gatorsmile committed Jan 20, 2018
1 parent 0cde521 commit e11d5eaf79ffccbe3a5444a5b9ecf3a203e1fc90
@@ -180,10 +180,10 @@ under the path, not the number of *new* files, so it can become a slow operation
The size of the window needs to be set to handle this.

1. Files only appear in an object store once they are completely written; there
is no need for a worklow of write-then-rename to ensure that files aren't picked up
is no need for a workflow of write-then-rename to ensure that files aren't picked up
while they are still being written. Applications can write straight to the monitored directory.

1. Streams should only be checkpointed to an store implementing a fast and
1. Streams should only be checkpointed to a store implementing a fast and
atomic `rename()` operation Otherwise the checkpointing may be slow and potentially unreliable.

## Further Reading
@@ -79,7 +79,7 @@ Then, you can supply configuration values at runtime:
{% endhighlight %}

The Spark shell and [`spark-submit`](submitting-applications.html)
tool support two ways to load configurations dynamically. The first are command line options,
tool support two ways to load configurations dynamically. The first is command line options,
such as `--master`, as shown above. `spark-submit` can accept any Spark property using the `--conf`
flag, but uses special flags for properties that play a part in launching the Spark application.
Running `./bin/spark-submit --help` will show the entire list of these options.
@@ -413,7 +413,7 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
Enable profiling in Python worker, the profile result will show up by <code>sc.show_profiles()</code>,
or it will be displayed before the driver exiting. It also can be dumped into disk by
or it will be displayed before the driver exits. It also can be dumped into disk by
<code>sc.dump_profiles(path)</code>. If some of the profile results had been displayed manually,
they will not be displayed automatically before driver exiting.

@@ -446,7 +446,7 @@ Apart from these, the following properties are also available, and may be useful
<td>true</td>
<td>
Reuse Python worker or not. If yes, it will use a fixed number of Python workers,
does not need to fork() a Python process for every tasks. It will be very useful
does not need to fork() a Python process for every task. It will be very useful
if there is large broadcast, then the broadcast will not be needed to transferred
from JVM to Python worker for every task.
</td>
@@ -1294,7 +1294,7 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.files.openCostInBytes</code></td>
<td>4194304 (4 MB)</td>
<td>
The estimated cost to open a file, measured by the number of bytes could be scanned in the same
The estimated cost to open a file, measured by the number of bytes could be scanned at the same
time. This is used when putting multiple files into a partition. It is better to over estimate,
then the partitions with small files will be faster than partitions with bigger files.
</td>
@@ -1855,8 +1855,8 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.user.groups.mapping</code></td>
<td><code>org.apache.spark.security.ShellBasedGroupsMappingProvider</code></td>
<td>
The list of groups for a user are determined by a group mapping service defined by the trait
org.apache.spark.security.GroupMappingServiceProvider which can configured by this property.
The list of groups for a user is determined by a group mapping service defined by the trait
org.apache.spark.security.GroupMappingServiceProvider which can be configured by this property.
A default unix shell based implementation is provided <code>org.apache.spark.security.ShellBasedGroupsMappingProvider</code>
which can be specified to resolve a list of groups for a user.
<em>Note:</em> This implementation supports only a Unix/Linux based environment. Windows environment is
@@ -2465,7 +2465,7 @@ should be included on Spark's classpath:

The location of these configuration files varies across Hadoop versions, but
a common location is inside of `/etc/hadoop/conf`. Some tools create
configurations on-the-fly, but offer a mechanisms to download copies of them.
configurations on-the-fly, but offer a mechanism to download copies of them.

To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/conf/spark-env.sh`
to a location containing the configuration files.
@@ -708,7 +708,7 @@ messages remaining.
> messaging function. These constraints allow additional optimization within GraphX.
The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch*
of its implementation (note: to avoid stackOverflowError due to long lineage chains, pregel support periodcally
of its implementation (note: to avoid stackOverflowError due to long lineage chains, pregel support periodically
checkpoint graph and messages by setting "spark.graphx.pregel.checkpointInterval" to a positive number,
say 10. And set checkpoint directory as well using SparkContext.setCheckpointDir(directory: String)):
@@ -928,7 +928,7 @@ switch to 2D-partitioning or other heuristics included in GraphX.
<!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
Once the edges have be partitioned the key challenge to efficient graph-parallel computation is
Once the edges have been partitioned the key challenge to efficient graph-parallel computation is
efficiently joining vertex attributes with the edges. Because real-world graphs typically have more
edges than vertices, we move vertex attributes to the edges. Because not all partitions will
contain edges adjacent to all vertices we internally maintain a routing table which identifies where
@@ -118,7 +118,7 @@ The history server can be configured as follows:
<td>
The number of applications to retain UI data for in the cache. If this cap is exceeded, then
the oldest applications will be removed from the cache. If an application is not in the cache,
it will have to be loaded from disk if its accessed from the UI.
it will have to be loaded from disk if it is accessed from the UI.
</td>
</tr>
<tr>
@@ -407,7 +407,7 @@ can be identified by their `[attempt-id]`. In the API listed below, when running
</tr>
</table>
The number of jobs and stages which can retrieved is constrained by the same retention
The number of jobs and stages which can be retrieved is constrained by the same retention
mechanism of the standalone Spark UI; `"spark.ui.retainedJobs"` defines the threshold
value triggering garbage collection on jobs, and `spark.ui.retainedStages` that for stages.
Note that the garbage collection takes place on playback: it is possible to retrieve
@@ -422,10 +422,10 @@ These endpoints have been strongly versioned to make it easier to develop applic
* Individual fields will never be removed for any given endpoint
* New endpoints may be added
* New fields may be added to existing endpoints
* New versions of the api may be added in the future at a separate endpoint (eg., `api/v2`). New versions are *not* required to be backwards compatible.
* New versions of the api may be added in the future as a separate endpoint (eg., `api/v2`). New versions are *not* required to be backwards compatible.
* Api versions may be dropped, but only after at least one minor release of co-existing with a new api version.

Note that even when examining the UI of a running applications, the `applications/[app-id]` portion is
Note that even when examining the UI of running applications, the `applications/[app-id]` portion is
still required, though there is only one application available. Eg. to see the list of jobs for the
running app, you would go to `http://localhost:4040/api/v1/applications/[app-id]/jobs`. This is to
keep the paths consistent in both modes.
@@ -67,7 +67,7 @@ res3: Long = 15
./bin/pyspark


Or if PySpark is installed with pip in your current enviroment:
Or if PySpark is installed with pip in your current environment:

pyspark

@@ -156,7 +156,7 @@ One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can i
>>> wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
{% endhighlight %}
Here, we use the `explode` function in `select`, to transfrom a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: "word" and "count". To collect the word counts in our shell, we can call `collect`:
Here, we use the `explode` function in `select`, to transform a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: "word" and "count". To collect the word counts in our shell, we can call `collect`:

{% highlight python %}
>>> wordCounts.collect()
@@ -422,7 +422,7 @@ $ YOUR_SPARK_HOME/bin/spark-submit \
Lines with a: 46, Lines with b: 23
{% endhighlight %}
If you have PySpark pip installed into your enviroment (e.g., `pip install pyspark`), you can run your application with the regular Python interpreter or use the provided 'spark-submit' as you prefer.
If you have PySpark pip installed into your environment (e.g., `pip install pyspark`), you can run your application with the regular Python interpreter or use the provided 'spark-submit' as you prefer.
{% highlight bash %}
# Use the Python interpreter to run your application
@@ -154,7 +154,7 @@ can find the results of the driver from the Mesos Web UI.
To use cluster mode, you must start the `MesosClusterDispatcher` in your cluster via the `sbin/start-mesos-dispatcher.sh` script,
passing in the Mesos master URL (e.g: mesos://host:5050). This starts the `MesosClusterDispatcher` as a daemon running on the host.

By setting the Mesos proxy config property (requires mesos version >= 1.4), `--conf spark.mesos.proxy.baseURL=http://localhost:5050` when launching the dispacther, the mesos sandbox URI for each driver is added to the mesos dispatcher UI.
By setting the Mesos proxy config property (requires mesos version >= 1.4), `--conf spark.mesos.proxy.baseURL=http://localhost:5050` when launching the dispatcher, the mesos sandbox URI for each driver is added to the mesos dispatcher UI.

If you like to run the `MesosClusterDispatcher` with Marathon, you need to run the `MesosClusterDispatcher` in the foreground (i.e: `bin/spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher`). Note that the `MesosClusterDispatcher` not yet supports multiple instances for HA.

@@ -445,7 +445,7 @@ To use a custom metrics.properties for the application master and executors, upd
<code>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</code> should be
configured in yarn-site.xml.
This feature can only be used with Hadoop 2.6.4+. The Spark log4j appender needs be changed to use
FileAppender or another appender that can handle the files being removed while its running. Based
FileAppender or another appender that can handle the files being removed while it is running. Based
on the file name configured in the log4j configuration (like spark.log), the user should set the
regex (spark*) to include all the log files that need to be aggregated.
</td>
@@ -62,7 +62,7 @@ component-specific configuration namespaces used to override the default setting
</tr>
</table>

The full breakdown of available SSL options can be found on the [configuration page](configuration.html).
The full breakdown of available SSL options can be found on the [configuration page](configuration.html).
SSL must be configured on each node and configured for each component involved in communication using the particular protocol.

### YARN mode
@@ -1253,7 +1253,7 @@ provide a ClassTag.
(Note that this is different than the Spark SQL JDBC server, which allows other applications to
run queries using Spark SQL).
To get started you will need to include the JDBC driver for you particular database on the
To get started you will need to include the JDBC driver for your particular database on the
spark classpath. For example, to connect to postgres from the Spark Shell you would run the
following command:
@@ -1793,7 +1793,7 @@ options.
- Since Spark 2.3, when all inputs are binary, `functions.concat()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.concatBinaryAsString` to `true`.
- Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.eltOutputAsString` to `true`.

- Since Spark 2.3, by default arithmetic operations between decimals return a rounded value if an exact representation is not possible (instead of returning NULL). This is compliant to SQL ANSI 2011 specification and Hive's new behavior introduced in Hive 2.2 (HIVE-15331). This involves the following changes
- Since Spark 2.3, by default arithmetic operations between decimals return a rounded value if an exact representation is not possible (instead of returning NULL). This is compliant with SQL ANSI 2011 specification and Hive's new behavior introduced in Hive 2.2 (HIVE-15331). This involves the following changes
- The rules to determine the result type of an arithmetic operation have been updated. In particular, if the precision / scale needed are out of the range of available values, the scale is reduced up to 6, in order to prevent the truncation of the integer part of the decimals. All the arithmetic operations are affected by the change, ie. addition (`+`), subtraction (`-`), multiplication (`*`), division (`/`), remainder (`%`) and positive module (`pmod`).
- Literal values used in SQL operations are converted to DECIMAL with the exact precision and scale needed by them.
- The configuration `spark.sql.decimalOperations.allowPrecisionLoss` has been introduced. It defaults to `true`, which means the new behavior described here; if set to `false`, Spark uses previous rules, ie. it doesn't adjust the needed scale to represent the values and it returns NULL if an exact representation of the value is not possible.
@@ -1821,7 +1821,7 @@ options.
transformations (e.g., `map`, `filter`, and `groupByKey`) and untyped transformations (e.g.,
`select` and `groupBy`) are available on the Dataset class. Since compile-time type-safety in
Python and R is not a language feature, the concept of Dataset does not apply to these languages’
APIs. Instead, `DataFrame` remains the primary programing abstraction, which is analogous to the
APIs. Instead, `DataFrame` remains the primary programming abstraction, which is analogous to the
single-node data frame notion in these languages.

- Dataset and DataFrame API `unionAll` has been deprecated and replaced by `union`
@@ -1997,7 +1997,7 @@ Java and Python users will need to update their code.

Prior to Spark 1.3 there were separate Java compatible classes (`JavaSQLContext` and `JavaSchemaRDD`)
that mirrored the Scala API. In Spark 1.3 the Java API and Scala API have been unified. Users
of either language should use `SQLContext` and `DataFrame`. In general theses classes try to
of either language should use `SQLContext` and `DataFrame`. In general these classes try to
use types that are usable from both languages (i.e. `Array` instead of language specific collections).
In some cases where no common type exists (e.g., for passing in closures or Maps) function overloading
is used instead.
@@ -42,7 +42,7 @@ Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code>
The main category of parameters that should be configured are the authentication parameters
required by Keystone.

The following table contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be
The following table contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be
any (alphanumeric) name.

<table class="table">
@@ -74,7 +74,7 @@ import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
// The master requires 2 cores to prevent a starvation scenario.

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
@@ -172,7 +172,7 @@ each line will be split into multiple words and the stream of words is represent
`words` DStream. Note that we defined the transformation using a
[FlatMapFunction](api/scala/index.html#org.apache.spark.api.java.function.FlatMapFunction) object.
As we will discover along the way, there are a number of such convenience classes in the Java API
that help define DStream transformations.
that help defines DStream transformations.
Next, we want to count these words.
@@ -125,7 +125,7 @@ df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
### Creating a Kafka Source for Batch Queries
If you have a use case that is better suited to batch processing,
you can create an Dataset/DataFrame for a defined range of offsets.
you can create a Dataset/DataFrame for a defined range of offsets.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -597,7 +597,7 @@ Note that the following Kafka params cannot be set and the Kafka source or sink
- **key.serializer**: Keys are always serialized with ByteArraySerializer or StringSerializer. Use
DataFrame operations to explicitly serialize the keys into either strings or byte arrays.
- **value.serializer**: values are always serialized with ByteArraySerializer or StringSerializer. Use
DataFrame oeprations to explicitly serialize the values into either strings or byte arrays.
DataFrame operations to explicitly serialize the values into either strings or byte arrays.
- **enable.auto.commit**: Kafka source doesn't commit any offset.
- **interceptor.classes**: Kafka source always read keys and values as byte arrays. It's not safe to
use ConsumerInterceptor as it may break the query.

0 comments on commit e11d5ea

Please sign in to comment.
You can’t perform that action at this time.