From 20e0e31158fe0350b8f59617f2228a48c34274ef Mon Sep 17 00:00:00 2001
From: Andrew Ash <andrew@andrewash.com>
Date: Wed, 24 Sep 2014 01:50:07 -0700
Subject: [PATCH 1/3] SPARK-3526 Add section about data locality to the tuning
 guide

---
 docs/tuning.md | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/docs/tuning.md b/docs/tuning.md
index 8fb2a0433b1a8..f249de4bb9b52 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -247,6 +247,39 @@ Spark prints the serialized size of each task on the master, so you can look at
 decide whether your tasks are too large; in general tasks larger than about 20 KB are probably
 worth optimizing.
 
+## Data Locality
+
+One of the most important principles of distributed computing is data locality.  If data and the
+code that operates on it are together than computation tends to be fast.  But if code and data are
+separated, one must move to the other.  Typically it is faster to ship serialized code from place to
+place than a chunk of data because code size is much smaller than data.  Spark builds its scheduling
+around this general principle of data locality.
+
+Data locality is how close data is to the code processing it.  There are several levels of
+locality based on the data's current location.  In order from closest to farthest:
+
+- `PROCESS_LOCAL` data is in the same JVM as the running code.  This is the best locality
+  possible
+- `NODE_LOCAL` data is on the same node.  Examples might be in HDFS on the same node, or in
+  another executor on the same node.  This is a little slower than `PROCESS_LOCAL` because the data
+  has to travel between processes
+- `NO_PREF` data is accessed equally quickly from anywhere and has no locality preference
+- `RACK_LOCAL` data is on the same rack of servers.  Data is on a different server on the same rack
+  so needs to be sent over the network, typically through a single switch
+- `ANY` data is elsewhere on the network and not in the same rack
+
+Spark prefers to schedule all tasks at the best locality level, but this is not always possible.  In
+situations where there is no unprocessed data on any idle executor, Spark switches to lower locality
+levels. There are two options: a) wait until a busy CPU frees up to start a task on data on the same
+server, or b) immediately start a new task in a farther away place that requires moving data there.
+
+What Spark typically does is wait a bit in the hopes that a busy CPU frees up.  Once that timeout
+expires, it starts moving the data from far away to the free CPU.  The wait timeout for fallback
+between each level can be configured individually via `spark.locality.wait.process` and
+`spark.locality.wait.node` and `spark.locality.wait.rack`, or all together via `spark.locality.wait`
+You should increase these settings if your tasks are long and see poor locality, but the default
+usually works well.
+
 # Summary
 
 This has been a short guide to point out the main concerns you should know about when tuning a

From 6d5d966a2dcf08c9f6f08d84d5e79641fdf655d6 Mon Sep 17 00:00:00 2001
From: Andrew Ash <andrew@andrewash.com>
Date: Thu, 25 Sep 2014 21:01:46 -0700
Subject: [PATCH 2/3] Stay focused on Spark, no astronaut architecture
 mumbo-jumbo

---
 docs/tuning.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/tuning.md b/docs/tuning.md
index f249de4bb9b52..7a258fc371168 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -249,11 +249,11 @@ worth optimizing.
 
 ## Data Locality
 
-One of the most important principles of distributed computing is data locality.  If data and the
-code that operates on it are together than computation tends to be fast.  But if code and data are
-separated, one must move to the other.  Typically it is faster to ship serialized code from place to
-place than a chunk of data because code size is much smaller than data.  Spark builds its scheduling
-around this general principle of data locality.
+Data locality can have a major impact on the performance of Spark jobs.  If data and the code that
+operates on it are together than computation tends to be fast.  But if code and data are separated,
+one must move to the other.  Typically it is faster to ship serialized code from place to place than
+a chunk of data because code size is much smaller than data.  Spark builds its scheduling around
+this general principle of data locality.
 
 Data locality is how close data is to the code processing it.  There are several levels of
 locality based on the data's current location.  In order from closest to farthest:

From 44cff28f183d5ba85d9395dda699faa137cad377 Mon Sep 17 00:00:00 2001
From: Andrew Ash <andrew@andrewash.com>
Date: Thu, 25 Sep 2014 21:08:10 -0700
Subject: [PATCH 3/3] Link to spark.locality parameters rather than copying the
 list

---
 docs/tuning.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/tuning.md b/docs/tuning.md
index 7a258fc371168..4eff88ab5acb0 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -275,8 +275,8 @@ server, or b) immediately start a new task in a farther away place that requires
 
 What Spark typically does is wait a bit in the hopes that a busy CPU frees up.  Once that timeout
 expires, it starts moving the data from far away to the free CPU.  The wait timeout for fallback
-between each level can be configured individually via `spark.locality.wait.process` and
-`spark.locality.wait.node` and `spark.locality.wait.rack`, or all together via `spark.locality.wait`
+between each level can be configured individually or all together in one parameter; see the
+`spark.locality` parameters on the [configuration page](configuration.html#scheduling) for details.
 You should increase these settings if your tasks are long and see poor locality, but the default
 usually works well.