Skip to content

avoid uploading dispensably the file existed in HDFS when the host name is a virtrual host for HA Hadoop cluster#1530

Closed
pengyanhong wants to merge 12 commits intoapache:masterfrom
pengyanhong:master
Closed

avoid uploading dispensably the file existed in HDFS when the host name is a virtrual host for HA Hadoop cluster#1530
pengyanhong wants to merge 12 commits intoapache:masterfrom
pengyanhong:master

Conversation

@pengyanhong
Copy link

if the host name is a virtual machine name, the "compareFs" method will throw a unknown host exception, then the JARS file (include the spark assembly jar) will upload every time, it will take more time dispensable.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

mengxr and others added 10 commits July 22, 2014 11:45
Fix Mima issues in #1521.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1533 from mengxr/mima-als and squashes the following commits:

78386e1 [Xiangrui Meng] make Mima ignore updateFeatures (private) in ALS
### Why and what?
Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.

This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.

Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl

### Memory implications
An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default.

Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7).

The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case.
This results in a worst-case sorting overhead of 4N bytes.

Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements.

### Performance implications
As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation.

Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter".

<table>
<tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr>
<tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
<tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr>
<tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr>
<tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr>
</table>

The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7.

In short, this patch should significantly improve performance for users running either Java 6 or 7.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1502 from aarondav/sort and squashes the following commits:

652d936 [Aaron Davidson] Update license, move Sorter to java src
a7b5b1c [Aaron Davidson] fix licenses
5c0efaf [Aaron Davidson] Update tmpLength
ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs
034bf10 [Aaron Davidson] Change to Apache v2 Timsort
b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark]
6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
Currently, the "==" in HiveQL expression will cause exception thrown, this patch will fix it.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1522 from chenghao-intel/equal and squashes the following commits:

f62a0ff [Cheng Hao] Add == Support for HiveQl
fix examples

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #1523 from CrazyJvm/graphx-example and squashes the following commits:

663457a [CrazyJvm] outDegrees does not take parameters
7cfff1d [CrazyJvm] fix example for joinVertices
The issue is caused by #1112 .

Author: GuoQiang Li <witgo@qq.com>

Closes #1501 from witgo/webui_style and squashes the following commits:

4b34998 [GuoQiang Li] In some cases, pages display incorrect in WebUI
… re...

Opting to the option 2 defined in SPARK-2577, i.e., retrieve and pass the correct file system object to addResource.

Author: Gera Shegalov <gera@twitter.com>

Closes #1483 from gerashegalov/master and squashes the following commits:

90c9087 [Gera Shegalov] [YARN] SPARK-2577: File upload to viewfs is broken due to mount point resolution
…uffle

MessageToPartition was used in `Graph#partitionBy`. Unlike a Tuple2, it marked the key as transient to avoid sending it over the network. However, it was incompatible with sort-based shuffle (SPARK-2045) and represented only a minor optimization: for partitionBy, it improved performance by 6.3% (30.4 s to 28.5 s) and reduced communication by 5.6% (114.2 MB to 107.8 MB).

Author: Ankur Dave <ankurdave@gmail.com>

Closes #1537 from ankurdave/remove-MessageToPartition and squashes the following commits:

f9d0054 [Ankur Dave] Remove MessageToPartition
ab71364 [Ankur Dave] Remove unused VertexBroadcastMsg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants