Skip to content

Conversation

@wesleydias
Copy link

[SPARK-6639] Create a new script to start multiple masters

srowen and others added 30 commits February 19, 2015 15:35
Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names.

Author: Sean Owen <sowen@cloudera.com>

Closes #4514 from srowen/SPARK-4682 and squashes the following commits:

5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark]
169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names
277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way
b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis()
160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock
7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock

(cherry picked from commit 34b7c35)
Signed-off-by: Andrew Or <andrew@databricks.com>
…emory", ...) will not work

I've updated documentation to reflect true behavior of this setting in client vs. cluster mode.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #4665 from ilganeli/SPARK-5570 and squashes the following commits:

5d1c8dd [Ilya Ganelin] Added example configuration code
a51700a [Ilya Ganelin] Getting rid of extra spaces
85f7a08 [Ilya Ganelin] Reworded note
5889d43 [Ilya Ganelin] Formatting adjustment
f149ba1 [Ilya Ganelin] Minor updates
1fec7a5 [Ilya Ganelin] Updated to add clarification for other driver properties
db47595 [Ilya Ganelin] Slight formatting update
c899564 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5570
17b751d [Ilya Ganelin] Updated documentation for driver-memory to reflect its true behavior in client vs cluster mode

(cherry picked from commit 6bddc40)
Signed-off-by: Andrew Or <andrew@databricks.com>
In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.

Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.

I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.

CC: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4695 from mengxr/SPARK-5900 and squashes the following commits:

865b5ca [Xiangrui Meng] make Assignment serializable
cffa96e [Xiangrui Meng] fix test
9c0e590 [Xiangrui Meng] remove unused Tuple2
1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly

(cherry picked from commit 0cfd2ce)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch #3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <mcheah@palantir.com>

Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check

(cherry picked from commit 3be92cd)
Signed-off-by: Andrew Or <andrew@databricks.com>
…ease

For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.

For SPARK-5892:
* Fix Python docs
* Various other cleanups

BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]

CC: mengxr  (ML),  davies  (Python docs)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:

f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib

(cherry picked from commit 4a17eed)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
…s and tuple/list

Fix createDataFrame() from pandas DataFrame (not tested by jenkins, depends on SPARK-5693).

It also support to create DataFrame from plain tuple/list without column names, `_1`, `_2` will be used as column names.

Author: Davies Liu <davies@databricks.com>

Closes #4679 from davies/pandas and squashes the following commits:

c0cbe0b [Davies Liu] fix tests
8466d1d [Davies Liu] fix create DataFrame from pandas

(cherry picked from commit 5b0a42c)
Signed-off-by: Michael Armbrust <michael@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-5909

Author: Yin Huai <yhuai@databricks.com>

Closes #4694 from yhuai/clearCache and squashes the following commits:

397ecc4 [Yin Huai] Address comments.
a2702fc [Yin Huai] Update parser.
3a54506 [Yin Huai] add isEmpty to CacheManager.
6d14460 [Yin Huai] Python clearCache.
f7b8dbd [Yin Huai] Add clear cache command.
Continue to see IllegalStateException in YARN cluster mode. Adding a simple workaround for now.

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes #4690 from nishkamravi2/master_nravi and squashes the following commits:

d453197 [nishkamravi2] Update NewHadoopRDD.scala
6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
0ce2c32 [nishkamravi2] Update HadoopRDD.scala
f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
494d8c0 [nishkamravi2] Update DiskBlockManager.scala
3c5ddba [nishkamravi2] Update DiskBlockManager.scala
f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles

(cherry picked from commit d3cbd38)
Signed-off-by: Andrew Or <andrew@databricks.com>
…ect class is used in t...

...ests.

Without this SparkHadoopUtil is used by the Client instead of YarnSparkHadoopUtil.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #4711 from harishreedharan/SPARK-5937 and squashes the following commits:

d154de6 [Hari Shreedharan] Use System.clearProperty() instead of setting the value of SPARK_YARN_MODE to empty string.
f729f70 [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.

(cherry picked from commit 7138816)
Signed-off-by: Andrew Or <andrew@databricks.com>
…Reduce API

This looks like a simple typo ```SparkContext.newHadoopRDD``` instead of ```SparkContext.newAPIHadoopRDD``` as in actual http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.SparkContext

Author: Alexander <abezzubov@nflabs.com>

Closes #4718 from bzz/hadoop-InputFormats-doc-fix and squashes the following commits:

680a4c4 [Alexander] Fix typo in docs on custom Hadoop InputFormats

(cherry picked from commit a7f9039)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Author: Cheng Hao <hao.cheng@intel.com>

Closes #4717 from chenghao-intel/typo1 and squashes the following commits:

858d7b0 [Cheng Hao] update the typo

(cherry picked from commit 275b1be)
Signed-off-by: Sean Owen <sowen@cloudera.com>
CC mengxr

Author: Sean Owen <sowen@cloudera.com>

Closes #4715 from srowen/SPARK-5669.3 and squashes the following commits:

b27ffa9 [Sean Owen] Reverse exclusion of JBLAS libs for 1.3
Author: Makoto Fukuhara <fukuo33@gmail.com>

Closes #4724 from fukuo33/fix-typo and squashes the following commits:

8c806b9 [Makoto Fukuhara] fix typo.

(cherry picked from commit 9348767)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…warning

Author: Saisai Shao <saisai.shao@intel.com>

Closes #4722 from jerryshao/SPARK-5943 and squashes the following commits:

1b01233 [Saisai Shao] Update the test to use new API to reduce the warning

(cherry picked from commit 757b14b)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Add parameter parsing in FPGrowth example app in Scala and Java
And a sample data file is added in data/mllib folder

Author: Jacky Li <jacky.likun@huawei.com>

Closes #4714 from jackylk/parameter and squashes the following commits:

8c478b3 [Jacky Li] fix according to comments
3bb74f6 [Jacky Li] make FPGrowth exampl app take parameters
f0e4d10 [Jacky Li] make FPGrowth exampl app take parameters

(cherry picked from commit 651a1c0)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features.

Author: Alexander Ulanov <nashb@yandex.ru>

Closes #4709 from avulanov/SPARK-5912 and squashes the following commits:

19a8a4e [Alexander Ulanov] Addressing reviewers comments @jkbradley
58d9e4d [Alexander Ulanov] Addressing reviewers comments @jkbradley
eb6b9fe [Alexander Ulanov] Typo
2921a1d [Alexander Ulanov] ChiSqSelector example of use
c845350 [Alexander Ulanov] ChiSqSelector docs

(cherry picked from commit 28ccf5e)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Fixes:
* typo in Scala example
* Removed comment "usually applied on sparse data" since that is debatable
* small edits to text for clarity

CC: avulanov  I noticed a typo post-hoc and ended up making a few small edits.  Do the changes look OK?

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4732 from jkbradley/chisqselector-docs and squashes the following commits:

9656a3b [Joseph K. Bradley] added Java example for ChiSqSelector to guide
3f3f9f4 [Joseph K. Bradley] small fixes to ChiSqSelector docs

(cherry picked from commit 59536cc)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
…aset.

JIRA: https://issues.apache.org/jira/browse/SPARK-5935

Author: Yin Huai <yhuai@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #4710 from yhuai/jsonMapType and squashes the following commits:

3e40390 [Yin Huai] Remove unnecessary changes.
f8e6267 [Yin Huai] Fix test.
baa36e3 [Yin Huai] Accept MapType in the schema provided to jsonFile/jsonRDD.

(cherry picked from commit 48376bf)
Signed-off-by: Michael Armbrust <michael@databricks.com>
…xecution

Author: Michael Armbrust <michael@databricks.com>

Closes #4684 from marmbrus/explainAnalysis and squashes the following commits:

afbaa19 [Michael Armbrust] fix python
d93278c [Michael Armbrust] fix hive
e5fa0a4 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis
52119f2 [Michael Armbrust] more tests
82a5431 [Michael Armbrust] fix tests
25753d2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis
aee1e6a [Michael Armbrust] fix hive
b23a844 [Michael Armbrust] newline
de8dc51 [Michael Armbrust] more comments
acf620a [Michael Armbrust] [SPARK-5873][SQL] Show partially analyzed plans in query execution

(cherry picked from commit 1ed5708)
Signed-off-by: Michael Armbrust <michael@databricks.com>
* Removed SVD code from examples.
* Corrected Java API doc link.
* Updated variable names: `AtransposeA` -> `ata`.
* Minor changes.

brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #4737 from mengxr/update-block-matrix-user-guide and squashes the following commits:

70f53ac [Xiangrui Meng] update block matrix user guide

(cherry picked from commit cf2e416)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Please refer to the [JIRA ticket] [1] for the motivation.

[1]: https://issues.apache.org/jira/browse/SPARK-5968

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4744)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4744 from liancheng/spark-5968 and squashes the following commits:

caac6a8 [Cheng Lian] Suppresses ParquetOutputCommitter WARN logs

(cherry picked from commit 8403331)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #4736 from marmbrus/asExprs and squashes the following commits:

5ba97e4 [Michael Armbrust] [SPARK-5910][SQL] Support for as in selectExpr

(cherry picked from commit 0a59e45)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #4738 from marmbrus/udtRepart and squashes the following commits:

c06d7b5 [Michael Armbrust] fix compilation
91c8829 [Michael Armbrust] [SQL][SPARK-5532] Repartition should not use external rdd representation

(cherry picked from commit 2012366)
Signed-off-by: Michael Armbrust <michael@databricks.com>
…JobIds

Patch should be self-explanatory
pwendell JoshRosen

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4741 from tdas/SPARK-5967 and squashes the following commits:

653b5bb [Tathagata Das] Fixed the fix and added test
e2de972 [Tathagata Das] Clear stages which have no corresponding active jobs.

(cherry picked from commit 64d2c01)
Signed-off-by: Andrew Or <andrew@databricks.com>
For screenshot see: https://issues.apache.org/jira/browse/SPARK-5965
This was caused by 20a6013.

Author: Andrew Or <andrew@databricks.com>

Closes #4739 from andrewor14/user-jar-blocker and squashes the following commits:

23c4a9e [Andrew Or] Use right argument

(cherry picked from commit 6d2caa5)
Signed-off-by: Andrew Or <andrew@databricks.com>
Variance is calculated on labels/responses.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4740 from mengxr/patch-1 and squashes the following commits:

673317b [Xiangrui Meng] [MLLIB] Change x_i to y_i in Variance's user guide

(cherry picked from commit 105791e)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #4746 from marmbrus/hiveLock and squashes the following commits:

8b871cf [Michael Armbrust] [SPARK-5952][SQL] Lock when using hive metastore client

(cherry picked from commit a2b9137)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4745 from davies/fix_zip and squashes the following commits:

2124b2c [Davies Liu] Update tests.py
b5c828f [Davies Liu] increase the number of records
c1e40fd [Davies Liu] fix zip with two RDDs with AutoBatchedSerializer

(cherry picked from commit da505e5)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
**NOTICE** Do NOT merge this, as we're waiting for #3881 to be merged.

`HiveThriftServer2Suite` has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness:

1. Fixes a racing condition occurred while using `tail -f` to check log file

   It's possible that the line we are looking for has already been printed into the log file before we start the `tail -f` process. This PR uses `tail -n +0 -f` to ensure all lines are checked.

2. Retries up to 3 times if the server fails to start

   In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start.

3. A server instance is reused among all test cases within a single suite

   The original `HiveThriftServer2Suite` is splitted into two test suites, `HiveThriftBinaryServerSuite` and `HiveThriftHttpServerSuite`. Each suite starts a `HiveThriftServer2` instance and reuses it for all of its test cases.

**TODO**

- [ ] Starts the Thrift server in foreground once #3881 is merged (adding `--foreground` flag to `spark-daemon.sh`)

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4720)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4720 from liancheng/revamp-thrift-server-tests and squashes the following commits:

d6c80eb [Cheng Lian] Relaxes server startup timeout
6f14eb1 [Cheng Lian] Revamped HiveThriftServer2Suite for robustness

(cherry picked from commit f816e73)
Signed-off-by: Cheng Lian <lian@databricks.com>
marmbrus and others added 26 commits March 27, 2015 11:40
This is based on bug and test case proposed by viirya.  See #5203 for a excellent description of the problem.

TLDR; The problem occurs because the function `groupBy(String)` calls `resolve`, which returns an `AttributeReference`.  However, this `AttributeReference` is based on an analyzed plan which is thrown away.  At execution time, we once again analyze the plan.  However, in the case of self-joins, each call to analyze will produce a new tree for the left side of the join, rendering the previously returned `AttributeReference` invalid.

As a fix, I propose we keep the analyzed plan instead of the unresolved plan inside of a `DataFrame`.

Author: Michael Armbrust <michael@databricks.com>

Closes #5217 from marmbrus/preanalyzer and squashes the following commits:

1f98e2d [Michael Armbrust] revert change
dd4dec1 [Michael Armbrust] Use the analyzed plan in DataFrame
089c52e [Michael Armbrust] WIP

(cherry picked from commit 5d9c37c)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Fix the import in sql example.

Author: Davies Liu <davies@databricks.com>

Closes #5230 from davies/fix_sql_example and squashes the following commits:

7ecc5f4 [Davies Liu] fix sql example

(cherry picked from commit 887e1b7)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Fixes bug causing Kryo serialization to fail with Avro files in between stages.

https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249

Author: Dean Chen <deanchen5@gmail.com>

Closes #5193 from deanchen/SPARK-6544 and squashes the following commits:

813d4c5 [Dean Chen] [SPARK-6544][build] Increment Avro version from 1.7.6 to 1.7.7
… 1 row

Author: Reynold Xin <rxin@databricks.com>

Closes #5226 from rxin/empty-df and squashes the following commits:

1306d88 [Reynold Xin] Proper fix.
e135bb9 [Reynold Xin] [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 rows, not 1 row.

(cherry picked from commit 3af7334)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…a Parquet schema

Opening to replace #5188.

When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore.

In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an *"ALTER TABLE... ADD PARTITION..."* statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema.

In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The **mergeMetastoreParquetSchema()** method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore.

This pull requests alters the behavior of **mergeMetastoreParquetSchema()** by having it first add any nullable fields from the metastore schema to the Parquet file schema if they aren't already present there.

Author: Adam Budde <budde@amazon.com>

Closes #5214 from budde/nullable-fields and squashes the following commits:

a52d378 [Adam Budde] Refactor ParquetSchemaSuite.scala for cases now permitted by SPARK-6471 and SPARK-6538
9041bfa [Adam Budde] Add missing nullable Metastore fields when merging a Parquet schema

(cherry picked from commit 5909f09)
Signed-off-by: Cheng Lian <lian@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #5238 from rxin/pyspark-docs and squashes the following commits:

c285951 [Reynold Xin] Reset deprecation warning.
8c1031e [Reynold Xin] inferSchema
dd91b1a [Reynold Xin] [DOC] Improvements to Python docs.

(cherry picked from commit 5eef00d)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…supported types in Schema creation

Currently if trying to register an RDD (or DataFrame in 1.3) as a table that has types that have no supported Schema representation (e.g. type "Any") - it would throw a match error. e.g. scala.MatchError: Any (of class scala.reflect.internal.Types$ClassNoArgsTypeRef)

This fix is just to have a nicer error message than a MatchError

Author: Eran Medan <ehrann.mehdan@gmail.com>

Closes #5235 from eranation/patch-2 and squashes the following commits:

af4b1a2 [Eran Medan] Line should be under 100 chars
0c69e9d [Eran Medan] Change from sys.error UnsupportedOperationException
524be86 [Eran Medan] better exception than scala.MatchError: Any

(cherry picked from commit 17b13c5)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…ning shuffle

I've updated the Spark Programming Guide to add a section on the shuffle operation providing some background on what it does. I've also addressed some of its performance impacts.

I've included documentation to address the following issues:
https://issues.apache.org/jira/browse/SPARK-5836
https://issues.apache.org/jira/browse/SPARK-3441
https://issues.apache.org/jira/browse/SPARK-5750

https://issues.apache.org/jira/browse/SPARK-4227 is related but can be addressed in a separate PR since it involves updates to the Spark Configuration Guide.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Author: Ilya Ganelin <ilganeli@gmail.com>

Closes #5074 from ilganeli/SPARK-5750 and squashes the following commits:

6178e24 [Ilya Ganelin] Update programming-guide.md
7a0b96f [Ilya Ganelin] Update programming-guide.md
2c5df08 [Ilya Ganelin] Merge branch 'SPARK-5750' of github.com:ilganeli/spark into SPARK-5750
dffbd2d [Ilya Ganelin] [SPARK-5750] Slight wording update
1ff4eb4 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
85f9c6e [Ilya Ganelin] Update programming-guide.md
349d1fa [Ilya Ganelin] Added cross linkf or configuration page
eeb5a7a [Ilya Ganelin] [SPARK-5750] Added some minor fixes
dd5cc9d [Ilya Ganelin] [SPARK-5750] Fixed some factual inaccuracies with regards to shuffle internals.
a8adb57 [Ilya Ganelin] [SPARK-5750] Incoporated feedback from Sean Owen
9954bbe [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
159dd1c [Ilya Ganelin] [SPARK-5750] Style fixes from rxin.
75ef67b [Ilya Ganelin] [SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining the shuffle operation and included errata from a number of other JIRAs

(cherry picked from commit 4bdfb7b)
Signed-off-by: Sean Owen <sowen@cloudera.com>
… instead of login name

Utils.getCurrentUserName returns UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't set. It should return UserGroupInformation.getCurrentUser().getShortUserName()
getUserName() returns the users full principal name (ie user1CORP.COM). getShortUserName() returns just the users login name (user1).

This just happens to work on YARN because the Client code sets:
env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName()

Author: Thomas Graves <tgraves@apache.org>

Closes #5229 from tgravescs/SPARK-6558 and squashes the following commits:

24830bf [Thomas Graves] Utils.getCurrentUserName returns the full principal name instead of login name
Now that we have `DataFrame`s it is possible to have multiple copies in a single query plan.  As such, it needs to inherit from `MultiInstanceRelation` or self joins will break.  I also add better debugging errors when our self join handling fails in case there are future bugs.

Author: Michael Armbrust <michael@databricks.com>

Closes #5251 from marmbrus/multiMetaStore and squashes the following commits:

4272f6d [Michael Armbrust] [SPARK-6595][SQL] MetastoreRelation should be MuliInstanceRelation

(cherry picked from commit fe81f6c)
Signed-off-by: Cheng Lian <lian@databricks.com>
This fixes `predictAll` after load. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5243 from mengxr/SPARK-6571 and squashes the following commits:

82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load

(cherry picked from commit f75f633)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
… class under catalyst dir

https://issues.apache.org/jira/browse/SPARK-6592

The current impl in SparkBuild.scala filter all classes under catalyst directory, however, we have a corner case that Row class is a public API under that directory

we need to include Row into the scaladoc while still excluding other classes of catalyst project

Thanks for the help on this patch from rxin and liancheng

Author: CodingCat <zhunansjtu@gmail.com>

Closes #5252 from CodingCat/SPARK-6592 and squashes the following commits:

02098a4 [CodingCat] ignore collection, enable types (except those protected classes)
f7af2cb [CodingCat] commit
3ab4403 [CodingCat] fix filter for scaladoc to generate API doc for Row.scala under catalyst directory

(cherry picked from commit 32259c6)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…hema() and applySchema

This PR create an alias for `registerFunction` as `udf.register`, to be consistent with Scala API.

It also deprecated inferSchema() and applySchema(), show an warning for them.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #5273 from davies/udf and squashes the following commits:

476e947 [Davies Liu] address comments
c096fdb [Davies Liu] add SQLContext.udf and deprecate inferSchema() and applySchema

(cherry picked from commit f76d2e5)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…d Parquet tables

This PR leverages the output commit coordinator introduced in #4066 to help committing Hive and Parquet tables.

This PR extracts output commit code in `SparkHadoopWriter.commit` to `SparkHadoopMapRedUtil.commitTask`, and reuses it for committing Parquet and Hive tables on executor side.

TODO

- [ ] Add tests

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5139)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5139 from liancheng/spark-6369 and squashes the following commits:

72eb628 [Cheng Lian] Fixes typo in javadoc
9a4b82b [Cheng Lian] Adds javadoc and addresses @aarondav's comments
dfdf3ef [Cheng Lian] Uses commit coordinator to help committing Hive and Parquet tables

(cherry picked from commit fde6945)
Signed-off-by: Cheng Lian <lian@databricks.com>
This pull request adds variants of DataFrame.na.drop and DataFrame.na.fill to the Scala/Java API, and DataFrame.fillna and DataFrame.dropna to the Python API.

Author: Reynold Xin <rxin@databricks.com>

Closes #5274 from rxin/df-missing-value and squashes the following commits:

4ee1b98 [Reynold Xin] Improve error reporting in Python.
33a330c [Reynold Xin] Remove replace for now.
bc4fdbb [Reynold Xin] Added documentation for replace.
d56f5a5 [Reynold Xin] Added replace for Scala/Java.
2385d00 [Reynold Xin] Feedback from Xiangrui on "how".
914a374 [Reynold Xin] fill with map.
185c67e [Reynold Xin] Allow specifying column subsets in fill.
749eb47 [Reynold Xin] fillna
249b94e [Reynold Xin] Removing undefined functions.
6a73c68 [Reynold Xin] Missing file.
67d7003 [Reynold Xin] [SPARK-6119][SQL] DataFrame.na.drop (Scala/Java) and DataFrame.dropna (Python)

(cherry picked from commit b8ff2bc)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Filters such as startsWith, endsWith, contains will be very useful for data sources that provide search functionality, e.g. Succinct, Elastic Search, Solr.

I also took this chance to improve documentation for the data source filters.

Author: Reynold Xin <rxin@databricks.com>

Closes #5285 from rxin/ds-string-filters and squashes the following commits:

f021727 [Reynold Xin] Fixed grammar.
7695a52 [Reynold Xin] [SPARK-6625][SQL] Add common string filters to data sources.

(cherry picked from commit f07e714)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…thon.

To maintain consistency with the Scala API.

Author: Reynold Xin <rxin@databricks.com>

Closes #5284 from rxin/df-na-alias and squashes the following commits:

19f46b7 [Reynold Xin] Show DataFrameNaFunctions in docs.
6618118 [Reynold Xin] [SPARK-6623][SQL] Alias DataFrame.na.drop and DataFrame.na.fill in Python.

(cherry picked from commit b80a030)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…-grained lock

JIRA: https://issues.apache.org/jira/browse/SPARK-6618

Author: Yin Huai <yhuai@databricks.com>

Closes #5281 from yhuai/lookupRelationLock and squashes the following commits:

591b4be [Yin Huai] A test?
b3a9625 [Yin Huai] Just protect client.

(cherry picked from commit 314afd0)
Signed-off-by: Cheng Lian <lian@databricks.com>
…ation

Also removes temporary workarounds made in #5183 and #5251.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5289)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5289 from liancheng/spark-6555 and squashes the following commits:

d0095ac [Cheng Lian] Removes unused imports
cfafeeb [Cheng Lian] Removes outdated comment
75a2746 [Cheng Lian] Overrides equals() and hashCode() for MetastoreRelation

(cherry picked from commit a7992ff)
Signed-off-by: Michael Armbrust <michael@databricks.com>
… converting metastore Parquet tables

Consider a metastore Parquet table that

1. doesn't have schema evolution issue
2. has lots of data files and/or partitions

In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when converting such a metastore Parquet table.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5231)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5231 from liancheng/spark-6575 and squashes the following commits:

cd96159 [Cheng Lian] Adds configuration to disable schema merging while converting metastore Parquet tables

(cherry picked from commit 8102014)
Signed-off-by: Michael Armbrust <michael@databricks.com>
This PR is based on work by cloud-fan in #4904, but with two differences:
 - We isolate the logic for Sort's special handling into `ResolveSortReferences`
 - We avoid creating UnresolvedGetField expressions during resolution.  Instead we either resolve GetField or we return None.  This avoids us going down the wrong path early on.

Author: Michael Armbrust <michael@databricks.com>

Closes #5189 from marmbrus/nestedOrderBy and squashes the following commits:

b8cae45 [Michael Armbrust] fix another test
0f36a11 [Michael Armbrust] WIP
91820cd [Michael Armbrust] Fix bug.

(cherry picked from commit cd48ca5)
Signed-off-by: Michael Armbrust <michael@databricks.com>

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
…uther resolution

Before it was possible for a query to flip back and forth from a resolved state, allowing resolution to propagate up before coercion had stabilized.  The issue was that `ResolvedReferences` would run after `FunctionArgumentConversion`, but before `PropagateTypes` had run.  This PR ensures we correctly `PropagateTypes` after any coercion has applied.

Author: Michael Armbrust <michael@databricks.com>

Closes #5278 from marmbrus/unionNull and squashes the following commits:

dc3581a [Michael Armbrust] [SPARK-5371][SQL] Propogate types after function conversion / before futher resolution

(cherry picked from commit beebb7f)
Signed-off-by: Michael Armbrust <michael@databricks.com>

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
…structing sources.StringContains

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5299 from viirya/stringcontains and squashes the following commits:

c1ece4c [Liang-Chi Hsieh] Should be Contains instead of EndsWith.

(cherry picked from commit 2036bc5)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…r only after authorized committer fails, not after any failure

In OutputCommitCoordinator, there is some logic to clear the authorized committer's lock on committing in case that task fails.  However, it looks like the current code also clears this lock if other non-authorized tasks fail, which is an obvious bug.

In theory, it's possible that this could allow a new committer to start, run to completion, and commit output before the authorized committer finished, but it's unlikely that this race occurs often in practice due to the complex combination of failure and timing conditions that would be required to expose it.

This patch addresses this issue and adds a regression test.

Thanks to aarondav for spotting this issue.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5276 from JoshRosen/SPARK-6614 and squashes the following commits:

d532ba7 [Josh Rosen] Check whether failed task was authorized committer
cbb3784 [Josh Rosen] Add regression test for SPARK-6614
Author: Reynold Xin <rxin@databricks.com>

Closes #5287 from rxin/pyspark-df-doc-cleanup-context and squashes the following commits:

1841b60 [Reynold Xin] Lint.
f2007f1 [Reynold Xin] functions and types.
bc3b72b [Reynold Xin] More improvements to DataFrame Python doc.
ac1d4c0 [Reynold Xin] Bug fix.
b163365 [Reynold Xin] Python fix. Added Experimental flag to DataFrameNaFunctions.
608422d [Reynold Xin] [Doc] Cleanup context.py Python docs.

(cherry picked from commit 305abe1)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@srowen
Copy link
Member

srowen commented Apr 1, 2015

I think this was opened in error. Mind closing this PR?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@wesleydias
Copy link
Author

please ignore this git pull request. It was an error pull.

@wesleydias wesleydias closed this Apr 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.