[SPARK-6747] [SQL] Support List<> as a return type in Hive UDF #6179

maropu · 2015-05-15T07:23:03Z

This patch supports List<> as a return type in Hive UDF.

We assume an UDF below;
public class UDFToListString extends UDF {
public List evaluate(Object o)
{ return Arrays.asList("xxx", "yyy", "zzz"); }
}
An exception of scala.MatchError is thrown as follows when the UDF used in the current implementation.
scala.MatchError: interface java.util.List (of class java.lang.Class)
at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
...

AmplabJenkins · 2015-05-15T07:27:10Z

Can one of the admins verify this patch?

maropu · 2015-05-15T07:28:24Z

This is a re-open pr because I made a mistake to close #5395.

chenghao-intel · 2015-05-15T15:15:08Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala

@@ -214,8 +217,16 @@ private[hive] trait HiveInspectors {

    case c: Class[_] if c.isArray => ArrayType(javaClassToDataType(c.getComponentType))

+    // list type
+    case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =>


The type info are erased during the compile time.
For example:

scala> classOf[java.util.List[_]]==classOf[java.util.List[java.lang.Object]] res5: Boolean = true

Nit: Replace the java.lang.Object with _?

Thx and fixed.

chenghao-intel · 2015-05-15T15:20:55Z

@rxin @marmbrus can you trigger the unit test?

marmbrus · 2015-05-15T18:51:13Z

ok to test

AmplabJenkins · 2015-05-15T18:52:10Z

Merged build triggered.

AmplabJenkins · 2015-05-15T18:52:19Z

Merged build started.

SparkQA · 2015-05-15T18:54:30Z

Test build #32843 has started for PR 6179 at commit 2b3f8a1.

SparkQA · 2015-05-15T21:02:02Z

Test build #32843 has finished for PR 6179 at commit 2b3f8a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-15T21:02:06Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-15T21:02:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32843/
Test PASSed.

AmplabJenkins · 2015-05-16T03:02:10Z

Merged build triggered.

AmplabJenkins · 2015-05-16T03:02:19Z

Merged build started.

SparkQA · 2015-05-16T03:04:19Z

Test build #32879 has started for PR 6179 at commit 1e82316.

SparkQA · 2015-05-16T04:58:35Z

Test build #32879 has finished for PR 6179 at commit 1e82316.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-16T04:58:39Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-16T04:58:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32879/
Test PASSed.

maropu · 2015-05-21T11:33:59Z

@marmbrus please merge it.

marmbrus · 2015-05-21T17:44:57Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUdfSuite.scala

+    sql(s"CREATE TEMPORARY FUNCTION testUDFToListString AS '${classOf[UDFToListString].getName}'")
+    checkAnswer(
+      sql("SELECT testUDFToListString(s) FROM inputTable"),
+      Seq(Row(u"data1" :: u"data2" :: u"data3" :: Nil)))


I am pretty concerned about internal types leaking out of the execution engine into user code here. Are there real UDFs that we are trying to support here?

Yes, some libraries already depend on this type of UDFs:
https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/ftvec/AddBiasUDF.java#L37

Any idea to avoid this leaking?

AWS recently added M4 instances (https://aws.amazon.com/blogs/aws/the-new-m4-instance-type-bonus-price-reduction-on-m3-c4/). Author: Pradeep Chhetri <pradeep.chhetri89@gmail.com> Closes apache#6899 from pradeepchhetri/master and squashes the following commits: 4f4ea79 [Pradeep Chhetri] Added t2.large instance 3d2bb6c [Pradeep Chhetri] Added M4 instances to the list

…ession.py` [[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes apache#6926 from yu-iskw/SPARK-8511 and squashes the following commits: 7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()` 4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`

Currently we auto alias expression in parser. However, during parser phase we don't have enough information to do the right alias. For example, Generator that has more than 1 kind of element need MultiAlias, ExtractValue don't need Alias if it's in middle of a ExtractValue chain. Author: Wenchen Fan <cloud0fan@outlook.com> Closes apache#6647 from cloud-fan/alias and squashes the following commits: 552eba4 [Wenchen Fan] fix python 5b5786d [Wenchen Fan] fix agg 73a90cb [Wenchen Fan] fix case-preserve of ExtractValue 4cfd23c [Wenchen Fan] fix order by d18f401 [Wenchen Fan] refine 9f07359 [Wenchen Fan] address comments 39c1aef [Wenchen Fan] small fix 33640ec [Wenchen Fan] auto alias expressions in analyzer

…/parquet/jdbc always override mode https://issues.apache.org/jira/browse/SPARK-8532 This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`. Author: Yin Huai <yhuai@databricks.com> Closes apache#6937 from yhuai/SPARK-8532 and squashes the following commits: f972d5d [Yin Huai] davies's comment. d37abd2 [Yin Huai] style. d21290a [Yin Huai] Python doc. 889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet. 7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value. d696dff [Yin Huai] Python style. 88eb6c4 [Yin Huai] If mode is "error", do not call mode method. c40c461 [Yin Huai] Regression test.

Implementation of n-gram feature transformer for ML. Author: Feynman Liang <fliang@databricks.com> Closes apache#6887 from feynmanliang/ngram-featurizer and squashes the following commits: d2c839f [Feynman Liang] Make n > input length yield empty output 9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces fe93873 [Feynman Liang] Implement n-gram feature transformer

… SparkR to `.lintr` [[SPARK-8537] Add a validation rule about the curly braces in SparkR to `.lintr` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8537) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes apache#6940 from yu-iskw/SPARK-8537 and squashes the following commits: 7eec1a0 [Yu ISHIKAWA] [SPARK-8537][SparkR] Add a validation rule about the curly braces in SparkR to `.lintr`

Deprecates ```callUdf``` in favor of ```callUDF```. Author: BenFradet <benjamin.fradet@gmail.com> Closes apache#6902 from BenFradet/SPARK-8356 and squashes the following commits: ef4e9d8 [BenFradet] deprecated callUDF, use udf instead 9b1de4d [BenFradet] reinstated unit test for the deprecated callUdf cbd80a5 [BenFradet] deprecated callUdf in favor of callUDF

Support BinaryType in UnsafeRow, just like StringType. Also change the layout of StringType and BinaryType in UnsafeRow, by combining offset and size together as Long, which will limit the size of Row to under 2G (given that fact that any single buffer can not be bigger than 2G in JVM). Author: Davies Liu <davies@databricks.com> Closes apache#6911 from davies/unsafe_bin and squashes the following commits: d68706f [Davies Liu] update comment 519f698 [Davies Liu] address comment 98a964b [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_bin 180b49d [Davies Liu] fix zero-out 22e4c0a [Davies Liu] zero-out padding bytes 6abfe93 [Davies Liu] fix style 447dea0 [Davies Liu] support binaryType in UnsafeRow

This was introduced in apache#6866.

first convert `ordinal` to `Number`, then convert to int type. Author: Wenchen Fan <cloud0fan@outlook.com> Closes apache#5706 from cloud-fan/7153 and squashes the following commits: 915db79 [Wenchen Fan] fix 7153

…nfo if needed Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes apache#5992 from rekhajoshm/fix/SPARK-7137 and squashes the following commits: 8c42b57 [Joshi] update checkInputColumn to print more info if needed 33ddd2e [Joshi] update checkInputColumn to print more info if needed acf3e17 [Joshi] update checkInputColumn to print more info if needed 8993c0e [Joshi] SPARK-7137: Add checkInputColumn back to Params and print more info e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master

[[SPARK-8549] Fix the line length of SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8549) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes apache#7204 from yu-iskw/SPARK-8549 and squashes the following commits: 6fb131a [Yu ISHIKAWA] Fix the typo 1737598 [Yu ISHIKAWA] [SPARK-8549][SparkR] Fix the line length of SparkR

This is a the follow up of apache#6843. Author: Cheng Hao <hao.cheng@intel.com> Closes apache#7230 from chenghao-intel/str_funcs2_followup and squashes the following commits: 52cc553 [Cheng Hao] update the code as comment

Otherwise it is impossible to declare an expression supporting DecimalType. Author: Reynold Xin <rxin@databricks.com> Closes apache#7232 from rxin/typecollection-adt and squashes the following commits: 934d3d1 [Reynold Xin] [SPARK-8831][SQL] Support AbstractDataType in TypeCollection.

When pruning partitions for a query plan, a message is logged indicating what how many partitions were selected based on predicate criteria, and what percent were pruned. The current release erroneously uses `1 - total/selected` to compute this quantity, leading to nonsense messages like "pruned -1000% partitions". The fix is simple and obvious. Author: Steve Lindemann <steve.lindemann@engineersgatelp.com> Closes apache#7227 from srlindemann/master and squashes the following commits: c788061 [Steve Lindemann] fix percentPruned log message

Here are more examples on SparkR DataFrames including creating a Spark Contect and a SQL context, loading data and simple data manipulation. Author: Daniel Emaasit (PhD Student) <daniel.emaasit@gmail.com> Closes apache#6668 from Emaasit/dan-dev and squashes the following commits: 3a97867 [Daniel Emaasit (PhD Student)] Used fewer rows for createDataFrame f7227f9 [Daniel Emaasit (PhD Student)] Using command line arguments a550f70 [Daniel Emaasit (PhD Student)] Used base R functions 33f9882 [Daniel Emaasit (PhD Student)] Renamed file b6603e3 [Daniel Emaasit (PhD Student)] changed "Describe" function to "describe" 90565dd [Daniel Emaasit (PhD Student)] Deleted the getting-started file b95a103 [Daniel Emaasit (PhD Student)] Deleted this file cc55cd8 [Daniel Emaasit (PhD Student)] combined all the code into one .R file c6933af [Daniel Emaasit (PhD Student)] changed variable name to SQLContext 8e0fe14 [Daniel Emaasit (PhD Student)] provided two options for creating DataFrames 2653573 [Daniel Emaasit (PhD Student)] Updates to a comment and variable name 275b787 [Daniel Emaasit (PhD Student)] Added the Apache License at the top of the file 2e8f724 [Daniel Emaasit (PhD Student)] Added the Apache License at the top of the file 486f44e [Daniel Emaasit (PhD Student)] Added the Apache License at the file d705112 [Daniel Emaasit (PhD Student)] Created more examples on SparkR DataFrames

Author: Wenchen Fan <cloud0fan@outlook.com> Closes apache#7237 from cloud-fan/parser and squashes the following commits: e7b49bb [Wenchen Fan] support using keyword in column name

Just change the attribute from -PsparkR to -Psparkr Author: Dirceu Semighini Filho <dirceu.semighini@gmail.com> Closes apache#7242 from dirceusemighini/patch-1 and squashes the following commits: fad5991 [Dirceu Semighini Filho] Small update in the readme file

Add Python API for hex/unhex, also cleanup Hex/Unhex Author: Davies Liu <davies@databricks.com> Closes apache#7223 from davies/hex and squashes the following commits: 6f1249d [Davies Liu] no explicit rule to cast string into binary 711a6ed [Davies Liu] fix test f9fe5a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex 49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex b31fc9a [Davies Liu] Update math.scala 25156b7 [Davies Liu] address comments and fix test c3af78c [Davies Liu] address commments 1a24082 [Davies Liu] Add Python API for hex and unhex

…nTest This pull request (1) extracts common functions used by hash outer joins and put it in interface HashOuterJoin (2) adds ShuffledHashOuterJoin and BroadcastHashOuterJoin (3) adds test cases for shuffled and broadcast hash outer join (3) makes SparkPlanTest to support binary or more complex operators, and fixes bugs in plan composition in SparkPlanTest Author: kai <kaizeng@eecs.berkeley.edu> Closes apache#7162 from kai-zeng/outer and squashes the following commits: 3742359 [kai] Fix not-serializable exception for code-generated keys in broadcasted relations 14e4bf8 [kai] Use CanBroadcast in broadcast outer join planning dc5127e [kai] code style fixes b5a4efa [kai] (1) Add broadcast hash outer join, (2) Fix SparkPlanTest

Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes apache#7234 from adrian-wang/exchangeclean and squashes the following commits: b093ec9 [Daoyuan Wang] remove unused code

Spark standalone master web UI show "Alive Workers" total core, total used cores and "Alive workers" total memory, memory used. But the JSON API page "http://MASTERURL:8088/json" shows "ALL workers" core, memory number. This webUI data is not sync with the JSON API. The proper way is to sync the number with webUI and JSON API. Author: Wisely Chen <wiselychen@appier.com> Closes apache#7038 from thegiive/SPARK-8656 and squashes the following commits: 9e54bf0 [Wisely Chen] Change variable name to camel case 2c8ea89 [Wisely Chen] Change some styling and add local variable 431d2b0 [Wisely Chen] Worker List should contain DEAD node also 8b3b8e8 [Wisely Chen] [SPARK-8656] Fix the webUI and JSON API number is not synced

…specify constraints based on slave attributes Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources when trying to determine the usablility of a resource offer from a mesos slave node. It may be preferable for the user to be able to ensure that the spark jobs are only started on a certain set of nodes (based on attributes). For example, If the user sets a property, let's say `spark.mesos.constraints` is set to `tachyon=true;us-east-1=false`, then the resource offers will be checked to see if they meet both these constraints and only then will be accepted to start new executors. Author: Ankur Chauhan <achauhan@brightcove.com> Closes apache#5563 from ankurcha/mesos_attribs and squashes the following commits: 902535b [Ankur Chauhan] Fix line length d83801c [Ankur Chauhan] Update code as per code review comments 8b73f2d [Ankur Chauhan] Fix imports c3523e7 [Ankur Chauhan] Added docs 1a24d0b [Ankur Chauhan] Expand scope of attributes matching to include all data types 482fd71 [Ankur Chauhan] Update access modifier to private[this] for offer constraints 5ccc32d [Ankur Chauhan] Fix nit pick whitespace 1bce782 [Ankur Chauhan] Fix nit pick whitespace c0cbc75 [Ankur Chauhan] Use offer id value for debug message 7fee0ea [Ankur Chauhan] Add debug statements fc7eb5b [Ankur Chauhan] Fix import codestyle 00be252 [Ankur Chauhan] Style changes as per code review comments 662535f [Ankur Chauhan] Incorporate code review comments + use SparkFunSuite fdc0937 [Ankur Chauhan] Decline offers that did not meet criteria 67b58a0 [Ankur Chauhan] Add documentation for spark.mesos.constraints 63f53f4 [Ankur Chauhan] Update codestyle - uniform style for config values 02031e4 [Ankur Chauhan] Fix scalastyle warnings in tests c09ed84 [Ankur Chauhan] Fixed the access modifier on offerConstraints val to private[mesos] 0c64df6 [Ankur Chauhan] Rename overhead fractions to memory_*, fix spacing 8cc1e8f [Ankur Chauhan] Make exception message more explicit about the source of the error addedba [Ankur Chauhan] Added test case for malformed constraint string ec9d9a6 [Ankur Chauhan] Add tests for parse constraint string 72fe88a [Ankur Chauhan] Fix up tests + remove redundant method override, combine utility class into new mesos scheduler util trait 92b47fd [Ankur Chauhan] Add attributes based constraints support to MesosScheduler

This reverts commit 25f574e. After speaking to some users and developers, we realized that FP-growth doesn't meet the requirement for frequent sequence mining. PrefixSpan (SPARK-6487) would be the correct algorithm for it. feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes apache#7240 from mengxr/SPARK-7212.revert and squashes the following commits: 2b3d66b [Xiangrui Meng] Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"

AmplabJenkins · 2015-07-06T23:48:19Z

Merged build triggered.

AmplabJenkins · 2015-07-06T23:48:27Z

Merged build started.

SparkQA · 2015-07-06T23:51:51Z

Test build #36620 has started for PR 6179 at commit feb1129.

SparkQA · 2015-07-07T01:35:30Z

Test build #36620 has finished for PR 6179 at commit feb1129.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-07-07T01:36:02Z

Merged build finished. Test PASSed.

chenghao-intel reviewed May 15, 2015
View reviewed changes

maropu force-pushed the FixBugInHiveInspectors branch from 2b3f8a1 to 1e82316 Compare May 16, 2015 03:01

marmbrus reviewed May 21, 2015
View reviewed changes

pradeepchhetri and others added 10 commits June 22, 2015 11:45

[HOTFIX] [TESTS] Typo mqqt -> mqtt

1dfb0f7

This was introduced in apache#6866.

[SPARK-7153] [SQL] support all integral type ordinal in GetArrayItem

860a49e

first convert `ordinal` to `Number`, then convert to int type. Author: Wenchen Fan <cloud0fan@outlook.com> Closes apache#5706 from cloud-fan/7153 and squashes the following commits: 915db79 [Wenchen Fan] fix 7153

rekhajoshm and others added 24 commits July 5, 2015 12:58

[SQL][Minor] Update the DataFrame API for encode/decode

6d0411b

This is a the follow up of apache#6843. Author: Cheng Hao <hao.cheng@intel.com> Closes apache#7230 from chenghao-intel/str_funcs2_followup and squashes the following commits: 52cc553 [Cheng Hao] update the code as comment

[SPARK-8837][SPARK-7114][SQL] support using keyword in column name

0e19464

Author: Wenchen Fan <cloud0fan@outlook.com> Closes apache#7237 from cloud-fan/parser and squashes the following commits: e7b49bb [Wenchen Fan] support using keyword in column name

[MINOR] [SQL] remove unused code in Exchange

132e7fc

Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes apache#7234 from adrian-wang/exchangeclean and squashes the following commits: b093ec9 [Daoyuan Wang] remove unused code

Support List as a return type in Hive UDF

ee232db

Add a blank line at the end of UDFToListString

93e3d4e

Apply review comments

6984bf4

Fix code-style errors

7f812fd

Remove a new type

af61f2e

Add StringToUtf8 to comvert String into UTF8String

fdb2ae4

Add TODO comments in UDFToListString of HiveUdfSuite

7114a47

Apply comments

2844a8e

Throw an exception when java list type used

92ed7a6

Fix conflicts

feb1129

maropu closed this Jul 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6747] [SQL] Support List<> as a return type in Hive UDF #6179

[SPARK-6747] [SQL] Support List<> as a return type in Hive UDF #6179

maropu commented May 15, 2015

AmplabJenkins commented May 15, 2015

maropu commented May 15, 2015

chenghao-intel May 15, 2015

maropu May 16, 2015

chenghao-intel commented May 15, 2015

marmbrus commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 15, 2015

SparkQA commented May 15, 2015

SparkQA commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 16, 2015

SparkQA commented May 16, 2015

SparkQA commented May 16, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 16, 2015

maropu commented May 21, 2015

marmbrus May 21, 2015

maropu May 27, 2015

AmplabJenkins commented Jul 6, 2015

AmplabJenkins commented Jul 6, 2015

SparkQA commented Jul 6, 2015

SparkQA commented Jul 7, 2015

AmplabJenkins commented Jul 7, 2015

[SPARK-6747] [SQL] Support List<> as a return type in Hive UDF #6179

[SPARK-6747] [SQL] Support List<> as a return type in Hive UDF #6179

Conversation

maropu commented May 15, 2015

AmplabJenkins commented May 15, 2015

maropu commented May 15, 2015

chenghao-intel May 15, 2015

Choose a reason for hiding this comment

maropu May 16, 2015

Choose a reason for hiding this comment

chenghao-intel commented May 15, 2015

marmbrus commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 15, 2015

SparkQA commented May 15, 2015

SparkQA commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 16, 2015

SparkQA commented May 16, 2015

SparkQA commented May 16, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 16, 2015

maropu commented May 21, 2015

marmbrus May 21, 2015

Choose a reason for hiding this comment

maropu May 27, 2015

Choose a reason for hiding this comment

AmplabJenkins commented Jul 6, 2015

AmplabJenkins commented Jul 6, 2015

SparkQA commented Jul 6, 2015

SparkQA commented Jul 7, 2015

AmplabJenkins commented Jul 7, 2015