-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3007][SQL]Add "Dynamic Partition" support to Spark Sql hive #1919
Conversation
Can one of the admins verify this patch? |
I didnt add the related test since I dont know how to write it. can any one give me some instruction?:) |
There are a couple of ways we can add tests, ideally we would do a little of both:
|
@@ -93,6 +93,33 @@ private[hive] class SparkHiveHadoopWriter( | |||
null) | |||
} | |||
|
|||
def open(dynamicPartPath: String) { | |||
val numfmt = NumberFormat.getInstance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumberFormat.getInstance()
is not thread-safe. We can use a thread-local variable to hold this object, similar to Cast.threadLocalDateFormat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized this function is a variant of the original open()
method within the same file. This should be a bug in the master branch.
Another issue is that, SparkHadoopWriter
resides in project core
, which is an indirect dependency of sql/hive
. Thus logically, it's not proper to put open(dynamicPartPath: String)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it is actually SparkHiveHadoopWriter
in sql/hive
. Seems we need to rename this file.
@@ -271,4 +272,9 @@ object Cast { | |||
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") | |||
} | |||
} | |||
private[sql] val threadLocalNumberFormat = new ThreadLocal[NumberFormat] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I didn't make myself clear enough. I mean you can refer to Cast.threadLocalDateFormat
, not add the thread-local version of NumberFormat
here, since it's not related to Cast
. A better place to hold this could be object SparkHadoopWriter
.
Please don't forget to add golden answer files for those test cases newly added to whitelist in |
count += 1 | ||
writer2.write(record) | ||
} | ||
for((k,v) <- writerMap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space before (
@baishuo Thank you for working on it. I have three general comments.
|
thanks a lot @yhuai and @liancheng :) |
Hi @marmbrus and @liancheng I had made some modification and do the test with "sbt/sbt catalyst/test sql/test hive/test" . Please help me to check if it is proper when you have time . Thank you :) |
Hmm, I see 17 newly whitelisted test cases, but only golden answers for the |
I also curious about that. |
Here I try to explain my design idea(the code is mostly in InsertIntoHiveTable.scala) : first: If we what to implement the "dynamic partiton function", we need to use hive api "loadDynamicPartitions" to move data and update metadata. But the requirement of directory formate for "loadDynamicPartitions" is a little difference to "loadPartition": 1: In case of one static partition and one dynamic partition (HQL like " 2: In case of zero static partition and 2 dynamic partition (HQL like " So whether there is static partition in HQL determines how we create subdir under TMPLOCATION. That why the function "getDynamicPartDir" exist. second: when the next rdd (closure in writeToFile) get the data and dynamicPartPath, we can check if the dynamicPartPath equals null. if not null. we check if there is already a corresponding writer exist in writerMap which store all writer for each partition. if there is. we use this writer to write the record. that ensure the data belongs to same partition will be wroten to the same directory. loadDynamicPartitions require there is no other files under TMPLOCATION except the subdir for dynamic partition. that why there are several "if (dynamicPartNum == 0)" in writeToFile |
Hi @marmbrus i had update the file relating with test. all test passed on my machine. Would you please help to verify this patch when you have time:) I had write out the thinking of the code. thank you. |
Thanks for working on this! We will have more time to review it after the Spark 1.1 release. |
ok to test |
QA tests have started for PR 1919 at commit
|
QA tests have finished for PR 1919 at commit
|
Hi @marmbrus , can you help me to check why the test fail? I had compile and do the test locally so I had thought it can passed the Spark QA test:) . And there is a new PR(with same changes, had test locally )#2226 base on new master. Would you please do a test on it if this PR still fails? thank you :) |
Would you mind to close this PR since #2226 was opened as a replacement? |
no problem, close this PR |
a new PR base on new master. changes are the same as #1919 Author: baishuo(白硕) <vc_java@hotmail.com> Author: baishuo <vc_java@hotmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2226 from baishuo/patch-3007 and squashes the following commits: e69ce88 [Cheng Lian] Adds tests to verify dynamic partitioning folder layout b20a3dc [Cheng Lian] Addresses @yhuai's comments 096bbbc [baishuo(白硕)] Merge pull request #1 from liancheng/refactor-dp 1093c20 [Cheng Lian] Adds more tests 5004542 [Cheng Lian] Minor refactoring fae9eff [Cheng Lian] Refactors InsertIntoHiveTable to a Command 528e84c [Cheng Lian] Fixes typo in test name, regenerated golden answer files c464b26 [Cheng Lian] Refactors dynamic partitioning support 5033928 [baishuo] pass check style 2201c75 [baishuo] use HiveConf.DEFAULTPARTITIONNAME to replace hive.exec.default.partition.name b47c9bf [baishuo] modify according micheal's advice c3ab36d [baishuo] modify for some bad indentation 7ce2d9f [baishuo] modify code to pass scala style checks 37c1c43 [baishuo] delete a empty else branch 66e33fc [baishuo] do a little modify 88d0110 [baishuo] update file after test a3961d9 [baishuo(白硕)] Update Cast.scala f7467d0 [baishuo(白硕)] Update InsertIntoHiveTable.scala c1a59dd [baishuo(白硕)] Update Cast.scala 0e18496 [baishuo(白硕)] Update HiveQuerySuite.scala 60f70aa [baishuo(白硕)] Update InsertIntoHiveTable.scala 0a50db9 [baishuo(白硕)] Update HiveCompatibilitySuite.scala 491c7d0 [baishuo(白硕)] Update InsertIntoHiveTable.scala a2374a8 [baishuo(白硕)] Update InsertIntoHiveTable.scala 701a814 [baishuo(白硕)] Update SparkHadoopWriter.scala dc24c41 [baishuo(白硕)] Update HiveQl.scala
Co-authored-by: Szehon Ho <szehon.apache@gmail.com>
the detail please refer the comment of https://issues.apache.org/jira/browse/SPARK-3007