[SPARK-28050][SQL]DataFrameWriter support insertInto a specific table partition#25657
Closed
Le-Dai wants to merge 34 commits intoapache:masterfrom
Closed
[SPARK-28050][SQL]DataFrameWriter support insertInto a specific table partition#25657Le-Dai wants to merge 34 commits intoapache:masterfrom
Le-Dai wants to merge 34 commits intoapache:masterfrom
Conversation
|
Can one of the admins verify this patch? |
…Source inserting for partitioned table ## What changes were proposed in this pull request? Datasource table now supports partition tables long ago. This commit adds the ability to translate the InsertIntoTable(HiveTableRelation) to datasource table insertion. ## How was this patch tested? Existing tests with some modification Closes apache#25306 from advancedxy/SPARK-28573. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…R on Windows ### What changes were proposed in this pull request? This PR adds three more information: - Mentions that `bash` in `PATH` to build is required. - Specifies supported JDK and Maven versions - Explicitly mentions that building on Windows is not the official support ### Why are the changes needed? In order to make SparkR developers on Windows able to work, and describe what is needed for AppVeyor build. ### Does this PR introduce any user-facing change? No. It just adds some information in `R/WINDOWS.md` ### How was this patch tested? This is already being tested as so in AppVeyor. Also, I tested as so (long ago though). Closes apache#25647 from HyukjinKwon/SPARK-28946. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Document SHOW COLUMNS statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.** <img width="1234" alt="Screen Shot 2019-09-02 at 11 07 48 PM" src="https://user-images.githubusercontent.com/14225158/64148033-0fe77300-cdd7-11e9-93ee-e5951c7ed33c.png"> <img width="1234" alt="Screen Shot 2019-09-02 at 11 08 08 PM" src="https://user-images.githubusercontent.com/14225158/64148039-137afa00-cdd7-11e9-8bec-634ea9d2594c.png"> <img width="1234" alt="Screen Shot 2019-09-02 at 11 11 45 PM" src="https://user-images.githubusercontent.com/14225158/64148046-17a71780-cdd7-11e9-91c3-95a9c97e7a77.png"> ### How was this patch tested? Tested using jykyll build --serve Closes apache#25531 from dilipbiswal/ref-doc-show-columns. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request? Document DESCRIBE FUNCTION statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.** <img width="1234" alt="Screen Shot 2019-09-02 at 11 14 09 PM" src="https://user-images.githubusercontent.com/14225158/64148193-85534380-cdd7-11e9-9c07-5956b5e8276e.png"> <img width="1234" alt="Screen Shot 2019-09-02 at 11 14 29 PM" src="https://user-images.githubusercontent.com/14225158/64148201-8a17f780-cdd7-11e9-93d8-10ad9932977c.png"> <img width="1234" alt="Screen Shot 2019-09-02 at 11 14 42 PM" src="https://user-images.githubusercontent.com/14225158/64148208-8dab7e80-cdd7-11e9-97c5-3a4ce12cac7a.png"> ### How was this patch tested? Tested using jykyll build --serve Closes apache#25530 from dilipbiswal/ref-doc-desc-function. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request? This PR is to upgrade the maven dependence from 3.6.1 to 3.6.2. ### Why are the changes needed? All the builds are broken because 3.6.1 is not available. http://ftp.wayne.edu/apache//maven/maven-3/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/485/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10536/  ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes apache#25665 from gatorsmile/upgradeMVN. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request? This adds namespace support to V2SessionCatalog. ## How was this patch tested? WIP: will add tests for v2 session catalog namespace methods. Closes apache#25363 from rdblue/SPARK-28628-support-namespaces-in-v2-session-catalog. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
… grained KeyLock ### What changes were proposed in this pull request? This PR provides a new lock mechanism `KeyLock` to lock with a given key. Also use this new lock in `TorrentBroadcast` to avoid blocking tasks from fetching different broadcast values. ### Why are the changes needed? `TorrentBroadcast.readObject` uses a global lock so only one task can be fetching the blocks at the same time. This is not optimal if we are running multiple stages concurrently because they should be able to independently fetch their own blocks. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#25612 from zsxwing/SPARK-3137. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
…ence ### What changes were proposed in this pull request? Document ANALYZE TABLE statement in SQL Reference ### Why are the changes needed? To complete SQL reference ### Does this PR introduce any user-facing change? Yes ***Before***: There was no documentation for this. ***After***:    ### How was this patch tested? Tested using jykyll build --serve Closes apache#25524 from huaxingao/spark-28788. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request? We are adding other resource type support to the executors and Spark. We should show the resource information for each executor on the UI Executors page. This also adds a toggle button to show the resources column. It is off by default.   ### Why are the changes needed? to show user what resources the executors have. Like Gpus, fpgas, etc ### Does this PR introduce any user-facing change? Yes introduces UI and rest api changes to show the resources ### How was this patch tested? Unit tests and manual UI tests on yarn and standalone modes. Closes apache#25613 from tgravescs/SPARK-27489-gpu-ui-latest. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
…ount examples
### What changes were proposed in this pull request?
Add Java/Scala StructuredKerberizedKafkaWordCount examples to test kerberized kafka.
### Why are the changes needed?
Now,`StructuredKafkaWordCount` example is not support to visit kafka using kerberos authentication.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
```
Yarn client:
$ bin/run-example --files ${jaas_path}/kafka_jaas.conf,${keytab_path}/kafka.service.keytab \
--driver-java-options "-Djava.security.auth.login.config=${path}/kafka_driver_jaas.conf" \
--conf \
"spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \
--master yarn
sql.streaming.StructuredKerberizedKafkaWordCount broker1-host:port,broker2-host:port \
subscribe topic1,topic2
Yarn cluster:
$ bin/run-example --files \
${jaas_path}/kafka_jaas.conf,${keytab_path}/kafka.service.keytab,${krb5_path}/krb5.conf \
--driver-java-options \
"-Djava.security.auth.login.config=./kafka_jaas.conf \
-Djava.security.krb5.conf=./krb5.conf" \
--conf \
"spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \
--master yarn --deploy-mode cluster \
sql.streaming.StructuredKerberizedKafkaWordCount broker1-host:port,broker2-host:port \
subscribe topic1,topic2
```
Closes apache#25649 from hddong/Add-StructuredKerberizedKafkaWordCount-examples.
Lead-authored-by: hongdd <jn_hdd@163.com>
Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
… older releases
### What changes were proposed in this pull request?
Fall back to archive.apache.org in `build/mvn` to download Maven, in case the ASF mirrors no longer have an older release.
### Why are the changes needed?
If an older release's specified Maven doesn't exist in the mirrors, {{build/mvn}} will fail.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually tested different paths and failures by commenting in/out parts of the script and modifying it directly.
Closes apache#25667 from srowen/SPARK-28963.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
maropu
reviewed
Sep 4, 2019
| * | ||
| * @since 3.0 | ||
| */ | ||
| def insertInto(tableName: String,partionInfo: String): Unit = { |
Member
There was a problem hiding this comment.
nit: add space String, partionInfo (Have you run ./dev/sbt-checkstyle ?
Member
|
Can you add some tests before accepting Jenkins? |
# Conflicts: # sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala
Author
|
@maropu How is the general format of the spark code? the last commit like ok |
…dd offHeapMemorySize ## What changes were proposed in this pull request? If MEMORY_OFFHEAP_ENABLED is true, add MEMORY_OFFHEAP_SIZE to resource requested for executor to ensure instance has enough memory to use. In this pr add a helper method `executorOffHeapMemorySizeAsMb` in `YarnSparkHadoopUtil`. ## How was this patch tested? Add 3 new test suite to test `YarnSparkHadoopUtil#executorOffHeapMemorySizeAsMb` Closes apache#25309 from LuciferYang/spark-28577. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
…n JDBC Tab UI ### What changes were proposed in this pull request? Current Spark Thirft Server can't support cancel SQL job, when we use Hue to query throgh Spark Thrift Server, when we run a sql and then click cancel button to cancel this sql, we will it won't work in backend and in the spark JDBC UI tab, we can see the SQL's status is always COMPILED, then the duration of SQL is always increasing, this may make people confused.  ### Why are the changes needed? If sql status can't reflect sql's true status, it will make user confused. ### Does this PR introduce any user-facing change? SparkthriftServer's UI tab will show SQL's status in CANCELED when we cancel a SQL . ### How was this patch tested? Manuel tested UI TAB Status   backend log  Closes apache#25611 from AngersZhuuuu/SPARK-28901. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request? This patch fixes the bugs in test code itself, FsHistoryProviderSuite. 1. When creating log file via `newLogFile`, codec is ignored, leading to wrong file name. (No one tends to create test for test code, as well as the bug doesn't affect existing tests indeed, so not easy to catch.) 2. When writing events to log file via `writeFile`, metadata (in case of new format) gets written to file regardless of its codec, and the content is overwritten by another stream, hence no information for Spark version is available. It affects existing test, hence we have wrong expected value to workaround the bug. This patch also removes redundant parameter `isNewFormat` in `writeFile`, as according to review comment, Spark no longer supports old format. ### Why are the changes needed? Explained in above section why they're bugs, though they only reside in test-code. (Please note that the bug didn't come from non-test side of code.) ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modified existing UTs, as well as read event log file in console to see metadata is not overwritten by other contents. Closes apache#25629 from HeartSaVioR/MINOR-FIX-FsHistoryProviderSuite. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request? This patch does pooling for both kafka consumers as well as fetched data. The overall benefits of the patch are following: * Both pools support eviction on idle objects, which will help closing invalid idle objects which topic or partition are no longer be assigned to any tasks. * It also enables applying different policies on pool, which helps optimization of pooling for each pool. * We concerned about multiple tasks pointing same topic partition as well as same group id, and existing code can't handle this hence excess seek and fetch could happen. This patch properly handles the case. * It also makes the code always safe to leverage cache, hence no need to maintain reuseCache parameter. Moreover, pooling kafka consumers is implemented based on Apache Commons Pool, which also gives couple of benefits: * We can get rid of synchronization of KafkaDataConsumer object while acquiring and returning InternalKafkaConsumer. * We can extract the feature of object pool to outside of the class, so that the behaviors of the pool can be tested easily. * We can get various statistics for the object pool, and also be able to enable JMX for the pool. FetchedData instances are pooled by custom implementation of pool instead of leveraging Apache Commons Pool, because they have CacheKey as first key and "desired offset" as second key which "desired offset" is changing - I haven't found any general pool implementations supporting this. This patch brings additional dependency, Apache Commons Pool 2.6.0 into `spark-sql-kafka-0-10` module. ## How was this patch tested? Existing unit tests as well as new tests for object pool. Also did some experiment regarding proving concurrent access of consumers for same topic partition. * Made change on both sides (master and patch) to log when creating Kafka consumer or fetching records from Kafka is happening. * branches * master: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-master-ref-debugging * patch: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-debugging * Test query (doing self-join) * https://gist.github.com/HeartSaVioR/d831974c3f25c02846f4b15b8d232cc2 * Ran query from spark-shell, with using `local[*]` to maximize the chance to have concurrent access * Collected the count of fetch requests on Kafka via command: `grep "creating new Kafka consumer" logfile | wc -l` * Collected the count of creating Kafka consumers via command: `grep "fetching data from Kafka consumer" logfile | wc -l` Topic and data distribution is follow: ``` truck_speed_events_stream_spark_25151_v1:0:99440 truck_speed_events_stream_spark_25151_v1:1:99489 truck_speed_events_stream_spark_25151_v1:2:397759 truck_speed_events_stream_spark_25151_v1:3:198917 truck_speed_events_stream_spark_25151_v1:4:99484 truck_speed_events_stream_spark_25151_v1:5:497320 truck_speed_events_stream_spark_25151_v1:6:99430 truck_speed_events_stream_spark_25151_v1:7:397887 truck_speed_events_stream_spark_25151_v1:8:397813 truck_speed_events_stream_spark_25151_v1:9:0 ``` The experiment only used smallest 4 partitions (0, 1, 4, 6) from these partitions to finish the query earlier. The result of experiment is below: branch | create Kafka consumer | fetch request -- | -- | -- master | 1986 | 2837 patch | 8 | 1706 Closes apache#22138 from HeartSaVioR/SPARK-25151. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: Jungtaek Lim <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request? Document SHOW TBLPROPERTIES statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.**   ### How was this patch tested? Tested using jykyll build --serve Closes apache#25571 from dilipbiswal/ref-show-tblproperties. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request? Document SHOW FUNCTIONS statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.**  <img width="589" alt="Screen Shot 2019-09-04 at 11 41 44 AM" src="https://user-images.githubusercontent.com/11567269/64281911-0fe79000-cf09-11e9-955f-21b44590707c.png"> <img width="572" alt="Screen Shot 2019-09-04 at 11 41 54 AM" src="https://user-images.githubusercontent.com/11567269/64281916-12e28080-cf09-11e9-9187-688c2c751559.png"> ### How was this patch tested? Tested using jykyll build --serve Closes apache#25539 from dilipbiswal/ref-doc-show-functions. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request? This patch implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: 1. As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the `ReuseExchange` rule; or 2. As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise 3. As a bypassed condition (`true`). ### Why are the changes needed? This is an important performance feature. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT - Testing DPP by enabling / disabling the reuse broadcast results feature and / or the subquery duplication feature. - Testing DPP with reused broadcast results. - Testing the key iterators on different HashedRelation types. - Testing the packing and unpacking of the broadcast keys in a LongType. Closes apache#25600 from maryannxue/dpp. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
…ion-tests ### What changes were proposed in this pull request? Per apache#25640 (comment) also bump K8S client version in integration-tests module. ### Why are the changes needed? Harmonize the version as intended. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes apache#25664 from srowen/SPARK-28921.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request? Replaces some incorrect usage of `new Configuration()` as it will load default configs defined in Hadoop ### Why are the changes needed? Unexpected config could be accessed instead of the expected config, see SPARK-28203 for example ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed tests. Closes apache#25616 from advancedxy/remove_invalid_configuration. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request? Rename `UnresolvedTable` to `V1Table` because it is not unresolved. ### Why are the changes needed? The class name is inaccurate. This should be fixed before it is in a release. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#25683 from rdblue/SPARK-28979-rename-unresolved-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ence ### What changes were proposed in this pull request? Document UNCACHE TABLE statement in SQL Reference ### Why are the changes needed? To complete SQL Reference ### Does this PR introduce any user-facing change? Yes. After change:  ### How was this patch tested? Tested using jykyll build --serve Closes apache#25540 from huaxingao/spark-28830. Lead-authored-by: Huaxin Gao <huaxing@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request? Use `KeyLock` added in apache#25612 to simplify `MapOutputTracker.getStatuses`. It also has some improvement after the refactoring: - `InterruptedException` is no longer sallowed. - When a shuffle block is fetched, we don't need to wake up unrelated sleeping threads. ### Why are the changes needed? `MapOutputTracker.getStatuses` is pretty hard to maintain right now because it has a special lock mechanism which we needs to pay attention to whenever updating this method. As we can use `KeyLock` to hide the complexity of locking behind a dedicated lock class, it's better to refactor it to make it easy to understand and maintain. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes apache#25680 from zsxwing/getStatuses. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
… saveAsTable ### What changes were proposed in this pull request? Adds the provider information to the table properties in saveAsTable. ### Why are the changes needed? Otherwise, catalog implementations don't know what kind of Table definition to create. ### Does this PR introduce any user-facing change? nope ### How was this patch tested? Existing unit tests check the existence of the provider now. Closes apache#25669 from brkyvz/provider. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…batches ### What changes were proposed in this pull request? Remove unnecessary physical projection added to ensure rows are `UnsafeRow` when the DSv2 scan is columnar. This is not needed because conversions are automatically added to convert from columnar operators to `UnsafeRow` when the next operator does not support columnar execution. ### Why are the changes needed? Removes an extra projection and copy. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#25586 from rdblue/SPARK-28878-remove-dsv2-project-with-columnar. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala
Member
|
you failed to rebase... |
Author
|
sorry,I am a community novice , should i create a new PR ? |
Member
|
you can fix it by a git command, but reopening is easier I think. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
support insertInto a specific table partition
Why are the changes needed?
make the api more kind
Does this PR introduce any user-facing change?
no
How was this patch tested?
use api write data to partioned hive table
df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'")