[HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer #3120

pengzhiwei2018 · 2021-06-21T06:19:03Z

…taStreamer

What is the purpose of the pull request

Currently we only support reading hoodie table as datasource table for spark since #2283
That PR just sync the spark table properties from the HoodieSparkSqlWriter, which can not be used by other engine like flink.
In order to support this feature for flink and DeltaStreamer, we need to sync the spark table properties needed by datasource table to the meta store in HiveSyncTool.

Brief change log

Sync the spark table properties and serde properties needed by spark datasource table in HiveSyncTool . So that all the engines can use this feature.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

hudi-bot · 2021-06-21T06:20:36Z

CI report:

aaca30fffd1ea37f803f51ef3cf49c59ed79badc UNKNOWN
fcd06c8bccfc90b272b51d3511094e6617ec25bd UNKNOWN
96947d0419df5f8bab10072eb64afecd29326e55 UNKNOWN
02acd1127b72470f6d7adffb787179f0cddfa954 UNKNOWN
504a6770be5d4cd3a78d61129be5b1aaadd515df UNKNOWN
75aadbc834d6606527764468dd3dbcb1e802b171 UNKNOWN
f14ffb1f08820146e5d26616aa9b956ff99ec604 UNKNOWN
06dff3c437b7b3f1aa227b700cf8c34669b067ed UNKNOWN
97ba05a69199cff86cebbe25732097e3a68284f1 UNKNOWN
3948fff7aacd6c97dcbe053a59a1208dae875607 UNKNOWN
8ff6a0af2f53984c5864b04156a5b942400811c3 UNKNOWN
3bb76014c4a7c7eb58a4f2c382f83bde474995c7 UNKNOWN
5bbcd6a3d7460f76fee4c539c5b8bb9aeb1dcdd8 UNKNOWN
1742e1831691ef9ebbf98d3fa29fe24aa1077072 UNKNOWN
09020d66b59cb051cccacd894203ee7c6859ee3e UNKNOWN
0fb47c68d425f44a4b3a0be43a047215f2f37a83 UNKNOWN
ffa9341 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2021-06-21T06:40:30Z

Codecov Report

Merging #3120 (ffa9341) into master (5804ad8) will increase coverage by 0.01%.
The diff coverage is 70.67%.

@@             Coverage Diff              @@
##             master    #3120      +/-   ##
============================================
+ Coverage     47.72%   47.74%   +0.01%     
- Complexity     5528     5555      +27     
============================================
  Files           934      935       +1     
  Lines         41457    41536      +79     
  Branches       4166     4180      +14     
============================================
+ Hits          19786    19830      +44     
- Misses        19914    19943      +29     
- Partials       1757     1763       +6

Flag	Coverage Δ
hudicli	`39.97% <ø> (ø)`
hudiclient	`34.45% <ø> (ø)`
hudicommon	`48.58% <ø> (+0.02%)`	⬆️
hudiflink	`60.03% <ø> (ø)`
hudihadoopmr	`51.29% <ø> (ø)`
hudisparkdatasource	`67.20% <55.55%> (-0.47%)`	⬇️
hudisync	`55.73% <71.77%> (+1.22%)`	⬆️
huditimelineservice	`64.07% <ø> (ø)`
hudiutilities	`59.26% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...in/java/org/apache/hudi/hive/util/ConfigUtils.java	`73.91% <ø> (ø)`
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`71.51% <40.00%> (-0.53%)`	⬇️
...pache/hudi/hive/util/Parquet2SparkSchemaUtils.java	`56.94% <56.94%> (ø)`
...src/main/scala/org/apache/hudi/DefaultSource.scala	`74.77% <75.00%> (-0.46%)`	⬇️
...c/main/java/org/apache/hudi/hive/HiveSyncTool.java	`77.84% <91.48%> (+5.49%)`	⬆️
...main/java/org/apache/hudi/hive/HiveSyncConfig.java	`98.24% <100.00%> (+0.16%)`	⬆️
...in/scala/org/apache/hudi/HoodieStreamingSink.scala	`28.00% <0.00%> (-10.40%)`	⬇️
...ache/hudi/common/fs/inline/InMemoryFileSystem.java	`89.65% <0.00%> (+10.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5804ad8...ffa9341. Read the comment docs.

pengzhiwei2018 · 2021-06-22T06:51:17Z

Hi @umehrot2 ,Currently HoodieBootstrapRelation can not read bootstrap MOR rt table which will return the same result as ro table. After we have synced the spark datasource properties in HiveSyncTool in this PR, The test case ITTestHoodieDemo#testSparkSQLAfterSecondBatch will failed. Because spark will read the bootstrap mor rt by the HoodieBootstrapRelation.
So I think we should support reading rt table in HoodieBootstrapRelation before we can merge this PR. Can you give me some suggestion? thanks~

nsivabalan · 2021-06-24T16:28:24Z

@pengzhiwei2018 : Looks like an integ test is failing. Can you please check that out
ITTestHoodieDemo.testParquetDemo

pengzhiwei2018 · 2021-06-25T02:39:49Z

@pengzhiwei2018 : Looks like an integ test is failing. Can you please check that out
ITTestHoodieDemo.testParquetDemo

Hi @nsivabalan , The reason of test failure is that we can not support read bootstrap MOR rt table in spark datasource way. This PR will enable the datasource table by default so that cause the failure. I close the datasource in HiveSyncTool for the bootstrap mor table to avoid the problem. After the JIRA https://issues.apache.org/jira/browse/HUDI-2071 has solved, we can remove this limit.

yanghua

@pengzhiwei2018 Left some comments.

yanghua · 2021-07-06T12:32:48Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java

@@ -110,6 +110,12 @@
  @Parameter(names = {"--batch-sync-num"}, description = "The number of partitions one batch when synchronous partitions to hive")
  public Integer batchSyncNum = 1000;

+  @Parameter(names = {"--sparkDataSource"}, description = "Whether save this table as spark data source table.")


Let's keep the style consistent? Using - to split words?

yanghua · 2021-07-06T12:33:20Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java

+  @Parameter(names = {"--sparkDataSource"}, description = "Whether save this table as spark data source table.")
+  public Boolean saveAsSparkDataSourceTable = true;
+
+  @Parameter(names = {"--spark-schemaLengthThreshold"}, description = "The maximum length allowed in a single cell when storing additional schema information in Hive's metastore.")


hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java

yanghua · 2021-07-06T12:37:38Z

packaging/hudi-flink-bundle/pom.xml

@@ -141,6 +141,13 @@

                  <include>org.apache.hbase:hbase-common</include>
                  <include>commons-codec:commons-codec</include>
+                  <include>org.apache.spark:spark-sql_${scala.binary.version}</include>


Must we need to include spark dependencies into flink's bundle?

If we have enable the hive sync for flink, we should include the spark dependencies. Because currently HiveSyncTool need the spark dependencies to generate the spark table properties.
One thing I can improve here is making the scope of the spark dependencies to ${flink.bundle.hive.scope}.

@danny0405 Any thoughts that we can use to optimize? IMO, it seems to be not very graceful.

well I think i can remove the spark dependencies by writing a util to convert the parquet schema to spark schema json without the spark lib. Maybe it need some more work to do.

yanghua · 2021-07-06T12:38:03Z

packaging/hudi-flink-bundle/pom.xml

  </dependencies>

+


Useless empty line?

yanghua

Left some comments.

yanghua · 2021-07-07T12:59:58Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

@@ -26,6 +26,7 @@ import org.apache.hudi.common.model.HoodieTableType.{COPY_ON_WRITE, MERGE_ON_REA
 import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver}
 import org.apache.hudi.exception.HoodieException
 import org.apache.hudi.hadoop.HoodieROTablePathFilter
+import org.apache.hudi.hive.util.ConfigUtils


split it via an empty line

yanghua · 2021-07-07T13:04:40Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.spark.sql.internal.{SQLConf, StaticSQLConf}


wrong position

yanghua · 2021-07-07T14:13:17Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java

@@ -160,6 +168,8 @@ public String toString() {
      + ", supportTimestamp=" + supportTimestamp
      + ", decodePartition=" + decodePartition
      + ", createManagedTable=" + createManagedTable
+      + ", saveAsSparkDataSourceTable=" + syncAsSparkDataSourceTable


save to sync ?

hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java

yanghua · 2021-07-07T14:27:20Z

packaging/hudi-flink-bundle/pom.xml

@@ -141,6 +141,13 @@

                  <include>org.apache.hbase:hbase-common</include>
                  <include>commons-codec:commons-codec</include>
+                  <include>org.apache.spark:spark-sql_${scala.binary.version}</include>


@danny0405 Any thoughts that we can use to optimize? IMO, it seems to be not very graceful.

yanghua · 2021-07-07T14:28:39Z

@leesf Would like to receive your inputs in this PR.

danny0405 · 2021-07-08T02:25:32Z

@yanghua We should not include the spark-sql dependency in hudi-flink-bundle .

vinothchandar · 2021-07-08T03:41:53Z

+1 on @danny0405 's comment on the deps

pengzhiwei2018 · 2021-07-08T03:55:08Z

+1 on @danny0405 's comment on the deps

I am writing a util to convert the parquet schema to spark schema json string without the spark dependencies. After this, we can drop the spark dependency.

pengzhiwei2018 · 2021-07-09T07:30:19Z

Hi @yanghua , I have write a util to convert the parquet schema to spark schema json without the spark dependencies. Please take a look again~

yanghua · 2021-07-09T08:10:03Z

Hi @yanghua , I have write a util to convert the parquet schema to spark schema json without the spark dependencies. Please take a look again~

OK, sounds good. Will review it soon.

yanghua

Left some comments.

yanghua · 2021-07-10T07:52:21Z

hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java

@@ -70,6 +69,10 @@
    return Arrays.asList(new Object[][] {{true, true, true}, {true, false, false}, {false, true, true}, {false, false, false}});
  }

+  private static Iterable<Object[]> useJdbcAndSchemaFromCommitMetadataAndSyncAsDataSource() {


IMO, this method is too long and can not explain the purpose of the method body.

Yes, will rename it.

yanghua · 2021-07-10T07:54:51Z

hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestParquet2SparkSchemaUtils.java

+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+public class TestParquet2SparkSchemaUtils {


Can we add more test cases for the multiple integer type?

yanghua · 2021-07-10T07:56:19Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java

+      partitionCols.add(column2Field.getOrDefault(partitionName,
+              new PrimitiveType(Type.Repetition.REQUIRED, BINARY, partitionName, UTF8)));
+    }
+    for (Type field : originGroupType.getFields()) {


another one

yanghua · 2021-07-10T07:57:20Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java

+  public Boolean syncAsSparkDataSourceTable = true;
+
+  @Parameter(names = {"--spark-schema-length-threshold"}, description = "The maximum length allowed in a single cell when storing additional schema information in Hive's metastore.")
+  public int sparkSchemaLengthThreshold = 4000;


Is 4000 a special value here?

It is the default value in spark conf: spark.sql.sources.schemaStringLengthThreshold

danny0405 · 2021-07-12T04:17:46Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/ConfigUtils.java

+   * Config stored in hive serde properties to tell query engine (spark/flink) to
+   * read the table as a read-optimized table when this config is true.
+   */
+  public static final String IS_QUERY_AS_RO_TABLE = "hoodie.query.as.ro.table";



What's the relationship between this key IS_QUERY_AS_RO_TABLE and SPARK_QUERY_AS_RO_KEY and SPARK_QUERY_AS_RT_KEY .

SPARK_QUERY_AS_RO_KEY is introduced by #2925 for spark sql writer to pass some params. It can only used for spark engine. In this PR, we do not need this now. We use IS_QUERY_AS_RO_TABLE which can be used for both spark & flink.

Then we should add some comments and some strategy to deprecate (or delete) them.

This PR #2925 has not released yet, so we can just delete them.

danny0405 · 2021-07-12T04:18:45Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/Parquet2SparkSchemaUtils.java

+ * Convert the parquet schema to spark schema' json string.
+ * This code is refer to org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter
+ * in spark project.
+ */


Why not just use ParquetToSparkSchemaConverter directly.

Using ParquetToSparkSchemaConverter directly need the spark dependencies for hive-sync. And the flink-bundle will also need the spark. In order to remove the spark dependencies, I write this util.

Okey, got your idea. It still looks very weird because this spark schema can only be recognized by Spark engine only. Looks like a special case for writer here.

Yes, Spark need these schema configs to read table as datasource table. We have to sync these configs.

danny0405 · 2021-07-12T04:19:14Z

hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java

@@ -70,6 +69,10 @@
    return Arrays.asList(new Object[][] {{true, true, true}, {true, false, false}, {false, true, true}, {false, false, false}});
  }

+  private static Iterable<Object[]> syncDataSourceTableParams() {
+    return Arrays.asList(new Object[][] {{true, true, true}, {true, false, false}, {false, true, true}, {false, false, false}});


Can we give some comments what each flag means ?

…taStreamer

danny0405 · 2021-07-13T12:02:42Z

+1, LGTM.

pengzhiwei2018 force-pushed the dev_metasync branch from 532803d to aaca30f Compare June 21, 2021 06:26

pengzhiwei2018 force-pushed the dev_metasync branch 6 times, most recently from 716127b to 8d23e1d Compare June 22, 2021 06:43

vinothchandar added this to Ready for Review in PR Tracker Board Jun 23, 2021

pengzhiwei2018 force-pushed the dev_metasync branch from fcd06c8 to 5eff09c Compare June 25, 2021 02:34

pengzhiwei2018 force-pushed the dev_metasync branch 11 times, most recently from adba2dc to dbcd6ae Compare July 2, 2021 03:07

pengzhiwei2018 force-pushed the dev_metasync branch 2 times, most recently from 936f3c5 to 5497716 Compare July 6, 2021 03:18

pengzhiwei2018 mentioned this pull request Jul 6, 2021

Emr mihayo danny0405/hudi#1

Merged

5 tasks

yanghua self-assigned this Jul 6, 2021

yanghua reviewed Jul 6, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_metasync branch 5 times, most recently from a2a3e94 to a7f17df Compare July 7, 2021 07:53

yanghua reviewed Jul 7, 2021

View reviewed changes

vinothchandar changed the title ~~[HUDI-2045] Support Read Hoodie As DataSource Table For Flink And Del…~~ [HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer Jul 8, 2021

pengzhiwei2018 force-pushed the dev_metasync branch 2 times, most recently from 4ed02dc to b5bf84a Compare July 9, 2021 05:03

yanghua reviewed Jul 10, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_metasync branch 2 times, most recently from 09020d6 to ec60b9c Compare July 12, 2021 03:59

danny0405 reviewed Jul 12, 2021

View reviewed changes

[HUDI-2045] Support Read Hoodie As DataSource Table For Flink And Del…

ffa9341

…taStreamer

pengzhiwei2018 force-pushed the dev_metasync branch from 0fb47c6 to ffa9341 Compare July 12, 2021 05:03

danny0405 approved these changes Jul 13, 2021

View reviewed changes

PR Tracker Board automation moved this from Ready for Review to Nearing Landing Jul 13, 2021

pengzhiwei2018 merged commit f0a2f37 into apache:master Jul 13, 2021

PR Tracker Board automation moved this from Nearing Landing to Done Jul 13, 2021

pengzhiwei2018 mentioned this pull request Jul 13, 2021

Revert "[HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer" #3269

Closed

		import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
		import org.apache.spark.sql.internal.{SQLConf, StaticSQLConf}


		import static org.junit.jupiter.api.Assertions.assertEquals;

		public class TestParquet2SparkSchemaUtils {

[HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer #3120

[HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer #3120

Conversation

pengzhiwei2018 commented Jun 21, 2021

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

hudi-bot commented Jun 21, 2021 • edited

CI report:

codecov-commenter commented Jun 21, 2021 • edited

Codecov Report

pengzhiwei2018 commented Jun 22, 2021

nsivabalan commented Jun 24, 2021

pengzhiwei2018 commented Jun 25, 2021

yanghua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanghua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanghua commented Jul 7, 2021

danny0405 commented Jul 8, 2021

vinothchandar commented Jul 8, 2021

pengzhiwei2018 commented Jul 8, 2021 • edited

pengzhiwei2018 commented Jul 9, 2021

yanghua commented Jul 9, 2021

yanghua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 commented Jul 13, 2021

hudi-bot commented Jun 21, 2021 •

edited

codecov-commenter commented Jun 21, 2021 •

edited

pengzhiwei2018 commented Jul 8, 2021 •

edited