Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer #3120

Merged
merged 1 commit into from Jul 13, 2021

Conversation

pengzhiwei2018
Copy link
Contributor

…taStreamer

What is the purpose of the pull request

Currently we only support reading hoodie table as datasource table for spark since #2283
That PR just sync the spark table properties from the HoodieSparkSqlWriter, which can not be used by other engine like flink.
In order to support this feature for flink and DeltaStreamer, we need to sync the spark table properties needed by datasource table to the meta store in HiveSyncTool.

Brief change log

Sync the spark table properties and serde properties needed by spark datasource table in HiveSyncTool . So that all the engines can use this feature.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hudi-bot
Copy link

hudi-bot commented Jun 21, 2021

CI report:

  • aaca30fffd1ea37f803f51ef3cf49c59ed79badc UNKNOWN
  • fcd06c8bccfc90b272b51d3511094e6617ec25bd UNKNOWN
  • 96947d0419df5f8bab10072eb64afecd29326e55 UNKNOWN
  • 02acd1127b72470f6d7adffb787179f0cddfa954 UNKNOWN
  • 504a6770be5d4cd3a78d61129be5b1aaadd515df UNKNOWN
  • 75aadbc834d6606527764468dd3dbcb1e802b171 UNKNOWN
  • f14ffb1f08820146e5d26616aa9b956ff99ec604 UNKNOWN
  • 06dff3c437b7b3f1aa227b700cf8c34669b067ed UNKNOWN
  • 97ba05a69199cff86cebbe25732097e3a68284f1 UNKNOWN
  • 3948fff7aacd6c97dcbe053a59a1208dae875607 UNKNOWN
  • 8ff6a0af2f53984c5864b04156a5b942400811c3 UNKNOWN
  • 3bb76014c4a7c7eb58a4f2c382f83bde474995c7 UNKNOWN
  • 5bbcd6a3d7460f76fee4c539c5b8bb9aeb1dcdd8 UNKNOWN
  • 1742e1831691ef9ebbf98d3fa29fe24aa1077072 UNKNOWN
  • 09020d66b59cb051cccacd894203ee7c6859ee3e UNKNOWN
  • 0fb47c68d425f44a4b3a0be43a047215f2f37a83 UNKNOWN
  • ffa9341 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link

codecov-commenter commented Jun 21, 2021

Codecov Report

Merging #3120 (ffa9341) into master (5804ad8) will increase coverage by 0.01%.
The diff coverage is 70.67%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #3120      +/-   ##
============================================
+ Coverage     47.72%   47.74%   +0.01%     
- Complexity     5528     5555      +27     
============================================
  Files           934      935       +1     
  Lines         41457    41536      +79     
  Branches       4166     4180      +14     
============================================
+ Hits          19786    19830      +44     
- Misses        19914    19943      +29     
- Partials       1757     1763       +6     
Flag Coverage Δ
hudicli 39.97% <ø> (ø)
hudiclient 34.45% <ø> (ø)
hudicommon 48.58% <ø> (+0.02%) ⬆️
hudiflink 60.03% <ø> (ø)
hudihadoopmr 51.29% <ø> (ø)
hudisparkdatasource 67.20% <55.55%> (-0.47%) ⬇️
hudisync 55.73% <71.77%> (+1.22%) ⬆️
huditimelineservice 64.07% <ø> (ø)
hudiutilities 59.26% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...in/java/org/apache/hudi/hive/util/ConfigUtils.java 73.91% <ø> (ø)
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 71.51% <40.00%> (-0.53%) ⬇️
...pache/hudi/hive/util/Parquet2SparkSchemaUtils.java 56.94% <56.94%> (ø)
...src/main/scala/org/apache/hudi/DefaultSource.scala 74.77% <75.00%> (-0.46%) ⬇️
...c/main/java/org/apache/hudi/hive/HiveSyncTool.java 77.84% <91.48%> (+5.49%) ⬆️
...main/java/org/apache/hudi/hive/HiveSyncConfig.java 98.24% <100.00%> (+0.16%) ⬆️
...in/scala/org/apache/hudi/HoodieStreamingSink.scala 28.00% <0.00%> (-10.40%) ⬇️
...ache/hudi/common/fs/inline/InMemoryFileSystem.java 89.65% <0.00%> (+10.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5804ad8...ffa9341. Read the comment docs.

@pengzhiwei2018 pengzhiwei2018 force-pushed the dev_metasync branch 6 times, most recently from 716127b to 8d23e1d Compare June 22, 2021 06:43
@pengzhiwei2018
Copy link
Contributor Author

Hi @umehrot2 ,Currently HoodieBootstrapRelation can not read bootstrap MOR rt table which will return the same result as ro table. After we have synced the spark datasource properties in HiveSyncTool in this PR, The test case ITTestHoodieDemo#testSparkSQLAfterSecondBatch will failed. Because spark will read the bootstrap mor rt by the HoodieBootstrapRelation.
So I think we should support reading rt table in HoodieBootstrapRelation before we can merge this PR. Can you give me some suggestion? thanks~

@vinothchandar vinothchandar added this to Ready for Review in PR Tracker Board Jun 23, 2021
@nsivabalan
Copy link
Contributor

@pengzhiwei2018 : Looks like an integ test is failing. Can you please check that out
ITTestHoodieDemo.testParquetDemo

@pengzhiwei2018
Copy link
Contributor Author

@pengzhiwei2018 : Looks like an integ test is failing. Can you please check that out
ITTestHoodieDemo.testParquetDemo

Hi @nsivabalan , The reason of test failure is that we can not support read bootstrap MOR rt table in spark datasource way. This PR will enable the datasource table by default so that cause the failure. I close the datasource in HiveSyncTool for the bootstrap mor table to avoid the problem. After the JIRA https://issues.apache.org/jira/browse/HUDI-2071 has solved, we can remove this limit.

@pengzhiwei2018 pengzhiwei2018 force-pushed the dev_metasync branch 11 times, most recently from adba2dc to dbcd6ae Compare July 2, 2021 03:07
@pengzhiwei2018 pengzhiwei2018 force-pushed the dev_metasync branch 2 times, most recently from 936f3c5 to 5497716 Compare July 6, 2021 03:18
@pengzhiwei2018 pengzhiwei2018 mentioned this pull request Jul 6, 2021
5 tasks
@yanghua yanghua self-assigned this Jul 6, 2021
Copy link
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pengzhiwei2018 Left some comments.

@@ -110,6 +110,12 @@
@Parameter(names = {"--batch-sync-num"}, description = "The number of partitions one batch when synchronous partitions to hive")
public Integer batchSyncNum = 1000;

@Parameter(names = {"--sparkDataSource"}, description = "Whether save this table as spark data source table.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the style consistent? Using - to split words?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@Parameter(names = {"--sparkDataSource"}, description = "Whether save this table as spark data source table.")
public Boolean saveAsSparkDataSourceTable = true;

@Parameter(names = {"--spark-schemaLengthThreshold"}, description = "The maximum length allowed in a single cell when storing additional schema information in Hive's metastore.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -141,6 +141,13 @@

<include>org.apache.hbase:hbase-common</include>
<include>commons-codec:commons-codec</include>
<include>org.apache.spark:spark-sql_${scala.binary.version}</include>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must we need to include spark dependencies into flink's bundle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have enable the hive sync for flink, we should include the spark dependencies. Because currently HiveSyncTool need the spark dependencies to generate the spark table properties.
One thing I can improve here is making the scope of the spark dependencies to ${flink.bundle.hive.scope}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 Any thoughts that we can use to optimize? IMO, it seems to be not very graceful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I think i can remove the spark dependencies by writing a util to convert the parquet schema to spark schema json without the spark lib. Maybe it need some more work to do.

</dependencies>


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless empty line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@pengzhiwei2018 pengzhiwei2018 force-pushed the dev_metasync branch 5 times, most recently from a2a3e94 to a7f17df Compare July 7, 2021 07:53
Copy link
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

@@ -26,6 +26,7 @@ import org.apache.hudi.common.model.HoodieTableType.{COPY_ON_WRITE, MERGE_ON_REA
import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver}
import org.apache.hudi.exception.HoodieException
import org.apache.hudi.hadoop.HoodieROTablePathFilter
import org.apache.hudi.hive.util.ConfigUtils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split it via an empty line

Comment on lines 51 to 52
import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
import org.apache.spark.sql.internal.{SQLConf, StaticSQLConf}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong position

@@ -160,6 +168,8 @@ public String toString() {
+ ", supportTimestamp=" + supportTimestamp
+ ", decodePartition=" + decodePartition
+ ", createManagedTable=" + createManagedTable
+ ", saveAsSparkDataSourceTable=" + syncAsSparkDataSourceTable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save to sync ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -141,6 +141,13 @@

<include>org.apache.hbase:hbase-common</include>
<include>commons-codec:commons-codec</include>
<include>org.apache.spark:spark-sql_${scala.binary.version}</include>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 Any thoughts that we can use to optimize? IMO, it seems to be not very graceful.

@yanghua
Copy link
Contributor

yanghua commented Jul 7, 2021

@leesf Would like to receive your inputs in this PR.

@danny0405
Copy link
Contributor

@yanghua We should not include the spark-sql dependency in hudi-flink-bundle .

@vinothchandar vinothchandar changed the title [HUDI-2045] Support Read Hoodie As DataSource Table For Flink And Del… [HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer Jul 8, 2021
@vinothchandar
Copy link
Member

+1 on @danny0405 's comment on the deps

@pengzhiwei2018
Copy link
Contributor Author

pengzhiwei2018 commented Jul 8, 2021

+1 on @danny0405 's comment on the deps

I am writing a util to convert the parquet schema to spark schema json string without the spark dependencies. After this, we can drop the spark dependency.

@pengzhiwei2018 pengzhiwei2018 force-pushed the dev_metasync branch 2 times, most recently from 4ed02dc to b5bf84a Compare July 9, 2021 05:03
@pengzhiwei2018
Copy link
Contributor Author

Hi @yanghua , I have write a util to convert the parquet schema to spark schema json without the spark dependencies. Please take a look again~

@yanghua
Copy link
Contributor

yanghua commented Jul 9, 2021

Hi @yanghua , I have write a util to convert the parquet schema to spark schema json without the spark dependencies. Please take a look again~

OK, sounds good. Will review it soon.

Copy link
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

@@ -70,6 +69,10 @@
return Arrays.asList(new Object[][] {{true, true, true}, {true, false, false}, {false, true, true}, {false, false, false}});
}

private static Iterable<Object[]> useJdbcAndSchemaFromCommitMetadataAndSyncAsDataSource() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this method is too long and can not explain the purpose of the method body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will rename it.


import static org.junit.jupiter.api.Assertions.assertEquals;

public class TestParquet2SparkSchemaUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more test cases for the multiple integer type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

partitionCols.add(column2Field.getOrDefault(partitionName,
new PrimitiveType(Type.Repetition.REQUIRED, BINARY, partitionName, UTF8)));
}
for (Type field : originGroupType.getFields()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

public Boolean syncAsSparkDataSourceTable = true;

@Parameter(names = {"--spark-schema-length-threshold"}, description = "The maximum length allowed in a single cell when storing additional schema information in Hive's metastore.")
public int sparkSchemaLengthThreshold = 4000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 4000 a special value here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the default value in spark conf: spark.sql.sources.schemaStringLengthThreshold

@pengzhiwei2018 pengzhiwei2018 force-pushed the dev_metasync branch 2 times, most recently from 09020d6 to ec60b9c Compare July 12, 2021 03:59
* Config stored in hive serde properties to tell query engine (spark/flink) to
* read the table as a read-optimized table when this config is true.
*/
public static final String IS_QUERY_AS_RO_TABLE = "hoodie.query.as.ro.table";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the relationship between this key IS_QUERY_AS_RO_TABLE and SPARK_QUERY_AS_RO_KEY and SPARK_QUERY_AS_RT_KEY .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK_QUERY_AS_RO_KEY is introduced by #2925 for spark sql writer to pass some params. It can only used for spark engine. In this PR, we do not need this now. We use IS_QUERY_AS_RO_TABLE which can be used for both spark & flink.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should add some comments and some strategy to deprecate (or delete) them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR #2925 has not released yet, so we can just delete them.

* Convert the parquet schema to spark schema' json string.
* This code is refer to org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter
* in spark project.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use ParquetToSparkSchemaConverter directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ParquetToSparkSchemaConverter directly need the spark dependencies for hive-sync. And the flink-bundle will also need the spark. In order to remove the spark dependencies, I write this util.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey, got your idea. It still looks very weird because this spark schema can only be recognized by Spark engine only. Looks like a special case for writer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Spark need these schema configs to read table as datasource table. We have to sync these configs.

@@ -70,6 +69,10 @@
return Arrays.asList(new Object[][] {{true, true, true}, {true, false, false}, {false, true, true}, {false, false, false}});
}

private static Iterable<Object[]> syncDataSourceTableParams() {
return Arrays.asList(new Object[][] {{true, true, true}, {true, false, false}, {false, true, true}, {false, false, false}});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give some comments what each flag means ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@danny0405
Copy link
Contributor

+1, LGTM.

PR Tracker Board automation moved this from Ready for Review to Nearing Landing Jul 13, 2021
@pengzhiwei2018 pengzhiwei2018 merged commit f0a2f37 into apache:master Jul 13, 2021
PR Tracker Board automation moved this from Nearing Landing to Done Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants