[HUDI-2063] Add Doc For Spark Sql Integrates With Hudi #3140

pengzhiwei2018 · 2021-06-23T11:11:28Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

leesf · 2021-06-24T13:56:20Z

docs/_docs/1_1_spark_quick_start_guide.md

+Hudi support using spark sql to write and read data with the **HoodieSparkSessionExtension** sql extension.
+```shell
+# spark sql for spark 3
+spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \


0.8.0 does not support spark sql? move to 0.9.0-SNAPSHOT and would update the version when releasing.

Yes, will fix this.

leesf · 2021-06-24T13:58:11Z

docs/_docs/1_1_spark_quick_start_guide.md

+ALTER TABLE tableIdentifier ADD COLUMNS(colAndType (,colAndType)*)
+
+-- Alter table column type
+ALTER TABLE tableIdentifier CHANGE COLUMN colName colName colType


support adding partitions?

yes, It is the same to the other spark datasource tables. So I have not notice the alter partition here.

leesf · 2021-06-24T13:59:33Z

docs/_docs/1_1_spark_quick_start_guide.md

+options (
+  primaryKey = 'id',
+  type = 'mor',
+  hoodie.index.type = 'GLOBAL_BLOOM'


here we should describe how to set all hoodie config through options?

User can set hoodie' config by two ways: 1、 using SET command. 2、using table options. I not clear about the set all hoodie config through options means?

here what i mean is that we should describe how to set hoodie config such as hoodie.datasource.write.operation and other config.

ok, I will add some notes here.

wangxianghu · 2021-06-28T07:28:57Z

docs/_docs/1_1_spark_quick_start_guide.md

+  type = 'cow'
+);
+```
+**Create Non-Partitioned Table**


partitioned ?

Good catch, will fix this soon.

wangxianghu · 2021-06-28T07:31:29Z

docs/_docs/1_1_spark_quick_start_guide.md

+
+| Parameter Name | Introduction |
+|------------|--------|
+| primaryKey | The primary key names of the table, multiple fields separated by commas. |


Do we support TimestampBasedKeyGenerator now? if yes more configs may be needed

We have already support all kinds of partition data type including date & timestamp by the sql internal key generator : SqlKeyGenerator. So user do not need to set custom key generator.

wangxianghu

LGTM

vinothchandar

I am cool to land this and then iterate. I am not sure if this should sit inside Quickstart or a separate page. We can discuss that separately

vinothchandar · 2021-07-08T03:08:52Z

@leesf Leave it to you to land when ready

leesf · 2021-07-12T14:40:35Z

@leesf Leave it to you to land when ready

@vinothchandar I am going to land it after we cut out the release since the spark shell uses 0.9.0 version.

lw309637554 · 2021-07-14T12:55:42Z

docs/_docs/1_1_spark_quick_start_guide.md

@@ -300,6 +300,268 @@ spark.
  show(100, false)
 ```



@pengzhiwei2018 hello, now spark sql support compaction 、cleaner、clustering ？

Yes, I am planning to support this in the next PRs.

vinothchandar · 2021-08-03T20:59:23Z

@leesf @pengzhiwei2018 can we redo this based on the new site. We have code tabs. So we can just add a sql code tab for this and make it very simple

nsivabalan · 2021-08-10T13:23:20Z

website/docs/querying_data.md

@@ -119,6 +119,11 @@ By default, Spark SQL will try to use its own parquet reader instead of Hive Ser
 both parquet and avro data, this default setting needs to be turned off using set `spark.sql.hive.convertMetastoreParquet=false`. 
 This will force Spark to fallback to using the Hive Serde to read the data (planning/executions is still Spark). 

+**NOTICE**
+
+Since 0.9.0 hudi will sync the table to hive as a spark datasource table. So we do not need the `spark.sql.hive.convertMetastoreParquet=false`


Not required in this patch. But just curious. Did we test an upgrade scenario? i.e. an already existing table that was created w/ 0.8.0. And upgraded to 0.9.0. and flipping the hive sync by default works smoothly w/o any issues. If not, can you create a follow up ticket. I can look into it.

Well, for an existing old table, the hive sync will not update the properties of spark datasource table to the table. So they still need this config when querying.

website/docs/querying_data.md

nsivabalan · 2021-08-10T13:26:53Z

website/docs/quick-start-guide.md

 ]}>
 <TabItem value="scala">

 ```scala
 // spark-shell for spark 3
 spark-shell \
-  --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \
+  --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \


Guess we should not update these versions in this patch. As part of release, the release manager will do one time update of all such versions. Can we revert these please.

nsivabalan · 2021-08-10T13:27:24Z

website/docs/quick-start-guide.md

+Hudi support using spark sql to write and read data with the **HoodieSparkSessionExtension** sql extension.
+```shell
+# spark sql for spark 3
+spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \


These versions can left as 0.9.0 since this feature is available only in 0.9.0

nsivabalan · 2021-08-10T13:27:46Z

website/docs/quick-start-guide.md

@@ -51,17 +74,17 @@ export PYSPARK_PYTHON=$(which python3)

 # for spark3
 pyspark
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
+--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1


same comment as above. lets not flip to 0.9.0 yet

nsivabalan · 2021-08-10T13:53:06Z

website/docs/quick-start-guide.md

+|------------|--------|
+| primaryKey | The primary key names of the table, multiple fields separated by commas. |
+| type       | The table type to create. type = 'cow' means a COPY-ON-WRITE table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' without specified this option.|
+| preCombineField | The Pre-Combine field of the table. |


Can we add details on CTAS as well. either in this section or in the insert section (lines 250 ish)

when you add info on CTAS, add a note that bulk_insert will be used with CTAS

I would also suggest to move the "set options" section here. Bcoz, it goes with create table. May be briefly talk about how to set hudi configs for a given table here. and in the later section, you can dive into details. But as of now, its towards the very end, not sure if users will get to the end to read those sections. likely some may skip.

Well, I think make the setting hudi config a separate part is more reasonable. As user may want to find how to setting hudi config by spark sql, they can easy to find the info in that part. But It's hard to associate this with Create Table, although
we can do that in the table options.

sure, sounds good. but just a 1 liner like, "to set custom hoodie configs for a table, check Set Hudi Config section" would help.

nsivabalan · 2021-08-10T13:54:59Z

website/docs/quick-start-guide.md

+
+For non-strict mode, hudi just do the insert operation for the pk-table.
+
+We can set the inset mode by the config: **hoodie.sql.insert.mode**


typo. "insert"

nsivabalan · 2021-08-10T13:59:12Z

website/docs/quick-start-guide.md

+ when matched then update set price = s0.price * 2
+```
+This works well for Cow-On-Write table which support update only the **price** field. But it do not work
+for Merge-ON-READ table.


Instead of saying something works for COW and does not work for MOR, we can word it differently.
"This works well for Copy_On_Write and support for Merge_On_Read will be added in future release". Can you revisit entire patch and fix all such phrases used.

Make sense.

nsivabalan · 2021-08-10T13:59:33Z

website/docs/quick-start-guide.md

  select("uuid","partitionpath").
  show(10, false)

 ``` 

+**NOTICE**
+
+The insert overwrite non-partitioned table sql statement will convert to the ***insert_overwrite_table*** operation.


Can we add an example command here

make sense.

nsivabalan · 2021-08-10T13:59:45Z

website/docs/quick-start-guide.md

  select("uuid","partitionpath").
  sort("partitionpath","uuid").
  show(100, false)
 ```
+**NOTICE**
+
+The insert overwrite partitioned table sql statement will convert to the ***insert_overwrite*** operation.


again, can we add an example command here please

nsivabalan

Very few minor comments. I will push an update and land it by today. Good job on the docs :)

nsivabalan · 2021-08-11T15:55:03Z

website/docs/quick-start-guide.md

+
+Hudi support create table using spark-sql.
+
+**Create Non-Partitioned Table**


Is this feedback addressed? i.e. 1 to 2 line brief intro about pk, non-pk table, partitioned, non-partitioned table, managed vs external table.

Introduce a section called "Terminologies" or something at the beginning of sql dml and then explain these details there before we dive into create table commands.

nsivabalan · 2021-08-11T15:59:08Z

website/docs/quick-start-guide.md

+|------------|--------|
+| primaryKey | The primary key names of the table, multiple fields separated by commas. |
+| type       | The table type to create. type = 'cow' means a COPY-ON-WRITE table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' without specified this option.|
+| preCombineField | The Pre-Combine field of the table. |


sure, sounds good. but just a 1 liner like, "to set custom hoodie configs for a table, check Set Hudi Config section" would help.

leesf reviewed Jun 24, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_sql_doc branch from 2a77233 to f45936b Compare June 25, 2021 02:52

wangxianghu requested changes Jun 28, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_sql_doc branch 4 times, most recently from 1bebbd8 to b3a34be Compare June 28, 2021 12:02

wangxianghu approved these changes Jun 29, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_sql_doc branch from 3954495 to d5d04df Compare July 5, 2021 12:49

vinothchandar self-assigned this Jul 5, 2021

vinothchandar added the priority:blocker label Jul 5, 2021

vinothchandar approved these changes Jul 8, 2021

View reviewed changes

lw309637554 reviewed Jul 14, 2021

View reviewed changes

[HUDI-2063] Add Doc For Spark Sql Integrates With Hudi

4d57211

pengzhiwei2018 force-pushed the dev_sql_doc branch from a8e8b75 to 4d57211 Compare August 4, 2021 14:50

pengzhiwei added 2 commits August 5, 2021 12:13

refactor sql doc

735697a

remove old doc

6e0798a

vinothchandar assigned nsivabalan and unassigned vinothchandar Aug 5, 2021

Add doc for insert mode & time travel query

6af7dd0

nsivabalan requested changes Aug 10, 2021

View reviewed changes

fix some comments

c844a54

nsivabalan reviewed Aug 11, 2021

View reviewed changes

add some intro for create table

acd6a25

nsivabalan approved these changes Aug 12, 2021

View reviewed changes

nsivabalan merged commit 252e906 into apache:asf-site Aug 12, 2021


		For non-strict mode, hudi just do the insert operation for the pk-table.

		We can set the inset mode by the config: hoodie.sql.insert.mode


		Hudi support create table using spark-sql.

		Create Non-Partitioned Table

[HUDI-2063] Add Doc For Spark Sql Integrates With Hudi #3140

[HUDI-2063] Add Doc For Spark Sql Integrates With Hudi #3140

Conversation

pengzhiwei2018 commented Jun 23, 2021

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengzhiwei2018 Jun 28, 2021 • edited Loading

Choose a reason for hiding this comment

wangxianghu left a comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

vinothchandar commented Jul 8, 2021

leesf commented Jul 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar commented Aug 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengzhiwei2018 Jun 28, 2021 •

edited

Loading