-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-2063] Add Doc For Spark Sql Integrates With Hudi #3140
Conversation
Hudi support using spark sql to write and read data with the **HoodieSparkSessionExtension** sql extension. | ||
```shell | ||
# spark sql for spark 3 | ||
spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.8.0 does not support spark sql? move to 0.9.0-SNAPSHOT and would update the version when releasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will fix this.
ALTER TABLE tableIdentifier ADD COLUMNS(colAndType (,colAndType)*) | ||
|
||
-- Alter table column type | ||
ALTER TABLE tableIdentifier CHANGE COLUMN colName colName colType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support adding partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, It is the same to the other spark datasource tables. So I have not notice the alter partition here.
options ( | ||
primaryKey = 'id', | ||
type = 'mor', | ||
hoodie.index.type = 'GLOBAL_BLOOM' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we should describe how to set all hoodie config through options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User can set hoodie' config by two ways: 1、 using SET command. 2、using table options. I not clear about the set all hoodie config through options
means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here what i mean is that we should describe how to set hoodie config such as hoodie.datasource.write.operation
and other config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will add some notes here.
2a77233
to
f45936b
Compare
type = 'cow' | ||
); | ||
``` | ||
**Create Non-Partitioned Table** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitioned ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will fix this soon.
|
||
| Parameter Name | Introduction | | ||
|------------|--------| | ||
| primaryKey | The primary key names of the table, multiple fields separated by commas. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support TimestampBasedKeyGenerator
now? if yes more configs may be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already support all kinds of partition data type including date & timestamp by the sql internal key generator : SqlKeyGenerator
. So user do not need to set custom key generator.
1bebbd8
to
b3a34be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
3954495
to
d5d04df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am cool to land this and then iterate. I am not sure if this should sit inside Quickstart or a separate page. We can discuss that separately
@leesf Leave it to you to land when ready |
@vinothchandar I am going to land it after we cut out the release since the spark shell uses 0.9.0 version. |
@@ -300,6 +300,268 @@ spark. | |||
show(100, false) | |||
``` | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pengzhiwei2018 hello, now spark sql support compaction 、cleaner、clustering ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am planning to support this in the next PRs.
@leesf @pengzhiwei2018 can we redo this based on the new site. We have code tabs. So we can just add a sql code tab for this and make it very simple |
a8e8b75
to
4d57211
Compare
@@ -119,6 +119,11 @@ By default, Spark SQL will try to use its own parquet reader instead of Hive Ser | |||
both parquet and avro data, this default setting needs to be turned off using set `spark.sql.hive.convertMetastoreParquet=false`. | |||
This will force Spark to fallback to using the Hive Serde to read the data (planning/executions is still Spark). | |||
|
|||
**NOTICE** | |||
|
|||
Since 0.9.0 hudi will sync the table to hive as a spark datasource table. So we do not need the `spark.sql.hive.convertMetastoreParquet=false` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required in this patch. But just curious. Did we test an upgrade scenario? i.e. an already existing table that was created w/ 0.8.0. And upgraded to 0.9.0. and flipping the hive sync by default works smoothly w/o any issues. If not, can you create a follow up ticket. I can look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, for an existing old table, the hive sync will not update the properties of spark datasource table to the table. So they still need this config when querying.
website/docs/quick-start-guide.md
Outdated
]}> | ||
<TabItem value="scala"> | ||
|
||
```scala | ||
// spark-shell for spark 3 | ||
spark-shell \ | ||
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \ | ||
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess we should not update these versions in this patch. As part of release, the release manager will do one time update of all such versions. Can we revert these please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Hudi support using spark sql to write and read data with the **HoodieSparkSessionExtension** sql extension. | ||
```shell | ||
# spark sql for spark 3 | ||
spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These versions can left as 0.9.0 since this feature is available only in 0.9.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
website/docs/quick-start-guide.md
Outdated
@@ -51,17 +74,17 @@ export PYSPARK_PYTHON=$(which python3) | |||
|
|||
# for spark3 | |||
pyspark | |||
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 | |||
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above. lets not flip to 0.9.0 yet
|------------|--------| | ||
| primaryKey | The primary key names of the table, multiple fields separated by commas. | | ||
| type | The table type to create. type = 'cow' means a COPY-ON-WRITE table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' without specified this option.| | ||
| preCombineField | The Pre-Combine field of the table. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add details on CTAS as well. either in this section or in the insert section (lines 250 ish)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you add info on CTAS, add a note that bulk_insert will be used with CTAS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also suggest to move the "set options" section here. Bcoz, it goes with create table. May be briefly talk about how to set hudi configs for a given table here. and in the later section, you can dive into details. But as of now, its towards the very end, not sure if users will get to the end to read those sections. likely some may skip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think make the setting hudi config a separate part is more reasonable. As user may want to find how to setting hudi config by spark sql, they can easy to find the info in that part. But It's hard to associate this with Create Table, although
we can do that in the table options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, sounds good. but just a 1 liner like, "to set custom hoodie configs for a table, check Set Hudi Config section" would help.
website/docs/quick-start-guide.md
Outdated
|
||
For non-strict mode, hudi just do the insert operation for the pk-table. | ||
|
||
We can set the inset mode by the config: **hoodie.sql.insert.mode** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo. "insert"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
website/docs/quick-start-guide.md
Outdated
when matched then update set price = s0.price * 2 | ||
``` | ||
This works well for Cow-On-Write table which support update only the **price** field. But it do not work | ||
for Merge-ON-READ table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of saying something works for COW and does not work for MOR, we can word it differently.
"This works well for Copy_On_Write and support for Merge_On_Read will be added in future release". Can you revisit entire patch and fix all such phrases used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
select("uuid","partitionpath"). | ||
show(10, false) | ||
|
||
``` | ||
|
||
**NOTICE** | ||
|
||
The insert overwrite non-partitioned table sql statement will convert to the ***insert_overwrite_table*** operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add an example command here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense.
select("uuid","partitionpath"). | ||
sort("partitionpath","uuid"). | ||
show(100, false) | ||
``` | ||
**NOTICE** | ||
|
||
The insert overwrite partitioned table sql statement will convert to the ***insert_overwrite*** operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, can we add an example command here please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very few minor comments. I will push an update and land it by today. Good job on the docs :)
|
||
Hudi support create table using spark-sql. | ||
|
||
**Create Non-Partitioned Table** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this feedback addressed? i.e. 1 to 2 line brief intro about pk, non-pk table, partitioned, non-partitioned table, managed vs external table.
Introduce a section called "Terminologies" or something at the beginning of sql dml and then explain these details there before we dive into create table commands.
|------------|--------| | ||
| primaryKey | The primary key names of the table, multiple fields separated by commas. | | ||
| type | The table type to create. type = 'cow' means a COPY-ON-WRITE table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' without specified this option.| | ||
| preCombineField | The Pre-Combine field of the table. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, sounds good. but just a 1 liner like, "to set custom hoodie configs for a table, check Set Hudi Config section" would help.
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.