Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-2063] Add Doc For Spark Sql Integrates With Hudi #3140

Merged
merged 6 commits into from
Aug 12, 2021

Conversation

pengzhiwei2018
Copy link
Contributor

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Hudi support using spark sql to write and read data with the **HoodieSparkSessionExtension** sql extension.
```shell
# spark sql for spark 3
spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.8.0 does not support spark sql? move to 0.9.0-SNAPSHOT and would update the version when releasing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will fix this.

ALTER TABLE tableIdentifier ADD COLUMNS(colAndType (,colAndType)*)

-- Alter table column type
ALTER TABLE tableIdentifier CHANGE COLUMN colName colName colType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support adding partitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, It is the same to the other spark datasource tables. So I have not notice the alter partition here.

options (
primaryKey = 'id',
type = 'mor',
hoodie.index.type = 'GLOBAL_BLOOM'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we should describe how to set all hoodie config through options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can set hoodie' config by two ways: 1、 using SET command. 2、using table options. I not clear about the set all hoodie config through options means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here what i mean is that we should describe how to set hoodie config such as hoodie.datasource.write.operation and other config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will add some notes here.

type = 'cow'
);
```
**Create Non-Partitioned Table**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitioned ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will fix this soon.


| Parameter Name | Introduction |
|------------|--------|
| primaryKey | The primary key names of the table, multiple fields separated by commas. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support TimestampBasedKeyGenerator now? if yes more configs may be needed

Copy link
Contributor Author

@pengzhiwei2018 pengzhiwei2018 Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already support all kinds of partition data type including date & timestamp by the sql internal key generator : SqlKeyGenerator. So user do not need to set custom key generator.

@pengzhiwei2018 pengzhiwei2018 force-pushed the dev_sql_doc branch 4 times, most recently from 1bebbd8 to b3a34be Compare June 28, 2021 12:02
Copy link
Contributor

@wangxianghu wangxianghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am cool to land this and then iterate. I am not sure if this should sit inside Quickstart or a separate page. We can discuss that separately

@vinothchandar
Copy link
Member

@leesf Leave it to you to land when ready

@leesf
Copy link
Contributor

leesf commented Jul 12, 2021

@leesf Leave it to you to land when ready

@vinothchandar I am going to land it after we cut out the release since the spark shell uses 0.9.0 version.

@@ -300,6 +300,268 @@ spark.
show(100, false)
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pengzhiwei2018 hello, now spark sql support compaction 、cleaner、clustering ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am planning to support this in the next PRs.

@vinothchandar
Copy link
Member

@leesf @pengzhiwei2018 can we redo this based on the new site. We have code tabs. So we can just add a sql code tab for this and make it very simple

@@ -119,6 +119,11 @@ By default, Spark SQL will try to use its own parquet reader instead of Hive Ser
both parquet and avro data, this default setting needs to be turned off using set `spark.sql.hive.convertMetastoreParquet=false`.
This will force Spark to fallback to using the Hive Serde to read the data (planning/executions is still Spark).

**NOTICE**

Since 0.9.0 hudi will sync the table to hive as a spark datasource table. So we do not need the `spark.sql.hive.convertMetastoreParquet=false`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required in this patch. But just curious. Did we test an upgrade scenario? i.e. an already existing table that was created w/ 0.8.0. And upgraded to 0.9.0. and flipping the hive sync by default works smoothly w/o any issues. If not, can you create a follow up ticket. I can look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for an existing old table, the hive sync will not update the properties of spark datasource table to the table. So they still need this config when querying.

website/docs/querying_data.md Show resolved Hide resolved
]}>
<TabItem value="scala">

```scala
// spark-shell for spark 3
spark-shell \
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess we should not update these versions in this patch. As part of release, the release manager will do one time update of all such versions. Can we revert these please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Hudi support using spark sql to write and read data with the **HoodieSparkSessionExtension** sql extension.
```shell
# spark sql for spark 3
spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These versions can left as 0.9.0 since this feature is available only in 0.9.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@@ -51,17 +74,17 @@ export PYSPARK_PYTHON=$(which python3)

# for spark3
pyspark
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above. lets not flip to 0.9.0 yet

|------------|--------|
| primaryKey | The primary key names of the table, multiple fields separated by commas. |
| type | The table type to create. type = 'cow' means a COPY-ON-WRITE table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' without specified this option.|
| preCombineField | The Pre-Combine field of the table. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add details on CTAS as well. either in this section or in the insert section (lines 250 ish)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you add info on CTAS, add a note that bulk_insert will be used with CTAS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also suggest to move the "set options" section here. Bcoz, it goes with create table. May be briefly talk about how to set hudi configs for a given table here. and in the later section, you can dive into details. But as of now, its towards the very end, not sure if users will get to the end to read those sections. likely some may skip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think make the setting hudi config a separate part is more reasonable. As user may want to find how to setting hudi config by spark sql, they can easy to find the info in that part. But It's hard to associate this with Create Table, although
we can do that in the table options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, sounds good. but just a 1 liner like, "to set custom hoodie configs for a table, check Set Hudi Config section" would help.


For non-strict mode, hudi just do the insert operation for the pk-table.

We can set the inset mode by the config: **hoodie.sql.insert.mode**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo. "insert"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

when matched then update set price = s0.price * 2
```
This works well for Cow-On-Write table which support update only the **price** field. But it do not work
for Merge-ON-READ table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of saying something works for COW and does not work for MOR, we can word it differently.
"This works well for Copy_On_Write and support for Merge_On_Read will be added in future release". Can you revisit entire patch and fix all such phrases used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

select("uuid","partitionpath").
show(10, false)

```

**NOTICE**

The insert overwrite non-partitioned table sql statement will convert to the ***insert_overwrite_table*** operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an example command here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense.

select("uuid","partitionpath").
sort("partitionpath","uuid").
show(100, false)
```
**NOTICE**

The insert overwrite partitioned table sql statement will convert to the ***insert_overwrite*** operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, can we add an example command here please

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very few minor comments. I will push an update and land it by today. Good job on the docs :)


Hudi support create table using spark-sql.

**Create Non-Partitioned Table**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this feedback addressed? i.e. 1 to 2 line brief intro about pk, non-pk table, partitioned, non-partitioned table, managed vs external table.

Introduce a section called "Terminologies" or something at the beginning of sql dml and then explain these details there before we dive into create table commands.

|------------|--------|
| primaryKey | The primary key names of the table, multiple fields separated by commas. |
| type | The table type to create. type = 'cow' means a COPY-ON-WRITE table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' without specified this option.|
| preCombineField | The Pre-Combine field of the table. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, sounds good. but just a 1 liner like, "to set custom hoodie configs for a table, check Set Hudi Config section" would help.

@nsivabalan nsivabalan merged commit 252e906 into apache:asf-site Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants