-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc: Updates Writing to Partitioned Table Spark Docs #7499
Conversation
|
||
Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this old data is misleading so I removed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some assorted comments
docs/spark-writes.md
Outdated
the [Spark's Adaptive Query planning](#controlling-file-sizes). | ||
* `range` - This mode requests that Spark perform a range based exchanged to shuffle the data before writing. This is | ||
a two stage procedure which is more expensive than the `hash` mode. The first stage samples the data to be written based | ||
on the partition and sort columns, this information is then used in the second stage to shuffle data into tasks. Each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: run-on, add 'and' before 'this'?
docs/spark-writes.md
Outdated
will not be able to grow to that size if the task is not large enough. The | ||
on disk file size will also be much smaller than the Spark task size since the on disk data will be both compressed | ||
and in columnar format as opposed to Spark's uncompressed row representation. This means a 100 megabyte task will | ||
always corrospond to on an on disk file of much less than 100 megabytes even when writing to a single Iceberg partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some extra words here.
docs/spark-writes.md
Outdated
## Controlling File Sizes | ||
|
||
When writing data to Iceberg with Spark, it's important to note that Spark cannot write a file larger than a Spark | ||
task. This means although Iceberg will always roll over a file when it grows to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great section. While we are at it, would it also help new users to explicitly mention partitions, ie,
it's important to note that Spark cannot write a file larger than a Spark task, and files cannot span across Iceberg partitions
I've been relying on the Itellij MD renderer, i'll need to copy this over and check it in the doc repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @RussellSpitzer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nit comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this @RussellSpitzer, some nitpicking style comment
docs/spark-writes.md
Outdated
* `none` - This is the previous default for Iceberg. | ||
<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done | ||
automatically by Spark, the data must be *manually* locally or globally sorted by partition value. To reduce the number | ||
of files produced during writing, using a global sort is recommended. | ||
<p>A local sort can be avoided by using the Spark [write fanout](#write-properties) property but this will cause all | ||
file handles to remain open until each write task has completed. | ||
* `hash` - This mode is the new default and requests that Spark uses a hash-based exchange to shuffle the incoming | ||
write data before writing. Practically, this means that each row is hashed based on the row's partition value and then placed | ||
in a corresponding Spark task based upon that value. Further division and coalescing of tasks may take place because of | ||
the [Spark's Adaptive Query planning](#controlling-file-sizes). | ||
* `range` - This mode requests that Spark perform a range based exchanged to shuffle the data before writing. This is | ||
a two stage procedure which is more expensive than the `hash` mode. The first stage samples the data to be written based | ||
on the partition and sort columns. The second stage uses the range information to shuffle the input data into Spark | ||
tasks. Each task gets an exclusive range of the input data which clusters the data by partition and also globally sorts. | ||
While this is more expensive than the hash distribution, the global ordering can be beneficial for read performance if | ||
sorted columns are used during queries. This mode is used by default if a table is created with a | ||
sort-order. Further division and coalescing of tasks may take place because of | ||
[Spark's Adaptive Query planning](#controlling-file-sizes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I just realized your comment above #7499 (comment), looks like doc render can do much better than just raw markdown, so I guess those are not really needed
docs/spark-writes.md
Outdated
There are 3 options for `write.distribution-mode` | ||
|
||
* `none` - This is the previous default for Iceberg. | ||
<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if <p>
is used to start a new paragraph, do we need </p>
to end ? I also cant find it in line 350 for paragraph below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm sorry my change here must not gotten pushed. I no longer have any of that
docs/spark-writes.md
Outdated
|
||
* `none` - This is the previous default for Iceberg. | ||
<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done | ||
automatically by Spark, the data must be *manually* locally or globally sorted by partition value. To reduce the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you intend to italicize the word manually here using markdown syntax *manually*
, I think it might not work as intended within <p>
. I think below will work. Please ignore me if you want literal asterisk
<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done
automatically by Spark, the data must be <em>manually</em> locally or globally sorted by partition value. To reduce the number
of files produced during writing, using a global sort is recommended.</p>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also nit, the the data must be manually locally or globally sorted by partition value
seems a bit weird to read.
Maybe
The data must be manually sorted by partition value. Sorting can be done locally or globally to reduce the number of files produced during writing and global sort is recommended
docs/spark-writes.md
Outdated
<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done | ||
automatically by Spark, the data must be *manually* locally or globally sorted by partition value. To reduce the number | ||
of files produced during writing, using a global sort is recommended. | ||
<p>A local sort can be avoided by using the Spark [write fanout](#write-properties) property but this will cause all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hyperlink for [write fanout](#write-properties)
also does not seem to work in <p>
, might need HTML syntax instead like <a href="url">link text</a>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
is removed so this should be ok now
@dramaticlly Sorry I forgot to push that last set of changes. Please check it out now |
Thank you @RussellSpitzer , LGTM. Always enjoy your in-depth writing about iceberg and spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great write-up! I've left a few minor comments but LGTM otherwise
If you're inserting data with SQL statement, you can use `ORDER BY` to achieve it, like below: | ||
To write data to the sample table, data needs to be sorted by `days(ts), category` but this is taken care | ||
of automatically by the default `hash` distribution. Previously this would have required manually sorting, but this | ||
is no longer the case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when finishing this sentence, it's not clear what the below SQL example is trying to tell me. Maybe add a sentence saying that previously an ORDER BY
was required in the below SQL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order by wasn't required before either, the "OrderBy" would automatically set the distirbution mode to range. I feel like that was just confusing. Now this is mentioned in the "range" section below
Thanks for the review @nastra , @stevenzwu , @dramaticlly and @szehon-ho . Hopefully this will make Iceberg a little less mysterious! |
Closes #7037 |
No description provided.