Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: Updates Writing to Partitioned Table Spark Docs #7499

Merged
merged 5 commits into from
May 10, 2023

Conversation

RussellSpitzer
Copy link
Member

No description provided.


Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this old data is misleading so I removed it

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some assorted comments

docs/spark-writes.md Outdated Show resolved Hide resolved
docs/spark-writes.md Outdated Show resolved Hide resolved
docs/spark-writes.md Outdated Show resolved Hide resolved
the [Spark's Adaptive Query planning](#controlling-file-sizes).
* `range` - This mode requests that Spark perform a range based exchanged to shuffle the data before writing. This is
a two stage procedure which is more expensive than the `hash` mode. The first stage samples the data to be written based
on the partition and sort columns, this information is then used in the second stage to shuffle data into tasks. Each
Copy link
Collaborator

@szehon-ho szehon-ho May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: run-on, add 'and' before 'this'?

will not be able to grow to that size if the task is not large enough. The
on disk file size will also be much smaller than the Spark task size since the on disk data will be both compressed
and in columnar format as opposed to Spark's uncompressed row representation. This means a 100 megabyte task will
always corrospond to on an on disk file of much less than 100 megabytes even when writing to a single Iceberg partition.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some extra words here.

docs/spark-writes.md Show resolved Hide resolved
## Controlling File Sizes

When writing data to Iceberg with Spark, it's important to note that Spark cannot write a file larger than a Spark
task. This means although Iceberg will always roll over a file when it grows to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great section. While we are at it, would it also help new users to explicitly mention partitions, ie,

it's important to note that Spark cannot write a file larger than a Spark task, and files cannot span across Iceberg partitions

@RussellSpitzer
Copy link
Member Author

I've been relying on the Itellij MD renderer, i'll need to copy this over and check it in the doc repo.

@RussellSpitzer
Copy link
Member Author

image
image

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @RussellSpitzer

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit comment

docs/spark-writes.md Outdated Show resolved Hide resolved
Copy link
Contributor

@dramaticlly dramaticlly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this @RussellSpitzer, some nitpicking style comment

Comment on lines 346 to 363
* `none` - This is the previous default for Iceberg.
<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done
automatically by Spark, the data must be *manually* locally or globally sorted by partition value. To reduce the number
of files produced during writing, using a global sort is recommended.
<p>A local sort can be avoided by using the Spark [write fanout](#write-properties) property but this will cause all
file handles to remain open until each write task has completed.
* `hash` - This mode is the new default and requests that Spark uses a hash-based exchange to shuffle the incoming
write data before writing. Practically, this means that each row is hashed based on the row's partition value and then placed
in a corresponding Spark task based upon that value. Further division and coalescing of tasks may take place because of
the [Spark's Adaptive Query planning](#controlling-file-sizes).
* `range` - This mode requests that Spark perform a range based exchanged to shuffle the data before writing. This is
a two stage procedure which is more expensive than the `hash` mode. The first stage samples the data to be written based
on the partition and sort columns. The second stage uses the range information to shuffle the input data into Spark
tasks. Each task gets an exclusive range of the input data which clusters the data by partition and also globally sorts.
While this is more expensive than the hash distribution, the global ordering can be beneficial for read performance if
sorted columns are used during queries. This mode is used by default if a table is created with a
sort-order. Further division and coalescing of tasks may take place because of
[Spark's Adaptive Query planning](#controlling-file-sizes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when reading from rich markdown diff, I notice that the 3 mode are concatenated together and it seems hard to read like in
image

Maybe you want to add a new line before hash and range on line 352 and 356 to render it properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I just realized your comment above #7499 (comment), looks like doc render can do much better than just raw markdown, so I guess those are not really needed

There are 3 options for `write.distribution-mode`

* `none` - This is the previous default for Iceberg.
<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if <p> is used to start a new paragraph, do we need </p> to end ? I also cant find it in line 350 for paragraph below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm sorry my change here must not gotten pushed. I no longer have any of that


* `none` - This is the previous default for Iceberg.
<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done
automatically by Spark, the data must be *manually* locally or globally sorted by partition value. To reduce the number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you intend to italicize the word manually here using markdown syntax *manually*, I think it might not work as intended within <p>. I think below will work. Please ignore me if you want literal asterisk

<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done 
automatically by Spark, the data must be <em>manually</em> locally or globally sorted by partition value. To reduce the number 
of files produced during writing, using a global sort is recommended.</p>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also nit, the the data must be manually locally or globally sorted by partition value seems a bit weird to read.

Maybe

The data must be manually sorted by partition value. Sorting can be done locally or globally to reduce the number of files produced during writing and global sort is recommended

<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done
automatically by Spark, the data must be *manually* locally or globally sorted by partition value. To reduce the number
of files produced during writing, using a global sort is recommended.
<p>A local sort can be avoided by using the Spark [write fanout](#write-properties) property but this will cause all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hyperlink for [write fanout](#write-properties) also does not seem to work in <p> , might need HTML syntax instead like <a href="url">link text</a>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

is removed so this should be ok now

@RussellSpitzer
Copy link
Member Author

@dramaticlly Sorry I forgot to push that last set of changes. Please check it out now

image
image
image

@dramaticlly
Copy link
Contributor

@dramaticlly Sorry I forgot to push that last set of changes. Please check it out now

Thank you @RussellSpitzer , LGTM. Always enjoy your in-depth writing about iceberg and spark

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great write-up! I've left a few minor comments but LGTM otherwise

docs/spark-writes.md Outdated Show resolved Hide resolved
If you're inserting data with SQL statement, you can use `ORDER BY` to achieve it, like below:
To write data to the sample table, data needs to be sorted by `days(ts), category` but this is taken care
of automatically by the default `hash` distribution. Previously this would have required manually sorting, but this
is no longer the case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when finishing this sentence, it's not clear what the below SQL example is trying to tell me. Maybe add a sentence saying that previously an ORDER BY was required in the below SQL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order by wasn't required before either, the "OrderBy" would automatically set the distirbution mode to range. I feel like that was just confusing. Now this is mentioned in the "range" section below

docs/spark-writes.md Outdated Show resolved Hide resolved
docs/spark-writes.md Outdated Show resolved Hide resolved
docs/spark-writes.md Outdated Show resolved Hide resolved
@RussellSpitzer
Copy link
Member Author

Thanks for the review @nastra , @stevenzwu , @dramaticlly and @szehon-ho . Hopefully this will make Iceberg a little less mysterious!

@RussellSpitzer RussellSpitzer merged commit 2a06bb5 into apache:master May 10, 2023
2 checks passed
@RussellSpitzer RussellSpitzer deleted the UpdateWriteDocs branch May 10, 2023 20:58
@RussellSpitzer
Copy link
Member Author

Closes #7037

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants