Doc: Updates Writing to Partitioned Table Spark Docs #7499

RussellSpitzer · 2023-05-02T17:17:46Z

No description provided.

RussellSpitzer · 2023-05-02T17:18:45Z

docs/spark-writes.md


-Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write


All this old data is misleading so I removed it

szehon-ho

Some assorted comments

docs/spark-writes.md

szehon-ho · 2023-05-02T22:13:13Z

docs/spark-writes.md

+the [Spark's Adaptive Query planning](#controlling-file-sizes).
+* `range` - This mode requests that Spark perform a range based exchanged to shuffle the data before writing. This is
+a two stage procedure which is more expensive than the `hash` mode. The first stage samples the data to be written based
+on the partition and sort columns, this information is then used in the second stage to shuffle data into tasks. Each


Nit: run-on, add 'and' before 'this'?

szehon-ho · 2023-05-02T22:15:24Z

docs/spark-writes.md

+will not be able to grow to that size if the task is not large enough. The
+on disk file size will also be much smaller than the Spark task size since the on disk data will be both compressed 
+and in columnar format as opposed to Spark's uncompressed row representation. This means a 100 megabyte task will 
+always corrospond to on an on disk file of much less than 100 megabytes even when writing to a single Iceberg partition.


Some extra words here.

docs/spark-writes.md

szehon-ho · 2023-05-02T22:17:56Z

docs/spark-writes.md

+## Controlling File Sizes
+
+When writing data to Iceberg with Spark, it's important to note that Spark cannot write a file larger than a Spark 
+task. This means although Iceberg will always roll over a file when it grows to 


This is a great section. While we are at it, would it also help new users to explicitly mention partitions, ie,

it's important to note that Spark cannot write a file larger than a Spark task, and files cannot span across Iceberg partitions

RussellSpitzer · 2023-05-03T22:16:45Z

I've been relying on the Itellij MD renderer, i'll need to copy this over and check it in the doc repo.

RussellSpitzer · 2023-05-04T16:28:47Z

szehon-ho

Looks great, thanks @RussellSpitzer

stevenzwu

just a nit comment

docs/spark-writes.md

dramaticlly

Thanks for writing this @RussellSpitzer, some nitpicking style comment

dramaticlly · 2023-05-08T17:35:54Z

docs/spark-writes.md

+* `none` - This is the previous default for Iceberg.
+<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done 
+automatically by Spark, the data must be *manually* locally or globally sorted by partition value. To reduce the number 
+of files produced during writing, using a global sort is recommended.
+<p>A local sort can be avoided by using the Spark [write fanout](#write-properties) property but this will cause all 
+file handles to remain open until each write task has completed. 
+* `hash` - This mode is the new default and requests that Spark uses a hash-based exchange to shuffle the incoming
+write data before writing. Practically, this means that each row is hashed based on the row's partition value and then placed
+in a corresponding Spark task based upon that value. Further division and coalescing of tasks may take place because of
+the [Spark's Adaptive Query planning](#controlling-file-sizes).
+* `range` - This mode requests that Spark perform a range based exchanged to shuffle the data before writing. This is
+a two stage procedure which is more expensive than the `hash` mode. The first stage samples the data to be written based
+on the partition and sort columns. The second stage uses the range information to shuffle the input data into Spark 
+tasks. Each task gets an exclusive range of the input data which clusters the data by partition and also globally sorts.
+While this is more expensive than the hash distribution, the global ordering can be beneficial for read performance if
+sorted columns are used during queries. This mode is used by default if a table is created with a 
+sort-order. Further division and coalescing of tasks may take place because of
+[Spark's Adaptive Query planning](#controlling-file-sizes).


when reading from rich markdown diff, I notice that the 3 mode are concatenated together and it seems hard to read like in

Maybe you want to add a new line before hash and range on line 352 and 356 to render it properly?

actually I just realized your comment above #7499 (comment), looks like doc render can do much better than just raw markdown, so I guess those are not really needed

dramaticlly · 2023-05-08T17:38:53Z

docs/spark-writes.md

+There are 3 options for `write.distribution-mode`
+
+* `none` - This is the previous default for Iceberg.
+<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done 


if  is used to start a new paragraph, do we need  to end ? I also cant find it in line 350 for paragraph below

hmm sorry my change here must not gotten pushed. I no longer have any of that

dramaticlly · 2023-05-08T17:44:29Z

docs/spark-writes.md

+
+* `none` - This is the previous default for Iceberg.
+<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done 
+automatically by Spark, the data must be *manually* locally or globally sorted by partition value. To reduce the number 


if you intend to italicize the word manually here using markdown syntax *manually*, I think it might not work as intended within . I think below will work. Please ignore me if you want literal asterisk

This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done automatically by Spark, the data must be manually locally or globally sorted by partition value. To reduce the number of files produced during writing, using a global sort is recommended.

also nit, the the data must be manually locally or globally sorted by partition value seems a bit weird to read.

Maybe

The data must be manually sorted by partition value. Sorting can be done locally or globally to reduce the number of files produced during writing and global sort is recommended

dramaticlly · 2023-05-08T17:50:06Z

docs/spark-writes.md

+<p>This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done 
+automatically by Spark, the data must be *manually* locally or globally sorted by partition value. To reduce the number 
+of files produced during writing, using a global sort is recommended.
+<p>A local sort can be avoided by using the Spark [write fanout](#write-properties) property but this will cause all 


the hyperlink for [write fanout](#write-properties) also does not seem to work in  , might need HTML syntax instead like <a href="url">link text</a>

yep
is removed so this should be ok now

RussellSpitzer · 2023-05-09T20:01:49Z

@dramaticlly Sorry I forgot to push that last set of changes. Please check it out now

dramaticlly · 2023-05-09T20:09:13Z

@dramaticlly Sorry I forgot to push that last set of changes. Please check it out now

Thank you @RussellSpitzer , LGTM. Always enjoy your in-depth writing about iceberg and spark

nastra

Great write-up! I've left a few minor comments but LGTM otherwise

docs/spark-writes.md

nastra · 2023-05-10T06:35:09Z

docs/spark-writes.md

-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve it, like below:
+To write data to the sample table, data needs to be sorted by `days(ts), category` but this is taken care
+of automatically by the default `hash` distribution. Previously this would have required manually sorting, but this 
+is no longer the case.


when finishing this sentence, it's not clear what the below SQL example is trying to tell me. Maybe add a sentence saying that previously an ORDER BY was required in the below SQL

The order by wasn't required before either, the "OrderBy" would automatically set the distirbution mode to range. I feel like that was just confusing. Now this is mentioned in the "range" section below

docs/spark-writes.md

RussellSpitzer · 2023-05-10T20:57:48Z

Thanks for the review @nastra , @stevenzwu , @dramaticlly and @szehon-ho . Hopefully this will make Iceberg a little less mysterious!

RussellSpitzer · 2023-05-12T15:36:04Z

Closes #7037

Doc: Updates Writing to Partitioned Table Spark Docs

538b479

github-actions bot added the docs label May 2, 2023

RussellSpitzer requested review from nastra, aokolnychyi and szehon-ho May 2, 2023 17:18

RussellSpitzer commented May 2, 2023

View reviewed changes

szehon-ho reviewed May 2, 2023

View reviewed changes

Updated based on review comments

df3e9a2

Fix Spacing in Bullets

ea50fa9

szehon-ho approved these changes May 4, 2023

View reviewed changes

RussellSpitzer mentioned this pull request May 6, 2023

dynamic write partition with an extra shuffle #7541

Closed

stevenzwu approved these changes May 6, 2023

View reviewed changes

docs/spark-writes.md Outdated Show resolved Hide resolved

dramaticlly reviewed May 8, 2023

View reviewed changes

dramaticlly approved these changes May 8, 2023

View reviewed changes

Additional ReviewerComments

f5c656e

nastra approved these changes May 10, 2023

View reviewed changes

More Review Edits

6956c93

RussellSpitzer merged commit 2a06bb5 into apache:master May 10, 2023
2 checks passed

RussellSpitzer deleted the UpdateWriteDocs branch May 10, 2023 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc: Updates Writing to Partitioned Table Spark Docs #7499

Doc: Updates Writing to Partitioned Table Spark Docs #7499

RussellSpitzer commented May 2, 2023

RussellSpitzer May 2, 2023

szehon-ho left a comment

szehon-ho May 2, 2023 •

edited

Loading

szehon-ho May 2, 2023

szehon-ho May 2, 2023

RussellSpitzer commented May 3, 2023

RussellSpitzer commented May 4, 2023

szehon-ho left a comment

stevenzwu left a comment

dramaticlly left a comment

dramaticlly May 8, 2023

dramaticlly May 8, 2023 •

edited

Loading

dramaticlly May 8, 2023

RussellSpitzer May 9, 2023

dramaticlly May 8, 2023 •

edited

Loading

dramaticlly May 8, 2023

dramaticlly May 8, 2023

RussellSpitzer May 9, 2023

RussellSpitzer commented May 9, 2023

dramaticlly commented May 9, 2023

nastra left a comment

nastra May 10, 2023

RussellSpitzer May 10, 2023 •

edited

Loading

RussellSpitzer commented May 10, 2023

RussellSpitzer commented May 12, 2023


		Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write

Doc: Updates Writing to Partitioned Table Spark Docs #7499

Doc: Updates Writing to Partitioned Table Spark Docs #7499

Conversation

RussellSpitzer commented May 2, 2023

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

szehon-ho May 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer commented May 3, 2023

RussellSpitzer commented May 4, 2023

szehon-ho left a comment

Choose a reason for hiding this comment

stevenzwu left a comment

Choose a reason for hiding this comment

dramaticlly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dramaticlly May 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dramaticlly May 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer commented May 9, 2023

dramaticlly commented May 9, 2023

nastra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer May 10, 2023 • edited Loading

Choose a reason for hiding this comment

RussellSpitzer commented May 10, 2023

RussellSpitzer commented May 12, 2023

szehon-ho May 2, 2023 •

edited

Loading

dramaticlly May 8, 2023 •

edited

Loading

dramaticlly May 8, 2023 •

edited

Loading

RussellSpitzer May 10, 2023 •

edited

Loading