Spark Integration to read from Snapshot ref #5150

namrathamyske · 2022-06-28T19:26:56Z

Issue adressed: #3899

This PR provides a way to spark query using snapshot ref

namrathamyske · 2022-06-28T19:30:23Z

@amogh-jahagirdar Took a pull from your PR. let me know if my commit 8132d20 looks good.

namrathamyske · 2022-08-05T23:38:44Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java

+
+  @Test
+  public void testSnapshotSelectionByRef() throws IOException {
+    String tableLocation = temp.newFolder("iceberg-table").toString();


Still need to polish these testcases!

namrathamyske · 2022-08-05T23:39:32Z

spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java

+
+  @Test
+  public void testSnapshotSelectionByRef() throws IOException {
+    String tableLocation = temp.newFolder("iceberg-table").toString();


Still need to polish these testcases!

amogh-jahagirdar · 2022-08-07T17:38:20Z

@namrathamyske Thanks for this!
High level comment, I think we should separate the API and the Spark integration changes. Also at the API level, I think it makes sense to separate useBranch and useTag, rather than having one useSnapshotRef , because branching can be combined with time travel, but tagging cannot be; although that could just be an implementation detail we handle. Semantically from an API perspective though it seems cleaner to separate the 2.

Let me know what you think. Checkout this thread #4428 (comment)

I have this PR #5364 for branching + time travel, I think we could do a separate one for tagging.

rdblue · 2022-08-07T17:56:01Z

api/src/main/java/org/apache/iceberg/TableScan.java

+   * @return a new scan based on this with the given snapshot Ref
+   * @throws IllegalArgumentException if the snapshot cannot be found
+   */
+  TableScan useSnapshotRef(String snapshotRef);


I think this this should be useRef(String branchOrTagName). The term SnapshotRef is internal and I don't think it should be exposed.

I think we need to separate the useBranch and useTag APIs. As you said, refs are internal. From a Spark user perspective we also want to only expose the branch/tag terms; imo I think the same case could be applied to the API level. Also considering branches can be combined with time travel we could do a separate API for that ; although there's an argument to be made to just combine useBranch + as Of Time.

Yeah, I considered that as well. The problem is that the caller doesn't know whether the ref is a tag or a branch before calling the method. That's determined when we look at table metadata and we don't want to force the caller to do that.

There may be a better name than "ref" for useRef. That seems like the problem to me. Maybe we could simplify it to use? I'm not sure that's obvious enough.

@aokolnychyi, do you have any thoughts on the name here?

@rdblue @amogh-jahagirdar I agree that we can use a common API for tag or branch like useRef.

We have two signatures:

useRef(String refName)

useRef(String refName, Long timeStampMillis) -> will throw exception for tag type, since we cant do time travel for tag.

Sure, this sounds reasonable. The only thing is I think if we do useRef (or if we come up with a better name) then we would not want to have the useRef(String refName, Long timeStampMillis). A user would chain it with the existing useTimestamp and then the validation that it's a branch would happen in the scan context.useRef().asOfTime() I don't think we would want the extra method because time travel would only apply for branches so having the ref in that case doesn't make sense to me since it's really only supported for 1 ref type, the branch.

If we have consensus on this, then I can update https://github.com/apache/iceberg/pull/5364/files with the updated approach. Then this, PR could be focused on the Spark side of the integration. Will wait to hear what @aokolnychyi suggests as well!

will throw exception for tag type, since we cant do time travel for tag.

In that case I would suggest:

useRef(String refName)

useBranchAsOfTime(String branchName, Long timeStampMillis)

Oh I see the alternative is just .useRef(refName).asOfTime(timestampMillis). That also works, in that case +1 for useRef(String refName)

Sounds like there is consensus for useRef.

core/src/main/java/org/apache/iceberg/BaseTableScan.java

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkConfParser.java

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java

namrathamyske · 2022-08-07T23:24:45Z

@rdblue @amogh-jahagirdar Thanks for your review! working on changes for this!

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

rdblue · 2022-10-17T03:10:40Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java

@@ -226,4 +226,88 @@ public void testSnapshotSelectionBySnapshotIdAndTimestamp() throws IOException {
        .hasMessageContaining("Cannot specify both snapshot-id")
        .hasMessageContaining("and as-of-timestamp");
  }
+
+  @Test
+  public void testSnapshotSelectionByTag() throws IOException {


I think we also need tests to show that branch and tag options can't be used at the same time, and tests to validate what happens when snapshot or timestamp are set along with branch or tag. It should be easy to make a few tests for those error cases.

rdblue

Thanks, @namrathamyske! This looks close.

amogh-jahagirdar · 2022-10-20T18:50:31Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java

+  // branch ref of the table snapshot to read from
+  public static final String BRANCH = "branch";
+
+  // tag ref of the table snapshot to read from


Just a nit: I think for the comments we can leave off the "table snapshot" and "ref" part and the comment could look something like

"Tag to read from"
"Branch to read from"

namrathamyske · 2022-10-23T04:35:45Z

I came across duplicate implementations of using timestamp option in

iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

Line 185 in 5688d59

Long asOfTimestamp = readConf.asOfTimestamp();

and

iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

Line 629 in 5688d59

Matcher at = AT_TIMESTAMP.matcher(ident.name());

. I am unsure of how to proceed for branches and tags usecase. Just making changes to read from branch/tag in SparkScanBuilder worked before for previous versions of spark - 3.1, 3.2. But for 3.3 , properties get overridden in SparkCatalog.java before reaching SparkScanBuilder.java. @rdblue @amogh-jahagirdar any thoughts appreciated.

amogh-jahagirdar

@namrathamyske I checked out your branch and debugged, see my comment for what the bug is; after the fix the tests should start passing (verified locally)

amogh-jahagirdar · 2022-10-29T23:22:08Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+    if (branch != null) {
+      scan.useRef(branch);
+    } else if (tag != null) {
+      scan.useRef(tag);
+    }


I think the issue is that scan.useRef(tag) by itself doesn't update the snapshot in the context on the existing scan; it's a builder like pattern so it needs to be scan = scan.useRef(ref). That's why the tests are failing; the context snapshot isn't set, so by default the scan reads main.

my bad missed this! Thanks for pointing out!

No problem!

…erg into spark-integration-ref

namrathamyske · 2022-10-30T02:40:48Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java

@@ -270,6 +282,8 @@ && readSchema().equals(that.readSchema())
        && Objects.equals(startSnapshotId, that.startSnapshotId)
        && Objects.equals(endSnapshotId, that.endSnapshotId)
        && Objects.equals(asOfTimestamp, that.asOfTimestamp);
+    //        && Objects.equals(branch, that.branch)
+    //        && Objects.equals(tag, that.tag);
  }


have to uncomment this , but getting a checkstyle cyclomatic complexity error.

Considering it's required for a correct equals implementation of SparkBatchQueryScan, I think it makes the most sense just to suppress the warnings on the method @SuppressWarnings("checkstyle:CyclomaticComplexity")

amogh-jahagirdar

Overall looks great to me just some nits. Thanks for contributing this @namrathamyske ! cc @rdblue

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java

rdblue · 2022-11-07T00:40:56Z

I am unsure of how to proceed for branches and tags usecase. Just making changes to read from branch/tag in SparkScanBuilder worked before for previous versions of spark - 3.1, 3.2. But for 3.3 , properties get overridden in SparkCatalog.java before reaching SparkScanBuilder.java

The current implementation looks fine to me. If those code paths are taken, it indicates that Spark was passed syntax like TIMESTAMP AS OF '...', which should be incompatible with branch or tag options. This PR already implements that because the snapshot passed to SparkTable is added to options.

rdblue · 2022-11-07T00:43:06Z

This looks good to me. I'm rerunning CI since the failures don't look related to this.

amogh-jahagirdar and others added 2 commits May 29, 2022 21:40

Core, API: Add getting refs and snapshot by ref to the Table API

413e75a

Spark 3.2 Integration to read from Snapshot ref

8132d20

namrathamyske changed the title ~~Spark integration ref~~ Spark 3.2 Integration to read from Snapshot ref Jun 28, 2022

github-actions bot added API core spark labels Jun 28, 2022

checkStyle fixes

c6cdf5c

hililiwei mentioned this pull request Jul 18, 2022

Spark: Spark SQL read from Snapshot ref #5294

Draft

nkeshavaprakash added 2 commits August 5, 2022 15:48

Merge from master

1f87b1f

Spark 3.2 Integration to read from Snapshot ref

235ff35

namrathamyske marked this pull request as ready for review August 5, 2022 22:55

nkeshavaprakash added 2 commits August 5, 2022 15:57

Spark 3.2 Integration to read from Snapshot ref

7483cf9

Spark 3.1 Integration to read from Snapshot ref

2be3e6e

namrathamyske changed the title ~~Spark 3.2 Integration to read from Snapshot ref~~ Spark Integration to read from Snapshot ref Aug 5, 2022

Adding checks to snapshot ref

26003f3

namrathamyske commented Aug 5, 2022

View reviewed changes