Spark: add procedure to generate symlink manifests by jackye1995 · Pull Request #4401 · apache/iceberg

jackye1995 · 2022-03-25T07:40:46Z

Add a Spark procedure to generate symlink manifests, so that systems without Iceberg support can read Iceberg table data using an external table:

CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<symlink-table-root-path>'

I did not add an action for this because this is to just give a gateway for users with any existing query engine that does not natively support Iceberg (in my case it's Redshift Spectrum) to start reading Iceberg, because most engines support Hive with symlink input format to some extent. If we think it deserves an action in core API, I can also add that.

The procedure looks like:

CALL catalog.system.generate_symlink_format_manifest(
  table => 'table_name', 
  symlink_root_location => 's3://some/path'
);

The symlink_root_location is optional. The default is <table_root>/_symlink_format_manifest/<snapshot_id>. A snapshot ID suffix is added because if this procedure is executed twice against the same table, we don't want to mix the results if the table is updated. If users want to use a consistent root path for the symlink table, it could be input as an override.

I thought about adding another option for snapshot_id in the input, so we can generate a symlink table for any historical snapshots, but decided to not do that to avoid making the procedure too complicated. We can add it as a follow up if needed.

The procedure currently returns the snapshot_id that the procedure is executed against, and data_file_count for the number of data files in the symlink manifests.

Regarding partitioning, the generated symlink table exposes all the hidden partitions, and use the union of all historical table partition specs. For example, if the table is partitioned by spec1 category, spec 2 bucket(16, id), users are expected to create a symlink table with PARTITIONED BY (id_bucket int, category string).

Regarding merge-on-read, generated symlink table does not consider delete files. This is basically a "snapshot view" of the table. A compaction is needed to generate the most up-to-date view of the table.

singhpk234 · 2022-03-25T17:38:01Z

...rc/main/java/org/apache/iceberg/spark/procedures/GenerateSymlinkFormatManifestProcedure.java

+    if (partitionType.fields().isEmpty()) {
+      entries.select("data_file.file_path")
+          .write()
+          .format("parquet")


[question] should the format here be "text" instead ? As per this : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java#L47-L52

singhpk234 · 2022-03-26T20:48:25Z

...rc/main/java/org/apache/iceberg/spark/procedures/GenerateSymlinkFormatManifestProcedure.java

+
+    Types.StructType partitionType = Partitioning.partitionType(icebergTable);
+    Dataset<Row> entries = SparkTableUtil.loadCatalogMetadataTable(spark(), icebergTable, MetadataTableType.ENTRIES)
+        .filter("status < 2 AND data_file.content = 0");


we should cache the DF as well, otherwise we will end up scanning the metaDataTable twice. in :

count()

write()

singhpk234 · 2022-03-26T20:49:54Z

...st/java/org/apache/iceberg/spark/extensions/TestGenerateSymlinkFormatManifestsProcedure.java

+  private void checkDirectoryExists(String... paths) {
+    String path = String.join("/", paths);
+    Assert.assertTrue("Directory should exist: " + path, Files.isDirectory(Paths.get(URI.create(path))));
+  }


should we also check the symlink folder when being used as the external table is return's same result as iceberg table ? WDYT

kbendick

Thanks for this addition @jackye1995!

Regarding merge-on-read, generated symlink table does not consider delete files. This is basically a "snapshot view" of the table. A compaction is needed to generate the most up-to-date view of the table.

Should we error out if there are delete files present?

We can add an option to still allow for ignoring them to get this "snapshot view", but I think by default a symlink manifest shouldn't be generated that can represent a table at a state it never existed in.

For example, if we're using MOR, and we have one file that gets written as a plain.data file (as it only contains appends), and then some number of delta files, the current proposed symlink manifest based table will contain that new file but won't contain the deltas.

Correct me if I'm wrong, but I think that would be a state the table was never in. I can see how it would avoid unnecessary compute to skip trying to process any delta files, e.g. if the user did compact it ahead of time, but I don't think that should be the default for the procedure.

The default output should return a symlink that represents the table at least as it is at some point in time.

Ranjith-AR · 2022-03-29T04:08:00Z

CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<symlink-table-root-path>' 
The procedure looks like:
CALL catalog.system.generate_symlink_format_manifest(
  table => 'table_name', 
  symlink_root_location => 's3://some/path'
);
The symlink_root_location is optional. The default is <table_root>/_symlink_format_manifest/<snapshot_id>. A snapshot ID suffix is added because if this procedure is executed twice against the same table, we don't want to mix the results if the table is updated. If users want to use a consistent root path for the symlink table, it could be input as an override.

**in the above default location "<table_root>/_symlink_format_manifest/<snapshot_id>",
what is the name of the manifest file ? /<snapshot_id>_manifest ? or just /<snapshot_id> ?

in either case, does this mean whenever I regenerate the symlink_format_manifest file, i have to alter the external table location to reflect the latest manifest file ?

how to execute the procedure to override / exclude the snapshot_id prefix and generate manifest file like below
"<table_root>/_symlink_format_manifest/manifest" ?**

szehon-ho

Thanks, this is great to see, really look forward to this feature. Left a round of comments.

I agree with @kbendick to add check if the table lacks delete file (or maybe, just support table v1 table initially).

szehon-ho · 2022-04-04T22:39:51Z

...rc/main/java/org/apache/iceberg/spark/procedures/GenerateSymlinkFormatManifestProcedure.java

+    Preconditions.checkArgument(tableIdent != null && !tableIdent.isEmpty(),
+        "Cannot handle an empty identifier for argument table");
+
+    CatalogPlugin defaultCatalog = spark().sessionState().catalogManager().currentCatalog();


Nit: these lines can be replaced by: Spark3Util.loadIcebergTable()

szehon-ho · 2022-04-04T23:25:54Z

...rc/main/java/org/apache/iceberg/spark/procedures/GenerateSymlinkFormatManifestProcedure.java

+    int len = tableLocation.length();
+    StringBuilder sb = new StringBuilder();
+    sb.append(tableLocation);
+    if (sb.charAt(len - 1) != '/') {


Optional: can save one line above to get "len" by just checking tableLocation.endsWith(). (Not sure if there's any other reason to do it this way)

szehon-ho · 2022-04-04T23:27:54Z

...rc/main/java/org/apache/iceberg/spark/procedures/GenerateSymlinkFormatManifestProcedure.java

+  public InternalRow[] call(InternalRow args) {
+    String tableIdent = args.getString(0);
+    Preconditions.checkArgument(tableIdent != null && !tableIdent.isEmpty(),
+        "Cannot handle an empty identifier for argument table");


Nit: maybe a slightly more user-friendly error message, "table cannot be null or empty"?

szehon-ho · 2022-04-04T23:30:14Z

...rc/main/java/org/apache/iceberg/spark/procedures/GenerateSymlinkFormatManifestProcedure.java

+        "Cannot generate symlink manifests for an empty table");
+
+    long snapshotId = icebergTable.currentSnapshot().snapshotId();
+    String symlinkRootLocation = args.isNullAt(1) ?


Curious, do we need to add any validation that path ends with /? (Noticed we put a / at the end of the default path, not sure if that was a hard requirement)

szehon-ho · 2022-04-04T23:38:30Z

...st/java/org/apache/iceberg/spark/extensions/TestGenerateSymlinkFormatManifestsProcedure.java

+    sql("INSERT INTO TABLE %s VALUES (2, 'b')", tableName);
+    List<Object[]> result = sql("CALL %s.system.generate_symlink_format_manifest('%s')", catalogName, tableIdent);
+    Table table = validationCatalog.loadTable(tableIdent);
+    List<Object[]> expected = Lists.newArrayList();


Nit: can we use Lists.newArrayList(row(table...))?

(same for other tests)

szehon-ho · 2022-04-04T23:41:12Z

...st/java/org/apache/iceberg/spark/extensions/TestGenerateSymlinkFormatManifestsProcedure.java

+    String path = String.join("/", paths);
+    Assert.assertTrue("Directory should exist: " + path, Files.isDirectory(Paths.get(URI.create(path))));
+  }
+}


If not too hard, can we add one more test about evolving partition spec?

szehon-ho · 2022-04-04T23:42:22Z

...st/java/org/apache/iceberg/spark/extensions/TestGenerateSymlinkFormatManifestsProcedure.java

+    List<Object[]> expected = Lists.newArrayList();
+    expected.add(row(table.currentSnapshot().snapshotId(), 2L));
+    assertEquals("Should find 2 files", expected, result);
+    checkDirectoryExists(customLocation);


Can we add check (maybe in this test or another), of directory exists under customLocation for each partition? To make sure that custom location works with partitioned table.

szehon-ho · 2022-04-04T23:45:14Z

...rc/main/java/org/apache/iceberg/spark/procedures/GenerateSymlinkFormatManifestProcedure.java

+  }
+
+  @Override
+  public InternalRow[] call(InternalRow args) {


Do you think it's beneficial to add an Action and have Procedure call it, like the other ones? I am not sure. cc @RussellSpitzer @aokolnychyi if any thoughts.

prashantgohel1 · 2022-06-09T19:30:05Z

@jackye1995 Is there a way to do the same thing using Java API? How can we read the location for all of data files given tableId and snapshotId?

amogh-jahagirdar · 2022-07-31T03:44:51Z

@jackye1995 Is there a way to do the same thing using Java API? How can we read the location for all of data files given tableId and snapshotId?

@prashantgohel1 You may have already gotten it, but If you're looking for getting the data files for a table at a given snapshot, going through the Table APIs table.snapshot to get the snapshot, and then using snapshot's data manifest API to get all the data file manifests should work.https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Snapshot.java#L101 Then it's a matter of using the Java library to read the manifest file. https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestFiles.java#L71

Another way is to query the data files metadata table.

amogh-jahagirdar · 2022-07-31T03:48:55Z

Discussed offline with @jackye1995 I'll be carrying this PR forward. So my thinking is the following:

1.) I do think we need to have an explicit flag for ignoring deletes, it makes sense to me that the operation fails if there are deletes. This is because as a user, by default, I expect the actual state of the table to be represented in the symlink file. If the operation fails, it does force a user to do a compaction prior to running this procedure but I think that makes sense. And then if they really don't want this behavior they can pass in the flag to ignore deletes @kbendick @jackye1995 let me know your thinking or we can discuss on the PRs I plan on raising.

2.) Looks like there's interest in having this be an actual Spark Action, I will add that.

@jackye1995 Feel free to close this, I will add you as a co-author on the PR for the procedure implementation. Thanks!

github-actions bot added the spark label Mar 25, 2022

Spark: add procedure to generate symlink manifests

816e006

singhpk234 reviewed Mar 25, 2022

View reviewed changes

singhpk234 reviewed Mar 26, 2022

View reviewed changes

kbendick reviewed Mar 28, 2022

View reviewed changes

szehon-ho reviewed Apr 4, 2022

View reviewed changes

amogh-jahagirdar mentioned this pull request Aug 1, 2022

API, Spark: Generate symlink manifest action #5398

Closed

jackye1995 closed this Aug 4, 2022

Conversation

jackye1995 commented Mar 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Ranjith-AR commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prashantgohel1 commented Jun 9, 2022

Uh oh!

amogh-jahagirdar commented Jul 31, 2022

Uh oh!

amogh-jahagirdar commented Jul 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jackye1995 commented Mar 25, 2022 •

edited

Loading

Ranjith-AR commented Mar 29, 2022 •

edited

Loading

szehon-ho left a comment •

edited

Loading

amogh-jahagirdar commented Jul 31, 2022 •

edited

Loading