-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Core: Add V4 location relativization utilities #16174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7f9b5d5
3306f14
cbfb44b
5534732
fa10500
08374e9
6449a75
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -57,4 +57,59 @@ public static String tableLocation(TableIdentifier tableIdentifier, boolean useU | |
| return tableIdentifier.name(); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Returns true if the location contains a URI scheme (e.g. {@code s3:}, {@code hdfs:}, {@code | ||
| * file:}), per <a href="https://datatracker.ietf.org/doc/html/rfc3986#section-3.1">RFC 3986 | ||
| * section 3.1</a>. | ||
| */ | ||
| private static boolean hasScheme(String location) { | ||
| if (location.isEmpty()) { | ||
| return false; | ||
| } | ||
|
|
||
| // Early termination for relative locations since most commonly start with / | ||
| if (location.charAt(0) == '/') { | ||
| return false; | ||
| } | ||
|
|
||
| for (int i = 0; i < location.length(); i += 1) { | ||
| char ch = location.charAt(i); | ||
| if (ch == ':') { | ||
| return i > 0; | ||
| } | ||
|
|
||
| if (!Character.isLetterOrDigit(ch) && ch != '+' && ch != '-' && ch != '.') { | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| return false; | ||
| } | ||
|
|
||
| /** | ||
| * Resolves a location against a table location. If the location has a URI scheme, it is returned | ||
| * as-is. Otherwise, the location is appended to the table location without any additional | ||
| * separator. | ||
| */ | ||
| public static String resolveLocation(String tableLocation, String location) { | ||
| if (hasScheme(location)) { | ||
| return location; | ||
| } | ||
|
|
||
| return tableLocation + location; | ||
| } | ||
|
|
||
| /** | ||
| * Relativizes a location against a table location. If the location starts with the table | ||
| * location, the prefix is removed and the remaining relative portion is returned. Otherwise, the | ||
| * location is returned as-is. | ||
| */ | ||
| public static String relativizeLocation(String tableLocation, String location) { | ||
| if (location.startsWith(tableLocation)) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Prefix-collision bug: The new One possible solution if the path separator is if (location.startsWith(tableLocation)) {
String rest = location.substring(tableLocation.length());
if (rest.isEmpty() || rest.startsWith("/")) {
return rest;
}
}
return location;
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had this defensive check in the initial version of this PR, but removed it because I got feedback that we should not do assumptions about path separators. cc @danielcweeks
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The issue here isn't really about whether The actual problem with byte-prefix Same-location round-trip looks fine in isolation: But move the table to The file that originally lived at The spec PR's rule is "If a file's absolute path shares a common prefix with the table location, the relative portion should be stored. Otherwise, the absolute path should be stored." — but it doesn't pin down what "common prefix" means. Byte-prefix permits the case above; a boundary/segment-aware notion forbids it. The relocation argument points to the latter — otherwise relative paths are unsafe for any sibling-prefix layout. cc @rdblue / @danielcweeks: should
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the scenario Steven described is definitely possible. I had called out this in the proposal doc. I agree that this is something we need to consider in the spec.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's reasonable to add defensive checks if it can be done efficiently in the reference implementation, but I'm not convinced there's a way to cleanly solve for this in the spec. This scenario should resolve to an absolute path, but I don't think we can provide a concrete solution without relying on a separator character.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Great. This is the first thing that we need to agree on.
This is the tricky part. The location is an URI (not a s3 object key string). It might be reasonable to assume The spec already mentioned the path separator. Can we require that the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can flip it require that the table location must end with a path separator. Both options solve this problem. Implementors will find it a bit surprising that the relative path starts with a This can be avoided if we flip it and require the location must end with the separator instead. The downside of this approach is that we might lose the early termination done in the current version of this PR and may have to look at more characters for a colon instead. But honestly it may not make much of a difference in modern CPUs if JVM is using JIT compilation with SIMD instructions.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was originally pushback against including the separator at the end of the table prefix (I don't remember the specific argument, but I'll look into that). I don't see a good reason at this point to not change the separator location. That would also align better with what people intuitively think of when looking at a path (relative paths typically don't start with a I'll follow up on this. |
||
| return location.substring(tableLocation.length()); | ||
| } | ||
|
|
||
| return location; | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -84,4 +84,141 @@ void testStripTrailingSlashForRootPathWithTrailingSlashes() { | |
| .as("Should be root path") | ||
| .isEqualTo(rootPath); | ||
| } | ||
|
|
||
| @Test | ||
| public void testResolveRelativeLocations() { | ||
| String tableLocation = "s3://bucket/table"; | ||
|
|
||
| assertThat(LocationUtil.resolveLocation(tableLocation, "/metadata/file.parquet")) | ||
| .isEqualTo("s3://bucket/table/metadata/file.parquet"); | ||
|
|
||
| assertThat(LocationUtil.resolveLocation(tableLocation, "/data/00000-0.parquet")) | ||
| .isEqualTo("s3://bucket/table/data/00000-0.parquet"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testResolveLocationsWithColonsInSegments() { | ||
| String tableLocation = "s3://bucket/table"; | ||
|
|
||
| assertThat( | ||
| LocationUtil.resolveLocation(tableLocation, "/data/partition=key:value/file.parquet")) | ||
| .isEqualTo("s3://bucket/table/data/partition=key:value/file.parquet"); | ||
|
|
||
| assertThat(LocationUtil.resolveLocation(tableLocation, "/metadata/snap-123:456.avro")) | ||
| .isEqualTo("s3://bucket/table/metadata/snap-123:456.avro"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testResolveAbsoluteLocationsUnchanged() { | ||
| String tableLocation = "s3://bucket/table"; | ||
|
|
||
| assertThat(LocationUtil.resolveLocation(tableLocation, "s3://other-bucket/path/file.parquet")) | ||
| .isEqualTo("s3://other-bucket/path/file.parquet"); | ||
|
|
||
| assertThat(LocationUtil.resolveLocation(tableLocation, "hdfs://namenode/path/file.parquet")) | ||
| .isEqualTo("hdfs://namenode/path/file.parquet"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testRelativize() { | ||
| String tableLocation = "s3://bucket/table"; | ||
|
|
||
| assertThat( | ||
| LocationUtil.relativizeLocation( | ||
| tableLocation, "s3://bucket/table/metadata/file.parquet")) | ||
| .isEqualTo("/metadata/file.parquet"); | ||
|
|
||
| assertThat( | ||
| LocationUtil.relativizeLocation( | ||
| tableLocation, "s3://bucket/table/data/00000-0.parquet")) | ||
| .isEqualTo("/data/00000-0.parquet"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testRelativizeLocationNotUnderTableLocation() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like for this to test both a table/path mismatch and a bucket mismatch.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea. Done. |
||
| String tableLocation = "s3://bucket/table"; | ||
|
|
||
| // different bucket | ||
| assertThat( | ||
| LocationUtil.relativizeLocation(tableLocation, "s3://other-bucket/path/file.parquet")) | ||
| .isEqualTo("s3://other-bucket/path/file.parquet"); | ||
|
|
||
| // same bucket, different path | ||
| assertThat( | ||
| LocationUtil.relativizeLocation( | ||
| tableLocation, "s3://bucket/other-table/data/file.parquet")) | ||
| .isEqualTo("s3://bucket/other-table/data/file.parquet"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testRelativizeLocationEqualToTableLocation() { | ||
| String tableLocation = "s3://bucket/table"; | ||
|
|
||
| assertThat(LocationUtil.relativizeLocation(tableLocation, "s3://bucket/table")).isEqualTo(""); | ||
| } | ||
|
|
||
| @Test | ||
| public void testRelativizeMismatchedFileSchemeNotRelativized() { | ||
| // mixed file: variants are NOT relativized. Consistent URI forms are the caller's | ||
| // responsibility | ||
| assertThat( | ||
| LocationUtil.relativizeLocation( | ||
| "file:/tmp/table", "file:///tmp/table/metadata/file.parquet")) | ||
| .isEqualTo("file:///tmp/table/metadata/file.parquet"); | ||
|
|
||
| assertThat( | ||
| LocationUtil.relativizeLocation( | ||
| "file:///tmp/table", "file:/tmp/table/metadata/file.parquet")) | ||
| .isEqualTo("file:/tmp/table/metadata/file.parquet"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testResolveAbsoluteLocationWithNonAlphanumericScheme() { | ||
| String tableLocation = "s3://bucket/table"; | ||
|
|
||
| assertThat(LocationUtil.resolveLocation(tableLocation, "git+ssh://host/repo")) | ||
| .isEqualTo("git+ssh://host/repo"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testResolveEmptyLocationReturnsTableLocation() { | ||
| String tableLocation = "s3://bucket/table"; | ||
| assertThat(LocationUtil.resolveLocation(tableLocation, "")).isEqualTo(tableLocation); | ||
| } | ||
|
|
||
| @Test | ||
| public void testRelativizeResolveRoundTrip() { | ||
| String tableLocation = "s3://bucket/table"; | ||
| String absoluteLocation = "s3://bucket/table/metadata/root-manifest.parquet"; | ||
|
|
||
| String relativized = LocationUtil.relativizeLocation(tableLocation, absoluteLocation); | ||
| assertThat(relativized).isEqualTo("/metadata/root-manifest.parquet"); | ||
|
|
||
| String resolved = LocationUtil.resolveLocation(tableLocation, relativized); | ||
| assertThat(resolved).isEqualTo(absoluteLocation); | ||
|
stevenzwu marked this conversation as resolved.
|
||
| } | ||
|
|
||
| @Test | ||
| public void testRelativizeResolveRoundTripWithFileScheme() { | ||
| String tableLocation = "file:///tmp/warehouse/table"; | ||
| String absoluteLocation = "file:///tmp/warehouse/table/metadata/root-manifest.parquet"; | ||
|
|
||
| String relativized = LocationUtil.relativizeLocation(tableLocation, absoluteLocation); | ||
| assertThat(relativized).isEqualTo("/metadata/root-manifest.parquet"); | ||
|
|
||
| String resolved = LocationUtil.resolveLocation(tableLocation, relativized); | ||
| assertThat(resolved).isEqualTo(absoluteLocation); | ||
| } | ||
|
|
||
| @Test | ||
| public void testRelativizeResolveRoundTripWithHDFS() { | ||
| String tableLocation = "hdfs://namenode/warehouse/table"; | ||
| String absoluteLocation = "hdfs://namenode/warehouse/table/data/00000-0.parquet"; | ||
|
|
||
| String relativized = LocationUtil.relativizeLocation(tableLocation, absoluteLocation); | ||
| assertThat(relativized).isEqualTo("/data/00000-0.parquet"); | ||
|
|
||
| String resolved = LocationUtil.resolveLocation(tableLocation, relativized); | ||
| assertThat(resolved).isEqualTo(absoluteLocation); | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to try to speed this up, but I'm concerned about being outside of the RFC. Section 3.1 says that the scheme can contain
+,-, and.:Rather than only using
isLetterOrDigit, I think this should also coverch == '+' || ch == '-' || ch == '.'.There is also no length check. I'm torn on this because I don't think we are effectively limiting many cases by choosing a limit. At the same time, how often are we going to hit paths of only alphanumeric characters without
/?The false-positive case is that we have a long relative path directory name that is not a scheme, since we think schemes are almost always < 10 chars. Initial directory names are almost always "data/" or "metadata/", which will exit the loop on
/at index 4 or 8. By that logic, this only helps if we're checking very strange paths that are not absolute, not regular table paths, and have long alphanumeric directory names. Doesn't seem worth the potential bug for this optimization to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually agree. Simplified.