Skip to content

Add Container/File export and check for workdir escaping#3348

Merged
vito merged 10 commits into
dagger:mainfrom
vito:container-export
Nov 2, 2022
Merged

Add Container/File export and check for workdir escaping#3348
vito merged 10 commits into
dagger:mainfrom
vito:container-export

Conversation

@vito
Copy link
Copy Markdown
Contributor

@vito vito commented Oct 12, 2022

Adds the following schema:

extend type Container {
  "Export the container an OCI tarball to the destination path."
  export(path: String!): Boolean!
}

extend type File {
  "Write the file to a directory on the host"
  export(path: String!): Boolean!
}

Side note: I thought about supporting export(path: String!, address: ContainerAddress!) so you can set a name for the exported image, but supporting that isn't possible with the OCI exporter until Buildkit ships a new version (possibly v0.11?) with annotation support (moby/buildkit#2879).

@shykes
Copy link
Copy Markdown
Contributor

shykes commented Oct 12, 2022

Since under the hood this gives us a tar stream, would it make sense to just expose it as a Directory, then you can do whatever you want with it, including but not limited to writing it to host filesystem?

In other words:

extend type Container {
  "Export the container to an OCI-compatible image archive"
  export: Directory!
}

Then compose that with anything that can take a Directory?

@vito
Copy link
Copy Markdown
Contributor Author

vito commented Oct 12, 2022

@shykes I really like the idea of funneling everything into Directory so HostDirectory only has the one touch point, but I don't know of a clean way to take data exported from Buildkit and put it back into LLB, which is what Directory currently maps 1:1 to. Besides exporting it to a temporary directory and adding it as a LocalDir.

I'm pretty hesitant to change Directory to be anything but LLB + path since it has so many touch points. We need to be careful not to let that turn into a M:N maintenance burden (M flavors of Directory * N different users).

Here's a counter-proposal to tear apart:

extend type HostDirectory {
  export(container: ContainerID!, path: String!)
}

This avoids the issue of not having a clear HostDirectoryID for the workdir, at the expense of blessing "Container" as another special type that can be written to the host. Container and Directory effectively become the king of the jungle, which is already kind of the case.

@shykes
Copy link
Copy Markdown
Contributor

shykes commented Oct 13, 2022

That seems fine. Just to play devil's advocate: could it be orchestrated on top of a primitive that unpacks a tarball into a Directory? Possibly with the help of the shim?

By the way this could be done later as a future replacement to the "king of the jungle" approach.

@aluzzardi
Copy link
Copy Markdown
Contributor

Bikeshedding welcome.

Here we go :)

extend type Container {
    "Write the container image to the host directory as an OCI tarball"
    export(to: HostDirectoryID!, path: String!): Boolean!
}
"A directory on the host"
type HostDirectory {
  "Write the contents of another directory to the directory"
  write(contents: DirectoryID!, path: String): Boolean!
}

These 2 are doing about the same (at least in terms of BuildKit call, they're both an Export), however their GraphQL API is very different:

Exporting the raw tarball is done in the Container object by passing a HostDirectory as an argument, while exporting the unpacked tarball is done in the HostDirectory object by passing the Directory as an argument (e.g. swapping object<->subject).

I think we should keep it consistent with one or the other.

=== unrelated to this PR, but still related ===

While looking at the API, I realize there's something funky: the only way to "write" (whether it's a host directory write or container export) is to do so through HostDirectory, which itself must be "read" before.

e.g. writing to /tmp involves loading all of /tmp, exporting to /project/does-not-exist-yet will probably fail?

Basically writing forces us to read first. Is that a problem?

Alternatively, we could use Directory instead of HostDirectory (e.g. make it read-only, and therefore Directory does the trick) and have writer fields directly in Host,e.g.

"Information about the host execution environment"
type Host {
  "The current working directory on the host"
  workdir: Directory!

  "Write the contents of a directory to the host"
  exportDirectory(id: DirectoryID!, path: String!): Boolean!

  "Write the container image to the host directory as an OCI tarball"
  exportContainer(id: ContainerID!, path: String!): Boolean!
}

This would:

  • Get consistent between container and directory exports (e.g. the original thing I raised above)
  • Get rid of must-read-before-write
  • Get rid of HostDirectory and HostDirectoryID types
  • Shorten a little bit the call path (e.g. core.Host().Workdir().Read().File("README.md").Contents(ctx))

@vito
Copy link
Copy Markdown
Contributor Author

vito commented Oct 13, 2022

@aluzzardi One correction: you don't need to read a HostDirectory to get a HostDirectoryID - you just need to know the ID ahead of time; it's a static scalar value passed to engine.Start for each LocalDir. When you read a HostDirectory you get a DirectoryID, not a HostDirectoryID.

The weird part to me is having to know the workdir's HostDirectoryID, especially since it'll be randomized soon per @sipsma, but that can be fixed by moving these calls to HostDirectory instead.

re:

  "Write the contents of a directory to the host"
  exportDirectory(id: DirectoryID!, path: String!): Boolean!

  "Write the container image to the host directory as an OCI tarball"
  exportContainer(id: ContainerID!, path: String!): Boolean!

Fully agree with this especially since I realized I also need an exportFile. At this point a pattern seems to have emerged and we should keep them all consistent. But I think it makes more sense for these to live on HostDirectory instead of Host if we're keeping support for multiple host dirs instead of single workdir. My vote is to keep it; I need it for Bass since a single thunk might make use of multiple host paths from completely different (non-overlapping) contexts.

@aluzzardi aluzzardi changed the base branch from cloak to main October 14, 2022 18:24
@vito vito force-pushed the container-export branch from e51a72b to 907cdfb Compare October 14, 2022 20:52
@shykes
Copy link
Copy Markdown
Contributor

shykes commented Oct 15, 2022

Inspired by this discussion, here's a proposal for both 1) a simpler way to write to host filesystem, and 2) container export to OCI archive.

@vito vito force-pushed the container-export branch from 907cdfb to e6ca864 Compare October 20, 2022 00:40
@vito vito changed the title add container { export } for exporting OCI tarballs to the host Container: add export for exporting OCI tarballs to the host Oct 20, 2022
@vito
Copy link
Copy Markdown
Contributor Author

vito commented Oct 20, 2022

Rebased on #3421 and turned it into the following API:

extend type Container {
  export(path: String!): Boolean!
}

...as originally proposed in #3401.

@vito vito force-pushed the container-export branch from e6ca864 to 04c4370 Compare October 31, 2022 18:44
@vito vito changed the title Container: add export for exporting OCI tarballs to the host Add Container/File export and check for workdir escaping Oct 31, 2022
@vito vito requested review from aluzzardi, shykes and sipsma October 31, 2022 19:36
@vito
Copy link
Copy Markdown
Contributor Author

vito commented Oct 31, 2022

@shykes Updated this PR and added File export along the way (if ya don't recognize it, this PR used to just be Container export).

This should mark the completion of the new host API as proposed in #3401. Also added an error check + tests for trying to write outside of the workdir.

@vito vito added this to the Go SDK 0.4.0 milestone Oct 31, 2022
Comment thread core/container.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this API is different than the local dir export one in that it directly creates files and thus forces the engine to be local to the client.

From the discussion earlier, it seems like we might end up going down that route anyways, but this makes the decision for us if we keep the implementation as is (unless I'm missing something of course).

My best guess at long term solution here is services (can stream this back to clients over websocket), but depending on how we end up implementing file sync between clients and remote engine maybe this will end up using that same mechanism.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me all these host APIs are a stopgap until we figure out how the remote engine architecture should work and how files should be synced back and forth. It's pretty hard for me to make sense of any of this API in that world.

My mental picture is that the routing API will always be co-located with Buildkit, so for this particular piece of code it shouldn't make much difference between doing an os.Create here vs in Buildkit against the same filesystem, it's more just dealing with the fact that his export type gives us a stream instead.

But it's strange to think about how this code would work in that context and whether it's consistent with the rest because I feel like we'd want a fundamentally different interface based on filesync anyhow. :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm okay with merging this implementation for now. The only other options would be to get the exported file into buildkit's cache somehow and then do a local export of that. Probably possible but not worth it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm slow here... I want to make sure I understand.

  • Directory export works by receiving the buildkit exporter tar stream and unpacking it in the target host directory;
  • File export works by querying the file's contents from the buildkit state, then writing it to the target host file;
  • Both Directory and File export write the result to the client's host filesystem (ie. where the client code is running)
  • But @sipsma you're concerned that by using different buildkit APIs (llb state inspection vs. exporter), these two implementations may be affected differently by the future change to a remote engine architecture

I think I'm missing something as I don't understand the last part. Sorry if I'm missing something obvious.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Directory export works by receiving the buildkit exporter tar stream and unpacking it in the target host directory;
  • File export works by querying the file's contents from the buildkit state, then writing it to the target host file;
  • Both Directory and File export write the result to the client's host filesystem (ie. where the client code is running)

Yes (technically it's not a tar stream but rather buildkit's own diffcopy protocol, but same idea)

But @sipsma you're concerned that by using different buildkit APIs (llb state inspection vs. exporter), these two implementations may be affected differently by the future change to a remote engine architecture

I'm just pointing out that this API implementation is different in that nothing is streamed back to the client; we just os.Create the file in the engine code directly and presume that it's the same filesystem as the clients. So before we move to a remote engine we need to fix this as, unlike the other file export mechanism, it's a hard assumption on the engine being local. That's totally fine for now, I just wanted to make sure we're aware of this and tracking it (#3624)

@vito vito force-pushed the container-export branch from ad9a114 to af5e2f0 Compare October 31, 2022 23:38
@sipsma sipsma mentioned this pull request Oct 31, 2022
Comment thread core/container.go
defer out.Close()

return host.Export(ctx, bkclient.ExportEntry{
Type: bkclient.ExporterOCI,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict this exporter type supports multiplatform too, so we should export it for each platform (same as here)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied everything from Publish into a private function shared by Export, but one caveat is that the multi-platform OCI images created by it aren't compatible with docker load since they don't contain a manifest.json.

As a compromise I've made it include a manifest.json if there's only one platform, which should be compatible with the previous behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch on that issue, thanks!

Comment thread core/schema/file.graphqls Outdated
Comment on lines +22 to +23
"Write the file to a directory on the host"
export(path: String!): Boolean!
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get why we want to make path be a directory on the host (it's annoying to have to specify the filename all the time), but my initial intuition would have been that path should be set to the path where I want the file at, not the parent dir.

The problem is that if a user has that expectation and does something like export a file named "foo.txt" like this:

file.Export(ctx, "/dir/subdir/bar.txt")

The actual result will be that we create /dir/subdir/bar.txt/foo.txt even though they probably expected us to create /dir/subdir/bar.txt with the contents of foo.txt.

I don't have great solutions in the short term for this other than considering making it so path points to the path of the file rather than the parent dir. But I think at minimum we could be more explicit in the docstring here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I actually started with 'file or dir' but then realized there's no way for us to tell the user's intention (besides requiring a trailing / which is too subtle).

So I went with just directory, but now that you mention it exporting to a file path feels much more intuitive alongside the container export.

I think it can be done easily by specifying a new name for the file on the right-hand side of llb.Copy and just exporting it to the specified file path's parent dir.

Starting on that now!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s.: the only downside to this is kind of the opposite of what you mention: if someone does file.Export(ctx, ".") (targeting a dir) it won't work. I would expect some kind of 'is a directory' error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yikes, the Buildkit exporter actually seems to do a os.RemoveAll of the destination directory in order to replace it with a file. So that actually "works" but doesn't do NEARLY what the user wanted. 😱

I'll add a guard for this. Yet another thing to keep in mind with remote filesync. 🤔

Comment thread core/schema/container.graphqls Outdated
Copy link
Copy Markdown
Contributor

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

vito added 5 commits November 2, 2022 00:51
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
vito added 3 commits November 2, 2022 00:53
Signed-off-by: Alex Suraci <alex@dagger.io>
also updated GraphQL descriptions

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
@vito vito force-pushed the container-export branch from 0a273ce to a06ff1f Compare November 2, 2022 04:53
vito added 2 commits November 2, 2022 11:11
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
@vito vito merged commit dfd214a into dagger:main Nov 2, 2022
@vito vito mentioned this pull request Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants