Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement external data support for Truss #267

Merged
merged 3 commits into from
Apr 14, 2023
Merged

Conversation

pankajroark
Copy link
Collaborator

@pankajroark pankajroark commented Apr 14, 2023

Truss now allows specifying external data. This allows offloading Truss's data, e.g. model weight, to a remote storage and reference that in the Truss. This is mainly targeted at large blobs, which are problematic to keep in git. There are several solutions for managing large files outside of git, such as git-lfs and DVC, and those can still be used. But by providing in-built support for this in Truss, necessary dependency on additional tools can be avoided.

Example of specifying external data

external_data:
- at: test.txt
  backend: http_public
  url: https://my.website/test.txt

This is an experimental feature at this point and is limited in functionality. Currently, only public URLs available over http/https are supported. The implementation is extensible and more backends such as S3 and baseten would very likely be added soon.

Currently, external data links are resolved at docker build time. But, Truss's model code would not depend on that. Protocol is that these blobs will be made available at the specified location in the data directory passed to the Truss model instance. Bundling into the docker image is current mechanism. This can be changed e.g. to download this data on container start, without affecting the operation of Truss model code.

There's some refactoring in this change as well. The util.py was becoming too much of a grab-bag, it has now been broken into more specific files.

There's an integration to test the flow end to end. Want to send this out for review early, while I write the unit tests.

  • Write unit tests

url: str

# This should be relative path. This is where the remote file will be downloaded.
at: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me what "at" means, maybe "destination", or "local_destination"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, how about local_data_path, to make it clear it's relative to the data directory?

name: Optional[str] = None

@staticmethod
def from_dict(d: Dict[str, str]) -> "ExternalDataItem":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[absolutely non-blocking] I don't know if we've discussed this in the past -- have we considered using pydantic instead of dataclasses for cases like this? You get built-in validations / runtime type checking so we wouldn't have to do these types of checks ourselves.

This might be a bigger change, and I know we probably want to be cautious about adding new dependencies to Truss, but wanted to throw this out there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. Yes, I think the config is getting complex enough that it's worth looking into that. It seems that pydantic is really lean:

(bin) @pankajroark ➜ /workspaces/foo $ pip show pydantic
Name: pydantic
Version: 1.10.7
Summary: Data validation and settings management using python type hints
Home-page: https://github.com/pydantic/pydantic
Author: Samuel Colvin
Author-email: s@muelcolvin.com
License: MIT
Location: /workspaces/foo/bin/lib/python3.9/site-packages
Requires: typing-extensions
Required-by: 

I'll look into moving to pydantic as a follow-up.

@pankajroark pankajroark merged commit 4a1fbee into main Apr 14, 2023
3 checks passed
@pankajroark pankajroark deleted the pg/external_data_review branch April 14, 2023 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants