-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement external data support for Truss #267
Conversation
truss/truss_config.py
Outdated
url: str | ||
|
||
# This should be relative path. This is where the remote file will be downloaded. | ||
at: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me what "at" means, maybe "destination", or "local_destination"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, how about local_data_path, to make it clear it's relative to the data directory?
name: Optional[str] = None | ||
|
||
@staticmethod | ||
def from_dict(d: Dict[str, str]) -> "ExternalDataItem": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[absolutely non-blocking] I don't know if we've discussed this in the past -- have we considered using pydantic instead of dataclasses for cases like this? You get built-in validations / runtime type checking so we wouldn't have to do these types of checks ourselves.
This might be a bigger change, and I know we probably want to be cautious about adding new dependencies to Truss, but wanted to throw this out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. Yes, I think the config is getting complex enough that it's worth looking into that. It seems that pydantic is really lean:
(bin) @pankajroark ➜ /workspaces/foo $ pip show pydantic
Name: pydantic
Version: 1.10.7
Summary: Data validation and settings management using python type hints
Home-page: https://github.com/pydantic/pydantic
Author: Samuel Colvin
Author-email: s@muelcolvin.com
License: MIT
Location: /workspaces/foo/bin/lib/python3.9/site-packages
Requires: typing-extensions
Required-by:
I'll look into moving to pydantic as a follow-up.
Truss now allows specifying external data. This allows offloading Truss's
data
, e.g. model weight, to a remote storage and reference that in the Truss. This is mainly targeted at large blobs, which are problematic to keep in git. There are several solutions for managing large files outside of git, such as git-lfs and DVC, and those can still be used. But by providing in-built support for this in Truss, necessary dependency on additional tools can be avoided.Example of specifying external data
This is an experimental feature at this point and is limited in functionality. Currently, only public URLs available over http/https are supported. The implementation is extensible and more backends such as S3 and baseten would very likely be added soon.
Currently, external data links are resolved at docker build time. But, Truss's model code would not depend on that. Protocol is that these blobs will be made available at the specified location in the data directory passed to the Truss model instance. Bundling into the docker image is current mechanism. This can be changed e.g. to download this data on container start, without affecting the operation of Truss model code.
There's some refactoring in this change as well. The
util.py
was becoming too much of a grab-bag, it has now been broken into more specific files.There's an integration to test the flow end to end. Want to send this out for review early, while I write the unit tests.