Cache Refactor and Improvements #710

varunshenoy · 2023-10-26T00:57:52Z

This PR adds the following features:

Caching with public S3 buckets
A refactored cache_warmer.py
Individual trusses can contain cached files from different cloud stores
Update docs to include information about S3
Alias model_cache to hf_cache.
models are now saved in app/model_cache instead of app/hf_cache

Tested the following on dev:

public gcs
private gcs
public s3
private s3
model_cache

varunshenoy · 2023-10-26T17:00:32Z

truss/contexts/image_builder/serving_image_builder.py

@@ -364,15 +371,17 @@ def create_vllm_build_dir(
    nginx_template = read_template_from_fs(TEMPLATES_DIR, "vllm/proxy.conf.jinja")

    data_dir = build_dir / "data"
-    credentials_file = data_dir / "service_account.json"
+    gcs_credentials_file = data_dir / "service_account.json"
+    s3_credentials_file = data_dir / "s3_credentials.json"
    dockerfile_content = dockerfile_template.render(


This is starting to get very unwieldy — is there a better way to format it?

what specifically did you have in mind here?

…/truss into varun/cache-refactor

squidarth

overall looks good, thanks for making these changes. My main high-level feedback is that while we're at this, we should rename a couple other things:

HuggingFaceCache -> ModelCache
HuggingFaceCache.repo_id -> ModelCache.path

wdyt?

squidarth · 2023-10-30T19:57:50Z

truss/contexts/image_builder/cache_warmer.py

+        try:
+            proc = _download_from_url_using_b10cp(_b10cp_path(), url, dst_file)
+            proc.wait()
+        except Exception as e:


we should almost never do except Exception. Imagine we mispelled proc.wait() as proc.wai. This would throw an attribute not found exception, and it would be very hard to figure out. Let's instead enumerate the network-related errors that could happen here.

squidarth · 2023-10-30T20:05:01Z

truss/contexts/image_builder/cache_warmer.py

-    # open the json file
-    with open(file_path, "r") as f:
-        data = json.load(f)
+class RepositoryFile:


could you make this an abstract base class (https://docs.python.org/3/library/abc.html)

squidarth · 2023-10-30T20:29:08Z

truss/contexts/image_builder/cache_warmer.py

+        self.is_private = True
+
+    @staticmethod
+    def create(repo_name, file_name, revision_name):


nit: I tend to prefer from over create

squidarth · 2023-10-30T20:30:13Z

truss/contexts/image_builder/cache_warmer.py

@@ -35,6 +36,23 @@ def _download_from_url_using_b10cp(
    )


+def parse_s3_service_account_file(file_path):


please add input & output types

squidarth · 2023-10-30T20:33:57Z

truss/contexts/image_builder/cache_warmer.py

+def parse_s3_service_account_file(file_path):
+    # open the json file
+    with open(file_path, "r") as f:
+        data = json.load(f)


nit: consider using something like python dataclass or pydantic to define the type.

squidarth · 2023-10-30T23:05:23Z

truss/contexts/image_builder/cache_warmer.py

+        try:
+            proc = _download_from_url_using_b10cp(_b10cp_path(), url, dst_file)
+            proc.wait()
+        except Exception as e:


see below my note about using except Exception as e

squidarth · 2023-10-30T23:07:08Z

truss/contexts/image_builder/cache_warmer.py

+        cache_dir = Path(f"/app/model_cache/{self.bucket_name}")
+        cache_dir.mkdir(parents=True, exist_ok=True)
+
+        dst_file = Path(f"{cache_dir}/{self.file_name}")


I think dst_file = cache_dir / self.file_name or dst_file = cache_dir / Path(self.file_name) should work here

squidarth · 2023-10-30T23:08:58Z

truss/contexts/image_builder/cache_warmer.py

-    aws_secret_access_key = data["aws_secret_access_key"]
-    aws_region = data["aws_region"]
+class GCSFile(RepositoryFile):
+    def connect(self, key_file="/app/data/service_account.json"):


what is the point of this connect function? could we do all of this as a part of download?

squidarth · 2023-10-30T23:12:11Z

truss/truss_config.py

@@ -502,6 +503,12 @@ def from_dict(d):
    def from_yaml(yaml_path: Path):
        with yaml_path.open() as yaml_file:
            raw_data = yaml.safe_load(yaml_file) or {}
+            if "hf_cache" in raw_data:
+                warnings.warn(


let's use logger instead of warnings here

squidarth · 2023-10-30T23:13:38Z

truss/tests/test_config.py

-def test_null_hf_cache_key():
-    config_yaml_dict = {"hf_cache": None}
+def test_null_model_cache_key():
+    config_yaml_dict = {"model_cache": None}


In case I missed this, could we have a case where there's a yaml file with the key "hf_cache" and check that it parses out correctly?

varunshenoy · 2023-10-31T18:00:06Z

overall looks good, thanks for making these changes. My main high-level feedback is that while we're at this, we should rename a couple other things:

HuggingFaceCache -> ModelCache

HuggingFaceCache.repo_id -> ModelCache.path

wdyt?

Agree with HuggingFaceCache -> ModelCache, but think it might be better to keep repo_id since most folks are caching from Hugging Face anyways.

squidarth · 2023-10-31T22:01:04Z

Agree with HuggingFaceCache -> ModelCache, but think it might be better to keep repo_id since most folks are caching from Hugging Face anyways.

k - I think that's fine. Still technically makes sense with gcs & s3

squidarth · 2023-10-30T23:15:45Z

truss/contexts/image_builder/serving_image_builder.py

@@ -364,15 +370,17 @@ def create_vllm_build_dir(
    nginx_template = read_template_from_fs(TEMPLATES_DIR, "vllm/proxy.conf.jinja")

    data_dir = build_dir / "data"
-    credentials_file = data_dir / "service_account.json"
+    gcs_credentials_file = data_dir / "service_account.json"
+    s3_credentials_file = data_dir / "s3_credentials.json"


could we move these key files to constants?

squidarth · 2023-10-30T23:17:21Z

truss/contexts/image_builder/serving_image_builder.py

@@ -253,13 +257,13 @@ def fetch_files_to_cache(cached_files: list, repo_id: str, filtered_repo_files:
        repo_id = f"gs://{bucket_name}"

        for filename in filtered_repo_files:
-            cached_files.append(f"/app/hf_cache/{bucket_name}/{filename}")
+            cached_files.append(f"/app/model_cache/{bucket_name}/{filename}")
    elif repo_id.startswith("s3://"):


could you use teh new nice clases that you made for this instead of big if statement?

squidarth · 2023-10-30T23:17:36Z

truss/contexts/image_builder/serving_image_builder.py

@@ -364,15 +371,17 @@ def create_vllm_build_dir(
    nginx_template = read_template_from_fs(TEMPLATES_DIR, "vllm/proxy.conf.jinja")

    data_dir = build_dir / "data"
-    credentials_file = data_dir / "service_account.json"
+    gcs_credentials_file = data_dir / "service_account.json"
+    s3_credentials_file = data_dir / "s3_credentials.json"
    dockerfile_content = dockerfile_template.render(


what specifically did you have in mind here?

squidarth · 2023-10-31T22:03:18Z

truss/contexts/image_builder/cache_warmer.py

+        # Create S3 Client
+        bucket_name, _ = split_path(repo_name, prefix="s3://")
+
+        key_file = "/app/data/s3_credentials.jso"


typo? Also let's move this path to a constant

@varunshenoy how did this work before?

squidarth · 2023-10-31T22:05:11Z

truss/contexts/image_builder/cache_warmer.py

+    except ValueError as value_error:
+        raise RuntimeError(f"Failure due to an error: {value_error}")
+
+    except Exception as general_error:


I think it's ok to just let this issue throw, we don't need to catch it

squidarth · 2023-10-31T22:05:37Z

truss/contexts/image_builder/cache_warmer.py

            )
-        except FileNotFoundError:
+        except Exception as exc:


can we be more specific w/ this exception?

Feel free to do this in a follow-up, but let's try to be more specific here

squidarth · 2023-10-31T22:07:58Z

truss/contexts/image_builder/serving_image_builder.py


        config.build.arguments[
            model_key
-        ] = f"/app/hf_cache/{model_name.replace('gs://', '')}"
+        ] = f"/app/model_cache/{model_name.replace('gs://', '')}"


could we move this string into a function w/ comments? it's not clear to me why we need to do this transformation of the config object

This is very specifically for TGI/vLLM, where the model maybe specified directly as an HF repo. If it's a GCS or S3 bucket, we want to alias that bucket to the cache and make sure the model server pulls from the cache instead of throwing an error.

squidarth · 2023-10-31T22:09:43Z

truss/templates/copy_cache_files.Dockerfile.jinja

@@ -1,5 +1,5 @@
 {% for file in cached_files %}
-    {%- if credentials_exists %}
+    {%- if file.startswith("/app/model_cache/") %}


I think this would be a little bit cleaner if we could have these templates be more logicless. Is there something else that we can check here? It's not clear from reading this template file what the implications of the file being named /app/model_cache/ are

This is the cache copying mechanism. The HuggingFace files have a special root directory while other files do not. Let me think about this and get back.

It might make sense to keep the .startswith but instead just use /app/ instead. If the file is relative to app we want to copy it to the same place.

The HuggingFace files have a special root directory while other files do not. Let me think about this and get back."

This assumption is baked in here & implicit but not made explicit anywhere. In the future someone might wonder why we're doing this. Here's an example of an approach that makes this explicit instead of implicit:

@dataclass class CachedFile: source: str dst: str cached_files = [ # Huggingface files have a special root directory, while the others do not. # Being in app/model_cache implies that it is not a huggingface file CachedFile(src=... if file.startswith(...) , dst=...) for file in files ] data = { ... cached_files: cached_files ... } render_template(data)

squidarth

Just a couple more comments! Lmk when you've tested it again and i'll throw a ✅

squidarth · 2023-11-02T02:10:38Z

truss/contexts/image_builder/cache_warmer.py

+        raise RuntimeError(f"Failure due to file ({file_name}) not found: {file_error}")
+
+    except TimeoutError as timeout_error:
+        raise RuntimeError(f"Failure due to timeout: {timeout_error}")


For TimeoutError, OSError, and ValueError, why not just throw that exception (Instead of catching and reraising a RuntimeError)? I don't think the RuntimeError adds anything here

squidarth · 2023-11-02T02:11:53Z

truss/templates/copy_cache_files.Dockerfile.jinja

@@ -1,7 +1,3 @@
 {% for file in cached_files %}
-    {%- if credentials_exists %}


squidarth · 2023-11-02T02:12:58Z

truss/templates/cache.Dockerfile.jinja

 WORKDIR /app

 {% if hf_access_token %}
 ENV HUGGING_FACE_HUB_TOKEN {{hf_access_token}}
 {% endif %}
-{%- if credentials_exists %}
+{%- if gcs_credentials_exists %}


we repeat these magic file paths in a lot of places, I wonder if we could just pass in credentials here, and can just do:

COPY ./data/{{ credentials}} ...

instead of having branching logic? And then we can define these constants in one place

squidarth · 2023-11-02T02:14:31Z

truss/contexts/image_builder/serving_image_builder.py

+        self.revision = revision
+
+    @staticmethod
+    def from_repo(repo_name, data_dir):


please add types

squidarth · 2023-11-02T02:25:02Z

truss/contexts/image_builder/serving_image_builder.py

-            )
+            model_cache = RemoteCache.from_repo(repo_id, truss_dir / config.data_dir)
+            remote_filtered_files = model_cache.filter(allow_patterns, ignore_patterns)
+            local_cached_files += model_cache.prepare_for_cache(remote_filtered_files)


I think local_cached_files makes it seem like they are already cached. That hasn't happened yet, maybe files_to_cache?

squidarth

awesome work here! I think there's a little bit of cleanup we can do in the serving_builder, but let's try to get this in.

The only thing that I'd consider is moving the doc changes to a different PR if you want to merge this now. If we merge this now, the docs will automatically deploy and be incorrect until we push the new context builder.

So I'd say if you want to merge this now, let's move the doc changes to a different PR, else, we can merge on mon and do a new context builder

squidarth · 2023-11-03T22:59:30Z

truss/contexts/image_builder/cache_warmer.py

            )
-        except FileNotFoundError:
+        except Exception as exc:


Feel free to do this in a follow-up, but let's try to be more specific here

varunshenoy added 14 commits October 25, 2023 21:21

init s3 caching

0abbf6d

update toml to test on dev

cb09ff1

fix gcs tests + add s3 tests

4bfa10e

cleanup

d5108c7

add boto to deps

63814ae

update pyproject to include boto

079143a

bump dev

9360961

update poetry lock

b0346e1

public s3 buckets are working

72566fe

update dev

0079b24

bump rc

63b2765

refactored cache warmer

27206fc

public s3 + code refactors

52f1000

Merge branch 'main' into varun/cache-refactor

14d8838

varunshenoy changed the title ~~Varun/cache refactor~~ [WIP] Cache Refactor and Improvements Oct 26, 2023

can pull from public GCS now

332096a

varunshenoy commented Oct 26, 2023

View reviewed changes

varunshenoy added 5 commits October 26, 2023 17:10

clean up

53cf08c

filter by path for buckets

f227212

add model_cache aliasing

c28d5d3

new version

e9b4e66

Merge branch 'main' into varun/cache-refactor

1580819

varunshenoy marked this pull request as ready for review October 28, 2023 01:12

varunshenoy requested review from bolasim and squidarth October 28, 2023 01:12

varunshenoy changed the title ~~[WIP] Cache Refactor and Improvements~~ Cache Refactor and Improvements Oct 28, 2023

varunshenoy added 4 commits October 30, 2023 18:23

convert hf_cache to model_cache

0dc1396

add info callout on model_cache

c656675

Merge branch 'varun/cache-refactor' of https://github.com/basetenlabs…

820d1c6

…/truss into varun/cache-refactor

add warning on hf_cache key

54a1d7d

varunshenoy added 2 commits October 30, 2023 19:26

add warning on hf_cache key

003723e

add warning on hf_cache key

bd80e1f

squidarth reviewed Oct 30, 2023

View reviewed changes

added some changes, still need to add typing

a3be3b5

renamed static file

9d2b9ee

varunshenoy requested a review from squidarth October 31, 2023 21:56

squidarth reviewed Oct 31, 2023

View reviewed changes

varunshenoy added 2 commits October 31, 2023 23:03

clean up constants

58a9eb9

Merge branch 'main' into varun/cache-refactor

4b9e446

varunshenoy requested a review from squidarth October 31, 2023 23:17

varunshenoy added 4 commits November 1, 2023 18:33

abstract out cache files

f13098d

clean up strings

4756491

refactor serving image builder

f7e5fbd

Merge branch 'main' into varun/cache-refactor

6caf737

squidarth reviewed Nov 2, 2023

View reviewed changes

varunshenoy added 4 commits November 3, 2023 21:21

added fixes

fbe76ed

refactor credentials

d016697

add typing

17efb6a

add typing

2e00983

varunshenoy requested a review from squidarth November 3, 2023 22:33

squidarth approved these changes Nov 3, 2023

View reviewed changes

varunshenoy added 3 commits November 6, 2023 16:47

bump dev

5bd9826

fixed credentials_to_cache, time to retest

8f6874c

bump toml

c1421bc

varunshenoy merged commit 1dc9be5 into main Nov 6, 2023
3 checks passed

varunshenoy deleted the varun/cache-refactor branch November 6, 2023 19:33

		@@ -35,6 +36,23 @@ def _download_from_url_using_b10cp(
		)


		def parse_s3_service_account_file(file_path):

		@@ -1,7 +1,3 @@
		{% for file in cached_files %}
		{%- if credentials_exists %}

Cache Refactor and Improvements #710

Cache Refactor and Improvements #710

Conversation

varunshenoy commented Oct 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squidarth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varunshenoy commented Oct 31, 2023

squidarth commented Oct 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squidarth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squidarth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varunshenoy commented Oct 26, 2023 •

edited