Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance characteristics of absolute path vs relative path (and excessive HEAD requests) #500

Closed
apanloco opened this issue Sep 5, 2023 · 1 comment
Labels
question Further information is requested

Comments

@apanloco
Copy link

apanloco commented Sep 5, 2023

Mountpoint for Amazon S3 version

mount-s3 1.0.0

AWS Region

eu-central-1

Describe the running environment

Running in EC2, Debian 12, using instance profile credentials against an S3 Bucket in the same account

What happened?

Hello fellow coders!
I'm trying to understand the behaviour and performance characteristics of mountpoint-s3.

To demonstrate a potential issue I have a screenshot showing:

  1. I read the first byte of a file in the bucket, using a relative path with file in cwd.
  2. I hit enter in both windows so it's easy to see the logs generated for which command.
  3. I read one byte of the same file but using an absolute path.
image

A few observations from the logs in the screenshot:
Using an absolute path is close to half the speed of using a relative path, i.e. current working directory. 175ms vs 243ms.
For relative path there is 7 connections established for 7 HTTP commands.
For absolute path there's 10 connections established for 11 HTTP commands.

I wonder about all the HEAD requests -- does this look correct? I have a feeling it's not optimal.
I also wonder what's the reason for the behaviour of traversing into the path like this, instead of just reading the file when you have the absolute path. The deeper the path, the longer the time to read the byte.

I'd appreciate any thoughts and input on this.

Relevant log output

The log file with --debug and --debug-crt for one of the commands was too long to attach. Any guidance of what parts are relevant, and I'll attach it.
@apanloco apanloco added the bug Something isn't working label Sep 5, 2023
@passaro
Copy link
Contributor

passaro commented Sep 5, 2023

Hi @apanloco, thanks for the detailed issue. The short answer is that the number of requests you are seeing is expected. More details below.

But first, a few suggestions when looking at the logs. In order to better understand why Mountpoint is making a request, you may want to expand your filter with e.g.:

  • “fuser”: this will show you the fuse requests Mountpoint received from the kernel, e.g.


    2023-09-05T09:25:11.726409Z DEBUG fuser::request: FUSE( 76) ino 0x0000000000000003 LOOKUP name “00”
    
  • “new request”: this marks the start of each s3 client request, decorated with the fuse command triggering it, e.g.

    2023-09-05T09:39:55.933864Z DEBUG lookup{req=114 ino=1 name=“00”}:list_objects{id=73 bucket=“<redacted>” continued=false delimiter="/" max_keys="1" prefix=“00/“}: mountpoint_s3_client::s3_crt_client::list_objects: new request
    

To explain the requests Mountpoint makes to S3, consider that, when resolving a path, the kernel will ask Mountpoint to look up each of its components and return their inode number (ino) and associated stats.

For each lookup, Mountpoint will issue 2 simultaneous requests, HeadObject and ListObjects, to determine whether the name maps to a file or a sub-directory. This is required to implement directory shadowing (described here: https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md#directories). We are considering potential optimisations here: #12.

The result of lookups can be cached in the kernel for a certain period of time. The longer the expiration time, the fewer repeated Head and List requests are required, with the drawback of returning potentially stale information when the content of the S3 bucket change. In order to reduce consistency issues, Mountpoint only caches metadata for up to 1 second and invalidates it on certain operations, like open (see here: https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md#consistency-and-concurrency).

With this in mind, the difference in performance when using relative and absolute paths is not surprising, especially when only measuring the time to read a single byte from a file, while high throughput workflows will not be affected in the same way. That said, if you have a specific use case that is negatively impacted by the current behavior, we would be happy to hear about it.

@passaro passaro added question Further information is requested and removed bug Something isn't working labels Sep 5, 2023
@passaro passaro closed this as completed Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants