Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drive File Stream Quota Management #1093

Open
whitephoenix117 opened this issue Apr 9, 2020 · 19 comments
Open

Drive File Stream Quota Management #1093

whitephoenix117 opened this issue Apr 9, 2020 · 19 comments

Comments

@whitephoenix117
Copy link

whitephoenix117 commented Apr 9, 2020

I believe this will fit the requirements for a bug. I apologize ahead of time for the length.

Note, as I am not a google engineer I am taking some liberties with how exactly Drive File Stream works based on my observations

Edit: Added Case 3

Background

Google Drive file stream allows you to "stream" files from drive without having them synchronized locally. This is very helpful for managing disk space.

Normally when a file is accessed via drive stream google's "magic" will allow you to download only
the specific portion of a file that are required for the task needed, similar to how a physical disk only accesses the sectors of data needed. For example if you are viewing a 90min video file Drive FS will only download the blocks related to the 2 minutes that your player is locally buffering, this includes seeking to an arbitrary point within the video. Drive FS will then continue to download blocks as requested by the OS, just like a traditional disk.

Typically when a file is accessed by "streaming" Drive FS will only create a single download request for a series of blocks form the disk as more blocks are needed it will "resume" this download until the needed blocks have been downloaded and provided to the OS. This resume process will repeat as needed.

Ok so whats the problem?

Quotas.

For "security reasons" Google limits the number of "download requests" and if you exceed it they ban you for 24 hours. For security reasons google does not publish exactly what these limits are.

Cryptomator breaks Drive FS's ability to "resume" downloads; creating a massive number of requests to google's servers and will result in you getting banned.

Things I have noticed that trigger excessive download requests to google

-Browsing/ scanning a large vault (1 download/ file/ folder/ etc to get basic metadata)
-Searching a vault (same reason as above)
-Opening/ consuming large files

[Summarize your problem.]

Google will ban you

Windows 10 64bit
Cryptomator 1.4.15

Steps to Reproduce

Case 1

  1. Open a vault that is not stored locally (online only) or cached by Drive FS
  2. Check your download log from google
  3. You will see a unique download request for every file/ directory you browsed inside the vault.
  4. Get banned by google

Case 2.

  1. Create an .ISO file container, fill it with a bunch of stuff, lets say 20,000 pictures.
  2. Store the ISO in the vault in Drive FS
  3. Ensure the ISO is in "online only" mode (no local downloaded/ cached copy)
  4. Mount the ISO using the mount tool of your choice
  5. Start a "slide show" viewing a new picture every few seconds
  6. Check your download log from google
  7. You will see a unique download request for each picture being shown, despite them being in a single file container (.ISO file)
  8. Get banned by google

Edit: Added case 3
Case 3:

  1. Load a video file into the vault
  2. Ensure the video is in "online only" mode (no local downloaded/ cached copy)
  3. Play the video
  4. Check your download log from google
  5. You will a larger number of individual download requests for the same file
  6. Get banned by google

Edit:
Here is an example of a google access log, you can see there are multiple download requests per second for the same file totaling ~1,400 in 20 minutes.

Expected Behavior

-1 download "request" for each file accessed
-Meta data management/ local caching to prevent file explorer activities from trigger a download for each file/ directory in the vault.

Actual Behavior

Many many many download requests for each file accessed

Reproducibility

Always

Additional Information

Can provide on request

@whitephoenix117 whitephoenix117 added the type:bug Something isn't working label Apr 9, 2020
@overheadhunter
Copy link
Member

Interesting observation, thanks for sharing this.

Case 1: Accessing multiple files (i.e. during a search operation)

I don't believe there is really anything we can do about it. Cryptomator is just the middleman between the process accessing files and the underlying file system. If a process decides to not just look at metadata but actually read from files, Cryptomator has to obey and accesses the corresponding ciphertext file, thus triggering a download.

Case 2: Accessing multiple blocks within a single large file:

Of course there is no way to tell the underlying file system to "resume a download", since there is no API for this. However, we can investigate how our access pattern looks like.

It should be:

open file
read file
read file
read file
close file

It should not behave like:

open file
read file
close file

open file
read file
close file

open file
read file
close file

Components we have to look at: fuse-nio-adapter (linux/mac), dokany-nio-adapter (win) and cryptofs. @cryptomator/libraries

@whitephoenix117
Copy link
Author

Case 1:

Would it be possible to have an encrypted local DB ?sqlLite? That would be able to store basic file information to try and manage the impact of this? Of course this would need to be optional and treated more like a cache to manage all the syncing related issues/ conflicts

@overheadhunter
Copy link
Member

How would you define "basic file information"?

For metadata like file name, size and modification date it is already not required to download a file.

@infeo
Copy link
Member

infeo commented Apr 9, 2020

I suggest to transform this issue into a feature request and optionally open up a bug report to investigate the behaviour mentioned by @overheadhunter .

Cryptomator is first of all designed to access locally stored files. In this case this wouldn't be a problem if a requested file is downloaded as a whole when it is needed, because then you can make as many filesystem calls as you want.

As far as i know cryptofs splits up read&write operations in chunks of a certain sizes (@overheadhunter please correct me if i'm wrong). If you want to read a big file as a whole, there will be a lot of single read operations in cryptofs, but in the end you get your file. Depending on the application, An example using fictional values:

Even if these values would be real, for todays hardware due to optimization not a problem when everything is stored locally. But with Drive File Stream instead you only get the chunk which you actually want to read :

Normally when a file is accessed via drive stream google's "magic" will allow you to download only
the specific portion of a file that are required for the task needed,

This means that for each call a request is send to the server. And counts into the quota.

I don't know the exact chunk size. But by design we can't improve this situation except by allowing to use a different chunk size.

Soo, the crucial fact here is the number of filesystem calls. I know from the dokany-nio-adpter, that for big files a lot of read requests are made. Another example:

The basic dokany mirror example is used to mirror a directory which contains a file of size ~310,148 MB. When I copy this file to another location, 424 calls to the ReadFile function were made.
The dokany-nio-adapter is, like the name suggest, an adapter to fit the dokan api to cryptofs. Therefore, using the example you made at least 424 read calls to the Drive File Stream driver. If and how this driver caches things is beyond my knowledge, but let us assume there is no optimization and all calls are translated into a web request. Comparing this number now with the provided Drive File Stream log, this can even be the case.

Edit: Updated due to direct comment below.

@overheadhunter
Copy link
Member

As far as i know cryptofs splits up read&write operations in chunks of a certain sizes (@overheadhunter please correct me if i'm wrong). If you want to read a big file as a whole, there will be a lot of single read operations in cryptofs, but in the end you get your file.

This is not entirely true. CryptoFS creates a file channel when it is asked to create one. It closes it when it is asked to close it. Between those two events the requester can read from the file. This is normal I/O behaviour for any process.

The only thing CryptoFS does, is reading a bit more than requested, as it needs whole chunks in order to do the MAC checks. Due to chunk buffering, it won't read things twice, unless cache eviction happens.

@whitephoenix117
Copy link
Author

@infeo
Let's say cryptofs splits up read operations in chunks of sizes 32 KiB. Then a 1 GiB (= 1048576 KiB) file needs fantastic 32,768 calls to be read.

I am not sure how varied you can change the chunk size but depending on the use case it will take a long time to get banned by google; upto a few hours. If you could reduce the request count by 10x this might be enough not to hit google'd limits.

@overheadhunter
For metadata like file name, size and modification date it is already not required to download a file.

I am rapid approaching the limit of my technical expertise. Whatever data is need in order for windows explorer to list the files in a directory, perform a search, or another application to do a library scan This could vary greatly depending on use case. Perhaps it could include the last accessed blocks of a file up to a certain size limit; hopefully this would be enough to keep certain requests local to the PC.

Making a generalization
I only use Drive FS as a sync/ backup tool but as the world continue to go to the could I would expect more and more providers to adopt this streaming model; especially for enterprise. All providers would likely have these request caps to prevent abuse. The ability for cryptomator to support this type streaming use case will likely be more and more relevant as time goes on.

@infeo infeo added type:feature-request New feature or request and removed type:bug Something isn't working labels Apr 9, 2020
@infeo
Copy link
Member

infeo commented Apr 9, 2020

Cryptomator breaks Drive FS's ability to "resume" downloads; creating a massive number of requests to google's servers and will result in you getting banned.

What I can imagine is that Drive File Stream uses certain system features. In windows the filesystem can determine in some cases if a file is used by another program. Maybe Drive File Stream has also this ability and can continue streaming a file. If it would just some basic caching mechanism, it could detect that the same file is read twice.

@infeo
Copy link
Member

infeo commented Apr 9, 2020

@whitephoenix117 Can you make similar tests with the dokany mirror example? It would be interesting if this application using the windows API also quickly hits the limit.

I added the log of my test run with it and it can be seen, that the reads are mostly consecutivley.

@whitephoenix117
Copy link
Author

@infeo
Yes. I should be able to do some testing tonight

I have tried copying files directly from the vault to a local location using windows explorer. This is completed with a single download request to google.

In this case it only triggers a single download request to google and the file transfer rate is limited by your internet bandwidth, or whatever your system bottleneck is for places with fast internet.

@whitephoenix117
Copy link
Author

whitephoenix117 commented Apr 9, 2020

@infeo

I think I got this correct, but I couldn't figure out how to get the debug version of Dokan to log. From the google end it doesn't appear that it worked.

Here is the chain of virtualization levels Drive FS --> Cryptomator --> Dokan Mirror

The file was a video, it was accessed through the M:/ directory I played the first 2 minutes Here is the google access log

image

@whitephoenix117
Copy link
Author

What I can imagine is that Drive File Stream uses certain system features. In windows the filesystem can determine in some cases if a file is used by another program. Maybe Drive File Stream has also this ability and can continue streaming a file. If it would just some basic caching mechanism, it could detect that the same file is read twice.

According to their Open source attribution Drive FS uses Dokan/ FUSE too.

@infeo
Copy link
Member

infeo commented Apr 14, 2020

Ohh, I'm sorry I was not totally clear. 🙈

I meant trying the mirror example without Cryptomator. Cryptomator is using Dokan to get an unencrypted view on your vault (the mounted drive). Mirror any directory on you File Stream Drive and access it by e.g. streaming a movie file.

Here is a small instruction how to use it:
Presumably, since you use Cryptomator 1.4.15, Dokan is already installed.

  1. Open a terminal and navigate to the Dokan installation:
    cd "C:\Program Files\Dokan\DokanLibrary-1.3.1\sample\mirror\"
  2. Start the Dokan mirror example. The following command mirrors a directory on your file stream drive mounted on M:\ with debug output enabled and redirected to the file dokanMirror.log on your Desktop:
    .\mirror.exe /r G:\oogle\File\Stream\Dir /l M /d /s > > %userprofile%\Desktop\dokanMirror.log
  3. Stream the file (e.g. movie)
  4. End the program by hitting at least twice CMD+C in the terminal window.
  5. Upload the log here.

Regarding the Open Source Attribution: Interesting! But i think stacking these drivers into each other should not cause a problem.

@whitephoenix117
Copy link
Author

@infeo
Ok, it it looks like there were still a lot of access requests using the Mirror, but I'm not sure there were as many as using cryptomator directly.

@whitephoenix117
Copy link
Author

@infeo

Anything else I can do to help with troubleshooting for this?

@infeo infeo added this to the Backlog milestone Jun 9, 2020
@infeo
Copy link
Member

infeo commented Jun 9, 2020

Not that i know. This feature is not very high on the prio list, so don't expect results soon.

@whitephoenix117
Copy link
Author

Not that i know. This feature is not very high on the prio list, so don't expect results soon.

Thanks. I understand you set your priority based on impact and number of affected users, and this is not very high. Let me know if there is anything I can do to contribute.

@whitephoenix117
Copy link
Author

I'm not sure this is especially useful for troubleshooting since the integration is completely different but it appears mountainduck https://mountainduck.io/ is a workaround to this issue. I am currently doing some more testing to confirm

@dosentmatter
Copy link

@whitephoenix117, did you reach a conclusion on mountainduck? Does it let you stream files? Does it have quota issues?

@whitephoenix117
Copy link
Author

whitephoenix117 commented Sep 27, 2021 via email

@infeo infeo modified the milestones: Backlog, 1.7.0 Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants