Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use BoltDB to map "Names" to Files #8

Open
djherbis opened this issue Mar 2, 2016 · 16 comments
Open

Use BoltDB to map "Names" to Files #8

djherbis opened this issue Mar 2, 2016 · 16 comments

Comments

@djherbis
Copy link
Owner

djherbis commented Mar 2, 2016

The default FileSystem has to do some ugly things to map the "key/name" into a filename. I should just use a Key/Value store like BoltDB.

@amozoss
Copy link
Contributor

amozoss commented Mar 2, 2016

As far as caching is concerned, it doesn't really care about what the original name was. Why do you need to know the name at all?

Seems like md5.Sum of the path combined with the cache root is all that is needed for all file operations.

md5sum := md5.Sum("path/to/my/file")
fpath:= filepath.Join(fs.root, md5sum)
os.Create(fpath)
os.Open(fpath)
os.Stat(fpath)
os.Remove(fpath)

@djherbis
Copy link
Owner Author

djherbis commented Mar 2, 2016

That's only the start of the ugliness I do in there.

It basically says: if the key can be base64 encoded and fit in a std. filename then do that (then the key is decodable from the filename).

Otherwise, hash the key and use that as the filename. Then create another file with the name "${hash}.key" and write the key to that file so it can be restored later.

I'd much rather have a nice clean, key/value store for managing which keys map to which files.

@ghost
Copy link

ghost commented Jan 28, 2018

Did this go anywhere, as I am using this in a prototype to have multi topology caches.
The cache code runs on lots of servers layers and finally in the browser.

djherbis added a commit that referenced this issue Jan 28, 2018
@djherbis
Copy link
Owner Author

Submitting a sample version (very rough), but boltdb is incredibly slow on my windows box (boltdb/bolt#516) sooo don't know if it's worth including.

@ghost
Copy link

ghost commented Jun 1, 2018

badger will get your better perf. Its often used instead of bolt.
Also for files you might want to consider minio as your file store ?

@djherbis
Copy link
Owner Author

djherbis commented Jun 3, 2018

Badger seems like a neat idea to use instead of bolt for managing the key db (certainly open to pull requests to add new "FileSystems" etc. though I might consider adding a subpkg. or something for holding extra ones besides the basic set).

I've certain considered trying to build an FS which uses a distributed/cloud-based file system, the main issues I've run into are:

  1. Most cloud FS doesn't show the object as 'existing' until it's completely written, one of the major features of fscache is that you can read from data while it's being written.

  2. Currently, fscache uses sync.Cond signalling to notify readers when there is more data to read. This requires that it runs within the same executable, which def. makes it impossible for the cache to work across multiple running instances/tasks.

I'd like to build a new library actually that takes advantage of:

  • abstracting over distributed fs (so backends like minio/aws/google cloud storage are an option)
  • using a key/value store w/ distributed notifications (instead of signals), so that readers in different instances of access to the same data (cache can be shared between tasks).

@deluan
Copy link

deluan commented Apr 15, 2020

Hey @djherbis, I didn't get why you can't use a md5 hash of the key as the filename and avoid having a key file. That would also help with a potential issue of having too many files in one folder.

Ex:
Key is abc123.300x300.jpg
Hash would be: b6c59a0370c107676c1cb457ccf02310
File in cache would be:
`b6/c5/9a/b6c59a0370c107676c1cb457ccf02310

This way, every time you need to do a lookup for some key, you'd need to do something like this:

key := "abc123.300x300.jpg"
hash := fmt.Sprintf("%x", md5.Sum([]byte(key)))
filePath := fmt.Sprintf("%s/%s/%s", hash[0:2], hash[2:4], hash)

Thoughts?

@djherbis
Copy link
Owner Author

The main issue was with Filesystem.Reload which needs to produce the key from the name.

That being said, I might be able to drop that for a Filesystem impl depending on the features that degrades.

What's the number of files in a folder limitation your seeing? I thought the limit on that was pretty high.

@deluan
Copy link

deluan commented Apr 15, 2020

I haven't experienced any slowdowns based on the number of files (yet), but this is a notorious thing to avoid when storing data in the filesystem. Good example is Git, that implements a schema similar to what I proposed.

For my use case (the music server) I can get to as much as 2-10 times the number of music files in a library (one file for each image size requested by the clients). For my personal library of 40K songs, this would be 400K files. I have a user with 150K+ songs, which would yield to 1.5million files!!

I was looking into Reload, can't the Filesystem implementation just return the hash identifiers for the files, instead of a key/name pair? That way, you convert the key to the the hash before calling the Filesystem methods (Create, Delete, etc...)

@djherbis
Copy link
Owner Author

I think the main problem is that things like expiration run over all the keys, and we don't know what all the keys are if we can't reload them. But that may not be all that important.

Good to know about files per dir and performance, I'll look into another Filesystem implementation.

@deluan
Copy link

deluan commented Oct 24, 2020

Hey @djherbis, I took a stab at creating a FileSystem with the structure I proposed above, storing only the files and not the keys: https://github.com/deluan/navidrome/blob/spread-fscache/core/cache/spread_fs.go

Without making any changes to the fscache lib, I had to make some assumptions and it may feel I bit... hacky? Let me know what you think.

@djherbis
Copy link
Owner Author

djherbis commented Oct 24, 2020

@deluan I took a quick look, I think the problem is that your Reload method returns different keys than were passed to Create. This means when you call cache.Get with a key that should already exist (from Reload), the cache won't think it has it (the map will contain the hashed paths, instead of the passed in key) and you'll get a cache miss on things that should have been reloaded.

Now I'm thinking of a simpler soln though, I think I could lift the hashing key logic into the cache map. Then the cache map keys and the filenames would match, instead of having this silly convertion logic.

I might run some tests to verify backwards compatibility, or just introduce a feature to control the hashing and a boring filesystem.

@deluan
Copy link

deluan commented Oct 24, 2020

You are right, it was late at night and I overlooked the key mapping... I was trying to do two things with this filesystem:

  1. Spread the files to avoid overpopulating the cache dir (I have a use case with 2mi+ files)
  2. Get rid of the key file

Well, looks like the item 2 is unattainable without changes to the fscache library. Couldn't we just move the key mapping to the FileSystem? Maybe add to the FileSystem's interface: Traverse(func) (to be used by the cache eviction routines) and Exist(key), and change Open(), Stat() and Remove() to accept keys instead of filenames. With this, we wouldn't need to load all keys in memory, reducing memory requirements. Of course this is based on my understanding of the internals of the library :D

Anyway, I added a key mapping, similar to the original FS, to my SpreadFS, seems to be working now. Would really appreciate if you could take another look: https://github.com/deluan/navidrome/blob/spread-fscache/core/cache/spread_fs.go

Again, thanks for you awesome library!

@djherbis
Copy link
Owner Author

From a very brief look at your updates, I think your FS should work (through testing is of course the best way to verify).

Key mapping does already happen in the filesystem.
The Open/State/Remove methods actually accept w/e is returned by a Filesystem.Create'd File.Name() method, since you control that File interface, you could theoretically make this w/e you want (key or filename), other Filesystem operations just need to work on w/e you choose.

Not sure what you mean about not needing to load all keys into memory, we do need an internal map of [key] => [fileStream] at least for running streams, because that's how we're able to internally lookup the stream.Stream which lets you read while it is being written.

I think signature-wise Reload/Traverse would actually be the same as both iterate the FS and report all (key, filename) pairs.
However, I'm not sure of the race conditions that might arise during such a traversal on a running cache, and the possible performance implications (iterating the map is very fast, iterating the filesystem, probably not).

I considered allowing cache-miss in memory to check the filesytem to load it 'on demand', I suppose we'd need to be able to do a spot-check on that load to verify it hasn't expired, if that's even possible (depends on how the expiration system works).

I'm still gonna give a shot at my other idea of moving the key production up a level (optionally), since that would solve "getting rid of the key file".

Thanks for the appreciation for this library :) I'm really amazed how much value people get from it, because looking back at it 4-5 years later I'd probably have built it differently from the ground up, but I'm still happy it's useful even if the API is fairly messy imho.

Someday I'll get around to building a v2 or new package redesigned from the ground-up that has a cleaner API, and hopefully support for more of the cool 'cloud' use cases that are more relevant today.

@djherbis
Copy link
Owner Author

I pushed some changes which should allow this kind of control, dumb example in the tests.

I'll expand a bit here:

// Create the FileSystem which doesn't mess with the filenames it is given:
fs, _ := NewFs("./cachex", 0700)
fs.EncodeKey = IdentityCodeKey // In the Fileystem Create(key) creates a File named key.
fs.DecodeKey = IdentityCodeKey // Since File.Name() == key, just returns File.Name().

// Create a Cache with that Filesystem:
ck, _ := NewCache(fs, NewReaper(time.Hour, time.Hour))

// Set the function which is used in the cache to create cache keys from the keys given to things like Get/Exists etc.
// This function essentially ends up creating the Filename when combined with the above FileSystem changes.
ck.SetKeyMapper(func(key string) string {
        // This is just a boring example which moved the EncodeKey from the Filesystem into the Cache.
	name, _ := B64OrMD5HashEncodeKey(key)
	return name
})

I think it would take a little more messing with EncodeKey/DecodeKey/SetKeyMapper to get a set of functions that will work correctly going forward but can also load existing cached files. But I think this is enough for a Saturday :)

@djherbis
Copy link
Owner Author

^ note I think your new FS is just a special case of this, if you want to remove the keyfiles you can just do this but set your KeyMapper function to your function which produces the /xx/xx/xxxxxxxx hashed filenames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants