Feature proposal: Song cache #75

Cephian · 2022-02-03T03:48:20Z

We tend to run !list more than once per bust, to do things like get Google Form info, make sure everything looks good, etc. We also run it a lot locally when testing. On a large number of songs there can be significant delay and bandwidth consumption from downloading all of the files. It would be nice to have a feature which allows songs to be "cached" so that runs of !list occurring close in time don't both need to download everything.

Despite my feature proposal being sort of long, this wouldn't actually be all that complicated to code. It just involves a lot of arbitrary choices for which I'd like to make recommendations, and I want to hear what others think before implementing it.

More concrete feature proposal:

Save file <discord_filename> as <discord_message_id>.<attachment_num>.<discord_fileext>
- We need <attachment_num> because discord messages can have multiple attachments
- Discord message ids are universally unique
- Previously we needed unique file ids per-channel and chose <index>.<discord_filename>
  - Now we need them globally unique, someone could delete a message with an old version of their song and resend it for example
  - Our current choice actually has a bug! Discord seems to accept 255-length filenames which is is the max filename length on most common filesystems (more or less) so we'd need to truncate very long filenames, ruining our claim to perfect archival through filenames anyways
On each !list:
1. When scraping channel for files, first check to see if cached version exists in attachments/ (as <message_id>.<file_ext>)
2. If it doesn't exist, download it
3. Set the modified time of the file to now
4. Keep only the max(X, # of songs in current !list) newest files and delete all others, where X is say 50?
- I previously argued that I was against indiscriminately deleting files in the attachment folder mainly because "delete all files in X directory" is never a piece of code you want to have a bug with. But I think in this case it's warranted.
- We could also do something like "delete all files older than 24 hours", but I don't like this because then we can't prerun !list at any point during the two weeks and it stays cached
- We could also do something like "delete the oldest files until total filesize is under a certain amount OR only bust files remain" but unless we're extremely space limited this seems like overkill

The text was updated successfully, but these errors were encountered:

Cephian · 2022-02-03T03:49:43Z

@anoadragon453 Let me know what you think of this when you get the chance

anoadragon453 · 2022-02-14T23:13:06Z

I think at this point we're starting to exhaust the amount of information we can easily encode in a filename, and we may want to start looking at more structured data on disk in order to make this easier on ourselves.

What about adding a bust.json file, containing something like the following:

{
  "attachments": {
    "<message snowflake>.<attachment id>": {
      "filename": "<song filename>",
      "busts_ago": 0
    },
    ...
}

This could also exist as an in-memory dictionary, but this way the data persists across restarts.

Filenames would still be the original filenames from Discord, but we could just truncate them if >255 chars.

Every time a song is queued for download, create a new entry in "attachments". If a matching entry already exists, reset busts_ago to zero and load the associated file. For cleaning up old attachments, every time a bust finishes (or the bot is exiting), we increment the "busts_ago" value for every attachment. If busts_ago for an attachment is >X, delete it from the JSON and delete the corresponding file from disk.

If a user deletes and re-uploads their track between two !lists (thus generating a new snowflake), then I think we just take the hit and re-download the song. They probably changed the file anyways.

While this does involve adding a new file to manage, I think it's simpler to reason about.

Cephian · 2022-02-15T20:40:00Z

The reason I suggested using <message_id>.<attachment_id>.<file_ext> was not to encode information really but rather to serve as a simple function unique file on discord --> unique filename on disk.

Filenames would still be the original filenames from Discord, but we could just truncate them if >255 chars.

Maybe this is a nitpick but remember that we might have to cache two files w the same name, which is why I think naming files by their unique discord IDs is simpler.

An external database definitely seems like the best way to implement it if we do the "busts ago" culling, but I'm not sure I like that the best. For ex, it seems weird that re-listing the same cached files on the same channel twice in a row without actually downloading anything new to the cache might cause a purge, especially given how often i repeatedly !list the same channel in a row when testing the bot.

When I think it over again, both the modified date culling and the json database culling sound somewhat complicated for busty's actually simple use case, where there is just one channel per two weeks we ever want to !list on. I propose that our first (and maybe also last, if it's good enough) implementation is just the following:

save files as <message_id>.<attachment_id>.<file_ext>
if the file already exists, don't download it again
just delete all files in the attachments folder which !list didn't either download or pull the cached version of.

This is equivalent to X = 1 in your proposal. Considering in the real server that we basically only ever want to run !list on the current submissions channel, all this fancy stuff might just be overkill. And even this simple form of caching will help speed up testing wait times a lot.

anoadragon453 · 2022-03-09T23:57:23Z

Maybe this is a nitpick but remember that we might have to cache two files w the same name, which is why I think naming files by their unique discord IDs is simpler.

Ah yeah indeed. You would end up with naming collisions. <message_id>.<attachment_id>.<file_ext> sounds good then - if we did ever want a "bust export" feature than we could just pull the names for each file from memory.

For ex, it seems weird that re-listing the same cached files on the same channel twice in a row without actually downloading anything new to the cache might cause a purge, especially given how often i repeatedly !list the same channel in a row when testing the bot.

Note that we'd reset the busts_ago to 0 if we end up finding a song in the cache, in which case I don't !listing the same channel repeatedly would cause a purge of its files.

Regardless though, I agree that we don't need anything very fancy right now.

I propose that our first (and maybe also last, if it's good enough) implementation is just the following:

save files as <message_id>.<attachment_id>.<file_ext>

if the file already exists, don't download it again

just delete all files in the attachments folder which !list didn't either download or pull the cached version of.

sgtm

Cephian added the enhancement New feature or request label Feb 3, 2022

Cephian mentioned this issue Mar 11, 2022

Cache files from previous !list invocations #91

Merged

Cephian closed this as completed in #91 Mar 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature proposal: Song cache #75

Feature proposal: Song cache #75

Cephian commented Feb 3, 2022

Cephian commented Feb 3, 2022

anoadragon453 commented Feb 14, 2022 •

edited

Loading

Cephian commented Feb 15, 2022

anoadragon453 commented Mar 9, 2022

Feature proposal: Song cache #75

Feature proposal: Song cache #75

Comments

Cephian commented Feb 3, 2022

Cephian commented Feb 3, 2022

anoadragon453 commented Feb 14, 2022 • edited Loading

Cephian commented Feb 15, 2022

anoadragon453 commented Mar 9, 2022

anoadragon453 commented Feb 14, 2022 •

edited

Loading