Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposal: Song cache #75

Closed
Cephian opened this issue Feb 3, 2022 · 4 comments · Fixed by #91
Closed

Feature proposal: Song cache #75

Cephian opened this issue Feb 3, 2022 · 4 comments · Fixed by #91
Labels
enhancement New feature or request

Comments

@Cephian
Copy link
Collaborator

Cephian commented Feb 3, 2022

We tend to run !list more than once per bust, to do things like get Google Form info, make sure everything looks good, etc. We also run it a lot locally when testing. On a large number of songs there can be significant delay and bandwidth consumption from downloading all of the files. It would be nice to have a feature which allows songs to be "cached" so that runs of !list occurring close in time don't both need to download everything.

Despite my feature proposal being sort of long, this wouldn't actually be all that complicated to code. It just involves a lot of arbitrary choices for which I'd like to make recommendations, and I want to hear what others think before implementing it.

More concrete feature proposal:

  • Save file <discord_filename> as <discord_message_id>.<attachment_num>.<discord_fileext>
    • We need <attachment_num> because discord messages can have multiple attachments
    • Discord message ids are universally unique
    • Previously we needed unique file ids per-channel and chose <index>.<discord_filename>
      • Now we need them globally unique, someone could delete a message with an old version of their song and resend it for example
      • Our current choice actually has a bug! Discord seems to accept 255-length filenames which is is the max filename length on most common filesystems (more or less) so we'd need to truncate very long filenames, ruining our claim to perfect archival through filenames anyways
  • On each !list:
    1. When scraping channel for files, first check to see if cached version exists in attachments/ (as <message_id>.<file_ext>)
    2. If it doesn't exist, download it
    3. Set the modified time of the file to now
    4. Keep only the max(X, # of songs in current !list) newest files and delete all others, where X is say 50?
    • I previously argued that I was against indiscriminately deleting files in the attachment folder mainly because "delete all files in X directory" is never a piece of code you want to have a bug with. But I think in this case it's warranted.
    • We could also do something like "delete all files older than 24 hours", but I don't like this because then we can't prerun !list at any point during the two weeks and it stays cached
    • We could also do something like "delete the oldest files until total filesize is under a certain amount OR only bust files remain" but unless we're extremely space limited this seems like overkill
@Cephian Cephian added the enhancement New feature or request label Feb 3, 2022
@Cephian
Copy link
Collaborator Author

Cephian commented Feb 3, 2022

@anoadragon453 Let me know what you think of this when you get the chance

@anoadragon453
Copy link
Owner

anoadragon453 commented Feb 14, 2022

I think at this point we're starting to exhaust the amount of information we can easily encode in a filename, and we may want to start looking at more structured data on disk in order to make this easier on ourselves.

What about adding a bust.json file, containing something like the following:

{
  "attachments": {
    "<message snowflake>.<attachment id>": {
      "filename": "<song filename>",
      "busts_ago": 0
    },
    ...
}

This could also exist as an in-memory dictionary, but this way the data persists across restarts.

Filenames would still be the original filenames from Discord, but we could just truncate them if >255 chars.

Every time a song is queued for download, create a new entry in "attachments". If a matching entry already exists, reset busts_ago to zero and load the associated file. For cleaning up old attachments, every time a bust finishes (or the bot is exiting), we increment the "busts_ago" value for every attachment. If busts_ago for an attachment is >X, delete it from the JSON and delete the corresponding file from disk.

If a user deletes and re-uploads their track between two !lists (thus generating a new snowflake), then I think we just take the hit and re-download the song. They probably changed the file anyways.

While this does involve adding a new file to manage, I think it's simpler to reason about.

@Cephian
Copy link
Collaborator Author

Cephian commented Feb 15, 2022

The reason I suggested using <message_id>.<attachment_id>.<file_ext> was not to encode information really but rather to serve as a simple function unique file on discord --> unique filename on disk.

Filenames would still be the original filenames from Discord, but we could just truncate them if >255 chars.

Maybe this is a nitpick but remember that we might have to cache two files w the same name, which is why I think naming files by their unique discord IDs is simpler.

An external database definitely seems like the best way to implement it if we do the "busts ago" culling, but I'm not sure I like that the best. For ex, it seems weird that re-listing the same cached files on the same channel twice in a row without actually downloading anything new to the cache might cause a purge, especially given how often i repeatedly !list the same channel in a row when testing the bot.

When I think it over again, both the modified date culling and the json database culling sound somewhat complicated for busty's actually simple use case, where there is just one channel per two weeks we ever want to !list on. I propose that our first (and maybe also last, if it's good enough) implementation is just the following:

  1. save files as <message_id>.<attachment_id>.<file_ext>
  2. if the file already exists, don't download it again
  3. just delete all files in the attachments folder which !list didn't either download or pull the cached version of.

This is equivalent to X = 1 in your proposal. Considering in the real server that we basically only ever want to run !list on the current submissions channel, all this fancy stuff might just be overkill. And even this simple form of caching will help speed up testing wait times a lot.

@anoadragon453
Copy link
Owner

Maybe this is a nitpick but remember that we might have to cache two files w the same name, which is why I think naming files by their unique discord IDs is simpler.

Ah yeah indeed. You would end up with naming collisions. <message_id>.<attachment_id>.<file_ext> sounds good then - if we did ever want a "bust export" feature than we could just pull the names for each file from memory.

For ex, it seems weird that re-listing the same cached files on the same channel twice in a row without actually downloading anything new to the cache might cause a purge, especially given how often i repeatedly !list the same channel in a row when testing the bot.

Note that we'd reset the busts_ago to 0 if we end up finding a song in the cache, in which case I don't !listing the same channel repeatedly would cause a purge of its files.

Regardless though, I agree that we don't need anything very fancy right now.

I propose that our first (and maybe also last, if it's good enough) implementation is just the following:

  1. save files as <message_id>.<attachment_id>.<file_ext>
  2. if the file already exists, don't download it again
  3. just delete all files in the attachments folder which !list didn't either download or pull the cached version of.

sgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants