Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asset server for older uploads #2500

Closed
r888888888 opened this issue Aug 19, 2015 · 10 comments
Closed

Asset server for older uploads #2500

r888888888 opened this issue Aug 19, 2015 · 10 comments
Labels

Comments

@r888888888
Copy link
Collaborator

Eventually the primary web servers will run out of hard drive space. Older images should be offloaded onto asset servers so free up space for newer content.

@Type-kun
Copy link
Collaborator

Just a note that this can be done with very little code changes. Offload the old files to assets server, mount the remote directory with sshfs, then use unionfs to merge remote and local image storages in a single RO directory, and use that directory in webserver config and when checking for existence. The downside is high internal traffic - older images will first be pulled from assets to main server, then delivered to user.

@r888888888
Copy link
Collaborator Author

posts actually has bit flags so it'd be trivial to add a flag for whether a post has been archived or not.

@r888888888
Copy link
Collaborator Author

Servers have around 350GB free so it's time to start thinking about this.

I think the immediate step is to start hosting older files out of S3. The files are already there, the ACL just has to be set and they have to be brought out of Glacier. It is a bit pricy, but accesses should be infrequent and not having the headache of maintaining a separate asset server (and all the infrastructure and development that entails) is probably worth it.

Oldest 100k posts should be hosted out of S3. I will see how that affects costs and make further decisions from there.

@evazion
Copy link
Member

evazion commented Feb 8, 2017

Why not just host everything out of S3 and configure nginx as a simple caching proxy in front of it? That way everything is handled transparently: unpopular images naturally fall out of the cache, but they're automatically pulled in from S3 when they're accessed again. That seems simpler than handling archival at the app level.

The image host doesn't necessarily need to be a separate server, just a separate subdomain (which would be beneficial regardless).

@r888888888
Copy link
Collaborator Author

r888888888 commented Feb 9, 2017

That's actually a really good idea.

I was able to implement a proof of concept on Testbooru.

I guess the downsides are this:

  • The first time an image is loaded will be slower, since it needs to be downloaded from S3 first. Nginx can actually be configured to only cache if a file is accessed N times or more, which is probably a smart idea to avoid a lot of churn.
  • Pretty much the entirety of the S3 database will be exposed to the public. There isn't really any advantage to signing requests, since a leecher could just hammer the web servers to get the signed URLs.
  • Higher cost, both from serving out of S3 and not being able to use Amazon Glacier for cost savings.
  • Uploads will be slower since the files need to be uploaded to S3 synchronously.

There are some pretty strong advantages though:

  • Adding additional web servers is trivial, since the size of the proxy cache can be adjusted for available disk space.
  • Would effectively never have to worry about disk space again, since the most popular images will always be biased towards more recent uploads.

I think it's possible to do this gradually by configuring Nginx to only proxy if the file doesn't exist.

@evazion
Copy link
Member

evazion commented Mar 25, 2017

From https://danbooru.donmai.us/forum_topics/9127?page=167#forum_post_128781:

The resized version of post #4087, post #1357 and post #930 is broken, not sure if anything can be done about it.

These URLs return 403 Forbidden:

with this error:

<Error>
  <Code>InvalidObjectState</Code>
  <Message>
    The operation is not valid for the object's storage class
  </Message>
  <RequestId>329863F0B4A14EA0</RequestId>
  <HostId>
    Fcqbtntbkk6aOQUhrlYDt4AAdkWrQtsCLtMxtb7KEtgCuoH7RwQSVetuVuIfcgBvoivMtJhFLVU=
  </HostId>
</Error>

@Type-kun
Copy link
Collaborator

Yeah, I guess we kind of forgot that app assumes that file is always physically present and runs imagemagick queries on every save. There's also APNG detector which likely broke as well. Actually, I have a feeling it'll bring us more problems in the future. Can't this be configured transparently at OS filesystem level instead of nginx, via SSHFS for example, or maybe using some smarter remote FS?

@r888888888
Copy link
Collaborator Author

The plan will be to migrate older posts instead of making this happen all at once during upload. So I think the file existence check is good enough.

@Type-kun
Copy link
Collaborator

Type-kun commented Apr 4, 2017

This actually already resulted in a problem. Since animated_gif tag is fully automated, on post edit, when existence check fails, it considers post NOT to be an animated gif and removes the tag. It's impossible to add the tag to https://danbooru.donmai.us/posts/159 now, for example.

@r888888888
Copy link
Collaborator Author

I don't see why auto-tags should ever be removed. If the file itself never changes then properties like whether it's animated or not will never change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants