Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access-controlled service for large static files #2099

Open
senderle opened this issue Jun 6, 2019 · 3 comments
Open

Access-controlled service for large static files #2099

senderle opened this issue Jun 6, 2019 · 3 comments

Comments

@senderle
Copy link

senderle commented Jun 6, 2019

Description

I'm proposing that a feature be added to serve large static files to authenticated users.

It might not be obvious why this is a problem. Here are some of the possible solution paths, and why they are blocked:

  1. Can't we use a static file service like whitenoise?

    • Whitenoise doesn't provide any kind of authentication or access control, and I'm not even sure it can handle very large files.
  2. Can't Django just serve the files through a FileResponse object?

    • FileResponse objects do a decent job of serving small and medium-sized files, but for very large files, problems arise. (In my case, when files get big enough, I hit a memory error.) It appears that if a given environment has a wsgi.file_wrapper defined, FileResponse objects may use that to efficiently serve access controlled files. But that seems to require that Django be running on the same machine as the web server.
  3. Isn't there some kind of funky thing you can do with headers?

    • Yes! Or rather, there was when cookiecutter-django used Caddy. Caddy supported the X-Accel-Redirect header, and could be configured similarly to nginx (as described here). After the switch to Traefik, this approach no longer works, because Traefik is not a web server at all.
  1. Could you use AWS somehow?

    • Maybe. I haven't looked into this option carefully. But it seems like it would be very complicated to get right.

How should it be implemented? I don't know. This is where I am stuck, and would welcome discussion. I posted a question on stack overflow and got crickets; if you see a way around this that doesn't require a pull request, please feel free to answer there.

Rationale

In a sense, this is not a "feature" but a fix. The change from Caddy to Traefik arguably broke functionality that was working pretty well before.

What it really means for me, concretely, is this: now that I want to do something similar with a new app, I can't use cookiecutter-django without a fairly elaborate and awkward reconfiguration -- something like standing up an nginx container between the django service and the traefik service. If that's the only option, my instinct is to not use cookiecutter-django at all. I probably don't need all the things, and the configuration work will wind up being about the same either way. And maybe that's fine; this could just be a "It might not be what you want" situation.

But I'm proposing the alternative narrative that this would actually fix something that worked before and now is broken. I don't honestly imagine that there are that many people doing what I'm doing, and so I can't argue that you will lose a bunch of users over this. It's just kind of annoying that it used to be easy, and now is hard.

Use case(s) / visualization(s)

Here's my use case: I am developing new apps for researchers at the University of Pennsylvania doing large-scale statistical text analysis in multiple different departments. I need to be able to automatically distribute copyright-protected data to authorized users in bulk, without risking leaking the data.

@browniebroke
Copy link
Member

I need to be able to automatically distribute copyright-protected data to authorized users in bulk, without risking leaking the data.

Are you positive these should be distributed using static files and not media files instead? It sounds like this data would be uploaded by your application users to a FileField or ImageField rather than tracked in version control like your code base. Django-storage is providing some options to restrict access.

@senderle
Copy link
Author

senderle commented Jun 6, 2019

The data is generated by a crawling process and aggregated into large zip files that the user then downloads. There's no uploading involved. (It's also not tracked in version control.)

But if there are ways to restrict access to the files using some other mechanism that I haven't mentioned above, I'm all ears! It just has to be able to efficiently handle multi-gigabyte files.

@browniebroke
Copy link
Member

The data is generated by a crawling process and aggregated into large zip files that the user then downloads. There's no uploading involved.

Ok, so when I have to do that type of things, for me, there is an "upload" invloved at some point, not from a user, but from the crawling process. Here is how I usually handle this (assuming I'm on Docker based config):

  • Create a storage class to store files as private in a AWS S3 bucket:

     from storages.backends.s3boto3 import S3Boto3Storage
     
     class PrivateStorage(S3Boto3Storage):
     	default_acl = 'private'
     	file_overwrite = False
     	bucket_name = 'my-private-bucket'

    More options detailed in the documentation. You can use AWS_QUERYSTRING_AUTH and AWS_QUERYSTRING_EXPIRE to control access of your files.

  • Create a model with a FileField that will be used to represent these files in my application, using this private storage, for example:

     class LargeZipFile(models.Model):
     	name = models.CharField(max_length=50)
     	zip = models.FileField(
     		storage=PrivateStorage()
     	)
  • Use Celery to generate the data, and when files are ready, create instances of LargeZipFile to upload the files.

Each time a user wants to download a file, your application exposes the LargeZipFile.zip.url on some page, which will have query parameters giving access for a short amount of time.

That being said, I don't know how your server is deployed at the University of Pennsylvania, it might be on a dedicated, non-cloud server. I don't know which are your storages options, but if AWS is not suitable, Digital Ocean might be and has a compatible API, which is supported by django-storages.

It just has to be able to efficiently handle multi-gigabyte files.

A word of warning that it could generate some significant costs from Amazon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants