Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Dataverse files are downloaded one-by-one, not concurrently #61

Open
ross-spencer opened this issue Aug 6, 2018 · 2 comments
Labels
OCUL: AM-Dataverse OCUL: AM-Dataverse

Comments

@ross-spencer
Copy link
Contributor

ross-spencer commented Aug 6, 2018

Please describe the problem you'd like to be solved.

I would like to see this portion of the code corrected to enable concurrent downloading of a Dataverse payload: https://github.com/artefactual/archivematica-storage-service/blob/9e6f97392042997bfd7ee251308e0708f514860e/storage_service/locations/models/dataverse.py#L253-L275

The speed at which one-by-one downloading happens at present impacts user efficacy or at least the speed of.

Describe the solution you'd like to see implemented.

The python requests library supports concurrent downloads, we could try a solution such as this outlined on Stack Overflow: https://stackoverflow.com/a/9189249

Describe alternatives you've considered.

Alternative libraries or mechanisms which we are already doing this in Archivematica might exist.

Additional Context.

Async requests: http://docs.python-requests.org/en/v0.10.6/user/advanced/#asynchronous-requests

@ross-spencer ross-spencer added the OCUL: AM-Dataverse OCUL: AM-Dataverse label Aug 6, 2018
@ross-spencer
Copy link
Contributor Author

Hi @sevein I wondered if you had an opinion about the above? Further, with our own async module, would that impact doing something like from requests import async?

@sevein
Copy link
Contributor

sevein commented Aug 7, 2018

You could do what you're describing. You'd need to remember that we're using gevent so the standard library is already patched and it's cooperative. E.g. if you decided to implement a solution based on concurrent.futures.ThreadPoolExecutor (also available in py2 via futures) it's good to know that you're not really using threads but greenlets. It shouldn't change much but it's worth understanding the difference.

Be aware that it's an area where we may not be doing our best to handle errors properly, e.g. what would happen if the operation fails / raises an exception? Are we doing proper error handling? Should we update AsyncManager so it can retry operations? Are we reporting the error to the user? These are questions that you may want to ask yourself while planning to work on performance improvements.

@joel-simpson joel-simpson changed the title Problem: Dataverse files are downloaded one-by-one, not concurrently Enhancement: Dataverse files are downloaded one-by-one, not concurrently Aug 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCUL: AM-Dataverse OCUL: AM-Dataverse
Projects
None yet
Development

No branches or pull requests

2 participants