Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Hadoop to run source dataset downloads in parallel #21

Closed
tomwhite opened this issue Apr 10, 2015 · 0 comments
Closed

Use Hadoop to run source dataset downloads in parallel #21

tomwhite opened this issue Apr 10, 2015 · 0 comments

Comments

@tomwhite
Copy link
Member

DownloadDatasetParallelTask in dag.py handles the parallelization of the source downloads itself. It would be preferable to let Hadoop do this for a couple of reasons: 1) it removes the SSH dependency (this will allow Eggo to run on a non-fabric provisioned cluster), and 2) fault tolerance is handled by Hadoop rather than Eggo code.

The simplest way is probably to write a streaming script that uses NLineInputFormat so each mapper can download a single source file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant