Use Hadoop to run source dataset downloads in parallel #21

tomwhite · 2015-04-10T15:56:19Z

DownloadDatasetParallelTask in dag.py handles the parallelization of the source downloads itself. It would be preferable to let Hadoop do this for a couple of reasons: 1) it removes the SSH dependency (this will allow Eggo to run on a non-fabric provisioned cluster), and 2) fault tolerance is handled by Hadoop rather than Eggo code.

The simplest way is probably to write a streaming script that uses NLineInputFormat so each mapper can download a single source file.

tomwhite mentioned this issue Apr 10, 2015

Support local testing. #22

Merged

tomwhite closed this as completed Apr 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Hadoop to run source dataset downloads in parallel #21

Use Hadoop to run source dataset downloads in parallel #21

tomwhite commented Apr 10, 2015

Use Hadoop to run source dataset downloads in parallel #21

Use Hadoop to run source dataset downloads in parallel #21

Comments

tomwhite commented Apr 10, 2015