You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DownloadDatasetParallelTask in dag.py handles the parallelization of the source downloads itself. It would be preferable to let Hadoop do this for a couple of reasons: 1) it removes the SSH dependency (this will allow Eggo to run on a non-fabric provisioned cluster), and 2) fault tolerance is handled by Hadoop rather than Eggo code.
The simplest way is probably to write a streaming script that uses NLineInputFormat so each mapper can download a single source file.
The text was updated successfully, but these errors were encountered:
DownloadDatasetParallelTask in dag.py handles the parallelization of the source downloads itself. It would be preferable to let Hadoop do this for a couple of reasons: 1) it removes the SSH dependency (this will allow Eggo to run on a non-fabric provisioned cluster), and 2) fault tolerance is handled by Hadoop rather than Eggo code.
The simplest way is probably to write a streaming script that uses NLineInputFormat so each mapper can download a single source file.
The text was updated successfully, but these errors were encountered: