Skip to content
/ dl Public

rust utility for more quickly dowloading big files using parallel range requests

License

Notifications You must be signed in to change notification settings

aguestuser/dl

Repository files navigation

Overview

dl is a program that lets you download big files fast(er) by parallelizing downloads across many threads and cores.

It was written in response to a take-home challenge for a software company, but lives on as an interesting playground for playing with network programming in Rust!

Table Of Contents

Getting Started

Get the code!

Assuming you don't have the repo yet, get it with:

git clone https://0xacab.org/aguestuser/dl.git
cd dl

System Dependencies

You will need rust and its build tool, cargo, installed to run dl. The fastest/simplest way to install them is by running this in a shell:

curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env

If curl-ing bash scripts from the internet into your shell makes you nervous, you can try one of the options here.

Whatever method you chose, you can check to make sure everything worked with:

rustc --version
cargo --version

If you would like to update all your packages (likely unnecessary), you can run:

rustup update

Building dl

Yay! Now that you have rust now, you can build dl with:

cd <path/to/this/repo>
cargo build --release

To make it easier to run, let's put the build artifact on our PATH:

sudo ln -s ${PWD}/target/release/dl /usr/bin/dl

(If you don't want to do that, that's okay! We can use a slightly longer command later!)

Running dl

Okay! Time for the main event! Now you can run dl with:

dl <url_to_download_from> <path_to_save_file_to>

(Filling in <url_to_download_from> and <path_to_save_file_to> with your own values.)

For example, to download this ~20mb copy of Designing Data-Intensive Applications (disguised as my resume on RC's S3 bucket), you could run:

dl https://recurse-uploads-production.s3.amazonaws.com/dc12e4d0-3c82-45b8-9cb7-6c64a8f50cfb/austin_guest_resume.pdf data/book.pdf

If you didn't symlink the build artifact above, you could run:

./target/release/dl <url> <path>

Developing dl

In the above we used production builds because they are faster, and this is a challenge! However, if we wanted to hack on the project to change it, we'd want faster build/run cycle than come with the release flag and invoking a binary.

Thankfully, we can build and run in one step with a simple:

cargo run

We can build and run the tests (with frequency, right?) with:

cargo test

If you want to check out the (nifty, autogenerated!) docs, you can always run:

cargo doc --open

And if you want to just re-compile, you can run:

cargo build

Application Design

If you want to find your way around the code, starting in main, then looking at lib and to the modules that it calls is a good way in.

The main action happens in lib::run, which is structured as a pipeline of futures, each of which consumes and produces a struct that encapsulates the current state of the program.

More specifically, program flow is modeled a transition between the following structs:

Config -> MetadataDownloader -> FileDownloader -> HashChecker

This encoding of application state as transitions between structs is a rust convention which, in addition to trying to provide clarity as to what phase of execution the app is in, does some memory management work for us under the hood.

As should be evident from the above, the flow of the program is to:

  • accept configs from the user
  • make a HEAD request to the user-supplied url to see if it supports range requests, and if so, retrieve the length of the file and (if it exists) its etag
  • download the file in several parallel chunks
  • take the hash of the downloaded file to see if it matches the advertised etag

Most of the heavy lifting comes in dl::file::FileDownloader::fetch. This function:

  • creates a blank placeholder file into which chunks will be written as soon as they come off the wire
  • calculates optimally-sized chunks in which to download the file (where "optimal" is file_size / P; P = degree of parallelism in app) and generates a stream of chunk-sized offsets to use in range requests
  • issues parallel range requests for the chunk at each offset (where each request is represented as a future and the set of all requests is represented as a stream of futures, derrived from our original stream of offsets)
  • reads the bytes of each response into a buffer, whose contents are written to the placeholder file after seeking to the correct offset (note: all writes are performed in parallel via stream composition)
  • collects the stream of parallel futures described above into a single future (via chained calls to buffer_unordered and collect) which resolves successfully if all requests resolve successfully and with failure if any request fails (yes: we could be less brittle than that in future iterations!**

A brief note on types:

Each call to download_piece returns a value that we represent as a Future<u64, DlError>. The u64 represents the offset of the downloaded chunk-- in case we wanted to track the status of each request for retries, which we don't currently do.

TheDlError is a custom error class that serves as an enumerable set of all possible errors we might see in the program's execution. Constructing this type involves a lot of boilerplate. (See dl::errors). BUT... having this type around turns out to do a lot of work. For starters, it's necessary to make the "types line up" for all of our variously-typed futures to return an error component of the same type, which allows us to use (monadic) map, map_err, and and_then combinators to manage control flow.

It also makes error-handling in the main loop a breeze, since we have a strong guarantee that our main loop will only ever produce (future) errors of a certain type and that execution will short circuit as soon as one is encountered. Thus, we can just call our main function and map_err over its results to print the correct error at the correct time without worrying too much about it.

If you see a Box<Future> and wonder what's going on: its main purpose is to make sure the compiler has enough information about what kind of future we're returning when we compose Futures of slightly varying types.

A note on parallelism:

There are several strategies I could have used to leverage parallelism in this application. I chose the simplest one I could think of:

  • Discover the number of logical cores on my machine (in my case 12). Call this number P, and assume it is the optimal level of parallelism for the application.[1]
  • Divide the file into P pieces and download them in a stream of P parallel tasks (i.e., "futures") using an https client configured with a thread pool of size P
  • When all tasks have completed, collect their results using built-in stream combinators provided with the tokio framework.

I could have alternately spawned P threads or P tasks and had them send their results over a channel to an aggregator thread or task (which I may try out in further experimentation), but for current purposes, this seemed the most straightforward approach.

[1] See below for benchmarking experiments to test this assumption.

Profiling

To make sure the program was successfully leveraging parallelism to increase performance (and validate my assumption of the optimal level of parallelization), I performed some elementary profiling with various levels of parallelism.

The benchmarks from these profiling experiments live with the repo. You can view the reports with (for example):

cd <this repo>
firefox target/criterion/report/index.html

I performed the benchmarks on a Thinkpad with 12 logical (6 physical) cores with internet speeds of ~850Mbps up / 930Mbps down. For each trial, I downloaded files of varying sizes (~50 KB, ~25 MB, and ~500 MB) with varying levels of parallelism (1, 6, 12, 24, 48) -- running 20 trials per permutation.

The benchmarks demonstrated that:

  • Increasing parallelism did not yield measurable performance gains for files on the order of KB (presumably b/c the overhead of context switching dominates any gains from parallelism given the small amount of data to be downloaded and the speed of the network on which I was testing).
  • Gains from parallelism begin accruing for files on the order of MB and increased as a factor of file size thereafter -- producing speed ups of 2x and 3x for files of size ~25 MB and ~500 MB, respectively.
  • At sizes large enough for parallelism to increase performance, setting the parallelism level P to the number of cores on the machine (in my case 12), was a more or less optimal choice.[1]

[1] More specifically: For files of size ~25 MB, download times decreased as P increased from 1 to 12, then increased again for values of P higher than 12. For files of size ~500 MB, download times decreased hyperbolically as P increased, with the reflection point of the hyperbola appearing somewhere before P=12. Since in the first case 12 was the optimal choice and in the latter case, download times only decreased asymtotically after reaching 12, setting P =12 seems to be an optimal choice in a situation in which we must choose a single vale of P for all file sizes and network speeds.

TODO

As noted above, my solution has two major flaws:

  1. It does not retry failed chunk requests and suffers halting errors when they happen

Additionally:

  1. It relies on discovering the file of the size in advance (to calculate chunk sizes) but has no fallback measures in place for discovering that information in the case that the target server does not supply it in the response to a HEAD request.

If I had more time to work on this project, I'd want to solve the above problems in roughly that order. Here are some preliminary thoughts on how I'd do so:

1. Handling Failed Requests

To handle failed requests, first we'd want to track the status of all requests. We'd need to store this data in some sort of thread-safe data structure that a bunch of parallel threads could read from and write to in parallel without producing contention or draining resources in acquiring locks, etc. My data structure of choice would be some sort of concurrent hash map. It looks like rust has a decent implementation in chashmap.

When FileDownloader#fetch was called, I'd pass it an empty chashmap. As requests or writes fail and/or succeed I'd write record that fact to an entry in the chashmap. As a first approximation, let's say the keys of the map would be the number of the offset, and the values would be an enum with possible values Success, FailedRequest, FailedWrite. If either a request or a write failed, one of the futures in the stream being collected into one future would fail, meaning that the entire future would fail. I'd map_err over this future and recursively call fetch again, feeding it the (now not-empty) version of the hash-map.

On subsequent calls, fetch should omit offset whose keys contain Success values from the stream of offsets it generates for range requests. (Which means that on initial call, it should also consult the empty map to help it construct a full set of offsets.)

To add in further resilience, we could consider the case that execution might terminate in the middle of downloading or writing. To recover from this, we could serialize our chashmap on some sane interval and write it to disk, and always make an attempt to build the initial chashmap that we pass into fetch by deserializing whatever we do (or don't) have stored to disk.

2. Fallback file-size discovery

I left a seam for introducing a functional take on a "strategy pattern" solution to this problem in MetadataDownloader::fetch. The idea would be that we would call a pipeline of composed functions to try to discover the file size. If any of them find the answer, we return early. If all of them fail, we return an error, causing the caller's execution to short circuit.

I can think of 3 strategies to try (in increasing order of cleverness):

  1. Ask for the length in a HEAD request (what we currently do). If that fails...
  2. Issue a range request for bytes 0-1 of the file. Inspect the Content-Lenghth and/or Content-Range headers and extract the size of the file from the header values, if present. If that fails...
  3. Perform a binary-search like search for the last byte, when you find it, return its position as the file size

As (3) is the only non-trivial solution, its implementation bears some discussion. I'd imagine something like the following:

  • write a recursive function that request a one byte chunk of the file at index 2^N, with N starting at 0. if you get data in response increase N by 1 until you don't see a response.
  • remeber the first N at which you didn't get a response and the last N at which you did. call the former upper and the latter lower
  • write a recursive function that implements binary search between upper and lower:
    • query the server for data at an index halfway between upper and lower
    • if you find data at an index, set lower to that index and recurse
    • if you don't find data at an index, set upper to that index and recurse
    • terminate the recursion when lower is one less than upper, and return lower

This should find the size of the file in roughly O(log N), where N is the size of the file in bytes.

About

rust utility for more quickly dowloading big files using parallel range requests

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages