turbo is a tool for downloading and uploading large datasets from and to AWS S3 quickly. turbo is available as a rust library, python library or as a CLI tool.
turbo was written specifically for downloading and uploading large machine learning datasets (e.g. of 10s or 100s of thousands of images) between S3 and a local or virtual machine used for model development / training quickly, but is useful for any use case where you have a large number of files that are fairly small.
For all use cases turbo requires AWS secrets to access private buckets and for the region to be set. turbo uses the AWS Rust SDK so the usual methods for providing credentials (i.e. credentials file or env variables) are supported. See here for more details
For env variables the following need to be set:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_REGION
These can be set like
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION=...
or using a .env
file that looks like
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=...
turbo uses dotenv so there's no need to source your .env
file if you choose to
use one
To install from source you'll need to install rustup by following the instructions here.
Then clone this repo
git clone git@github.com:benjaminjellis/turbo.git
Navigate to the cloned repo and run
cargo build --release
This will create a binary called turbo
(or turbo.exe
on Windows) in the directory
target/release
.
If you add this binary to a location in your path you'll be able to run turbo.
Pre-compiled binaries for windows, linux and mac are available from here for each release.
turbolib
is the back end for the CLI tool turbo
, turbolib
is distributed via crates.io.
To use turbolib
simply add it to the dependencies section of your Cargo.toml
[dependencies]
turbolib = "*"
turbos3-py is a python package that serves as python bindings for turbolib
and is available using pip
pip install turbos3_py
To download an entire bucket (e.g. my_bucket
) navigate to the directory where you'd like to save the download to
and run
turbo download --bucket my_bucket --output data
This will download the my_bucket
bucket into a directory called data
Using the --filter
flag, regular expressions can be used to specify what in a bucket to download.
For example take a bucket my_bucket
that has three sub folders: test
, train
and val
π my_bucket
β£ π test
β β£ somefile.txt
β β£ another_file.txt
β β ...
β β etc.
β
β£ π train
β β£ somefile.txt
β β£ another_file.txt
β β ...
β β etc.
β
β£ π val
β β£ somefile.txt
β β£ another_file.txt
β β ...
β β etc
To download just the val
directory you can run
turbo download --bucket my_bucket --output data --filter 'val/*'
note the single quotes around val/*
To upload an entire local directory my_local_dir
run
turbo upload --input my_local_dir --bucket my_bucket
Using the --filter
flag, regular expressions can be used to specify what in a bucket to upload.
For example take a bucket my_local_dir
that has three subdirectories: test
, train
and val
π my_local_dir
β£ π test
β β£ somefile.txt
β β£ another_file.txt
β β ...
β β etc.
β
β£ π train
β β£ somefile.txt
β β£ another_file.txt
β β ...
β β etc.
β
β£ π val
β β£ somefile.txt
β β£ another_file.txt
β β ...
β β etc
To upload just the val
directory you can run
turbo upload --input my_local_dir --bucket my_bucket --filter 'val/*'
turbos3-py
provides the same uploading and downloading via two functions:
upload
download
Note that because the backend turbolib
is async so are the python functions
from turbos3_py import download, upload
import asyncio
async def main():
await download(bucket="my-bucket", output="./data")
await upload(bucket="my-other-bucket", input="./some_local_dir")
if __name__ == "__main__":
asyncio.run(main())
The same filtering that turbo
allows can be used as in by specifying the regular expression using the filter
kwarg.
- The underlying AWS Rust SDK is developer preview so there may be bugs
- turbo's API isn't yet stable so may be subject to change across versions