Skip to content

benjaminjellis/turbo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

turbo

CI MSRV version

turbo is a tool for downloading and uploading large datasets from and to AWS S3 quickly. turbo is available as a rust library, python library or as a CLI tool.

turbo was written specifically for downloading and uploading large machine learning datasets (e.g. of 10s or 100s of thousands of images) between S3 and a local or virtual machine used for model development / training quickly, but is useful for any use case where you have a large number of files that are fairly small.

1 Using Turbo

1.2 General Setup

For all use cases turbo requires AWS secrets to access private buckets and for the region to be set. turbo uses the AWS Rust SDK so the usual methods for providing credentials (i.e. credentials file or env variables) are supported. See here for more details

For env variables the following need to be set:

- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_REGION

These can be set like

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION=...

or using a .env file that looks like

AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=...

turbo uses dotenv so there's no need to source your .env file if you choose to use one

1.3 Installation

turbo - CLI tool

From source

To install from source you'll need to install rustup by following the instructions here.

Then clone this repo

git clone git@github.com:benjaminjellis/turbo.git

Navigate to the cloned repo and run

cargo build --release

This will create a binary called turbo (or turbo.exe on Windows) in the directory target/release.

If you add this binary to a location in your path you'll be able to run turbo.

From pre-built binaries

Pre-compiled binaries for windows, linux and mac are available from here for each release.

turbolib - Rust Library

turbolib is the back end for the CLI tool turbo, turbolib is distributed via crates.io.

To use turbolib simply add it to the dependencies section of your Cargo.toml

[dependencies]
turbolib = "*"

turbos3-py - Python package

turbos3-py is a python package that serves as python bindings for turbolib and is available using pip

pip install turbos3_py

1.4 Usage

turbo

Download a bucket

To download an entire bucket (e.g. my_bucket) navigate to the directory where you'd like to save the download to and run

turbo download --bucket my_bucket --output data

This will download the my_bucket bucket into a directory called data

Using regular expressions to filter downloads

Using the --filter flag, regular expressions can be used to specify what in a bucket to download.

For example take a bucket my_bucket that has three sub folders: test, train and val

πŸ“‚ my_bucket
┣ πŸ“‚ test
┃ ┣ somefile.txt
┃ ┣ another_file.txt
┃ ┃ ...
┃ β”— etc.
┃
┣ πŸ“‚ train
┃ ┣ somefile.txt
┃ ┣ another_file.txt
┃ ┃ ...
┃ β”— etc.
┃ 
┣ πŸ“‚ val
┃ ┣ somefile.txt
┃ ┣ another_file.txt
┃ ┃ ...
┃ β”— etc

To download just the val directory you can run

turbo download --bucket my_bucket --output data --filter 'val/*'

note the single quotes around val/*

Uploading a directory

To upload an entire local directory my_local_dir run

turbo upload --input my_local_dir --bucket my_bucket
Uploading using filters

Using the --filter flag, regular expressions can be used to specify what in a bucket to upload.

For example take a bucket my_local_dir that has three subdirectories: test, train and val

πŸ“‚ my_local_dir
┣ πŸ“‚ test
┃ ┣ somefile.txt
┃ ┣ another_file.txt
┃ ┃ ...
┃ β”— etc.
┃
┣ πŸ“‚ train
┃ ┣ somefile.txt
┃ ┣ another_file.txt
┃ ┃ ...
┃ β”— etc.
┃ 
┣ πŸ“‚ val
┃ ┣ somefile.txt
┃ ┣ another_file.txt
┃ ┃ ...
┃ β”— etc

To upload just the val directory you can run

turbo upload --input my_local_dir --bucket my_bucket --filter 'val/*'

turbo-py

turbos3-py provides the same uploading and downloading via two functions:

  • upload
  • download

Note that because the backend turbolib is async so are the python functions

from turbos3_py import download, upload
import asyncio


async def main():
    await download(bucket="my-bucket", output="./data")
    await upload(bucket="my-other-bucket", input="./some_local_dir")


if __name__ == "__main__":
    asyncio.run(main())

The same filtering that turbo allows can be used as in by specifying the regular expression using the filter kwarg.

2. πŸ”ͺ Sharp Bits

  • The underlying AWS Rust SDK is developer preview so there may be bugs
  • turbo's API isn't yet stable so may be subject to change across versions

About

turbocharged S3 downloads and uploads in Python 🐍 and Rust πŸ¦€

Topics

Resources

License

Stars

Watchers

Forks

Languages