# Getting started with the Amazon S3 Connector for PyTorch

The Amazon S3 Connector for PyTorch delivers the highest throughput data access between Amazon S3 and a PyTorch training job, accelerating performance when interacting with machine learning training data and model checkpoints. Using the S3 Connector for PyTorch automatically optimizes performance when downloading training data from and writing checkpoints to Amazon S3, eliminating the need to write your own code to list S3 buckets and make concurrent requests.

The S3 Connector for PyTorch provides implementations of PyTorch's [dataset primitives](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) that you can use to load training data from Amazon S3. It supports both [map-style datasets](https://pytorch.org/docs/stable/data.html#map-style-datasets) for random data access patterns and [iterable-style datasets](https://pytorch.org/docs/stable/data.html#iterable-style-datasets) for streaming sequential data access patterns. The S3 Connector for PyTorch also includes a checkpointing interface to save and load checkpoints directly to Amazon S3, without first saving to local storage.

## Installation

To install the S3 Connector for PyTorch, **[...installation instructions...]**

In [10]:
import s3dataset
import torch.utils.data

## Simple examples

To illustrate how to use the S3 Connector for PyTorch, we've created a public Amazon S3 bucket with some example training data. [... talk about the dataset and the structure -- some images, from some friendly-licensed source. later we'll talk about WebDataset]

The simplest way to use the S3 Connector for PyTorch is to construct an `S3MapDataset`, a map-style dataset, by specifying an S3 URI:

In [7]:
dataset = s3dataset.S3MapDataset.from_prefix("s3://s3torchconnector-demo/images/", region="us-west-2")

You can randomly access a map-style dataset by indexing into it:

In [13]:
# TODO can you index only by integers, or by S3 URIs too?
dataset[0]

list_objects; id=2 bucket="s3torchconnector-demo" continued=false delimiter="" max_keys="1000" prefix="images/"
request failed request_type=Default http_status=-1 range=None duration=1.61000175s ttfb=None request_id=<unknown>
meta request failed duration=1.611079583s error=ClientError(NoSigningCredentials)


S3DatasetException: Client error: No signing credentials found

**[what do you actually get back? explain `S3Reader`, it's a stream not a bytes]**

Map-style datasets are also iterators, so you can iterate over them to retrieve every object in your S3 bucket:

In [9]:
# TODO visualize the output -- plot all the images or something
for object in dataset:
    print(object)

list_objects; id=1 bucket="s3torchconnector-demo" continued=false delimiter="" max_keys="1000" prefix="images/"
request failed request_type=Default http_status=-1 range=None duration=1.431013042s ttfb=None request_id=<unknown>
meta request failed duration=1.431248958s error=ClientError(NoSigningCredentials)


S3DatasetException: Client error: No signing credentials found

## Working with `DataLoader`s

While you can work directly with datasets, most PyTorch training loops will instead use a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), a wrapper around a dataset that supports customizable ordering, automatic batching, and multi-process data loading.

You can construct a `DataLoader` as a wrapper around an `S3MapDataset` or `S3IterableDataset`, passing in arguments such as the batch size you want to use:

In [11]:
loader = torch.utils.data.DataLoader(dataset, batch_size=2)

A `DataLoader` is an iterator over *batches* of data samples (in this case, images):

In [14]:
# TODO: iterate the loader, demonstrate the iterator yields batches

### Multi-process data loading

To speed up data loading, you can configure a `DataLoader` to automatically spawn a number of worker processes and load data in parallel in each process, using the `num_workers` argument:

In [15]:
loader = torch.utils.data.DataLoader(dataset, batch_size=4, num_workers=4)

Parallel data loading is especially important when loading training data from cloud storage services like Amazon S3, where the time to load each individual training sample may be high, but loading of many samples can happen in parallel. Multiple workers will give the best training throughput. We generally recommend setting `num_workers` to the number of vCPUs on your instance.

**Important**: When combining multi-process data loading with `S3IterableDataset`, by default each worker process will get its own replica of the dataset, and so each training sample will be duplicated `num_workers` times by the `DataLoader`. This is very likely not the behavior you want. **[we're working on it, link the github issue. in the meantime, show the torchdata version or something]**

## Training data formats for Amazon S3

When storing training data in your Amazon S3 bucket, collecting training samples into preprocessed *shards* can improve the throughput of your training jobs as well as reducing the cost of loading the data. A shard is a single Amazon S3 object that contains many samples, rather than storing each individual training sample as a separate object. Collecting samples into larger shards allow your training jobs to make the best use of S3's elastic throughput by streaming the shards in their entirety, hiding the latency of individual requests. and lowering request costs.

There are several ways to collect training data into shards. For textual data used to pre-train or fine-tune large language models, even simple sharding techniques like text files with one sample per line can be an effective sharding technique. For larger datasets or other data formats, consider open-source sharding formats like WebDataset or TensorFlow’s TFRecord.

For example, you can use `S3IterableDataset` to stream training data stored in S3 in WebDataset format:


In [16]:
# TODO an example of WebDataset parsing and streaming

Preprocessing your training samples into shards is also a good opportunity to optimize your training data format to reduce cost and improve training throughput. For example, you can pre-apply transformations like resizing, normalization, and tensor conversion to image or video datasets to avoid the overhead of these transformations during training. You can also compress the sharded objects before uploading them to Amazon S3 to reduce storage costs. Finally, sharded objects are more likely to be larger than the 128 KiB minimum size to be eligible for [S3 Intelligent Tiering](https://aws.amazon.com/s3/storage-classes/intelligent-tiering/), which can further reduce storage costs for infrequently accessed training data.

## Model checkpointing

**[talk about checkpointing, torch.save]**

## An end-to-end example

**[train an actual model, using dataset for training data and checkpointing for saves. grab a simple example from HuggingFace or something.]**