# FineWeb dataset

The [🍷 FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library.

The dataset comes in 10B, 100B, and the full 350B token sizes. The full dataset, however, contains 51.3 TB of data, which overwhelms most local setups.

In this project, we will use the 10B version, which should suffice to train a 124M GPT-2 model (as Andrej Karpathy did in [his GPT-2 tutorial](https://github.com/karpathy/llm.c/discussions/481))

Along with config default (all the data), and the configs for each individual dump, you can also download the following configs:

- sample-350BT: a subset randomly sampled from the whole dataset of around 350B gpt2 tokens (388GB)
- sample-100BT: a subset randomly sampled from the whole dataset of around 100B gpt2 tokens (277.4GB)
- sample-10BT: a subset randomly sampled from the whole dataset of around 10B gpt2 tokens (27.6GB)

sample-10B was sampled from sample-100B which in turn was sampled from sample-350BT.

In [1]:
import json
import multiprocessing as mp
import os
import pathlib

from typing import Any, Dict, Iterable, List, Tuple

from datasets import load_dataset
from huggingface_hub import snapshot_download
import numpy as np
import tiktoken
import torch
from tqdm.notebook import tqdm

# Fetching the dataset

In [2]:
# Download the 10B token sample of the FineWeb dataset.
local_dir = os.path.expanduser("~/data/datasets/fineweb/")
folder = snapshot_download(
    "HuggingFaceFW/fineweb",
    repo_type="dataset",
    local_dir=local_dir,
    allow_patterns="sample/10BT/*",
)

fw = load_dataset(
    "parquet",
    data_files={"train": f"{local_dir}/sample/10BT/*.parquet"},
    streaming=True,
)["train"]

for i, row in enumerate(fw):
    print(row)
    if i > 5:
        break

type(fw)

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

{'text': '|Viewing Single Post From: Spoilers for the Week of February 11th|\n|Lil||Feb 1 2013, 09:58 AM|\nDon\'t care about Chloe/Taniel/Jen-Jen. Don\'t care about Sami, really, but hoping that we get some good "SAMANTHA GENE!!" Marlena Death-Stares out of it. And "newfound" feelings. Please. If only.\nSTEFANO!! STEFANO, STEFANO, STEFANO!!!! :cheer:\n|Spoilers for the Week of February 11th · DAYS: News, Spoilers & Discussion|', 'id': '<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>', 'dump': 'CC-MAIN-2013-20', 'url': 'http://daytimeroyaltyonline.com/single/?p=8906650&t=8780053', 'date': '2013-05-18T05:48:59Z', 'file_path': 's3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz', 'language': 'en', 'language_score': 0.8232095837593079, 'token_count': 142}
{'text': '*sigh* Fundamentalist community, let me pass on some advice to you I learned from the atheistic community:\nIf you have set yourself on fire

datasets.iterable_dataset.IterableDataset

# Tokenization

## Single document tokenizer

In [12]:
# Borrowed from Andrej Karpathy's GPT-2 tutorial.
# See  https://github.com/karpathy/llm.c/blob/master/dev/data/fineweb.py


def tokenize_gpt2(doc: Dict[str, Any]) -> np.ndarray:
    """Tokenizes a single document and returns a numpy array of uint16 tokens."""
    # Initialize the tokenizer.
    enc = tiktoken.get_encoding("gpt2")
    encode = lambda s: enc.encode_ordinary(s)

    # Define the end of text token.
    eot = enc._special_tokens["<|endoftext|>"]  # end of text token

    # The special <|endoftext|> token delimits all documents.
    # NOTE: The end of text token is prepended to the document (rather than appended).
    tokens = [eot]

    # Encode the document.
    tokens.extend(encode(doc["text"]))
    tokens_np = np.array(tokens)

    # Check that the tokens are within the range of uint16.
    assert (0 <= tokens_np).all() and (
        tokens_np < 2**16
    ).all(), "Token dictionary too large for uint16"

    # Convert the tokens to a numpy array of uint16 tokens.
    tokens_np_uint = tokens_np.astype(np.uint16)
    return tokens_np_uint


def decode_gpt2(tokens: torch.Tensor) -> str:
    """Decodes a tensor of uint16 tokens to a string."""
    enc = tiktoken.get_encoding("gpt2")
    return enc.decode(tokens.tolist())


for i, row in enumerate(fw):
    print(row)
    tokens = tokenize_gpt2(row)
    print(tokens)
    print(type(tokens), tokens.dtype)
    break

{'text': '|Viewing Single Post From: Spoilers for the Week of February 11th|\n|Lil||Feb 1 2013, 09:58 AM|\nDon\'t care about Chloe/Taniel/Jen-Jen. Don\'t care about Sami, really, but hoping that we get some good "SAMANTHA GENE!!" Marlena Death-Stares out of it. And "newfound" feelings. Please. If only.\nSTEFANO!! STEFANO, STEFANO, STEFANO!!!! :cheer:\n|Spoilers for the Week of February 11th · DAYS: News, Spoilers & Discussion|', 'id': '<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>', 'dump': 'CC-MAIN-2013-20', 'url': 'http://daytimeroyaltyonline.com/single/?p=8906650&t=8780053', 'date': '2013-05-18T05:48:59Z', 'file_path': 's3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz', 'language': 'en', 'language_score': 0.8232095837593079, 'token_count': 142}
[50256    91  7680   278 14206  2947  3574    25  1338  9437   364   329
   262  6119   286  3945  1367   400    91   198    91    43   346 15886
 151

## Data file writing

In [None]:
HEADERS_INFO = {
    "gpt-2": {
        "magic": 20240520,
        "version": 1,
        "token_dtype": np.uint16,
    },
}


def write_token_shard_uint16_to_json(
    tokens: np.ndarray,
    out_path: pathlib.Path,
) -> None:
    """Write token data to a .json file for easier handling in the data loader.

    Args:
        tokens: The tokens to write to the file (as a numpy array of uint16).
        out_path: The path to the file to write to.
    """
    assert len(tokens) < 2**31, "token count too large"  # ~2.1B tokens

    # Construct the header.
    info = HEADERS_INFO["gpt-2"]
    data = {
        "header": {
            "model": "gpt-2",
            "magic": info["magic"],
            "version": info["version"],
            "num_tokens": len(tokens),
        }
    }

    # Construct the data (numpy array of tokens).
    data["tokens"] = tokens.tolist()

    # Write to file.
    print(f"Writing {len(tokens):,} tokens to {out_path} in the gpt-2 format")
    with open(out_path, "w") as f:
        json.dump(data, f)


# write_bin_shard.py
def write_token_shard_uint16_to_bin(
    tokens: Iterable[np.ndarray | list[int]],
    out_path: pathlib.Path,
    *,
    chunk_size: int = 1_000_000,
) -> int:
    """
    Convert a stream/array of uint16 GPT-style tokens to a contiguous .bin
    file with *little-endian uint16* entries.

    Parameters
    ----------
    tokens       : Either a single np.ndarray / list of uint16s, or an
                   **iterator** that yields many such chunks.
    out_path     : Destination *.bin* file.
    chunk_size   : If `tokens` is a single ndarray, we still slice it
                   into <=chunk_size pieces to keep peak RAM small.

    Returns
    -------
    int - number of tokens written (so you can update metadata.json).
    """
    # Create the parent directory if it doesn't exist.
    out_path.parent.mkdir(parents=True, exist_ok=True)

    # open in binary write‑mode once, stream successive chunks
    with out_path.open("wb") as f:
        total = 0

        def _writer(chunk: np.ndarray):
            """Write a chunk of tokens to the file."""
            # Required to modify the 'total' variable from outer scope.
            nonlocal total

            # Convert the chunk to uint16 if it's not already.
            assert chunk.dtype == np.uint16, "chunk datatype must be uint16"

            # Cast to uint16 little-endian *without* changing the underlying values.
            view16 = chunk.astype("<u2", copy=False)
            view16.tofile(f)
            total += len(view16)

        # ------------------------------------------------------------
        # branch: `tokens` is already an iterator vs. a monolithic array
        if isinstance(tokens, (list, np.ndarray)):
            arr = np.asarray(tokens, dtype=np.uint16)
            for i in range(0, len(arr), chunk_size):
                _writer(arr[i : i + chunk_size])
        else:
            # user supplied an iterator / generator of chunks
            for chunk in tokens:
                _writer(np.asarray(chunk, dtype=np.uint16))

    return total


test_filename = pathlib.Path(
    os.path.expanduser(
        "~/data/datasets/fineweb/sample/10BT_tokenized/fineweb_sample_10BT_train_test.bin"
    )
)
write_token_shard_uint16_to_bin(tokens=tokenize_gpt2(row), out_path=test_filename)

### Verify the binary file

In [None]:
mm = np.memmap(test_filename, dtype="<u2", mode="r")
assert mm.dtype == np.uint16 and mm[0] < 65_536
print(f"{len(mm)} tokens loaded via memmap: {mm[:5]}")

## Full dataset tokenization

In [17]:
# Define a helper function for file writing.
def write_data_to_file(
    filename_without_extension: pathlib.Path, data: np.ndarray, file_type: str
) -> None:
    """Write data to a file."""
    # Create the parent directory if it doesn't exist.
    filename_without_extension.parent.mkdir(parents=True, exist_ok=True)

    if file_type == "json":
        filename = filename_without_extension.with_suffix(".json")
        write_token_shard_uint16_to_json(tokens=data, out_path=filename)
    elif file_type == "bin":
        filename = filename_without_extension.with_suffix(".bin")
        write_token_shard_uint16_to_bin(tokens=data, out_path=filename)
    else:
        raise ValueError(f"Invalid file type: {file_type}")

    return filename

In [None]:
# Tokenize all documents and write output shards, each of shard_size tokens (last shard has
# remainder)

# Set the shard size (size of each data shard in the output .json files, in tokens).
SHARD_SIZE = 10**8  # 100M tokens

# Set the token dtype.
TOKEN_DTYPE = np.uint16

# Define the dataset name.
DATASET_NAME = "fineweb_sample_10BT"

# Define the data cache directory.
DATA_CACHE_DIR = os.path.expanduser("~/data/datasets/fineweb/sample/10BT_tokenized")

# Allocate the number of processes to use (N - 2 to avoid hogging the entire system).
N_PROCS = max(1, os.cpu_count() - 2)

# Define the file type.
FILE_TYPE = "bin"

# Instantiate the metadata.json file.
metadata = {
    "dataset_name": "fineweb_sample_10BT",
    "data_cache_dir": os.path.expanduser(
        "~/data/datasets/fineweb/sample/10BT_tokenized"
    ),
    "metadata": {
        "train": [],
        "val": [],
    },
}


def get_filename(shard_index: int) -> pathlib.Path:
    split = "val" if shard_index == 0 else "train"
    return pathlib.Path(
        os.path.join(DATA_CACHE_DIR, f"{DATASET_NAME}_{split}_{shard_index:06d}")
    )


process = True
if process:
    with mp.Pool(N_PROCS) as pool:
        # Initialize the shard index to 0.
        shard_index = 0

        # Preallocate buffer to hold current shard.
        all_tokens_np = np.empty((SHARD_SIZE,), dtype=TOKEN_DTYPE)
        token_count = 0
        progress_bar = None

        for tokens in pool.imap(tokenize_gpt2, fw, chunksize=16):

            # Is there enough space in the current shard for the new tokens?
            if token_count + len(tokens) < SHARD_SIZE:
                # Simply append tokens to current shard.
                all_tokens_np[token_count : token_count + len(tokens)] = tokens
                token_count += len(tokens)

                # Update progress bar.
                if progress_bar is None:
                    progress_bar = tqdm(
                        total=SHARD_SIZE, unit="tokens", desc=f"Shard {shard_index}"
                    )
                progress_bar.update(len(tokens))
            else:
                # Write the current shard and start a new one.
                filename = get_filename(shard_index)

                # Split the document into whatever fits in this shard, the remainder goes to next one.
                remainder = SHARD_SIZE - token_count
                progress_bar.update(remainder)
                all_tokens_np[token_count : token_count + remainder] = tokens[
                    :remainder
                ]

                write_data_to_file(filename, all_tokens_np, FILE_TYPE)
                shard_index += 1
                progress_bar = None

                # Populate the next shard with the leftovers of the current doc.
                all_tokens_np[0 : len(tokens) - remainder] = tokens[remainder:]
                token_count = len(tokens) - remainder

        # Write any remaining tokens as the last shard.
        if token_count != 0:
            filename = get_filename(shard_index)
            write_data_to_file(filename, all_tokens_np[:token_count], FILE_TYPE)

## Create metadata

In [None]:
# Initialize metadata.
metadata = {
    "dataset_name": "fineweb_sample_10BT",
    "data_cache_dir": os.path.expanduser(
        "~/data/datasets/fineweb/sample/10BT_tokenized"
    ),
    "metadata": {
        "train": [],
        "val": [],
    },
}

# Get a list of all pre-processed data shards in the DATA_CACHE_DIR.
DATA_CACHE_DIR = os.path.expanduser("~/data/datasets/fineweb/sample/10BT_tokenized")
for file_path in tqdm(
    sorted(
        [
            os.path.join(DATA_CACHE_DIR, f)
            for f in os.listdir(DATA_CACHE_DIR)
            if f.endswith(".json")
        ]
    ),
    desc="Loading data shards",
):
    # Load the data shard.
    with open(file_path, "r") as f:
        data = json.load(f)

    # Get the number of tokens in the data shard.
    header = data["header"]
    header["filename"] = os.path.basename(file_path)

    if "train" in file_path:
        metadata["metadata"]["train"].append(header)
    elif "val" in file_path:
        metadata["metadata"]["val"].append(header)
    else:
        raise ValueError(f"Unknown file type: {file_path}")

metadata

# Save the metadata to a JSON file.
with open(os.path.join(DATA_CACHE_DIR, "metadata.json"), "w") as f:
    json.dump(metadata, f, indent=4)

## Load a shard

### JSON

In [None]:
filename = pathlib.Path(
    os.path.expanduser(
        "~/data/datasets/fineweb/sample/10BT_tokenized/fineweb_sample_10BT_train_000001.json"
    )
)

with open(filename, "r") as f:
    data = json.load(f)

print(data.keys())
print(data["header"])

### BIN

In [3]:
filename = pathlib.Path(
    os.path.expanduser(
        "~/data/datasets/fineweb/sample/10BT_tokenized/fineweb_sample_10BT_train_000001.bin"
    )
)

mm = np.memmap(filename, dtype="<u2", mode="r")
assert mm.dtype == np.uint16 and mm[0] < 65_536
print(f"{len(mm)} tokens loaded via memmap: {mm[:5]}")

100000000 tokens loaded via memmap: [9370 4146 2662 2767 1581]


# Dataset and Dataloader

## Design choices

The following table summarizes the design choices made for this data set and loader (by OpenAI's O3-Pro model).

| Pain point                                    | Design choice                                                                         | Ref                |
| --------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------ |
| **RAM blow‑up**                               | `np.memmap` ➜ pages are faulted in *on demand*, O/S handles LRU                       | ([comet.com][1])   |
| **10 B tokens won’t fit in a Python list**    | `IterableDataset` yields one sample at a time, no `__len__` or index repo needed      | ([pytorch.org][2]) |
| **Duplicate data across  n × gpus × workers** | stride `start::stride` partitions the shard list deterministically by `<rank,worker>` | –                  |
| **Epoch‑wise shuffling**                      | Shuffle order of shards *and* intra‑shard offsets with epoch‑seeded RNG               | –                  |

[1]: https://www.comet.com/site/blog/understanding-memory-mapping-in-numpy-for-deep-learning-pt-1/?utm_source=chatgpt.com "Understanding Memory Mapping in Numpy for Deep Learning: Pt 1"
[2]: https://pytorch.org/docs/stable/data.html?utm_source=chatgpt.com "torch.utils.data - PyTorch documentation"

## Why is this distributed-safe

| DDP/FSDP requirement                    | How the dataset meets it                                                                                                                                        |
| --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Every rank sees different data**      | `stride = world × workers` slices shard list by `(rank,worker)`                                                                                                 |
| **Repeatable shuffling**                | seed = `base_seed + epoch` ⇒ deterministic order per epoch                                                                                                      |
| **Equal *number* of samples per rank**  | shards are equal‑size (100 M tokens); last shard is shorter but still split identically; uneven leftovers are dropped automatically when the generator exhausts |


In [7]:
import os, random, numpy as np, torch, pathlib
from typing import Iterator, List, Tuple
from torch.utils.data import IterableDataset, get_worker_info, DataLoader
import torch.distributed as dist


class TokenShardDataset(IterableDataset):
    """
    Streams fixed‑length training examples from many >RAM shards
    without ever loading a full shard into memory.
    """

    def __init__(
        self, shard_paths: List[pathlib.Path], seq_len: int = 1024, shuffle: bool = True
    ):
        """
        Initialize the dataset.

        Args:
            shard_paths: List of paths to the shard files.
            seq_len: The length of the sequence to yield.
            shuffle: Whether to shuffle the shards.
        """
        super().__init__()
        self.shard_paths = list(shard_paths)
        self.seq_len = seq_len
        self.shuffle = shuffle

        # Discover world context once so every epoch can reuse it.
        self.rank, self.world = (
            (dist.get_rank(), dist.get_world_size())
            if dist.is_initialized()
            else (0, 1)
        )
        print(f"Rank: {self.rank}, World: {self.world}")

    # ------------------------------------------------------------------ helpers
    def _open_shard(self, path: pathlib.Path) -> Tuple[np.ndarray, int]:
        """
        Open a shard file and return a memory-mapped array of the tokens.

        NOTE: The shard files are memory-mapped as unsigned 16-bit (2-byte) little-endian (as
              generated by the `tokenize_gpt2` function in this notebook).

        Args:
            path: Path to the shard file.

        Returns:
            Tuple[np.ndarray, int]: The memory-mapped array of the tokens and the number of tokens.
        """
        # Memory‑map as unsigned 16‑bit (2 bytes) little‑endian.
        n_bytes = os.path.getsize(path)
        n_tokens = n_bytes // 2
        mm = np.memmap(path, dtype="<u2", mode="r", shape=(n_tokens,))
        return mm, n_tokens

    def _iter_one_shard(
        self, mm: np.ndarray, n_tokens: int
    ) -> Iterator[Tuple[torch.Tensor, torch.Tensor]]:
        """
        Iterate over a shard file and yield (input, target) pairs.

        Args:
            mm: The memory-mapped array of the tokens.
            n_tokens: The number of tokens in the shard.
        """
        # +1 because we need (seq_len+1) to create (input, target) pairs.
        max_offset = n_tokens - (self.seq_len + 1)
        if max_offset <= 0:
            return  # Skip shards shorter than a context.

        # A deterministic but different start per GPU & worker.
        g = random.Random()  # Local RNG; thread‑safe.
        seed = (self.epoch * 17) ^ (self.rank * 971) ^ (self.worker_id * 31)
        g.seed(seed)

        # Streaming: walk through the shard once each epoch.
        offsets = list(range(0, max_offset, self.seq_len))
        if self.shuffle:
            g.shuffle(offsets)

        for off in offsets:
            seq = np.array(mm[off : off + self.seq_len + 1], copy=False)
            x = torch.from_numpy(seq[:-1])  # Input ids.
            y = torch.from_numpy(seq[1:])  # Targets.
            yield x.long(), y.long()

    # ------------------------------------------------------------------ main iterator
    def __iter__(self) -> Iterator[Tuple[torch.Tensor, torch.Tensor]]:
        """
        Iterate over the dataset.
        """
        self.worker_id = 0
        if (info := get_worker_info()) is not None:
            self.worker_id = info.id
            self.num_workers = info.num_workers
        else:
            self.num_workers = 1
        self.epoch = getattr(self, "_epoch", 0)  # set by the DataLoader

        # slice the shard list: each <rank,worker> gets every N‑th file
        shards = self.shard_paths.copy()
        if self.shuffle:
            random.Random(self.epoch).shuffle(shards)

        # unique stride = world * num_workers
        stride = self.world * self.num_workers
        start = self.rank * self.num_workers + self.worker_id
        my_shards = shards[start::stride]

        for path in my_shards:
            mm, n_tokens = self._open_shard(path)
            yield from self._iter_one_shard(mm, n_tokens)

    def set_epoch(self, epoch: int):
        """
        Set the epoch number.

        NOTE: The `DataLoader` calls this hook at the end of each epoch.

        Args:
            epoch: The epoch number.
        """
        self._epoch = epoch


# Get a list of all pre-processed data shards in the DATA_CACHE_DIR.
DATA_CACHE_DIR = os.path.expanduser("~/data/datasets/fineweb/sample/10BT_tokenized")
file_paths = sorted(
    [
        os.path.join(DATA_CACHE_DIR, f)
        for f in os.listdir(DATA_CACHE_DIR)
        if f.endswith(".bin") and "train" in f
    ]
)

# Print the number of files.
print(f"Found {len(file_paths)} training data files in {DATA_CACHE_DIR}")
for f in file_paths[:5]:
    print(f)

Found 103 training data files in /home/dpickem/data/datasets/fineweb/sample/10BT_tokenized
/home/dpickem/data/datasets/fineweb/sample/10BT_tokenized/fineweb_sample_10BT_train_000001.bin
/home/dpickem/data/datasets/fineweb/sample/10BT_tokenized/fineweb_sample_10BT_train_000002.bin
/home/dpickem/data/datasets/fineweb/sample/10BT_tokenized/fineweb_sample_10BT_train_000003.bin
/home/dpickem/data/datasets/fineweb/sample/10BT_tokenized/fineweb_sample_10BT_train_000004.bin
/home/dpickem/data/datasets/fineweb/sample/10BT_tokenized/fineweb_sample_10BT_train_000005.bin


## Common pitfalls

| Symptom                                                               | Fix                                                                                                  |
| --------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| **“Too many open files”** after a few epochs                          | `ulimit -n 4096` or open shards lazily as shown rather than in `__init__`.                           |
| **Imbalanced batches across ranks** (some GPUs finish an epoch early) | Make sure every shard is *exactly* the same length or pad the last sequence to `seq_len+1`.          |
| CUDA error at first all‑reduce                                        | Confirm every rank sees the *same* batch size (no drop‑last) and `dist.barrier()` after `set_epoch`. |
| Slow first epoch, fast later                                          | Normal: the kernel page‑faults the first pass, then serves from cache.                               |


In [14]:
# Default batch size. This determines how many sequences are processed together in each iteration.
DEFAULT_BATCH_SIZE = 16

# Default context length. This determines the maximum length of the input sequences (1024 is
# the maximum context length supported by our GPT-2 model).
DEFAULT_CONTEXT_LENGTH = 1024

# Number of subprocesses for data loading. This enables parallel data loading using multiple
# worker processes
N_PROCS = 2

# Initialize the dataset.
ds = TokenShardDataset(shard_paths=file_paths, seq_len=DEFAULT_CONTEXT_LENGTH)

# Initialize the data loader.
dl = torch.utils.data.DataLoader(
    ds,
    batch_size=DEFAULT_BATCH_SIZE,
    num_workers=N_PROCS,
    # pin_memory: If True, the data loader will copy tensors into CUDA pinned memory
    #             This speeds up GPU transfer by using page-locked memory that can't be swapped out
    pin_memory=True,
    # prefetch_factor: Number of batches loaded in advance by each worker
    #                  Here set to 2, meaning each worker queues 2 batches ahead
    #                  This helps maintain a steady flow of data to the GPU
    prefetch_factor=2,
    # persistent_workers: If True, worker processes are not killed between epochs
    #                     This avoids the overhead of recreating workers for each epoch
    persistent_workers=True,
    # drop_last: If True, drops the last incomplete batch if dataset size is not divisible by
    #            batch_size. This ensures all batches have the same size, which is often required
    #            for training.
    drop_last=True,
    # collate_fn: Custom function to merge a list of samples into a batch
    #             Here it takes a batch of (data, labels) tuples and stacks them into tensors
    #             The lambda function: lambda batch: tuple(torch.stack(x) for x in zip(*batch))
    #             - zip(*batch) groups all data samples together and all label samples together
    #             - torch.stack(x) converts each group into a single tensor
    #             - tuple(...) returns the stacked data and labels as a tuple
    collate_fn=lambda batch: tuple(torch.stack(x) for x in zip(*batch)),
)

Rank: 0, World: 1


In [15]:
for step, (data, labels) in enumerate(dl):
    print(f"Step {step}:")
    print(f"  data.shape: {data.shape}")
    print(f"  labels.shape: {labels.shape}")
    print(f"  data: {data}")
    print(f"  labels: {labels}")
    break

Step 0:
  data.shape: torch.Size([16, 1024])
  labels.shape: torch.Size([16, 1024])
  data: tensor([[  366,  1462, 12475,  ...,   606,   625,   612],
        [42791, 42693,    11,  ...,    89, 23641,  1722],
        [  250, 27903, 10211,  ...,    13, 10127,   345],
        ...,
        [ 1532,   612,   447,  ...,  4838,    13,   314],
        [  509,  1040,   805,  ...,   389, 26731,  2004],
        [  737, 16227,    11,  ...,    82,   737, 20401]])
  labels: tensor([[ 1462, 12475,   262,  ...,   625,   612,    13],
        [42693,    11,   290,  ..., 23641,  1722,    13],
        [27903, 10211, 12840,  ..., 10127,   345,   821],
        ...,
        [  612,   447,   247,  ...,    13,   314,  2497],
        [ 1040,   805,  2297,  ..., 26731,  2004,   284],
        [16227,    11, 10812,  ...,   737, 20401,    12]])


In [13]:
print(decode_gpt2(tokens=data[15]))

). Hence, genes in multiple-gene systems are called quantitative trait loci (QTLs). The objective of QTL study is to discover multiple genes with different effect sizes, contributing to the variation of the trait. Improved QTL linkage studies, which use many small families instead of few large ones, can be used to study the extremes of a dimension and are capable of finding genes with more than 10% effect size . Cardon et al. identified the first QTL linkage, a linkage for reading disability.
QTLs can also be identified with a simpler method that can detect QTLs with even much smaller effect sizes on the variance of the trait. This method is called allelic association and detects the correlation between a specific allele and trait in the population.
Benjamin et al.  reported one of the first associations for personality: the association between dopamine D4 receptor (DRD4), and novelty seeking. DRD4 gene has two types of alleles, short and long. Novelty seeking scores were higher in sub