# LLM Pre-training Dataset Preparation
## Load, Merge, and Split Dataset for Tokenization

This notebook provides optimized methods for:
1. Loading text from local .txt files
2. Loading datasets from HuggingFace Hub (with proper streaming support)
3. Merging multiple data sources
4. Creating input-target pairs for causal language modeling

## Installation

In [1]:
!pip install datasets transformers torch accelerate -q

In [2]:
!pip3 install tiktoken > /dev/null 2>&1

### Install Weights and Biases

In [None]:
!pip install wandb -q

### MODEL CONFIGURATION

In [3]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 128, # Context length
    "emb_dim": 64,         # Embedding dimension
    "n_heads": 8,          # Number of attention heads
    "n_layers": 8,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False,       # Query-Key-Value bias
    "max_length": 128,      # Maximum sequence length
    "output_dimension": 64, # Output dimension
    "batch_size": 2        # batch size
}

## Imports

In [4]:
from datasets import Dataset, DatasetDict, load_dataset, concatenate_datasets
from typing import List, Dict, Optional, Union, Tuple
import torch
from pathlib import Path
import os
from transformers import AutoTokenizer
import numpy as np
import wandb

In [5]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.12.0


### Initialise W&B

In [None]:
# Initialize Weights & Biases
def init_wandb(config, project_name="gpt-pretraining"):
    """Initialize W&B tracking"""
    wandb.init(
        project=project_name,
        config=config,
        name=f"gpt-{config['num_epochs']}ep-lr{config['learning_rate']}",
        tags=["gpt", "pretraining", "custom"]
    )
    wandb.watch(model, log="all", log_freq=100)  # Log gradients and parameters

## 1. Load Text from Local .txt Files

In [6]:
def load_txt_file(
    file_path: str,
    encoding: str = 'utf-8',
    chunk_size: Optional[int] = None,
    overlap: int = 0
) -> Dataset:
    """
    Load text data from a .txt file and convert to HuggingFace Dataset.

    Args:
        file_path: Path to the .txt file
        encoding: Text encoding (default: 'utf-8')
        chunk_size: Optional size to split text into chunks (in characters)
        overlap: Number of overlapping characters between chunks

    Returns:
        HuggingFace Dataset containing text data
    """
    print(f"Loading text from: {file_path}")

    # Read the file
    with open(file_path, 'r', encoding=encoding) as f:
        text_content = f.read()

    # Split into chunks if specified
    if chunk_size:
        texts = []
        start = 0
        while start < len(text_content):
            end = start + chunk_size
            texts.append(text_content[start:end])
            start = end - overlap
    else:
        # Split by paragraphs (double newline) or keep as single text
        texts = [t.strip() for t in text_content.split('\n\n') if t.strip()]

    # Create dataset
    dataset = Dataset.from_dict({"text": texts})

    print(f"✓ Loaded {len(dataset):,} text samples from .txt file")
    print(f"Total characters: {sum(len(t) for t in texts):,}")
    return dataset

## 2. Load Dataset from HuggingFace Hub

In [7]:
def load_huggingface_dataset(
    dataset_name: str,
    text_column: str = "text",
    split: str = "train",
    name: Optional[str] = None,
    num_samples: Optional[int] = None,
    streaming: bool = True,
    trust_remote_code: bool = False
) -> Dataset:
    """
    Load dataset from HuggingFace Hub with optimized streaming support.

    Args:
        dataset_name: Name of the dataset on HuggingFace Hub
                     (e.g., 'HuggingFaceFW/fineweb', 'openwebtext')
        text_column: Name of the column containing text data (default: 'text')
        split: Dataset split to load (default: 'train')
        name: Dataset configuration name (e.g., 'sample-10BT' for fineweb)
        num_samples: Limit number of samples to load (recommended for large datasets)
        streaming: Use streaming mode for memory efficiency (default: True)
        trust_remote_code: Trust remote code for custom datasets

    Returns:
        HuggingFace Dataset with 'text' column

    Examples:
        # Load FineWeb dataset
        dataset = load_huggingface_dataset(
            dataset_name="HuggingFaceFW/fineweb",
            name="sample-10BT",
            num_samples=10000
        )

        # Load OpenWebText
        dataset = load_huggingface_dataset(
            dataset_name="openwebtext",
            num_samples=5000
        )
    """
    print(f"\nLoading HuggingFace dataset: '{dataset_name}'" +
          (f" (config: {name})" if name else "") +
          f" (split: {split})")

    try:
        # Load dataset with streaming for memory efficiency
        dataset = load_dataset(
            dataset_name,
            name=name,
            split=split,
            streaming=streaming,
            trust_remote_code=trust_remote_code
        )

        # Extract text from samples
        texts = []

        if streaming:
            # Iterate through streaming dataset (memory efficient)
            print(f"Extracting text from streaming dataset...")
            for i, sample in enumerate(dataset):
                if num_samples and i >= num_samples:
                    break

                # Extract text from the specified column
                if text_column in sample:
                    texts.append(sample[text_column])
                else:
                    available = list(sample.keys())
                    raise KeyError(
                        f"Column '{text_column}' not found. "
                        f"Available columns: {available}"
                    )

                # Progress indicator
                if (i + 1) % 1000 == 0:
                    print(f"  Processed {i + 1:,} samples...", end="\r")

            if texts:
                print(f"\n✓ Extracted {len(texts):,} samples from streaming dataset")
        else:
            # Non-streaming mode (loads entire dataset into memory)
            print(f"Loading in non-streaming mode...")
            if num_samples:
                dataset = dataset.select(range(min(num_samples, len(dataset))))

            # Extract text column
            if text_column in dataset.column_names:
                texts = dataset[text_column]
            else:
                raise KeyError(
                    f"Column '{text_column}' not found. "
                    f"Available columns: {dataset.column_names}"
                )

            print(f"✓ Loaded {len(texts):,} samples")

        # Create a new Dataset with only the text column
        final_dataset = Dataset.from_dict({"text": texts})

        total_chars = sum(len(t) for t in texts)
        print(f"Total characters: {total_chars:,}")
        print(f"Average text length: {total_chars / len(texts):.0f} chars per sample")

        return final_dataset

    except Exception as e:
        print(f"❌ Error loading dataset: {str(e)}")
        raise

## 3. Merge Multiple Datasets

In [8]:
def merge_datasets(
    datasets: List[Dataset],
    shuffle: bool = True,
    seed: int = 42,
    interleave: bool = False
) -> Dataset:
    """
    Merge multiple datasets into a single dataset.

    Args:
        datasets: List of HuggingFace Datasets to merge
        shuffle: Whether to shuffle the merged dataset
        seed: Random seed for shuffling
        interleave: If True, interleave datasets instead of concatenating

    Returns:
        Merged HuggingFace Dataset
    """
    print(f"\nMerging {len(datasets)} datasets...")

    # Validate all datasets have 'text' column
    for i, ds in enumerate(datasets):
        if 'text' not in ds.column_names:
            raise ValueError(f"Dataset {i} does not have 'text' column")
        print(f"  Dataset {i+1}: {len(ds):,} samples")

    # Merge datasets
    if interleave:
        # Interleave datasets (useful for balanced sampling)
        from datasets import interleave_datasets
        merged_dataset = interleave_datasets(datasets, seed=seed)
        print("Using interleave strategy...")
    else:
        # Concatenate datasets
        merged_dataset = concatenate_datasets(datasets)
        print("Using concatenation strategy...")

    # Shuffle if requested
    if shuffle:
        print("Shuffling merged dataset...")
        merged_dataset = merged_dataset.shuffle(seed=seed)

    print(f"✓ Merged dataset contains {len(merged_dataset):,} total samples")
    return merged_dataset

## 4. Create Input-Target Pairs for Pre-training\n\n**Two Approaches Available:**\n1. **Implicit Shifting** (HuggingFace standard) - labels = input_ids\n2. **Explicit Shifting** (Custom training) - labels = input_ids shifted by 1

### 4.1 Implicit Shifting (HuggingFace Style - Recommended)

In [9]:
# def create_input_target_pairs(
#     dataset: Dataset,
#     tokenizer: AutoTokenizer,
#     max_length: int = 512,
#     stride: Optional[int] = None,
#     preprocessing_num_workers: int = 4,
#     batch_size: int = 1000
# ) -> Dataset:
#     """
#     Create input-target pairs for causal language modeling (IMPLICIT SHIFTING).

#     This is the HuggingFace standard approach where:
#     - input_ids = [token_0, token_1, token_2, ..., token_n]
#     - labels = [token_0, token_1, token_2, ..., token_n] (SAME as input_ids)

#     The model shifts internally during loss calculation:
#     - At position i, the model uses input[0:i] to predict label[i]

#     ✅ Use this for: HuggingFace Transformers, Trainer API, pretrained models

#     Args:
#         dataset: HuggingFace Dataset with 'text' column
#         tokenizer: HuggingFace tokenizer
#         max_length: Maximum sequence length
#         stride: Stride for sliding window (None = max_length // 2)
#         preprocessing_num_workers: Number of parallel workers
#         batch_size: Batch size for processing

#     Returns:
#         Dataset with 'input_ids', 'attention_mask', and 'labels' columns
#     """
#     print(f"\nCreating input-target pairs (IMPLICIT SHIFTING - HuggingFace style)...")
#     print(f"Max length: {max_length} tokens")

#     # Set stride (overlap) - default to half of max_length for context preservation
#     if stride is None:
#         stride = max_length // 2
#     print(f"Stride: {stride} tokens (overlap for longer texts)")

#     def tokenize_function(examples):
#         """
#         Tokenize text and create input-target pairs.
#         Uses efficient batched tokenization with sliding window.
#         """
#         # Tokenize with return_overflowing_tokens for chunking long texts
#         tokenized = tokenizer(
#             examples["text"],
#             truncation=True,
#             max_length=max_length,
#             stride=stride,
#             return_overflowing_tokens=True,
#             return_length=True,
#             padding=False,  # We'll pad during batching in training
#         )

#         # Create labels (same as input_ids for causal LM with HuggingFace)
#         # The model will shift internally during training
#         tokenized["labels"] = tokenized["input_ids"].copy()

#         return tokenized

#     # Apply tokenization with multiprocessing
#     print("Tokenizing dataset...")
#     tokenized_dataset = dataset.map(
#         tokenize_function,
#         batched=True,
#         num_proc=preprocessing_num_workers,
#         remove_columns=dataset.column_names,
#         batch_size=batch_size,
#         desc="Tokenizing and creating pairs"
#     )

#     # Filter out sequences that are too short
#     min_length = 10  # Minimum viable sequence length
#     print(f"Filtering sequences shorter than {min_length} tokens...")
#     tokenized_dataset = tokenized_dataset.filter(
#         lambda x: len(x["input_ids"]) >= min_length,
#         num_proc=preprocessing_num_workers,
#         desc="Filtering short sequences"
#     )

#     # Calculate statistics
#     lengths = [len(x) for x in tokenized_dataset['input_ids']]
#     avg_length = np.mean(lengths)

#     print(f"\n✓ Created {len(tokenized_dataset):,} input-target pairs")
#     print(f"Average sequence length: {avg_length:.1f} tokens")
#     print(f"Total tokens: {sum(lengths):,}")

#     return tokenized_dataset

### 4.2 Explicit Shifting (Custom Training Style)

In [10]:
def create_input_target_pairs_explicit(
    dataset: Dataset,
    tokenizer,
    max_length: int = 512,
    stride: Optional[int] = None,
    preprocessing_num_workers: int = 1,  # DEFAULT to 1 (safer)
    batch_size: int = 100  # REDUCED default
) -> Dataset:
    """
    Create input-target pairs for causal language modeling (EXPLICIT SHIFTING).
    """
    print(f"\nCreating input-target pairs (EXPLICIT SHIFTING - Custom style)...")
    print(f"Max length: {max_length} tokens")

    if stride is None:
        stride = max_length
    print(f"Stride: {stride} tokens")

    # Warning for large datasets with multiprocessing
    if len(dataset) > 10000 and preprocessing_num_workers > 1:
        print(f"⚠️  Large dataset ({len(dataset):,} samples) with multiprocessing may cause OOM")
        print(f"   Recommend: preprocessing_num_workers=1")

    def tokenize_and_shift(examples):
        """
        Tokenize text and create explicitly shifted input-target pairs.
        """
        all_input_ids = []
        all_labels = []
        all_attention_mask = []

        try:
            for text in examples["text"]:
                # Skip empty texts
                if not text or len(text.strip()) == 0:
                    continue

                # Tokenize with tiktoken
                try:
                    token_ids = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
                except Exception as e:
                    print(f"Warning: Tokenization failed for text, skipping: {str(e)[:100]}")
                    continue

                # Truncate if too long
                if len(token_ids) > 1024:
                    token_ids = token_ids[:1024]

                # Skip if too short
                if len(token_ids) < max_length + 1:
                    continue

                # Create sliding window chunks
                for i in range(0, len(token_ids) - max_length, stride):
                    input_chunk = token_ids[i : i + max_length]
                    target_chunk = token_ids[i + 1 : i + max_length + 1]

                    # Only add if we have complete sequences
                    if len(input_chunk) == max_length and len(target_chunk) == max_length:
                        all_input_ids.append(input_chunk)
                        all_labels.append(target_chunk)
                        all_attention_mask.append([1] * max_length)

        except Exception as e:
            print(f"Error in tokenize_and_shift: {e}")
            # Return empty to avoid crashing
            return {
                "input_ids": [],
                "labels": [],
                "attention_mask": []
            }

        return {
            "input_ids": all_input_ids,
            "labels": all_labels,
            "attention_mask": all_attention_mask
        }

    # Apply tokenization
    print("Tokenizing and shifting dataset...")
    print(f"Using {preprocessing_num_workers} worker(s)")

    try:
        tokenized_dataset = dataset.map(
            tokenize_and_shift,
            batched=True,
            num_proc=preprocessing_num_workers if preprocessing_num_workers > 1 else None,
            remove_columns=dataset.column_names,
            batch_size=batch_size,
            desc="Tokenizing and shifting"
        )
    except Exception as e:
        print(f"\n❌ Error during tokenization: {e}")
        print("Retrying with num_proc=1 (no multiprocessing)...")

        # Retry without multiprocessing
        tokenized_dataset = dataset.map(
            tokenize_and_shift,
            batched=True,
            num_proc=None,  # Disable multiprocessing
            remove_columns=dataset.column_names,
            batch_size=batch_size,
            desc="Tokenizing and shifting (retry)"
        )

    # Filter out empty sequences
    original_len = len(tokenized_dataset)
    tokenized_dataset = tokenized_dataset.filter(
        lambda x: len(x['input_ids']) > 0,
        desc="Filtering empty sequences"
    )

    if len(tokenized_dataset) < original_len:
        print(f"Filtered out {original_len - len(tokenized_dataset)} empty sequences")

    # Calculate statistics
    if len(tokenized_dataset) > 0:
        print(f"\n✓ Created {len(tokenized_dataset):,} input-target pairs")
        print(f"Sequence length: {max_length} tokens (fixed)")
        print(f"Total tokens: {len(tokenized_dataset) * max_length:,}")
    else:
        print("\n⚠️  Warning: No sequences created! Check your data and max_length")

    return tokenized_dataset

### LOAD, MERGE AND CREATING TOKEN EMBEDDINGS

In [11]:
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
import torch

# Load tokenizer from transformers
# tokenizer = AutoTokenizer.from_pretrained("gpt2")

#Load tokenizer from tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

# Step 1: Load raw datasets (text only, no tokenization yet)
txt_dataset = load_txt_file("the-verdict.txt")

hf_dataset = load_huggingface_dataset(
    dataset_name="HuggingFaceFW/fineweb",
    name="sample-10BT",
    num_samples=5000,
    streaming=True
)

# Step 2: Merge raw datasets (still have 'text' column)
merged_dataset = merge_datasets(
    datasets=[txt_dataset, hf_dataset],
    shuffle=False
)

# Step 3: NOW apply explicit shifting tokenization
explicit_dataset = create_input_target_pairs_explicit(
    dataset=merged_dataset,  # Raw text dataset
    tokenizer=tokenizer,
    max_length=GPT_CONFIG_124M['max_length'],
    stride=4
)

# Create PyTorch DataLoader
def collate_fn(batch):
    """Convert batch to tensors"""
    input_ids = torch.tensor([item['input_ids'] for item in batch])
    labels = torch.tensor([item['labels'] for item in batch])
    return input_ids, labels

dataloader = DataLoader(
    explicit_dataset,
    batch_size=8,
    shuffle=False,
    collate_fn=collate_fn
)

# Iterate through batches
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print(f"Inputs shape:  {inputs.shape}")
print(f"Targets shape: {targets.shape}")
print(f"\nInputs:\n{inputs}")
print(f"\nTargets:\n{targets}")

# Create token embeddings
token_embedding_layer = torch.nn.Embedding(GPT_CONFIG_124M['vocab_size'], GPT_CONFIG_124M['emb_dim'])
token_embeddings = token_embedding_layer(inputs)
print(f"\nToken embeddings shape: {token_embeddings.shape}")

Loading text from: the-verdict.txt
✓ Loaded 83 text samples from .txt file
Total characters: 20,315

Loading HuggingFace dataset: 'HuggingFaceFW/fineweb' (config: sample-10BT) (split: train)


README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/27468 [00:02<?, ?it/s]

Extracting text from streaming dataset...
  Processed 5,000 samples...
✓ Extracted 5,000 samples from streaming dataset
Total characters: 15,132,101
Average text length: 3026 chars per sample

Merging 2 datasets...
  Dataset 1: 83 samples
  Dataset 2: 5,000 samples
Using concatenation strategy...
✓ Merged dataset contains 5,083 total samples

Creating input-target pairs (EXPLICIT SHIFTING - Custom style)...
Max length: 128 tokens
Stride: 4 tokens
Tokenizing and shifting dataset...
Using 1 worker(s)


Tokenizing and shifting:   0%|          | 0/5083 [00:00<?, ? examples/s]

Filtering empty sequences:   0%|          | 0/453228 [00:00<?, ? examples/s]


✓ Created 453,228 input-target pairs
Sequence length: 128 tokens (fixed)
Total tokens: 58,013,184
Inputs shape:  torch.Size([8, 128])
Targets shape: torch.Size([8, 128])

Inputs:
tensor([[    1,   464,  6001,  ..., 11161,   407,   262],
        [  465, 13476,     1,  ..., 18113,   544,  9325],
        [ 5562,   373,   644,  ...,    11,   379,   262],
        ...,
        [   13, 46606,   536,  ...,   878,   402,   271],
        [  438, 14363,   938,  ...,   338,   366, 31640],
        [ 1650,   353,   438,  ...,    67, 20811,     1]])

Targets:
tensor([[  464,  6001,   286,  ...,   407,   262, 40123],
        [13476,     1,   438,  ...,   544,  9325,   701],
        [  373,   644,   262,  ...,   379,   262,   938],
        ...,
        [46606,   536,  5469,  ...,   402,   271, 10899],
        [14363,   938,  4842,  ...,   366, 31640,    12],
        [  353,   438,  2934,  ..., 20811,     1,   284]])

Token embeddings shape: torch.Size([8, 128, 64])


In [12]:
print(token_embeddings.shape)

torch.Size([8, 128, 64])


### CREATING POSITIONAL EMBEDDINGS

In [13]:
context_length = GPT_CONFIG_124M['max_length'] # Set the context length and max length the same
pos_embedding_layer = torch.nn.Embedding(context_length, GPT_CONFIG_124M['emb_dim'])

In [14]:
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([128, 64])


### CREATE INPUT AND POSITIONAL EMBEDDING

In [15]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 128, 64])


### IMPLEMENTING MULTI-HEAD ATTENTION

In [16]:
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

###  THE BUILDING BLOCKS-LAYER NORMALIZATION, GELU AND FEED-FORWARD NEURAL NETWORK

In [17]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
            (x + 0.044715 * torch.pow(x, 3))
        ))


class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]), ## Expansion
            GELU(), ## Activation
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]), ## Contraction
        )

    def forward(self, x):
        return self.layers(x)

### TRANSFORMER BLOCK

In [18]:
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        # 2*4*768
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        return x
        # 2*4*768

###  ENTIRE GPT MODEL ARCHITECTURE IMPLEMENTATION

In [19]:
class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

In [20]:
torch.manual_seed(123)

batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"
batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)

model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

Input batch:
 tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])

Output shape: torch.Size([2, 4, 50257])
tensor([[[-0.6504, -0.2906,  0.6864,  ..., -0.1355, -0.3266,  0.5960],
         [-0.5232,  0.8551,  0.2914,  ...,  1.4764,  0.7322,  0.2059],
         [ 0.8078, -0.2497,  1.3213,  ...,  0.5581, -0.3751, -0.6980],
         [-1.1853,  0.4824,  0.6537,  ..., -0.1297,  0.4060, -0.2244]],

        [[-0.5507,  0.2227,  0.1802,  ...,  0.0180, -0.5139,  0.6615],
         [-0.0905,  0.7796,  0.2680,  ...,  1.4981, -0.4316,  0.4002],
         [ 0.4891,  0.1639,  0.4259,  ...,  0.2368, -0.2964, -0.7698],
         [-0.7117,  0.9490,  0.4286,  ...,  0.0579,  0.1252, -0.2558]]],
       grad_fn=<UnsafeViewBackward0>)


### MODEL SIZE CALCULATION

In [21]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 6,839,552


In [22]:
print("Token embedding layer shape:", model.tok_emb.weight.shape)
print("Output layer shape:", model.out_head.weight.shape)

Token embedding layer shape: torch.Size([50257, 64])
Output layer shape: torch.Size([50257, 64])


In [23]:
total_size_bytes = total_params * 4 #A
total_size_mb = total_size_bytes / (1024 * 1024) #B
print(f"Total size of the model: {total_size_mb:.2f} MB")

Total size of the model: 26.09 MB


### GENERATING TEXT FROM OUTPUT TOKENS - INFERENCE

In [24]:
def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context

    for _ in range(max_new_tokens):

        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond) ### batch, n_tokens, vocab_size

        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

In [25]:
import tiktoken

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0) # remove batch dimension
    return tokenizer.decode(flat.tolist())

start_context = "He said we came here"



token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 He said we came here explosionsCorptimeout crust angeredbons Raymond trauma Rapp diagnosis


###  CREATING TRAINING, TESTING AND VALIDATION DATA

In [26]:
import torch
from torch.utils.data import DataLoader

# Split the tokenized dataset into train/validation
train_ratio = 0.85
split_idx = int(train_ratio * len(explicit_dataset))

# Split using HuggingFace datasets
train_dataset = explicit_dataset.select(range(split_idx))
val_dataset = explicit_dataset.select(range(split_idx, len(explicit_dataset)))

print(f"\nDataset split:")
print(f"Training samples: {len(train_dataset):,}")
print(f"Validation samples: {len(val_dataset):,}")

# Collate function
def collate_fn(batch):
    """Convert batch to tensors"""
    input_ids = torch.tensor([item['input_ids'] for item in batch])
    labels = torch.tensor([item['labels'] for item in batch])
    return input_ids, labels

# Set manual seed for reproducibility
torch.manual_seed(123)

# Create training dataloader
train_loader = DataLoader(
    train_dataset,
    batch_size=GPT_CONFIG_124M["batch_size"],
    shuffle=True,
    drop_last=True,
    collate_fn=collate_fn,
    num_workers=0
)

# Create validation dataloader
val_loader = DataLoader(
    val_dataset,
    batch_size=GPT_CONFIG_124M["batch_size"],
    shuffle=False,
    drop_last=False,
    collate_fn=collate_fn,
    num_workers=0
)

print(f"\nDataloaders created:")
print(f"Training batches: {len(train_loader)}")
print(f"Validation batches: {len(val_loader)}")

# Test iteration
print("\nTesting dataloaders...")
train_iter = iter(train_loader)
inputs, targets = next(train_iter)
print(f"Train batch - Inputs shape: {inputs.shape}, Targets shape: {targets.shape}")

val_iter = iter(val_loader)
inputs, targets = next(val_iter)
print(f"Val batch - Inputs shape: {inputs.shape}, Targets shape: {targets.shape}")


Dataset split:
Training samples: 385,243
Validation samples: 67,985

Dataloaders created:
Training batches: 192621
Validation batches: 33993

Testing dataloaders...
Train batch - Inputs shape: torch.Size([2, 128]), Targets shape: torch.Size([2, 128])
Val batch - Inputs shape: torch.Size([2, 128]), Targets shape: torch.Size([2, 128])


###  DEFINING THE CROSS ENTROPY LOSS FUNCTION

In [27]:
print(f"Train batches: {len(train_loader)}")
print(f"Val batches: {len(val_loader)}")
print(f"Total samples: {len(train_loader) + len(val_loader)}")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model.to(device)


print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"CUDA device name: {torch.cuda.get_device_name(0)}")

Train batches: 192621
Val batches: 33993
Total samples: 226614
Using device: cuda
CUDA available: True
CUDA device count: 1
CUDA device name: Tesla T4


In [28]:
def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
    return loss

# Full calculation
def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        # Reduce the number of batches to match the total number of batches in the data loader
        # if num_batches exceeds the number of batches in the data loader
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches



# Subset Calculation
def calc_loss_loader_subset(data_loader, model, device, num_batches=10):
    """Calculate loss on first num_batches only"""
    total_loss = 0.
    count = 0

    with torch.no_grad():
        for i, (inputs, targets) in enumerate(data_loader):
            if i >= num_batches:
                break
            inputs = inputs.to(device)
            targets = targets.to(device)
            logits = model(inputs)
            loss = torch.nn.functional.cross_entropy(
                logits.flatten(0, 1), targets.flatten()
            )
            total_loss += loss.item()
            count += 1

    return total_loss / count if count > 0 else 0

# Use subset calculation (much faster!)
with torch.no_grad():
    train_loss = calc_loss_loader_subset(train_loader, model, device, num_batches=10)
    val_loss = calc_loss_loader_subset(val_loader, model, device, num_batches=10)

print(f"Training loss (first 10 batches): {train_loss}")
print(f"Validation loss (first 10 batches): {val_loss}")

Training loss (first 10 batches): 10.990106773376464
Validation loss (first 10 batches): 10.996171188354491


In [29]:
print(device)

cuda


### CHCEK TO MAKE SURE ENOUGH DATASET FOR TRAINING

In [30]:
# Check your dataset size BEFORE training
print(f"\nDataset statistics:")
print(f"Training samples: {len(train_loader.dataset):,}")
print(f"Training batches: {len(train_loader):,}")
print(f"Validation samples: {len(val_loader.dataset):,}")
print(f"Validation batches: {len(val_loader):,}")

# You need AT LEAST 10,000+ samples for meaningful training
# If you have less, load more from HuggingFace:
if len(train_loader.dataset) < 10000:
    print("\n⚠️  WARNING: Dataset too small! Load more samples from HuggingFace")


Dataset statistics:
Training samples: 385,243
Training batches: 192,621
Validation samples: 67,985
Validation batches: 33,993


### TRAINING LOOP FOR THE LLM

In [31]:
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    """
    Evaluate model on train and val sets.
    Note: Caller should handle setting model back to train mode.
    """
    model.eval()
    with torch.no_grad():
        train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
    # Don't call model.train() here - let caller decide
    return train_loss, val_loss


def generate_and_print_sample(model, tokenizer, device, start_context):
    """
    Generate and print sample text.
    Note: Caller should handle setting model back to train mode.
    """
    model.eval()
    context_size = model.pos_emb.weight.shape[0]

    try:
        encoded = text_to_token_ids(start_context, tokenizer).to(device)
        with torch.no_grad():
            token_ids = generate_text_simple(
                model=model, idx=encoded,
                max_new_tokens=50, context_size=context_size
            )
        decoded_text = token_ids_to_text(token_ids, tokenizer)
        print(decoded_text.replace("\n", " "))  # Compact print format
    except Exception as e:
        print(f"Error generating sample: {e}")

In [32]:
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
                       eval_freq, eval_iter, start_context, tokenizer,
                       max_grad_norm=1.0, save_checkpoints=True, checkpoint_path="model_checkpoint.pt",
                       use_amp=True, scheduler=None, gradient_accumulation_steps=1):  # ADD THIS PARAMETER
    """
    Train model following OpenAI best practices.

    New Args:
        gradient_accumulation_steps: Accumulate gradients over N steps (effective batch size = batch_size * N)
    """
    from tqdm import tqdm

    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1
    best_val_loss = float('inf')

    scaler = GradScaler() if use_amp and torch.cuda.is_available() else None

    for epoch in range(num_epochs):
        model.train()
        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}")

        for batch_idx, (input_batch, target_batch) in enumerate(progress_bar):

            # Forward pass with mixed precision
            if scaler is not None:
                with autocast():
                    loss = calc_loss_batch(input_batch, target_batch, model, device)
                    loss = loss / gradient_accumulation_steps  # SCALE LOSS

                scaler.scale(loss).backward()

                # Only update weights every N steps
                if (batch_idx + 1) % gradient_accumulation_steps == 0:
                    scaler.unscale_(optimizer)
                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
                    scaler.step(optimizer)
                    scaler.update()
                    optimizer.zero_grad()

                    # Update learning rate AFTER optimizer step
                    if scheduler is not None:
                        scheduler.step()

                    global_step += 1
            else:
                # Without mixed precision
                loss = calc_loss_batch(input_batch, target_batch, model, device)
                loss = loss / gradient_accumulation_steps  # SCALE LOSS
                loss.backward()

                # Only update weights every N steps
                if (batch_idx + 1) % gradient_accumulation_steps == 0:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
                    optimizer.step()
                    optimizer.zero_grad()

                    if scheduler is not None:
                        scheduler.step()

                    global_step += 1

            tokens_seen += input_batch.numel()

            # Update progress bar
            current_lr = optimizer.param_groups[0]['lr']
            progress_bar.set_postfix({
                'loss': f'{(loss.item() * gradient_accumulation_steps):.3f}',  # Unscale for display
                'lr': f'{current_lr:.2e}'
            })

            # Evaluation (only on actual gradient steps)
            if global_step > 0 and global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)

                print(f"\nEp {epoch+1} (Step {global_step:06d}): "
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}, "
                      f"LR {current_lr:.2e}")

                if save_checkpoints and val_loss < best_val_loss:
                    best_val_loss = val_loss
                    torch.save({
                        'epoch': epoch,
                        'global_step': global_step,
                        'model_state_dict': model.state_dict(),
                        'optimizer_state_dict': optimizer.state_dict(),
                        'train_loss': train_loss,
                        'val_loss': val_loss,
                        'tokens_seen': tokens_seen,
                    }, checkpoint_path)
                    print(f"✓ Saved best checkpoint (val_loss: {val_loss:.3f})")

                model.train()

        # Generate sample after each epoch
        print("\n" + "="*70)
        print("Generated sample:")
        generate_and_print_sample(model, tokenizer, device, start_context)
        print("="*70 + "\n")
        model.train()

    return train_losses, val_losses, track_tokens_seen

In [41]:
import time
import torch
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.cuda.amp import autocast, GradScaler

# Set all random seeds for reproducibility
def set_seed(seed=123):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

set_seed(123)

# Configuration following OpenAI best practices
CONFIG = {
    'num_epochs': 20,
    'learning_rate': 3e-3,           # 0.0004
    'weight_decay': 0.01,
    'beta1': 0.9,                     # AdamW beta1 (OpenAI standard)
    'beta2': 0.95,                    # AdamW beta2 (OpenAI uses 0.95 instead of default 0.999)
    'epsilon': 1e-8,
    'max_grad_norm': 1.0,             # Gradient clipping
    'warmup_steps': 100,              # LR warmup steps
    'eval_freq': 200,                 # Evaluate every N steps (not every 5 steps - too frequent)
    'eval_iter': 50,                  # Use more batches for evaluation (not 1)
    'use_amp': torch.cuda.is_available(),  # Automatic Mixed Precision (faster training)
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

# Calculate total steps
total_steps = len(train_loader) * CONFIG['num_epochs']
print(f"\nTotal steps: {total_steps}")
print(f"Warmup ratio: {CONFIG['warmup_steps'] / total_steps * 100:.1f}%")
print(f"Expected evaluations: {total_steps // CONFIG['eval_freq']}")

# CRITICAL FIX: Better LR scheduler
def get_lr_scheduler_fixed(optimizer, warmup_steps, total_steps):
    """
    Fixed learning rate schedule with proper warmup and cosine decay.
    """
    def lr_lambda(current_step):
        # Linear warmup
        if current_step < warmup_steps:
            return float(current_step) / float(max(1, warmup_steps))
        # Cosine decay from 1.0 to 0.1
        progress = float(current_step - warmup_steps) / float(max(1, total_steps - warmup_steps))
        return max(0.1, 0.5 * (1.0 + math.cos(math.pi * progress)))

    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

import math

# Initialize model
model = GPTModel(GPT_CONFIG_124M)
model.to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"\nModel parameters: {total_params:,}")

# Optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=CONFIG['learning_rate'],
    betas=(CONFIG['beta1'], CONFIG['beta2']),
    eps=CONFIG['epsilon'],
    weight_decay=CONFIG['weight_decay']
)

# FIXED: Correct scheduler
scheduler = get_lr_scheduler_fixed(optimizer, CONFIG['warmup_steps'], total_steps)

# Gradient scaler
scaler = GradScaler() if CONFIG['use_amp'] else None

print(f"\nStarting training with:")
print(f"  Initial LR: {CONFIG['learning_rate']}")
print(f"  After warmup: {CONFIG['learning_rate']}")
print(f"  Final LR (min): {CONFIG['learning_rate'] * 0.1}")

Configuration:
  num_epochs: 20
  learning_rate: 0.003
  weight_decay: 0.01
  beta1: 0.9
  beta2: 0.95
  epsilon: 1e-08
  max_grad_norm: 1.0
  warmup_steps: 100
  eval_freq: 200
  eval_iter: 50
  use_amp: True

Total steps: 3852420
Warmup ratio: 0.0%
Expected evaluations: 19262

Model parameters: 6,839,552

Starting training with:
  Initial LR: 0.003
  After warmup: 0.003
  Final LR (min): 0.00030000000000000003


  scaler = GradScaler() if CONFIG['use_amp'] else None


In [42]:
# Start training
start_time = time.time()


# If GPU memory is limited, use gradient accumulation
GRADIENT_ACCUMULATION_STEPS = 2  # Effective batch size = 8 * 2 = 16

try:
    # In your training code:
    train_losses, val_losses, tokens_seen = train_model_simple(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        optimizer=optimizer,
        device=device,
        num_epochs=CONFIG['num_epochs'],
        eval_freq=CONFIG['eval_freq'],
        eval_iter=CONFIG['eval_iter'],
        start_context="He said we came here",
        tokenizer=tokenizer,
        max_grad_norm=CONFIG['max_grad_norm'],
        save_checkpoints=True,
        checkpoint_path="gpt_model_best.pt",
        use_amp=CONFIG['use_amp'],
        scheduler=scheduler,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS  # ADD THIS
    )

    # Save final model
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'config': GPT_CONFIG_124M,
        'train_losses': train_losses,
        'val_losses': val_losses,
        'tokens_seen': tokens_seen,
    }, "gpt_model_final.pt")

    print("\n" + "="*70)
    print("✓ Training completed successfully!")
    print("="*70)

except KeyboardInterrupt:
    print("\n" + "="*70)
    print("Training interrupted by user")
    print("="*70)

    # Save interrupted model
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'config': GPT_CONFIG_124M,
    }, "gpt_model_interrupted.pt")
    print("✓ Saved interrupted model checkpoint")

finally:
    end_time = time.time()
    execution_time_minutes = (end_time - start_time) / 60
    execution_time_hours = execution_time_minutes / 60

    print(f"\nTraining time: {execution_time_minutes:.2f} minutes ({execution_time_hours:.2f} hours)")

    if torch.cuda.is_available():
        print(f"Peak GPU memory: {torch.cuda.max_memory_allocated(device) / 1e9:.2f} GB")

  scaler = GradScaler() if use_amp and torch.cuda.is_available() else None
  with autocast():
Epoch 1/20:   0%|          | 401/192621 [00:18<2:14:14, 23.86it/s, loss=8.078, lr=3.00e-03]


Ep 1 (Step 000200): Train loss 7.669, Val loss 7.961, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.961)


Epoch 1/20:   0%|          | 407/192621 [00:20<11:07:20,  4.80it/s, loss=8.641, lr=3.00e-03]


Ep 1 (Step 000200): Train loss 7.609, Val loss 7.961, LR 3.00e-03


Epoch 1/20:   0%|          | 801/192621 [00:40<2:22:54, 22.37it/s, loss=6.922, lr=3.00e-03]


Ep 1 (Step 000400): Train loss 7.345, Val loss 7.757, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.757)


Epoch 1/20:   0%|          | 807/192621 [00:42<11:34:03,  4.61it/s, loss=6.257, lr=3.00e-03]


Ep 1 (Step 000400): Train loss 7.299, Val loss 7.757, LR 3.00e-03


Epoch 1/20:   1%|          | 1200/192621 [00:59<2:10:37, 24.42it/s, loss=7.324, lr=3.00e-03]


Ep 1 (Step 000600): Train loss 7.251, Val loss 7.614, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.614)


Epoch 1/20:   1%|          | 1205/192621 [01:01<12:44:21,  4.17it/s, loss=7.533, lr=3.00e-03]


Ep 1 (Step 000600): Train loss 7.203, Val loss 7.614, LR 3.00e-03


Epoch 1/20:   1%|          | 1601/192621 [01:20<2:15:18, 23.53it/s, loss=7.681, lr=3.00e-03]


Ep 1 (Step 000800): Train loss 7.113, Val loss 7.662, LR 3.00e-03


Epoch 1/20:   1%|          | 1604/192621 [01:21<16:44:13,  3.17it/s, loss=7.487, lr=3.00e-03]


Ep 1 (Step 000800): Train loss 7.169, Val loss 7.662, LR 3.00e-03


Epoch 1/20:   1%|          | 2002/192621 [01:40<12:54:33,  4.10it/s, loss=7.281, lr=3.00e-03]


Ep 1 (Step 001000): Train loss 7.025, Val loss 7.587, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.587)


Epoch 1/20:   1%|          | 2007/192621 [01:42<12:14:30,  4.33it/s, loss=7.759, lr=3.00e-03]


Ep 1 (Step 001000): Train loss 7.074, Val loss 7.587, LR 3.00e-03


Epoch 1/20:   1%|          | 2402/192621 [02:00<9:01:48,  5.85it/s, loss=7.440, lr=3.00e-03]


Ep 1 (Step 001200): Train loss 7.050, Val loss 7.489, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.489)


Epoch 1/20:   1%|          | 2407/192621 [02:01<10:18:04,  5.13it/s, loss=6.781, lr=3.00e-03]


Ep 1 (Step 001200): Train loss 6.924, Val loss 7.489, LR 3.00e-03


Epoch 1/20:   1%|▏         | 2801/192621 [02:20<2:16:34, 23.17it/s, loss=7.100, lr=3.00e-03]


Ep 1 (Step 001400): Train loss 6.927, Val loss 7.423, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.423)


Epoch 1/20:   1%|▏         | 2807/192621 [02:22<11:28:31,  4.59it/s, loss=7.450, lr=3.00e-03]


Ep 1 (Step 001400): Train loss 7.233, Val loss 7.423, LR 3.00e-03


Epoch 1/20:   2%|▏         | 3202/192621 [02:42<12:58:33,  4.05it/s, loss=7.125, lr=3.00e-03]


Ep 1 (Step 001600): Train loss 7.005, Val loss 7.530, LR 3.00e-03


Epoch 1/20:   2%|▏         | 3207/192621 [02:43<12:28:56,  4.22it/s, loss=7.409, lr=3.00e-03]


Ep 1 (Step 001600): Train loss 7.087, Val loss 7.530, LR 3.00e-03


Epoch 1/20:   2%|▏         | 3601/192621 [03:02<2:16:01, 23.16it/s, loss=7.655, lr=3.00e-03]


Ep 1 (Step 001800): Train loss 6.979, Val loss 7.456, LR 3.00e-03


Epoch 1/20:   2%|▏         | 3607/192621 [03:03<10:47:06,  4.87it/s, loss=6.606, lr=3.00e-03]


Ep 1 (Step 001800): Train loss 6.779, Val loss 7.456, LR 3.00e-03


Epoch 1/20:   2%|▏         | 4002/192621 [03:23<8:12:38,  6.38it/s, loss=7.078, lr=3.00e-03]


Ep 1 (Step 002000): Train loss 6.879, Val loss 7.466, LR 3.00e-03


Epoch 1/20:   2%|▏         | 4007/192621 [03:24<9:48:12,  5.34it/s, loss=7.013, lr=3.00e-03] 


Ep 1 (Step 002000): Train loss 6.936, Val loss 7.466, LR 3.00e-03


Epoch 1/20:   2%|▏         | 4402/192621 [03:43<13:46:59,  3.79it/s, loss=6.321, lr=3.00e-03]


Ep 1 (Step 002200): Train loss 6.783, Val loss 7.442, LR 3.00e-03


Epoch 1/20:   2%|▏         | 4407/192621 [03:45<12:39:30,  4.13it/s, loss=7.163, lr=3.00e-03]


Ep 1 (Step 002200): Train loss 6.779, Val loss 7.442, LR 3.00e-03


Epoch 1/20:   2%|▏         | 4801/192621 [04:03<2:17:04, 22.84it/s, loss=6.882, lr=3.00e-03]


Ep 1 (Step 002400): Train loss 6.850, Val loss 7.343, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.343)


Epoch 1/20:   2%|▏         | 4806/192621 [04:05<13:03:21,  4.00it/s, loss=6.832, lr=3.00e-03]


Ep 1 (Step 002400): Train loss 6.990, Val loss 7.343, LR 3.00e-03


Epoch 1/20:   3%|▎         | 5200/192621 [04:25<2:16:16, 22.92it/s, loss=7.169, lr=3.00e-03]


Ep 1 (Step 002600): Train loss 6.892, Val loss 7.391, LR 3.00e-03


Epoch 1/20:   3%|▎         | 5206/192621 [04:26<10:51:25,  4.79it/s, loss=7.017, lr=3.00e-03]


Ep 1 (Step 002600): Train loss 6.740, Val loss 7.391, LR 3.00e-03


Epoch 1/20:   3%|▎         | 5601/192621 [04:46<2:35:13, 20.08it/s, loss=7.372, lr=3.00e-03]


Ep 1 (Step 002800): Train loss 6.804, Val loss 7.344, LR 3.00e-03


Epoch 1/20:   3%|▎         | 5607/192621 [04:47<11:34:23,  4.49it/s, loss=6.892, lr=3.00e-03]


Ep 1 (Step 002800): Train loss 6.792, Val loss 7.344, LR 3.00e-03


Epoch 1/20:   3%|▎         | 6002/192621 [05:06<9:23:08,  5.52it/s, loss=6.079, lr=3.00e-03]


Ep 1 (Step 003000): Train loss 6.820, Val loss 7.323, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.323)


Epoch 1/20:   3%|▎         | 6006/192621 [05:08<12:51:43,  4.03it/s, loss=6.276, lr=3.00e-03]


Ep 1 (Step 003000): Train loss 6.751, Val loss 7.323, LR 3.00e-03


Epoch 1/20:   3%|▎         | 6402/192621 [05:27<8:18:04,  6.23it/s, loss=6.063, lr=3.00e-03]


Ep 1 (Step 003200): Train loss 6.736, Val loss 7.352, LR 3.00e-03


Epoch 1/20:   3%|▎         | 6407/192621 [05:28<9:45:39,  5.30it/s, loss=6.428, lr=3.00e-03] 


Ep 1 (Step 003200): Train loss 6.777, Val loss 7.352, LR 3.00e-03


Epoch 1/20:   4%|▎         | 6800/192621 [05:47<2:18:53, 22.30it/s, loss=6.806, lr=3.00e-03]


Ep 1 (Step 003400): Train loss 6.811, Val loss 7.290, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.290)


Epoch 1/20:   4%|▎         | 6806/192621 [05:49<11:20:19,  4.55it/s, loss=6.656, lr=3.00e-03]


Ep 1 (Step 003400): Train loss 6.728, Val loss 7.290, LR 3.00e-03


Epoch 1/20:   4%|▎         | 7200/192621 [06:07<2:13:36, 23.13it/s, loss=6.259, lr=3.00e-03]


Ep 1 (Step 003600): Train loss 6.682, Val loss 7.276, LR 3.00e-03


Epoch 1/20:   4%|▎         | 7200/192621 [06:08<2:13:36, 23.13it/s, loss=6.276, lr=3.00e-03]

✓ Saved best checkpoint (val_loss: 7.276)


Epoch 1/20:   4%|▎         | 7205/192621 [06:10<14:37:47,  3.52it/s, loss=6.995, lr=3.00e-03]


Ep 1 (Step 003600): Train loss 6.764, Val loss 7.276, LR 3.00e-03


Epoch 1/20:   4%|▍         | 7601/192621 [06:29<2:19:10, 22.16it/s, loss=6.422, lr=3.00e-03]


Ep 1 (Step 003800): Train loss 6.792, Val loss 7.268, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.268)


Epoch 1/20:   4%|▍         | 7607/192621 [06:31<11:11:37,  4.59it/s, loss=7.164, lr=3.00e-03]


Ep 1 (Step 003800): Train loss 6.578, Val loss 7.268, LR 3.00e-03


Epoch 1/20:   4%|▍         | 8000/192621 [06:49<2:13:03, 23.13it/s, loss=7.730, lr=3.00e-03]


Ep 1 (Step 004000): Train loss 6.687, Val loss 7.231, LR 3.00e-03
✓ Saved best checkpoint (val_loss: 7.231)


Epoch 1/20:   4%|▍         | 8006/192621 [06:52<11:09:08,  4.60it/s, loss=6.967, lr=3.00e-03]


Ep 1 (Step 004000): Train loss 6.760, Val loss 7.231, LR 3.00e-03


Epoch 1/20:   4%|▍         | 8401/192621 [07:10<2:42:15, 18.92it/s, loss=6.646, lr=3.00e-03]


Ep 1 (Step 004200): Train loss 6.711, Val loss 7.086, LR 3.00e-03


Epoch 1/20:   4%|▍         | 8401/192621 [07:12<2:42:15, 18.92it/s, loss=8.508, lr=3.00e-03]

✓ Saved best checkpoint (val_loss: 7.086)


Epoch 1/20:   4%|▍         | 8406/192621 [07:13<13:51:22,  3.69it/s, loss=7.185, lr=3.00e-03]


Ep 1 (Step 004200): Train loss 6.747, Val loss 7.086, LR 3.00e-03


Epoch 1/20:   4%|▍         | 8637/192621 [07:23<2:37:35, 19.46it/s, loss=6.570, lr=3.00e-03]



Training interrupted by user
✓ Saved interrupted model checkpoint

Training time: 7.40 minutes (0.12 hours)
Peak GPU memory: 0.49 GB
