# Sweep Framework Documentation

This notebook documents the components of a modular sweep framework for text classification.

We'll explore each module, configuration, dataset, metrics, loss, training, sweeps, and reporting, and show how they work together.

The goal: make experimentation reproducible, extensible, and easy to analyze.

## Model Configuration

### Purpose

Defines the configuration object for a model run.  Stores hyperparameters and provides builder methods for the model, optimizer, and scheduler.

### Class: `ModelConfig`

#### Attributes

`run_type: str` 

Specifies the type of recurrent model to build.

* `LSTM` is a Long Short-Term Memory network, good for handling long-range dependencies.
* `GRU` is a Gated Recurrent Unit, simpler and faster, but often comparable in performance to an LSTM.

`num_layers: int`

The number of stacked recurrent layers.

* `1` generates a single-layer RNN.
* `>1` generates a deeper RNN, potentially capturing more complex patterns but increasing training time.

`hidden_dim: int` 

Dimensionality of the hidden state of the RNN.

* Larger values create more capacity, but at higher risk of overfitting and slower training.
* Typical range: 64-512.

`bidirectional: bool` 

Identifies whether to use a bidirectional RNN.

* `True` processes sequences forward and backward, useful for tasks where context on both side matters, e.g. sentiment analysis.
* `False` standard forward-only RNN.

`dropout: float` 

Sets the dropout probability applied between layers of the RNN.

* Range: `0.0-1.0`
* Example: `0.3` means 30% of units are randomly dropped during training to reduce overfitting.

`embedding_dim: int` 

Sets the size of word embeddings.

* Determines how each token is represented numerically.
* Typical values: 50, 100, 300 (GloVe) or 768 (BERT).

`vocab_size: int` 

Number of unique tokens in the volcabulary (must be set before building a model).

* Must match the tokenizer's vocabulary size.
* Used to initialize the embedding layer.

`num_classes: int` 

Specifies the number of output classes.

* For SST-2: `2` (negative, positive)
* For SST-5: `5` (very negative, negative, neutral, positive, very positive)

`learning_rate: float` 

Step size for optimizer updates.

* Typical values: `1e-3` (Adam), `1e-2` (SGD).
* Too high leads to unstable training; too low leads to slow convergence.

`optimizer_type: str` 

Specifies the optimizer type (`Adam`, `SGD`, `AdamW`)

* `Adam` provides adaptive learning rates, good default.
* `SGD` is stochastic gradient descent, requires tuning.
* `AdamW` is `Adam` with weight decay; often better for transfomers.

`num_epochs: int`

Maximum number of training epochs.

* Each epoch equals one full pass through the training set.
* Early stopping may halt training before this.

`patience: int` 

Number of epochs to wait for improvement before early stopping.

* Example: `patience = 2` will trigger stop if validation metric doesn't improve for 2 consecutive epochs.

`run_group: str` 

Optional label for grouping runs in sweeps.

* Useful for organizing experiments, e.g. `lstm_baseline` versus `gru_baseline`.

#### Methods

#### `build_model`

Returns a PyTorch model configured with embeddings, RNN, and classifier.

#### Parameters

None (uses attributes).

#### Outputs

PyTorch model (`nn.Module`) configured with embeddings, RNN, and classifier.

#### Notes

Requires `vocab_size` and `num_classes` to be set.

#### `build_optimizer(parameters)`

Returns optimizer instance bound to model parameters.

#### Parameters

`parameters`: iterable of model parameters.

#### Outputs

Optimizer instance (`torch.optim.Adam`, `SGD`, `AdamW`).

#### `build_scheduler(optimizer)`

Returns learning rate scheduler.

#### Parameters

`optimizer`: optimizer instance.

#### Outputs

Learning rate scheduler (`torch.optim.lr_scheduler.StepLR`).

#### `to_dict`

Returns a dictionary of all configuration attributes for logging and reporting.

#### Parameters

None.

#### Outputs

Dictionary of all configuration attributes for logging and reporting.

### Example

We define a configuration for an LSTM model with 2 layers, hidden size 128, dropout 0.3, and Adam optimizer.

In [None]:
from sweep_framework.config.model_config import ModelConfig

config = ModelConfig(
    run_type="LSTM",
    num_layers=2,
    hidden_dim=128,
    bidirectional=True,
    dropout=0.3,
    embedding_dim=100,
    vocab_size=30522,   # from tokenizer
    num_classes=2,      # SST-2 is binary
    learning_rate=1e-3,
    optimizer_type="Adam",
    num_epochs=3,
    patience=2
)

print(config.to_dict())

# Build model, optimizer, scheduler
model = config.build_model()
optimizer = config.build_optimizer(model.parameters())
scheduler = config.build_scheduler(optimizer)

print(model)

## Data

### Purpose

The `Dataset` class manages raw examples, stratified splits, and PyTorch DataLoaders.  It acts as the bridge between HuggingFace datasets (or any `(text, label)` pairs) and the training loop.

#### Class: `Dataset`

#### Attributes

`examples: List[Tuple[str, int]]`

Raw dataset examples as `(text, label)` pairs.  Labels must be integers in `[0, num_classes - 1]`.

Note: filter out invalid labels, e.g. `-1` in the SST-2 test set.

`train_examples: List[Tuple[str, int]]`

Subset of examples used for training.

`val_examples: List[Tuple[str, int]]`

Subset of examples used for validation.

`test_examples: List[Tuple[str, int]]`

Subset of examples used for testing.

`train_loader: DataLoader`

PyTorch DataLoader for training set.

`val_loader: DataLoader`

PyTorch DataLoader for validation set.

`test_loader: DataLoader`

PyTorch DataLoader for test set.

#### Methods

`__init__(self, examples: List[Tuple[str, int]])`

#### Parameters

`examples`: List of `(text, label)` pairs.

#### Outputs

Initializes dataset with raw examples.

#### `stratify_split(self, train_ratio = 0.8, val_ratio = 0.1, test_ratio = 0.1, seed = 42)`

#### Parameters

`train_ratio`: Proportion of examples for training.

`val_ratio`: Proportion for validation.

`test_ratio`: Proportion for testing.

`seed`: Random seed for reproducibility.