<a href="https://colab.research.google.com/github/g24ait127/MLOps-Jan2025/blob/main/colabs/intro/Intro_to_Weights_%26_Biases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/intro/Intro_to_Weights_&_Biases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{intro-colab} -->

<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/intro/Intro_to_Weights_&_Biases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{intro-colab} -->

<img src="http://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />
<!--- @wandbcode{intro-colab} -->

Use [W&B](https://wandb.ai/site?utm_source=intro_colab&utm_medium=code&utm_campaign=intro) for machine learning experiment tracking, model checkpointing, collaboration with your team and more. See the full W&B Documentation [here](https://docs.wandb.ai/).

In this notebook, you will create and track a machine learning experiment using a simple PyTorch model. By the end of the notebook, you will have an interactive project dashboard that you can share and customize with other members of your team. [View an example dashboard here](https://wandb.ai/wandb/wandb_example).

## Prerequisites

Install the W&B Python SDK and log in:

In [1]:
!pip install wandb -qU

In [2]:
# Log in to your W&B account
import wandb
import random
import math

In [3]:
wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mg24ait127[0m ([33mg24ait127-abbott[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## Simulate and track a machine learning experiment with W&B

Create, track, and visualize a machine learning experiment. To do this:

1. Initialize a [W&B run](https://docs.wandb.ai/guides/runs) and pass in the hyperparameters you want to track.
2. Within your training loop, log metrics such as the accuruacy and loss.

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import wandb


In [5]:
import random
import math

# Launch 5 simulated experiments
total_runs = 5
for run in range(total_runs):
  # 1️. Start a new run to track this script
  wandb.init(
      # Set the project where this run will be logged
      project="basic-intro",
      # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
      name=f"experiment_{run}",
      # Track hyperparameters and run metadata
      config={
      "learning_rate": 0.02,
      "architecture": "CNN",
      "dataset": "CIFAR-100",
      "epochs": 10,
      })

  # This simple block simulates a training loop logging metrics
  epochs = 10
  offset = random.random() / 5
  for epoch in range(2, epochs):
      acc = 1 - 2 ** -epoch - random.random() / epoch - offset
      loss = 2 ** -epoch + random.random() / epoch + offset

      # 2️. Log metrics from your script to W&B
      wandb.log({"acc": acc, "loss": loss})

  # Mark the run as finished
  wandb.finish()

0,1
acc,▁▅▆▇▅█▇▆
loss,█▄▃▅▂▁▂▁

0,1
acc,0.8659
loss,0.04363


0,1
acc,▁▅▅███▇▇
loss,█▃▃▃▁▂▁▂

0,1
acc,0.88269
loss,0.13727


0,1
acc,▁▆▇▆▇▆▇█
loss,█▄▂▅▃▂▁▂

0,1
acc,0.96408
loss,0.08672


0,1
acc,▁▃▂█▇▇██
loss,█▆▇▄▁▁▃▁

0,1
acc,0.8203
loss,0.16191


0,1
acc,▁▄▃▇▆▇▇█
loss,▇▆█▂▃▁▃▃

0,1
acc,0.9298
loss,0.09507


View how your machine learning peformed in your W&B project. Copy and paste the URL link that is printed from the previous cell. The URL will redirect you to a W&B project that contains a dashboard showing graphs the show how

The following image shows what a dashboard can look like:

![](https://i.imgur.com/Pell4Oo.png)

Now that we know how to integrate W&B into a psuedo machine learning training loop, let's track a machine learning experiment using a basic PyTorch neural network. The following code will also upload model checkpoints to W&B that you can then share with other teams in in your organization.

##  Track a machine learning experiment using Pytorch

The following code cell defines and trains a simple MNIST classifier. During training, you will see W&B prints out URLs. Click on the project page link to see your results stream in live to a W&B project.

W&B runs automatically log [metrics](https://docs.wandb.ai/ref/app/pages/run-page#charts-tab),
[system information](https://docs.wandb.ai/ref/app/pages/run-page#system-tab),
[hyperparameters](https://docs.wandb.ai/ref/app/pages/run-page#overview-tab),
[terminal output](https://docs.wandb.ai/ref/app/pages/run-page#logs-tab) and
you'll see an [interactive table](https://docs.wandb.ai/guides/data-vis)
with model inputs and outputs.

### Set up PyTorch Dataloader
The following cell defines some useful functions that we will need to train our machine learning model. The functions themselves are not unique to W&B so we'll not cover them in detail here. See the PyTorch documentation for more information on how to define [forward and backward training loop](https://pytorch.org/tutorials/beginner/nn_tutorial.html), how to use [PyTorch DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) to load data in for training, and how define PyTorch models using the [`torch.nn.Sequential` Class](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html).

In [6]:
#@title
import torch, torchvision
import torch.nn as nn
from torchvision.datasets import MNIST
import torchvision.transforms as T

MNIST.mirrors = [mirror for mirror in MNIST.mirrors if "http://yann.lecun.com/" not in mirror]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

def get_dataloader(is_train, batch_size, slice=5):
    "Get a training dataloader"
    full_dataset = MNIST(root=".", train=is_train, transform=T.ToTensor(), download=True)
    sub_dataset = torch.utils.data.Subset(full_dataset, indices=range(0, len(full_dataset), slice))
    loader = torch.utils.data.DataLoader(dataset=sub_dataset,
                                         batch_size=batch_size,
                                         shuffle=True if is_train else False,
                                         pin_memory=True, num_workers=2)
    return loader

def get_model(dropout):
    "A simple model"
    model = nn.Sequential(nn.Flatten(),
                         nn.Linear(28*28, 256),
                         nn.BatchNorm1d(256),
                         nn.ReLU(),
                         nn.Dropout(dropout),
                         nn.Linear(256,10)).to(device)
    return model

def validate_model(model, valid_dl, loss_func, log_images=False, batch_idx=0):
    "Compute performance of the model on the validation dataset and log a wandb.Table"
    model.eval()
    val_loss = 0.
    with torch.inference_mode():
        correct = 0
        for i, (images, labels) in enumerate(valid_dl):
            images, labels = images.to(device), labels.to(device)

            # Forward pass ➡
            outputs = model(images)
            val_loss += loss_func(outputs, labels)*labels.size(0)

            # Compute accuracy and accumulate
            _, predicted = torch.max(outputs.data, 1)
            correct += (predicted == labels).sum().item()

            # Log one batch of images to the dashboard, always same batch_idx.
            if i==batch_idx and log_images:
                log_image_table(images, predicted, labels, outputs.softmax(dim=1))
    return val_loss / len(valid_dl.dataset), correct / len(valid_dl.dataset)

### Create a teble to compare the predicted values versus the true value

The following cell is unique to W&B, so let's go over it.

In the cell we define a function called `log_image_table`. Though technically, optional, this function creates a W&B Table object. We will use the table object to create a table that shows what the model predicted for each image.

More specifically, each row will conists of the image fed to the model, along with predicted value and the actual value (label).

In [7]:
def log_image_table(images, predicted, labels, probs):
    "Log a wandb.Table with (img, pred, target, scores)"
    # Create a wandb Table to log images, labels and predictions to
    table = wandb.Table(columns=["image", "pred", "target"]+[f"score_{i}" for i in range(10)])
    for img, pred, targ, prob in zip(images.to("cpu"), predicted.to("cpu"), labels.to("cpu"), probs.to("cpu")):
        table.add_data(wandb.Image(img[0].numpy()*255), pred, targ, *prob.numpy())
    wandb.log({"predictions_table":table}, commit=False)

### Train your model and upload checkpoints

The following code trains and saves model checkpoints to your project. Use model checkpoints like you normally would to assess how the model performed during training.

W&B also makes it easy to share your saved models and model checkpoints with other members of your team or organization. To learn how to share your model and model checkpoints with members outside of your team, see [W&B Registry](https://docs.wandb.ai/guides/registry).

In [8]:
# Launch 3 experiments, trying different dropout rates
for _ in range(3):
    # initialise a wandb run
    wandb.init(
        project="pytorch-intro",
        config={
            "epochs": 5,
            "batch_size": 128,
            "lr": 1e-3,
            "dropout": random.uniform(0.01, 0.80),
            })

    # Copy your config
    config = wandb.config

    # Get the data
    train_dl = get_dataloader(is_train=True, batch_size=config.batch_size)
    valid_dl = get_dataloader(is_train=False, batch_size=2*config.batch_size)
    n_steps_per_epoch = math.ceil(len(train_dl.dataset) / config.batch_size)

    # A simple MLP model
    model = get_model(config.dropout)

    # Make the loss and optimizer
    loss_func = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=config.lr)

   # Training
    example_ct = 0
    step_ct = 0
    for epoch in range(config.epochs):
        model.train()
        for step, (images, labels) in enumerate(train_dl):
            images, labels = images.to(device), labels.to(device)

            outputs = model(images)
            train_loss = loss_func(outputs, labels)
            optimizer.zero_grad()
            train_loss.backward()
            optimizer.step()

            example_ct += len(images)
            metrics = {"train/train_loss": train_loss,
                       "train/epoch": (step + 1 + (n_steps_per_epoch * epoch)) / n_steps_per_epoch,
                       "train/example_ct": example_ct}

            if step + 1 < n_steps_per_epoch:
                # Log train metrics to wandb
                wandb.log(metrics)

            step_ct += 1

        val_loss, accuracy = validate_model(model, valid_dl, loss_func, log_images=(epoch==(config.epochs-1)))

        # Log train and validation metrics to wandb
        val_metrics = {"val/val_loss": val_loss,
                       "val/val_accuracy": accuracy}
        wandb.log({**metrics, **val_metrics})

        # Save the model checkpoint to wandb
        torch.save(model, "my_model.pt")
        wandb.log_model("./my_model.pt", "my_mnist_model", aliases=[f"epoch-{epoch+1}_dropout-{round(wandb.config.dropout, 4)}"])

        print(f"Epoch: {epoch+1}, Train Loss: {train_loss:.3f}, Valid Loss: {val_loss:3f}, Accuracy: {accuracy:.2f}")

    # If you had a test set, this is how you could log it as a Summary metric
    wandb.summary['test_accuracy'] = 0.8

    # Close your wandb run
    wandb.finish()

Epoch: 1, Train Loss: 0.397, Valid Loss: 0.306698, Accuracy: 0.91
Epoch: 2, Train Loss: 0.171, Valid Loss: 0.258912, Accuracy: 0.93
Epoch: 3, Train Loss: 0.203, Valid Loss: 0.227712, Accuracy: 0.93
Epoch: 4, Train Loss: 0.178, Valid Loss: 0.209507, Accuracy: 0.94
Epoch: 5, Train Loss: 0.286, Valid Loss: 0.196954, Accuracy: 0.94


0,1
train/epoch,▁▁▁▁▁▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▅▅▅▅▅▆▆▆▆▇▇▇███
train/example_ct,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇▇█
train/train_loss,█▇▇▇▅▅▅▅▂▃▃▂▂▂▃▄▂▄▂▂▁▂▃▁▂▂▁▁▃▁▂▂▂▂▁▁▂▂▁▁
val/val_accuracy,▁▅▆█▇
val/val_loss,█▅▃▂▁

0,1
test_accuracy,0.8
train/epoch,5.0
train/example_ct,60000.0
train/train_loss,0.28647
val/val_accuracy,0.936
val/val_loss,0.19695


Epoch: 1, Train Loss: 0.306, Valid Loss: 0.286804, Accuracy: 0.92
Epoch: 2, Train Loss: 0.322, Valid Loss: 0.243012, Accuracy: 0.92
Epoch: 3, Train Loss: 0.220, Valid Loss: 0.205662, Accuracy: 0.94
Epoch: 4, Train Loss: 0.171, Valid Loss: 0.188443, Accuracy: 0.94
Epoch: 5, Train Loss: 0.201, Valid Loss: 0.179674, Accuracy: 0.94


0,1
train/epoch,▁▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇█
train/example_ct,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▆▆▆▆▆▆▆▇▇▇▇████
train/train_loss,█▄▄▃▂▃▃▂▃▃▂▂▃▃▂▂▁▂▂▂▂▁▂▂▁▂▁▂▂▂▁▁▂▁▁▁▂▂▁▁
val/val_accuracy,▁▂▇█▇
val/val_loss,█▅▃▂▁

0,1
test_accuracy,0.8
train/epoch,5.0
train/example_ct,60000.0
train/train_loss,0.2006
val/val_accuracy,0.9415
val/val_loss,0.17967


Epoch: 1, Train Loss: 0.388, Valid Loss: 0.293546, Accuracy: 0.92
Epoch: 2, Train Loss: 0.158, Valid Loss: 0.235513, Accuracy: 0.93
Epoch: 3, Train Loss: 0.266, Valid Loss: 0.206807, Accuracy: 0.94
Epoch: 4, Train Loss: 0.335, Valid Loss: 0.188292, Accuracy: 0.94
Epoch: 5, Train Loss: 0.124, Valid Loss: 0.185726, Accuracy: 0.94


0,1
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇▇▇████
train/example_ct,▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇█████
train/train_loss,██▄▃▃▃▃▂▃▂▂▂▃▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▂▂▂▂▁
val/val_accuracy,▁▄▇▆█
val/val_loss,█▄▂▁▁

0,1
test_accuracy,0.8
train/epoch,5.0
train/example_ct,60000.0
train/train_loss,0.12386
val/val_accuracy,0.9425
val/val_loss,0.18573


In [9]:
wandb.login()
wandb.init(project="MLOps2025_7", name="image_classification")


In [10]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

train_dataset = torchvision.datasets.FashionMNIST(root="./data", train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.FashionMNIST(root="./data", train=False, download=True, transform=transform)

train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size
train_data, val_data = torch.utils.data.random_split(train_dataset, [train_size, val_size])


In [11]:
class SimpleNN(nn.Module):
    def __init__(self, hidden_units=128):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(28*28, hidden_units)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_units, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleNN()


In [12]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [13]:
def train_model(model, train_loader, val_loader, epochs=5):
    for epoch in range(epochs):
        model.train()
        train_loss, correct, total = 0, 0, 0

        for images, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()
            _, predicted = outputs.max(1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)

        train_acc = 100 * correct / total
        val_acc, val_loss = evaluate_model(model, val_loader)

        wandb.log({"Train Loss": train_loss / len(train_loader), "Train Accuracy": train_acc,
                   "Validation Loss": val_loss, "Validation Accuracy": val_acc})

        print(f"Epoch {epoch+1}/{epochs} - Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%, Val Acc: {val_acc:.2f}%")

def evaluate_model(model, val_loader):
    model.eval()
    val_loss, correct, total = 0, 0, 0

    with torch.no_grad():
        for images, labels in val_loader:
            outputs = model(images)
            loss = criterion(outputs, labels)
            val_loss += loss.item()

            _, predicted = outputs.max(1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)

    return 100 * correct / total, val_loss / len(val_loader)

# Set initial batch size
batch_size = 64
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size, shuffle=False)

train_model(model, train_loader, val_loader, epochs=5)


Epoch 1/5 - Train Loss: 388.7733, Train Acc: 81.47%, Val Acc: 83.83%
Epoch 2/5 - Train Loss: 294.3023, Train Acc: 85.92%, Val Acc: 83.97%
Epoch 3/5 - Train Loss: 264.8664, Train Acc: 87.12%, Val Acc: 85.90%
Epoch 4/5 - Train Loss: 245.6521, Train Acc: 87.95%, Val Acc: 87.21%
Epoch 5/5 - Train Loss: 228.9844, Train Acc: 88.81%, Val Acc: 87.58%


In [14]:
sweep_config = {
    "method": "grid",
    "metric": {"goal": "maximize", "name": "Validation Accuracy"},
    "parameters": {
        "batch_size": {"values": [32, 64, 128]}  # Sweeping different batch sizes
    }
}

sweep_id = wandb.sweep(sweep_config, project="MLOps2025_7")


Create sweep with ID: sta538bf
Sweep URL: https://wandb.ai/g24ait127-abbott/MLOps2025_7/sweeps/sta538bf


In [16]:
import wandb

sweep_config = {
    "method": "grid",  # Choose "random", "grid", or "bayes"
    "metric": {"name": "val_loss", "goal": "minimize"},
    "parameters": {
        "batch_size": {"values": [32, 64, 128]},  # Change as per roll number 7 (batch size sweep)
        "learning_rate": {"values": [0.001, 0.0005]},
        "epochs": {"values": [5]},
    },
}


In [17]:
sweep_id = wandb.sweep(sweep_config, project="MLOps2025_YourRollNo")
print(f"Sweep ID: {sweep_id}")


Create sweep with ID: th5pkys2
Sweep URL: https://wandb.ai/g24ait127-abbott/MLOps2025_YourRollNo/sweeps/th5pkys2
Sweep ID: th5pkys2


In [18]:
def train_sweep():
    with wandb.init() as run:
        config = wandb.config

        # Update batch size dynamically
        train_loader = torch.utils.data.DataLoader(train_data, batch_size=config.batch_size, shuffle=True)
        val_loader = torch.utils.data.DataLoader(val_data, batch_size=config.batch_size, shuffle=False)

        train_model(model, train_loader, val_loader, epochs=5)

wandb.agent(sweep_id, function=train_sweep, count=3)


[34m[1mwandb[0m: Agent Starting Run: wq6tpkpw with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	learning_rate: 0.001


Epoch 1/5 - Train Loss: 485.5600, Train Acc: 88.05%, Val Acc: 86.91%
Epoch 2/5 - Train Loss: 442.8030, Train Acc: 89.12%, Val Acc: 87.13%
Epoch 3/5 - Train Loss: 417.3646, Train Acc: 89.61%, Val Acc: 87.66%
Epoch 4/5 - Train Loss: 401.1395, Train Acc: 90.03%, Val Acc: 87.38%
Epoch 5/5 - Train Loss: 386.0308, Train Acc: 90.40%, Val Acc: 87.78%


0,1
Train Accuracy,▁▄▆▇█
Train Loss,█▅▃▂▁
Validation Accuracy,▁▃▇▅█
Validation Loss,▅█▃█▁

0,1
Train Accuracy,90.40208
Train Loss,0.25735
Validation Accuracy,87.78333
Validation Loss,0.34477


[34m[1mwandb[0m: Agent Starting Run: wwu49ceo with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	learning_rate: 0.0005


Epoch 1/5 - Train Loss: 368.1658, Train Acc: 90.97%, Val Acc: 88.21%
Epoch 2/5 - Train Loss: 355.9574, Train Acc: 91.16%, Val Acc: 88.29%
Epoch 3/5 - Train Loss: 344.6764, Train Acc: 91.23%, Val Acc: 87.92%
Epoch 4/5 - Train Loss: 332.0869, Train Acc: 91.62%, Val Acc: 87.64%
Epoch 5/5 - Train Loss: 323.7187, Train Acc: 91.91%, Val Acc: 88.00%


0,1
Train Accuracy,▁▂▃▆█
Train Loss,█▆▄▂▁
Validation Accuracy,▇█▄▁▅
Validation Loss,▃▁▃▇█

0,1
Train Accuracy,91.91458
Train Loss,0.21581
Validation Accuracy,88.0
Validation Loss,0.37004


[34m[1mwandb[0m: Agent Starting Run: k7dchvzt with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	learning_rate: 0.001


Epoch 1/5 - Train Loss: 133.8768, Train Acc: 93.33%, Val Acc: 88.58%
Epoch 2/5 - Train Loss: 132.1137, Train Acc: 93.41%, Val Acc: 88.58%
Epoch 3/5 - Train Loss: 130.1729, Train Acc: 93.54%, Val Acc: 88.33%
Epoch 4/5 - Train Loss: 129.2628, Train Acc: 93.62%, Val Acc: 87.82%
Epoch 5/5 - Train Loss: 126.6405, Train Acc: 93.66%, Val Acc: 88.83%


0,1
Train Accuracy,▁▃▅▇█
Train Loss,█▆▄▄▁
Validation Accuracy,▆▆▅▁█
Validation Loss,▁▂▇█▄

0,1
Train Accuracy,93.65625
Train Loss,0.16885
Validation Accuracy,88.825
Validation Loss,0.36042


In [20]:
import wandb
import torch

# Initialize Weights & Biases
wandb.init(project="MLOps2025_YourRollNo", name="model_artifact_logging")  # Replace YourRollNo

# Save the trained model
torch.save(model.state_dict(), "fashion_mnist_model.pth")

# Create and log the artifact
artifact = wandb.Artifact("fashion_mnist_model", type="model")
artifact.add_file("fashion_mnist_model.pth")
wandb.log_artifact(artifact)

# Finish the wandb run properly
wandb.finish()
