TensorBoard is a font-end web interface that essentially reads data from a file and displays it. To use TensorBoard our task is to get the data we want displayed saved to a file that TensorBoard can read.

To make this easy for us, PyTorch has created a utility class called SummaryWriter. To get access to this class we use the following import:

from torch.utils.tensorboard import SummaryWriter

Once we have imported the class, we can create an instance of the class that we'll then use to get the data out of our program and onto the file system where it can then be consumed by TensorBoard.

In [23]:
from torch.utils.tensorboard import SummaryWriter

In [24]:
import torch 
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [25]:
train_set = torchvision.datasets.FashionMNIST(
    root='./data'
    ,train=True
    ,download=True
    ,transform=transforms.Compose([
        transforms.ToTensor()
    ])
)
train_loader = torch.utils.data.DataLoader(train_set
    ,batch_size=10
    ,shuffle=True
)

In [26]:
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
        
        self.fc1 = nn.Linear(in_features=12 * 4 * 4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)#linear, dense, and fully connected layer all are same
        self.out = nn.Linear(in_features=60, out_features=10)
    def forward(self, t):
    # (1) input layer
        t = t

        # (2) hidden conv layer
        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        # (3) hidden conv layer
        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        # (4) hidden linear layer
        t = t.reshape(-1, 12 * 4 * 4)
        t = self.fc1(t)
        t = F.relu(t)

        # (5) hidden linear layer
        t = self.fc2(t)
        t = F.relu(t)

        # (6) output layer
        t = self.out(t)
        #t = F.softmax(t, dim=1)

        return t

In [27]:
def get_num_correct(preds, labels):
    return preds.argmax(dim=1).eq(labels).sum().item()


In [8]:
tb = SummaryWriter()

network = Network()
images, labels = next(iter(train_loader))
grid = torchvision.utils.make_grid(images)

tb.add_image('images', grid)
tb.add_graph(network, images)
tb.close()

#This code creates a SummaryWriter

## Running TensorBoard
To launch TensorBoard, we need to run the tensorboard command at our terminal. This will launch a local server that will serve the TensorBoard UI and the the data our SummaryWriter wrote to disk.

By default, the PyTorch SummaryWriter object writes the data to disk in a directory called ./runs that is created in the current working directory.

When we run the tensorboard command, we pass an argument that tells tensorboard where the data is. So it's like this:

tensorboard --logdir=runs
The TensorBoard server will launch and be listening for http requests on port 6006. These details will be displayed in the console.

Access the TensorBoard UI by browsing to:

http://localhost:6006

## TensorBoard Histograms and Scalars

In [14]:
network = Network()

train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
optimizer = optim.Adam(network.parameters(), lr=0.01)
images, labels = next(iter(train_loader))
grid = torchvision.utils.make_grid(images)

tb = SummaryWriter()
tb.add_image('images', grid)
tb.add_graph(network, images)

for epoch in range(10):
    
    total_loss = 0
    total_correct = 0
    
    for batch in train_loader: # Get Batch
        images, labels = batch 

        preds = network(images) # Pass Batch
        loss = F.cross_entropy(preds, labels) # Calculate Loss

        optimizer.zero_grad()
        loss.backward() # Calculate Gradients
        optimizer.step() # Update Weights

        total_loss += loss.item()
        total_correct += get_num_correct(preds, labels)
        
    tb.add_scalar('Loss', total_loss, epoch)
    tb.add_scalar('Number Correct', total_correct, epoch)
    tb.add_scalar('Accuracy', total_correct / len(train_set), epoch)

    tb.add_histogram('conv1.bias', network.conv1.bias, epoch)
    tb.add_histogram('conv1.weight', network.conv1.weight, epoch)
    tb.add_histogram(
            'conv1.weight.grad'
            ,network.conv1.weight.grad
            ,epoch
        )

    print(
        "epoch", epoch, 
        "total_correct:", total_correct, 
        "loss:", total_loss
    )

epoch 0 total_correct: 46637 loss: 348.67361275851727
epoch 1 total_correct: 51120 loss: 239.23234041035175
epoch 2 total_correct: 51864 loss: 217.5779768228531
epoch 3 total_correct: 52276 loss: 209.0313512980938
epoch 4 total_correct: 52432 loss: 203.20759464800358
epoch 5 total_correct: 52620 loss: 197.5764152854681
epoch 6 total_correct: 52879 loss: 194.0111052542925
epoch 7 total_correct: 52754 loss: 192.17617286741734
epoch 8 total_correct: 53186 loss: 184.22582198679447
epoch 9 total_correct: 53121 loss: 186.3282471075654


This will add these values to TensorBoard. The values even update in real-time as the network trains.

The real power of TensorBoard is its out-of-the-box capability of comparing multiple runs. This allows us to rapidly experiment by varying the hyperparameter values and comparing runs to see which parameters are working best.
## Naming the Training Runs for TensorBoard
To take advantage of TensorBoard comparison capabilities, we need to do multiple runs and name each run in such a way that we can identify it uniquely.

With PyTorch's SummaryWriter, a run starts when the writer object instance is created and ends when the writer instance is closed or goes out of scope.

To uniquely identify each run, we can either set the file name of the run directly, or pass a comment string to the constructor that will be appended to the auto-generated file name.
### Choosing a Name for the Run

One way to name the run is to add the parameter names and values as a comment for the run. This will allow us to see how each parameter value stacks up with the others later when we are reviewing the runs inside TensorBoard.

In [8]:
#list of hyperparamter values on which our model would be tested .
batch_size_list = [100, 1000, 10000]
lr_list = [.01, .001, .0001, .00001]

In [9]:
#We create 2 loops for 2 hyperparamter we gonna tune and check
for batch_size in batch_size_list:
    for lr in lr_list:
        #batch_size = 100
        #lr = 0.01
        #because we want to change or tune these hyperparameter 
        network = Network()

        train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)
        optimizer = optim.Adam(network.parameters(), lr=lr)

        images, labels = next(iter(train_loader))
        grid = torchvision.utils.make_grid(images)

        comment = f' batch_size={batch_size} lr={lr}'
        tb = SummaryWriter(comment = comment)
        #to uniquely identify the name of the run ... the summarywritter will append the comment to name of the run
        tb.add_image('images', grid)
        tb.add_graph(network, images)

        for epoch in range(10):

            total_loss = 0
            total_correct = 0

            for batch in train_loader: # Get Batch
                images, labels = batch 

                preds = network(images) # Pass Batch
                loss = F.cross_entropy(preds, labels) # Calculate Loss

                optimizer.zero_grad()
                loss.backward() # Calculate Gradients
                optimizer.step() # Update Weights

                # initially total_loss += loss.item()
                total_loss += loss.item() * batch_size
                #why changed ? to compare the different runs as batch size would be a hyperparamter. so we can't just compare loss
                total_correct += get_num_correct(preds, labels)

            tb.add_scalar('Loss', total_loss, epoch)
            tb.add_scalar('Number Correct', total_correct, epoch)
            tb.add_scalar('Accuracy', total_correct / len(train_set), epoch)

            tb.add_histogram('conv1.bias', network.conv1.bias, epoch)
            tb.add_histogram('conv1.weight', network.conv1.weight, epoch)
            tb.add_histogram(
                    'conv1.weight.grad'
                    ,network.conv1.weight.grad
                    ,epoch
                )

            print(
                "epoch", epoch, 
                "total_correct:", total_correct, 
                "loss:", total_loss
            )

epoch 0 total_correct: 47371 loss: 33295.010417699814
epoch 1 total_correct: 51277 loss: 23253.64562869072
epoch 2 total_correct: 51949 loss: 21531.316296756268
epoch 3 total_correct: 52303 loss: 20488.59374523163
epoch 4 total_correct: 52578 loss: 19908.43445956707
epoch 5 total_correct: 52674 loss: 19767.661049962044
epoch 6 total_correct: 52838 loss: 19177.610696852207
epoch 7 total_correct: 52749 loss: 19427.634966373444
epoch 8 total_correct: 53126 loss: 18471.25412374735
epoch 9 total_correct: 53226 loss: 18369.066247344017
epoch 0 total_correct: 41507 loss: 48938.33390176296
epoch 1 total_correct: 48013 loss: 31853.487139940262
epoch 2 total_correct: 50184 loss: 26767.59902536869
epoch 3 total_correct: 51257 loss: 23788.830955326557
epoch 4 total_correct: 51884 loss: 21942.797024548054
epoch 5 total_correct: 52424 loss: 20649.516186118126
epoch 6 total_correct: 52814 loss: 19634.12436544895
epoch 7 total_correct: 53144 loss: 18772.239856421947
epoch 8 total_correct: 53368 loss: 

## Adding Network Parameters & Gradients to TensorBoard

## Adding More Hyperparameters Without Nesting

This is cool. However, what if we want to add a third or even a forth parameter to iterate on? We'll, this is going to get messy with many nested for-loops.

If we have a list of parameters, we can package them up into a set for each of our runs using the Cartesian product. For this we'll use the product function from the itertools library.

In [10]:
from itertools import product

Next, we define a dictionary that contains parameters as keys and parameter values we want to use as values

In [11]:
parameters = dict(
    lr = [.01, .001]
    ,batch_size = [100, 1000]
    ,shuffle = [True, False]
)

Next, we'll create a list of iterables that we can pass to the product functions.

In [14]:
param_values = [v for v in parameters.values()]


In [15]:
param_values

[[0.01, 0.001], [100, 1000], [True, False]]

we have three lists of parameter values. After we take the Cartesian product of these three lists, we'll have a set of parameter values for each of our runs

now we can iterate over each set of parameters using a single for-loop. All we have to do is unpack the set using sequence unpacking. It looks like this.

notice the * operator. This is a special way in Python to unpack a list into a set of arguments.

## Training Loop Run Builder
we’ll code a RunBuilder class that will allow us to generate multiple runs with varying parameters.

We’ll start with our imports.

In [18]:
from collections import OrderedDict
from collections import namedtuple
from itertools import product

In [19]:
class RunBuilder():
    @staticmethod
    def get_runs(params):

        Run = namedtuple('Run', params.keys())

        runs = []
        for v in product(*params.values()):
            runs.append(Run(*v))

        return runs

The main thing to note about using this class is that it has a static method called get_runs(). This method will get the runs for us that it builds based on the parameters we pass in.

Let’s define some parameters now.

In [20]:
params = OrderedDict(
    lr = [.01, .001]
    ,batch_size = [1000, 10000]
)

To get these runs, we just call the get_runs() function of the RunBuilder class, passing in the parameters we’d like to use.

In [21]:
 runs = RunBuilder.get_runs(params)

In [6]:
runs
#this contains all the possible combinations 

[Run(lr=0.01, batch_size=1000),
 Run(lr=0.01, batch_size=10000),
 Run(lr=0.001, batch_size=1000),
 Run(lr=0.001, batch_size=10000)]

We can access an individual run by indexing into the list like so:

In [8]:
runs[0]

Run(lr=0.01, batch_size=1000)

 the run is object is a tuple with named attributes, we can access the values using dot notation like so:

In [11]:
run = runs[0]
print(run.lr, run.batch_size)

0.01 1000


In [12]:
# we can iterate over the runs cleanly like so:

for run in runs:
    print(run, run.lr, run.batch_size)

Run(lr=0.01, batch_size=1000) 0.01 1000
Run(lr=0.01, batch_size=10000) 0.01 10000
Run(lr=0.001, batch_size=1000) 0.001 1000
Run(lr=0.001, batch_size=10000) 0.001 10000


### lets look inside the run builder class

In [14]:
 #we can get the keys of the dictionary by :- 
params.keys()

odict_keys(['lr', 'batch_size'])

In [15]:
 #we can get the values of the dictionary by :- 
params.values()

odict_values([[0.01, 0.001], [1000, 10000]])

In [16]:
 #we use these keys and values for what comes next. We’ll start with the keys.
Run = namedtuple('Run', params.keys())

This Run class is used to encapsulate the data for each of our runs

The field names of this class are set by the list of names passed to the constructor. First, we are passing the class name. Then, we are passing the field names, and in our case, we are passing the list of keys from our dictionary.

In [17]:
runs = []
for v in product(*params.values()):
    runs.append(Run(*v))

First we create a list called runs. Then, we use the product() function from itertools to create the Cartesian product using the values for each parameter inside our dictionary. This gives us a set of ordered pairs that define our runs. We iterate over these adding a run to the runs list for each one.

For each value in the Cartesian product we have an ordered tuples. The Cartesian product gives us every ordered pair so we have all possible order pairs of learning rates and batch sizes. When we pass the tuple to the Run constructor, we use the * operator to tell the constructor to accept the tuple values as arguments opposed to the tuple itself.

##### Since the get_runs() method is static, we can call it using the class itself. We don’t need an instance of the class.

In [32]:
for run in RunBuilder.get_runs(params):
        comment = f'-{run}'
        network = Network()

        train_loader = torch.utils.data.DataLoader(train_set, batch_size=run.batch_size)
        optimizer = optim.Adam(network.parameters(), lr=run.lr)

        images, labels = next(iter(train_loader))
        grid = torchvision.utils.make_grid(images)

        #comment = f' batch_size={batch_size} lr={lr}'
        tb = SummaryWriter(comment = comment)
        #to uniquely identify the name of the run ... the summarywritter will append the comment to name of the run
        tb.add_image('images', grid)
        tb.add_graph(network, images)

        for epoch in range(5):

            total_loss = 0
            total_correct = 0

            for batch in train_loader: # Get Batch
                images, labels = batch 

                preds = network(images) # Pass Batch
                loss = F.cross_entropy(preds, labels) # Calculate Loss

                optimizer.zero_grad()
                loss.backward() # Calculate Gradients
                optimizer.step() # Update Weights

                # initially total_loss += loss.item()
                total_loss += loss.item() * run.batch_size
                #why changed ? to compare the different runs as batch size would be a hyperparamter. so we can't just compare loss
                total_correct += get_num_correct(preds, labels)

            tb.add_scalar('Loss', total_loss, epoch)
            tb.add_scalar('Number Correct', total_correct, epoch)
            tb.add_scalar('Accuracy', total_correct / len(train_set), epoch)

            tb.add_histogram('conv1.bias', network.conv1.bias, epoch)
            tb.add_histogram('conv1.weight', network.conv1.weight, epoch)
            tb.add_histogram(
                    'conv1.weight.grad'
                    ,network.conv1.weight.grad
                    ,epoch
                )

            print(
                "epoch", epoch, 
                "total_correct:", total_correct, 
                "loss:", total_loss
            )

epoch 0 total_correct: 37780 loss: 58599.32154417038
epoch 1 total_correct: 47574 loss: 32896.0095345974
epoch 2 total_correct: 50360 loss: 26575.594872236252
epoch 3 total_correct: 51531 loss: 23226.498186588287
epoch 4 total_correct: 52153 loss: 21278.14558148384
epoch 0 total_correct: 15934 loss: 122171.53668403625
epoch 1 total_correct: 29269 loss: 77561.65146827698
epoch 2 total_correct: 37412 loss: 60099.947452545166
epoch 3 total_correct: 40782 loss: 49917.40643978119
epoch 4 total_correct: 43081 loss: 44133.17859172821
epoch 0 total_correct: 29046 loss: 92105.70079088211
epoch 1 total_correct: 40240 loss: 51495.10437250137
epoch 2 total_correct: 43045 loss: 43882.4542760849
epoch 3 total_correct: 44734 loss: 39440.139174461365
epoch 4 total_correct: 45993 loss: 36319.87351179123
epoch 0 total_correct: 6682 loss: 137506.74724578857
epoch 1 total_correct: 12265 loss: 133907.208442688
epoch 2 total_correct: 21892 loss: 124497.26819992065
epoch 3 total_correct: 26896 loss: 106004.6

## PyTorch DataLoader num_workers 
we will see how we can speed up the neural network training process by utilizing the multiple process capabilities of the PyTorch DataLoader class.

The num_workers attribute tells the data loader instance how many sub-processes to use for data loading. By default, the num_workers value is set to zero, and a value of zero tells the loader to load the data inside the main process.

Now, if we have a worker process, we can make use of the fact that our machine has multiple cores. This means that the next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is where the speed up comes from. The batches are loaded using additional worker processes and are queued up in memory.

In [33]:
params = OrderedDict(
    lr = [.01]
    ,batch_size = [1000, 10000]
    ,num_workers = [1,2,4,8]
)

In [35]:
for run in RunBuilder.get_runs(params):
        comment = f'-{run}'
        network = Network()

        train_loader = torch.utils.data.DataLoader(train_set, batch_size=run.batch_size,num_workers = run.num_workers )
        optimizer = optim.Adam(network.parameters(), lr=run.lr)

        images, labels = next(iter(train_loader))
        grid = torchvision.utils.make_grid(images)

        #comment = f' batch_size={batch_size} lr={lr}'
        tb = SummaryWriter(comment = comment)
        #to uniquely identify the name of the run ... the summarywritter will append the comment to name of the run
        tb.add_image('images', grid)
        tb.add_graph(network, images)

        for epoch in range(2):

            total_loss = 0
            total_correct = 0

            for batch in train_loader: # Get Batch
                images, labels = batch 

                preds = network(images) # Pass Batch
                loss = F.cross_entropy(preds, labels) # Calculate Loss

                optimizer.zero_grad()
                loss.backward() # Calculate Gradients
                optimizer.step() # Update Weights

                # initially total_loss += loss.item()
                total_loss += loss.item() * run.batch_size
                #why changed ? to compare the different runs as batch size would be a hyperparamter. so we can't just compare loss
                total_correct += get_num_correct(preds, labels)

            tb.add_scalar('Loss', total_loss, epoch)
            tb.add_scalar('Number Correct', total_correct, epoch)
            tb.add_scalar('Accuracy', total_correct / len(train_set), epoch)

            tb.add_histogram('conv1.bias', network.conv1.bias, epoch)
            tb.add_histogram('conv1.weight', network.conv1.weight, epoch)
            tb.add_histogram(
                    'conv1.weight.grad'
                    ,network.conv1.weight.grad
                    ,epoch
                )

            print(
                "epoch", epoch, 
                "total_correct:", total_correct, 
                "loss:", total_loss
            )

epoch 0 total_correct: 37212 loss: 58812.82848119736
epoch 1 total_correct: 48630 loss: 30188.325732946396
epoch 0 total_correct: 36806 loss: 59732.29068517685
epoch 1 total_correct: 47691 loss: 31755.951434373856
epoch 0 total_correct: 38101 loss: 57306.09983205795
epoch 1 total_correct: 48455 loss: 30193.024933338165
epoch 0 total_correct: 38005 loss: 58287.00304031372
epoch 1 total_correct: 48141 loss: 31524.434506893158
epoch 0 total_correct: 15734 loss: 126101.9766330719
epoch 1 total_correct: 31145 loss: 78653.46789360046
epoch 0 total_correct: 12641 loss: 127871.3583946228
epoch 1 total_correct: 30506 loss: 76154.60276603699
epoch 0 total_correct: 13269 loss: 128005.11002540588
epoch 1 total_correct: 29120 loss: 82087.17823028564
epoch 0 total_correct: 11345 loss: 130399.26767349243
epoch 1 total_correct: 23567 loss: 95575.30760765076
