# Training GNNs on Large Graphs

We have seen the example of training GNNs on the entire graph.  However, usually our graph is very big: it could contain millions or billions of nodes and edges.  The storage required for the graph would be many times bigger if we consider node and edge features.  If we want to utilize GPUs for faster computation, we would notice that full graph training is often impossible on GPUs because our graph and features cannot fit into a single GPU.  Not to mention that the node representation of intermediate layers are also stored for the sake of backpropagation.

To get over this limit, we employ two methodologies:

1. Stochastic training on graphs.
2. Neighbor sampling on graphs.

## Stochastic Training on Graphs

If you are familiar with deep learning for images/texts/etc., you should know stochastic gradient descent (SGD) very well.  In SGD, you sample a minibatch of examples, compute the loss on those examples only, find the gradients, and update the model parameters.

Stochastic training on graphs resembles SGD on image/text datasets in the sense that one also samples a minibatch of nodes (or pair/tuple of nodes, depending on the task) and compute the loss on those nodes only.  The difference is that the output representation of a small set of nodes may depend on the input features of a substantially larger set of nodes.

### An Example of GraphSAGE

This hands-on tutorial takes GraphSAGE as an example.

The output representation $\boldsymbol{y}_u$ of node $u$ from a GraphSAGE layer is simply computed by:

* Aggregating the input features of all neighbors of $u$ by for instance averaging.
* Concatenating the aggregation with the node $u$'s representation itself.
* Passing the concatenation to an MLP.

#### GraphSAGE Layer on Full Graph

If we write the above in the full-graph training manner, we will have this piece of code:

In [1]:
import dgl
import dgl.function as fn
import torch
from torch import nn
import torch.nn.functional as F

In [2]:
class SAGEConv(nn.Module):
    def __init__(self, in_feats, out_feats):
        super().__init__()
        self.W = nn.Linear(2 * in_feats, out_feats)
        
    def forward(self, g, x):
        with g.local_scope():
            # Aggregate input features of neighbors
            g.ndata['x'] = x
            g.update_all(fn.copy_u('x', 'm'), fn.mean('m', 'x_neigh'))
            # Concatenate aggregation with the node representation itself
            x_neigh = g.ndata['x_neigh']
            x_concat = torch.cat([x, x_neigh], -1)
            # Pass the concatenation to an MLP
            y = F.relu(self.W(x_concat))
            return y

For instance, consider the following graph:

![Graph](assets/graph.png)

In [3]:
# A small graph

import networkx as nx

example_graph = nx.Graph(
    [(0, 2), (0, 4), (0, 6), (0, 7), (0, 8), (0, 9), (0, 10),
     (1, 2), (1, 3), (1, 5), (2, 3), (2, 4), (2, 6), (3, 5),
     (3, 8), (4, 7), (8, 9), (8, 11), (9, 10), (9, 11)])
example_graph = dgl.graph(example_graph)
# We also assign features for each node
INPUT_FEATURES = 5
OUTPUT_FEATURES = 6
example_graph.ndata['features'] = torch.randn(12, INPUT_FEATURES)

If we wish to compute the output representation of node 4 and 6 with a GraphSAGE layer, we actually need the input feature of node 4 and 6 themselves, as well as their neighbors (node 7, 0 and 2):

![Graph](assets/graph_1layer_46.png)

We can see that node 7, 0, and 2 will contribute to representation of node 4, while 0 and 2 will contribute to node 6.

### Finding Neighbors of Nodes

DGL provides an API: `dgl.in_subgraph`, that takes in a set of nodes and returns a graph consisting of all edges going to one of the given nodes.  Such a graph can exactly describe the computation dependency above.

In [4]:
sampled_node_batch = torch.LongTensor([4, 6])   # These are the nodes whose outputs are to be computed
sampled_graph = dgl.in_subgraph(example_graph, sampled_node_batch)
print(sampled_graph.all_edges())

(tensor([0, 2, 7, 0, 2]), tensor([4, 4, 4, 6, 6]))


The result above reads that node 0, 2 and 7 connects to node 4, while node 0 and 2 connects to node 6.

### Message Passing on One Layer

Since we don't want to put the entire set of node features into GPU, we typically need to compact the sampled graph such that only the nodes necessary for computation are kept.  To further save memory, we would like to have two sets of nodes:

* The input nodes, which only contain the neighbors of those nodes.
* The output nodes, which only contain the nodes whose outputs are to be computed.

If we want to perform message passing from node 0, 2, and 7 to node 4 and 6 in the example above, we would want to put only node 0, 2 and 7 as input nodes, and only node 4 and 6 as output nodes.  The input nodes 0, 2 and 7 connects to output node 4, and the input nodes 0 and 2 connects to output node 6.  There would be 3 input nodes and 2 output nodes in total.

#### Compacting the Sampled Graph

DGL provides a function `dgl.to_block` that converts a graph to such a bipartite-structured *block*, which divides the nodes into input nodes and output nodes.

In [5]:
sampled_block = dgl.to_block(sampled_graph, sampled_node_batch)

def print_block_info(sampled_block):
    sampled_input_nodes = sampled_block.srcdata[dgl.NID]
    print('Node ID of input nodes in original graph:', sampled_input_nodes)

    sampled_output_nodes = sampled_block.dstdata[dgl.NID]
    print('Node ID of output nodes in original graph:', sampled_output_nodes)

    sampled_block_edges_src, sampled_block_edges_dst = sampled_block.all_edges()
    # We need to map the src and dst node IDs in the blocks to the node IDs in the original graph.
    sampled_block_edges_src_mapped = sampled_input_nodes[sampled_block_edges_src]
    sampled_block_edges_dst_mapped = sampled_output_nodes[sampled_block_edges_dst]
    print('Edge connections:', sampled_block_edges_src_mapped, sampled_block_edges_dst_mapped)
    
print_block_info(sampled_block)

Node ID of input nodes in original graph: tensor([4, 6, 0, 2, 7])
Node ID of output nodes in original graph: tensor([4, 6])
Edge connections: tensor([0, 2, 7, 0, 2]) tensor([4, 4, 4, 6, 6])


We can see that the input nodes also include node 4 and 6, which are the output nodes themselves.  The reason to also include the output nodes in the input nodes will be explained in the next section.

We can see that the edge connections are preserved (i.e. they map to the same ones in `sampled_graph`).

#### GraphSAGE Layer on Bipartite Graphs

Now, since the sampled block is a bipartite graph, we need to modify our `SAGEConv` module a little bit to make it work on bipartite graphs:

In [6]:
class SAGEConvBipartite(nn.Module):
    def __init__(self, in_feats, out_feats):
        super().__init__()
        self.W = nn.Linear(2 * in_feats, out_feats)
        
    def forward(self, g, x):
        # Because g is now a bipartite graph, we now need to input tensors, one on the input
        # side and another on the output side.
        x_src, x_dst = x
        
        with g.local_scope():
            # Aggregate input features of neighbors
            g.srcdata['x'] = x_src                                      # ndata here is changed to srcdata
                                                                        # x is also changed to x_src (input side)
            g.update_all(fn.copy_u('x', 'm'), fn.mean('m', 'x_neigh'))
            # Concatenate aggregation with the node representation itself
            x_neigh = g.dstdata['x_neigh']                              # ndata here is changed to dstdata
            x_concat = torch.cat([x_dst, x_neigh], -1)                  # x is changed to x_dst (output side)
            # Pass the concatenation to an MLP
            y = F.relu(self.W(x_concat))
            return y

We can now compute the output of node 4 and 6 with `sampled_block` from a GraphSAGE layer by using `SAGEConvBipartite`:

In [7]:
sageconv_module = SAGEConvBipartite(INPUT_FEATURES, OUTPUT_FEATURES)

sampled_block_src_features = example_graph.ndata['features'][sampled_block.srcdata[dgl.NID]]
sampled_block_dst_features = example_graph.ndata['features'][sampled_block.dstdata[dgl.NID]]

output_of_sampled_node_batch = sageconv_module(
    sampled_block, (sampled_block_src_features, sampled_block_dst_features))
print(output_of_sampled_node_batch)

tensor([[0.6833, 1.1329, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],
       grad_fn=<ReluBackward0>)


#### Using DGL's Built-in Layers

Currently DGL neural network layers support working on both homogeneous graphs and bipartite graphs, so one don't have to write `SAGEConvBipartite` by your own; you can simply reuse, and we also recommend reusing, `dgl.nn.SAGEConv`:

In [8]:
import dgl.nn as dglnn
sageconv_module = dglnn.SAGEConv(INPUT_FEATURES, OUTPUT_FEATURES, 'mean', activation=F.relu)

sampled_block_src_features = example_graph.ndata['features'][sampled_block.srcdata[dgl.NID]]
sampled_block_dst_features = example_graph.ndata['features'][sampled_block.dstdata[dgl.NID]]

output_of_sampled_node_batch = sageconv_module(
    sampled_block, (sampled_block_src_features, sampled_block_dst_features))
print(output_of_sampled_node_batch)

tensor([[0.0000, 0.0000, 1.1611, 0.0000, 1.4844, 0.0000],
        [1.2050, 0.6417, 0.0000, 1.0638, 0.0000, 0.0000]],
       grad_fn=<ReluBackward0>)


### Multiple Layers

Now we wish to compute the output of node 4 and 6 from a 2-layer GraphSAGE.  This requires the input features of not only the nodes themselves and their neighbors, but also the neighbors of these neighbors.

![](assets/graph_2layer_46.png)

To compute the 2-layer output of node 4 and 6, we first need to obtain the 1-layer output of node 4 and 6, as well as the neighbors (node 7, 0, and 2).  To obtain the 1-layer output of all these nodes, we again need the input feature of these nodes (node 4, 6, 7, 0, 2) as well as *their* neighbors (node 10, 9, 8, 1, and 3).  This constitutes a reason why `dgl.to_block` also includes the output nodes in the input nodes.

We can see that the generation of computation dependency for multi-layer GNNs is a bottom-up process: we start from the output layer, and grows the node set towards the input layer.

The following code directly returns the list of blocks as the computation dependency generation for multi-layer GNNs.

In [9]:
class FullNeighborBlockSampler(object):
    def __init__(self, g, num_layers):
        self.g = g
        self.num_layers = num_layers
        
    def sample(self, seeds):
        blocks = []
        for i in range(self.num_layers):
            sampled_graph = dgl.in_subgraph(self.g, seeds)
            sampled_block = dgl.to_block(sampled_graph, seeds)
            seeds = sampled_block.srcdata[dgl.NID]
            # Because the computation dependency is generated bottom-up, we prepend the new block instead of
            # appending it.
            blocks.insert(0, sampled_block)
            
        return blocks

In [10]:
block_sampler = FullNeighborBlockSampler(example_graph, 2)
sampled_blocks = block_sampler.sample(sampled_node_batch)

print('Block for first layer')
print('---------------------')
print_block_info(sampled_blocks[0])
print()
print('Block for second layer')
print('----------------------')
print_block_info(sampled_blocks[1])

Block for first layer
---------------------
Node ID of input nodes in original graph: tensor([ 4,  6,  0,  2,  7,  8,  9, 10,  1,  3])
Node ID of output nodes in original graph: tensor([4, 6, 0, 2, 7])
Edge connections: tensor([ 0,  2,  7,  0,  2,  2,  4,  6,  7,  8,  9, 10,  0,  4,  6,  1,  3,  0,
         4]) tensor([4, 4, 4, 6, 6, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 7, 7])

Block for second layer
----------------------
Node ID of input nodes in original graph: tensor([4, 6, 0, 2, 7])
Node ID of output nodes in original graph: tensor([4, 6])
Edge connections: tensor([0, 2, 7, 0, 2]) tensor([4, 4, 4, 6, 6])


The message propagation is instead a top-down process, as opposed to computation dependency generation: we start from the input layer, and computes the representations towards the output layer.

The following code defines a multi-layer GraphSAGE.  It takes in a list of blocks generated by the block sampler above.

In [11]:
class SAGENet(nn.Module):
    def __init__(self, n_layers, in_feats, out_feats, hidden_feats=None):
        super().__init__()
        self.convs = nn.ModuleList()
        
        if hidden_feats is None:
            hidden_feats = out_feats
        
        if n_layers == 1:
            self.convs.append(dglnn.SAGEConv(in_feats, out_feats, 'mean'))
        else:
            self.convs.append(dglnn.SAGEConv(in_feats, hidden_feats, 'mean', activation=F.relu))
            for i in range(n_layers - 2):
                self.convs.append(dglnn.SAGEConv(hidden_feats, hidden_feats, 'mean', activation=F.relu))
            self.convs.append(dglnn.SAGEConv(hidden_feats, out_feats, 'mean'))
        
    def forward(self, blocks, input_features):
        """
        blocks : List of blocks generated by block sampler.
        input_features : Input features of the first block.
        """
        h = input_features
        for layer, block in zip(self.convs, blocks):
            h = self.propagate(block, h, layer)
        return h
    
    def propagate(self, block, h, layer):
        # Because GraphSAGE requires not only the features of the neighbors, but also the features
        # of the output nodes themselves on the current layer, we need to copy the output node features
        # from the input side to the output side ourselves to make GraphSAGE work correctly.
        # The output nodes of a block are guaranteed to appear the first in the input nodes, so we can
        # conveniently write like this:
        h_dst = h[:block.number_of_dst_nodes()]
        h = layer(block, (h, h_dst))
        return h

In [12]:
sagenet = SAGENet(2, INPUT_FEATURES, OUTPUT_FEATURES)

# The input nodes for computing 2-layer GraphSAGE output on the given output nodes can be obtained like this:
sampled_input_nodes = sampled_blocks[0].srcdata[dgl.NID]

# Get the input features.
# In real life we want to copy this to GPU.  But in this hands-on tutorial we don't have GPUs.
sampled_input_features = example_graph.ndata['features'][sampled_input_nodes]

output_of_sampled_node_batch = sagenet(sampled_blocks, sampled_input_features)
print(output_of_sampled_node_batch)

tensor([[ 2.5556,  0.2520, -0.7534,  0.7471,  2.5092,  3.3257],
        [ 1.1765,  0.5756, -1.0913, -0.1211,  0.5537,  1.8392]],
       grad_fn=<AddBackward0>)


## Neighborhood Sampling

We may notice in the above example that 2-hop neighbors actually almost covered the entire graph.  In real world graphs whose node degrees often follow a power-law distribution (i.e. there would exist a few "hub" nodes with lots of edges), we indeed often observe that for a small set of output nodes from a multi-layer GNN, the input nodes will still cover a large part of the graph.  The whole purpose of saving GPU memory thus fails again in this setting.

Neighborhood sampling offers a solution by *not* taking all neighbors for every node during computation dependency generation.  Instead, we pick a small subset of neighbors and estimate the aggregation of all neighbors from this subset.  This trick often not only reduces memory consumption, but also improves model generalization.

DGL provides a function `dgl.sampling.sample_neighbors` for uniform sampling a fixed number of neighbors of each node.  One can also change `dgl.sampling.sample_neighbors` to any kind of existing neighborhood sampling algorithm (including your own).

In [13]:
class NeighborSampler(object):
    def __init__(self, g, num_fanouts):
        """
        num_fanouts : list of fanouts on each layer.
        """
        self.g = g
        self.num_fanouts = num_fanouts
        
    def sample(self, seeds):
        seeds = torch.LongTensor(seeds)
        blocks = []
        for fanout in reversed(self.num_fanouts):
            # We simply switch from in_subgraph to sample_neighbors for neighbor sampling.
            sampled_graph = dgl.sampling.sample_neighbors(self.g, seeds, fanout)
            
            sampled_block = dgl.to_block(sampled_graph, seeds)
            seeds = sampled_block.srcdata[dgl.NID]
            # Because the computation dependency is generated bottom-up, we prepend the new block instead of
            # appending it.
            blocks.insert(0, sampled_block)
            
        return blocks

In [14]:
block_sampler = NeighborSampler(example_graph, [2, 2])
sampled_blocks = block_sampler.sample(sampled_node_batch)

print('Block for first layer')
print('---------------------')
print_block_info(sampled_blocks[0])
print()
print('Block for second layer')
print('----------------------')
print_block_info(sampled_blocks[1])

Block for first layer
---------------------
Node ID of input nodes in original graph: tensor([ 4,  6,  7,  2,  0,  1,  3, 10])
Node ID of output nodes in original graph: tensor([4, 6, 7, 2, 0])
Edge connections: tensor([ 0,  7,  0,  2,  0,  4,  1,  3,  2, 10]) tensor([4, 4, 6, 6, 7, 7, 2, 2, 0, 0])

Block for second layer
----------------------
Node ID of input nodes in original graph: tensor([4, 6, 7, 2, 0])
Node ID of output nodes in original graph: tensor([4, 6])
Edge connections: tensor([7, 2, 0, 2]) tensor([4, 4, 6, 6])


We can see that each output node now has at most 2 neighbors.

Code for message passing on blocks generated with neighborhood sampling does not change at all.

In [15]:
sagenet = SAGENet(2, INPUT_FEATURES, OUTPUT_FEATURES)

# The input nodes for computing 2-layer GraphSAGE output on the given output nodes can be obtained like this:
sampled_input_nodes = sampled_blocks[0].srcdata[dgl.NID]

# Get the input features.
# In real life we want to copy this to GPU.  But in this hands-on tutorial we don't have GPUs.
sampled_input_features = example_graph.ndata['features'][sampled_input_nodes]

output_of_sampled_node_batch = sagenet(sampled_blocks, sampled_input_features)
print(output_of_sampled_node_batch)

tensor([[ 1.4713,  1.4072,  2.4270,  2.9882,  2.9500, -1.4210],
        [ 0.5040,  0.1228,  1.1560,  0.7806,  1.7027, -0.5086]],
       grad_fn=<AddBackward0>)


### Inference with Models Trained with Neighbor Sampling

Recall that modules such as Dropout or batch normalization have different formulations in training and inference.  The reason was that we do not wish to introduce any randomness during inference or model deployment.  Similarly, we do not want to sample any of the neighbors during inference; aggregation should be performed on all neighbors without sampling to eliminate randomness.  However, directly using the multi-layer `FullNeighborBlockSampler` would still cost a lot of memory even during inference, due to the large number of input nodes being covered.

The solution to this is to compute representations of all nodes on one intermediate layer at a time.  To be more specific, for a multi-layer GraphSAGE model, we first compute the representation of all nodes on the 1st GraphSAGE layer, using a 1-layer `FullNeighborBlockSampler` to take all neighbors into account.  Such representations are computed in minibatches.  After all the representations from the 1st GraphSAGE layer are computed, we start from there and compute the representation of all nodes on the 2nd GraphSAGE layer.  We repeat the process until we go to the last layer.

In [16]:

def inference_with_sagenet(sagenet, graph, input_features, batch_size):
    block_sampler = FullNeighborBlockSampler(graph, 1)
    h = input_features
    
    with torch.no_grad():
        # We are computing all representations of one layer at a time.
        # The outer loop iterates over GNN layers.
        for conv in sagenet.convs:
            new_h_list = []
            node_ids = torch.arange(graph.number_of_nodes())
            # The inner loop iterates over batch of nodes.
            for batch_start in range(0, graph.number_of_nodes(), batch_size):
                # Sample a block with full neighbors of the current node batch
                block = block_sampler.sample(node_ids[batch_start:batch_start+batch_size])[0]
                # Get the necessary input node IDs for this node batch on this layer
                input_node_ids = block.srcdata[dgl.NID]
                # Get the input features
                h_input = h[input_node_ids]
                # Compute the output of this node batch on this layer
                new_h = sagenet.propagate(block, h_input, conv)
                new_h_list.append(new_h)
            # We finished computing all representations on this layer.  We need to compute the
            # representations of next layer.
            h = torch.cat(new_h_list)
        
    return h

In [17]:
print(inference_with_sagenet(sagenet, example_graph, example_graph.ndata['features'], 2))

tensor([[ 1.7681e+00, -3.3778e-02,  8.7602e-03, -9.2630e-01, -1.4338e+00,
          1.1513e+00],
        [-8.6763e-01,  5.2748e-01,  1.5401e-01,  1.1183e+00,  4.7106e-01,
          3.1708e-01],
        [ 6.2210e-01,  1.4617e-02,  7.6488e-01,  2.8973e-01,  6.6038e-01,
          7.6737e-01],
        [ 1.9717e+00, -4.4922e-01, -1.0719e-03, -8.4052e-01, -2.3791e-01,
          6.2204e-01],
        [ 1.2272e+00,  1.1383e+00,  2.0385e+00,  2.7903e+00,  2.5650e+00,
         -7.2432e-01],
        [ 7.9594e-01,  8.5921e-02,  3.5123e-01,  4.9928e-01,  9.7032e-01,
          8.0882e-01],
        [ 5.9009e-02,  4.4664e-01,  9.2594e-01,  1.3676e+00,  1.1535e+00,
          4.4968e-01],
        [ 1.6940e+00,  8.8744e-02,  1.0162e+00,  7.7402e-01,  8.0411e-01,
          9.1836e-01],
        [-2.1243e-01,  1.0425e+00,  8.5567e-01,  1.0732e+00,  1.9559e-01,
          1.4029e+00],
        [ 1.7215e-01,  1.8499e+00,  3.0013e+00,  2.5618e+00,  3.0003e+00,
         -8.3607e-01],
        [ 4.7615e-01,  1.0388e

## Putting Together

Now let's see how we could apply stochastic training on a node classification task.  We take PubMed dataset as an example.

### Load Dataset

In [18]:
import dgl.data

dataset = dgl.data.citation_graph.load_pubmed()

# Set features and labels for each node
graph = dgl.graph(dataset.graph)
graph.ndata['features'] = torch.FloatTensor(dataset.features)
graph.ndata['labels'] = torch.LongTensor(dataset.labels)

# Find the node IDs in the training, validation, and test set.
train_nid = dataset.train_mask.nonzero()[0]
val_nid = dataset.val_mask.nonzero()[0]
test_nid = dataset.test_mask.nonzero()[0]

Finished data loading and preprocessing.
  NumNodes: 19717
  NumEdges: 88651
  NumFeats: 500
  NumClasses: 3
  NumTrainingSamples: 60
  NumValidationSamples: 500
  NumTestSamples: 1000


### Define Neighbor Sampler

We can reuse our neighbor sampler code above.

In [19]:
neighbor_sampler = NeighborSampler(graph, [10, 10])

### Define DataLoader

PyTorch generates minibatches with a `DataLoader` object.  We can also use it.

Note that to compute the output of a minibatch of nodes, we need a list of blocks described as above.  Therefore, we need to change the `collate_fn` argument which defines how to compose different individual examples into a minibatch.

In [20]:
import torch.utils.data

BATCH_SIZE = 5

train_dataloader = torch.utils.data.DataLoader(
    train_nid, batch_size=BATCH_SIZE, collate_fn=neighbor_sampler.sample, shuffle=True)

### Define Model and Optimizer

In [21]:
HIDDEN_FEATURES = 10
model = SAGENet(2, dataset.features.shape[1], dataset.num_labels, HIDDEN_FEATURES)

opt = torch.optim.Adam(model.parameters(), lr=1e-3)

### Evaluation

In [22]:
def compute_accuracy(pred, labels):
    return (pred.argmax(1) == labels).float().mean().item()

### Training Loop

In [23]:
NUM_EPOCHS = 50
EVAL_BATCH_SIZE = 1000
for epoch in range(NUM_EPOCHS):
    sagenet.train()
    for blocks in train_dataloader:
        input_nodes = blocks[0].srcdata[dgl.NID]
        output_nodes = blocks[-1].dstdata[dgl.NID]
        
        input_features = graph.ndata['features'][input_nodes]
        output_labels = graph.ndata['labels'][output_nodes]
        
        output_predictions = model(blocks, input_features)
        loss = F.cross_entropy(output_predictions, output_labels)
        opt.zero_grad()
        loss.backward()
        opt.step()
        
    sagenet.eval()
    all_predictions = inference_with_sagenet(model, graph, graph.ndata['features'], EVAL_BATCH_SIZE)
    
    val_predictions = all_predictions[val_nid]
    val_labels = graph.ndata['labels'][val_nid]
    test_predictions = all_predictions[test_nid]
    test_labels = graph.ndata['labels'][test_nid]
    
    print('Validation acc:', compute_accuracy(val_predictions, val_labels),
          'Test acc:', compute_accuracy(test_predictions, test_labels))

Validation acc: 0.3919999897480011 Test acc: 0.41600000858306885
Validation acc: 0.3959999978542328 Test acc: 0.4180000126361847
Validation acc: 0.40799999237060547 Test acc: 0.42500001192092896
Validation acc: 0.4359999895095825 Test acc: 0.4390000104904175
Validation acc: 0.4860000014305115 Test acc: 0.4860000014305115
Validation acc: 0.6200000047683716 Test acc: 0.6029999852180481
Validation acc: 0.6759999990463257 Test acc: 0.6769999861717224
Validation acc: 0.6620000004768372 Test acc: 0.6299999952316284
Validation acc: 0.6880000233650208 Test acc: 0.6729999780654907
Validation acc: 0.7239999771118164 Test acc: 0.7129999995231628
Validation acc: 0.7080000042915344 Test acc: 0.7070000171661377
Validation acc: 0.7099999785423279 Test acc: 0.7099999785423279
Validation acc: 0.6959999799728394 Test acc: 0.6880000233650208
Validation acc: 0.6980000138282776 Test acc: 0.6990000009536743
Validation acc: 0.7260000109672546 Test acc: 0.7089999914169312
Validation acc: 0.6959999799728394 Te