In [1]:
# Data processing
using MLDatasets;
using MLUtils: DataLoader;
using MLDataPattern;
using ImageCore;
using Augmentor;
using ImageFiltering;
using MappedArrays;
using Random;
using Flux: DataLoader;
using Functors;
using Optimisers;
using Zygote;
using Statistics;
using Yota;
using TimerOutputs;

In [2]:
using Flux;

# Data Pre-processing

* Inputs: batches of (32 x 32) RGB images
    * Tensor size (32, 32, 3, N) in WHCN dimensions
    * Values between [0, 1]
* For all data: ImageNet normalization
    * Subtract means [0.485, 0.456, 0.406]
    * Divide by standard deviations [0.229, 0.224, 0.225]
* Augment training data only:
    * Permute to CWHN (3, 32, 32, N)
    * Convert to RGB image for Augmentor.jl package to process (32, 32, N)
    * 4 pixel padding on each side (40, 40, N)
    * Random horizontal flip
    * (32 x 32) crop from augmented image (32, 32, N)
    * Convert to tensors (3, 32, 32, N)
    * Permute to WHCN (32, 32, 3, N)
* Batch and shuffle data

In [3]:
train_data = MLDatasets.CIFAR10(Tx=Float32, split=:train)
test_data = MLDatasets.CIFAR10(Tx=Float32, split=:test)

dataset CIFAR10:
  metadata  =>    Dict{String, Any} with 2 entries
  split     =>    :test
  features  =>    32×32×3×10000 Array{Float32, 4}
  targets   =>    10000-element Vector{Int64}

In [4]:
train_x = train_data.features;
train_y = train_data.targets;

test_x = test_data.features;
test_y = test_data.targets;
size(train_x), size(test_x)  # Data is in shape WHCB

((32, 32, 3, 50000), (32, 32, 3, 10000))

In [5]:
# Train-test split
# Copied from https://github.com/JuliaML/MLUtils.jl/blob/v0.2.11/src/splitobs.jl#L65
# obsview doesn't work with this data, so use getobs instead

import MLDataPattern.splitobs;

function splitobs(data; at, shuffle::Bool=false)
    if shuffle
        data = shuffleobs(data)
    end
    n = numobs(data)
    return map(idx -> MLDataPattern.getobs(data, idx), splitobs(n, at))
end

splitobs (generic function with 11 methods)

In [6]:
train, val = splitobs((train_x, train_y), at=0.9, shuffle=true);

train_x, train_y = train;
val_x, val_y = val;

size(train_x), size(val_x)

((32, 32, 3, 45000), (32, 32, 3, 5000))

In [7]:
# Normalize all the data

means = reshape([0.485, 0.465, 0.406], (1, 1, 3, 1))
stdevs = reshape([0.229, 0.224, 0.225], (1, 1, 3, 1))
normalize(x) = (x .- means) ./ stdevs

train_x = normalize(train_x);
val_x = normalize(val_x);
test_x = normalize(test_x);

In [8]:
# Notebook testing: Use less data
train_x, train_y = MLDatasets.getobs((train_x, train_y), 1:500);

val_x, val_y = MLDatasets.getobs((val_x, val_y), 1:50);

test_x, test_y = MLDatasets.getobs((test_x, test_y), 1:50);

# Data augmentation pipeline with Augmentor.jl

By default, batch is the last dimension.

In [9]:
# Pad the training data for further augmentation
train_x_padded = padarray(train_x, Fill(0, (4, 4, 0, 0)));  
size(train_x_padded)  # Should be (40, 40, 3, 50000)

(40, 40, 3, 500)

In [10]:
pl = PermuteDims((3, 1, 2)) |> CombineChannels(RGB) |> Either(FlipX(), NoOp()) |> RCropSize(32, 32) |> SplitChannels() |> PermuteDims((2, 3, 1))


6-step Augmentor.ImmutablePipeline:
 1.) Permute dimension order to (3, 1, 2)
 2.) Combine color channels into colorant RGB
 3.) Either: (50%) Flip the X axis. (50%) No operation.
 4.) Crop random window with size (32, 32)
 5.) Split colorant into its color channels
 6.) Permute dimension order to (2, 3, 1)

In [11]:
# Create an output array for augmented images
outbatch(X) = Array{Float32}(undef, (32, 32, 3, nobs(X)))

outbatch (generic function with 1 method)

In [12]:
# Function that takes a batch (images and targets) and augments the images
augmentbatch((X, y)) = (augmentbatch!(outbatch(X), X, pl), y)

augmentbatch (generic function with 1 method)

In [13]:
# Shuffled and batched dataset of augmented images
train_batch_size = 16

train_batches = mappedarray(augmentbatch, batchview(shuffleobs((train_x_padded, train_y)), size=train_batch_size));

└ @ MLDataPattern /home/araising/.julia/packages/MLDataPattern/2yPuO/src/dataview.jl:205


In [14]:
# Test and Validation data
test_batch_size = 32

val_loader = DataLoader((val_x, val_y), shuffle=true, batchsize=test_batch_size);
test_loader = DataLoader((test_x, test_y), shuffle=true, batchsize=test_batch_size);

## 2D Convolution in Flux


**Flux.Conv — Type** 

Conv(filter, in => out, σ = identity;
     stride = 1, pad = 0, dilation = 1, groups = 1, [bias, init])


Standard convolutional layer. _filter_ is a tuple of integers specifying the size of the convolutional kernel; _in_ and _out_ specify the number of input and output channels.

Image data should be stored in WHCN order (width, height, channels, batch). In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array. This has N = 2 spatial dimensions, and needs a kernel size like (5,5), a 2-tuple of integers.

To take convolutions along N feature dimensions, this layer expects as input an array with ndims(x) == N+2, where size(x, N+1) == in is the number of input channels, and size(x, ndims(x)) is (as always) the number of observations in a batch. Then:

- filter should be a tuple of N integers.
- Keywords stride and dilation should each be either single integer, or a tuple with N integers.
- Keyword pad specifies the number of elements added to the borders of the data array. It can be
    - a single integer for equal padding all around,
    - a tuple of N integers, to apply the same padding at begin/end of each spatial dimension,
    - a tuple of 2*N integers, for asymmetric padding, or
    - the singleton _SamePad()_, to calculate padding such that size(output,d) == size(x,d) / stride (possibly rounded) for each spatial dimension.
- Keyword groups is expected to be an Int. It specifies the number of groups to divide a convolution into.

Keywords to control initialization of the layer:

- init - Function used to generate initial weights. Defaults to glorot_uniform.
- bias - The initial bias vector is all zero by default. Trainable bias can be disabled entirely by setting this to false, or another vector can be provided such as bias = randn(Float32, out).
            
            
**Flux.Conv - Method**
_Conv(weight::AbstractArray, [bias, activation; stride, pad, dilation])_

Constructs a convolutional layer with the given weight and bias. Accepts the same keywords and has the same defaults as Conv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...).

## ResNet Layer

In [15]:
mutable struct ResNetLayer
    conv1::Flux.Conv
    conv2::Flux.Conv
    bn1::Flux.BatchNorm
    bn2::Flux.BatchNorm
    f::Function
    in_channels::Int
    channels::Int
    stride::Int
    # stride2::Int
    # pad1::Int
    # pad2::Int
end

@functor ResNetLayer (conv1, conv2, bn1, bn2)

In [16]:
function residual_identity(layer::ResNetLayer, x::AbstractArray{T, 4}) where {T<:Number}
    (w, h, c, b) = size(x)
    stride = layer.stride
    if stride > 1
        @assert ((w % stride == 0) & (h % stride == 0)) "Spatial dimensions are not divisible by `stride`"
    
        # Strided downsample
        inds = CartesianIndices((1:stride:w, 1:stride:h))
        x_id = copy(x[inds, :, :])
    else
        x_id = x
    end

    channels = layer.channels
    in_channels = layer.in_channels
    if in_channels < channels
        # Zero padding on extra channels
        (w, h, c, b) = size(x_id)
        pad = zeros(T, w, h, channels - in_channels, b)
        x_id = cat(x_id, pad; dims=3)
    elseif in_channels > channels
        error("in_channels > out_channels not supported")
    end
    return x_id
end

residual_identity (generic function with 1 method)

In [17]:
function ResNetLayer(in_channels::Int, channels::Int; stride=1, f=relu)
    bn1 = Flux.BatchNorm(in_channels)
    conv1 = Flux.Conv((3,3), in_channels=>channels; stride=stride, pad=1, init=Flux.kaiming_uniform, bias=false)
    bn2 = Flux.BatchNorm(channels)
    conv2 = Flux.Conv((3,3), channels=>channels; stride=1, pad=1, init=Flux.kaiming_uniform, bias=false)

    return ResNetLayer(conv1, conv2, bn1, bn2, f, in_channels, channels, stride)
end

ResNetLayer

In [18]:
function (self::ResNetLayer)(x::AbstractArray)
    identity = residual_identity(self, x)
    z = self.bn1(x)
    z = self.f(z)
    z = self.conv1(z)
    z = self.bn2(z)
    z = self.f(z)
    z = self.conv2(z)

    y = z + identity
    return y
end

In [19]:
l = ResNetLayer(3, 10; stride=2);

In [20]:
x = randn(Float32, (64, 64, 3, 2));
y = l(x);
size(y)

(32, 32, 10, 2)

# ResNet20 Model

In [21]:
mutable struct ResNet20
    input_conv::Flux.Conv
    resnet_blocks::Chain
    pool::GlobalMeanPool
    dense::Flux.Dense
end

@functor ResNet20

function ResNet20(in_channels::Int, num_classes::Int)
    resnet_blocks = Chain(
        block_1 = ResNetLayer(16, 16),
        block_2 = ResNetLayer(16, 16),
        block_3 = ResNetLayer(16, 16),
        block_4 = ResNetLayer(16, 32; stride=2),
        block_5 = ResNetLayer(32, 32),
        block_6 = ResNetLayer(32, 32),
        block_7 = ResNetLayer(32, 64; stride=2),
        block_8 = ResNetLayer(64, 64),
        block_9 = ResNetLayer(64, 64)
    )
    return ResNet20(
        Flux.Conv((3,3), in_channels=>16, init=Flux.kaiming_uniform, pad=1, bias=false),
        resnet_blocks,
        GlobalMeanPool(),
        Dense(64 => num_classes)
    )
end

function (self::ResNet20)(x::AbstractArray)
    z = self.input_conv(x)
    z = self.resnet_blocks(z)
    z = self.pool(z)
    z = dropdims(z, dims=(1, 2))
    y = self.dense(z)
    return y
end

In [22]:
# Testing ResNet20 model
# Expected output: (10, 4)
m = ResNet20(3, 10);
inputs = randn(Float32, (32, 32, 3, 4))
outputs = m(inputs);
size(outputs)

(10, 4)

# Training setup

## Sparse Cross Entropy Function

In [23]:
"""
    sparse_logit_cross_entropy(logits, labels)

Efficient computation of cross entropy loss with model logits and integer indices as labels.
Integer indices are from [0,  N-1], where N is the number of classes
Similar to TensorFlow SparseCategoricalCrossEntropy

# Arguments
- `logits::AbstractArray`: 2D model logits tensor of shape (classes, batch size)
- `labels::AbstractArray`: 1D integer label indices of shape (batch size,)

# Returns
- `loss::Float32`: Cross entropy loss
"""

function sparse_logit_cross_entropy(logits, labels)
    log_probs = logsoftmax(logits);
    inds = CartesianIndex.(labels .+ 1, axes(log_probs, 2));
    # Select indices of labels for loss
    log_probs = log_probs[inds];
    loss = -mean(log_probs);
    return loss
end

sparse_logit_cross_entropy (generic function with 1 method)

In [24]:
# Create model with 3 input channels and 10 classes
model = ResNet20(3, 10);

In [25]:
# Setup AdamW optimizer
β = (0.9, 0.999);
decay = 1e-4;
state = Optimisers.setup(Optimisers.Adam(1e-3, β, decay), model);

In [26]:
# Create objective function to optimize
function loss_function(model::ResNet20, x::AbstractArray, y::AbstractArray)
    ŷ = model(x)
    loss = sparse_logit_cross_entropy(ŷ, y)
    return loss
end

loss_function (generic function with 1 method)

In [27]:
(x, y) = first(train_batches);

In [28]:
mutable struct ResNet5
    input_conv::Flux.Conv
    resnet_block::ResNetLayer
    pool::GlobalMeanPool
    dense::Flux.Dense
end

@functor ResNet5

function ResNet5(in_channels::Int, num_classes::Int)
    return ResNet5(
        Flux.Conv((3,3), in_channels=>16, init=Flux.kaiming_uniform, pad=1, bias=false),
        ResNetLayer(16, 16),
        GlobalMeanPool(),
        Dense(16 => num_classes)
    )
end

function (self::ResNet5)(x::AbstractArray)
    z = self.input_conv(x)
    z = self.resnet_block(z)
    z = self.pool(z)
    z = dropdims(z, dims=(1, 2))
    y = self.dense(z)
    return y
end


function loss_function(model::ResNet5, x::AbstractArray, y::AbstractArray)
    ŷ = model(x)
    loss = sparse_logit_cross_entropy(ŷ, y)
    return loss
end

loss_function (generic function with 2 methods)

In [29]:
model = ResNet5(3, 10);

loss, g = Zygote.gradient(loss_function, model, x, y);

In [30]:
g

32×32×3×16 Array{Float32, 4}:
[:, :, 1, 1] =
  1.7362f-5   -8.45275f-5   1.64143f-5  …  -9.56953f-5   -7.42138f-5
 -1.32453f-7  -7.54285f-5  -1.66397f-5     -0.000151026  -8.3889f-5
  8.06739f-6  -1.90499f-5  -1.45963f-5     -0.000106016  -9.65289f-6
 -1.22822f-5  -7.12256f-5  -4.67133f-5     -0.000145646  -3.9911f-5
 -3.10252f-5  -7.81925f-5  -5.927f-5       -0.000151137  -6.68164f-5
 -1.25631f-5  -8.11842f-5  -5.6855f-5   …  -0.000117506  -5.66902f-5
 -1.83619f-5  -5.34015f-5  -2.15533f-5     -2.6127f-5    -4.03683f-5
 -8.06493f-6  -3.37832f-5  -4.21246f-5     -4.20609f-5   -7.1381f-5
 -1.78168f-5  -4.56914f-5  -1.03866f-5     -5.23648f-5   -0.000112848
 -3.2396f-6   -3.73311f-5   1.89601f-5     -2.05763f-5   -0.000129008
 -1.57936f-5  -7.01726f-5  -5.20718f-7  …  -8.20537f-5   -0.000104548
 -4.92688f-6  -6.64523f-5  -2.49975f-5     -8.93846f-5   -0.000101201
 -5.268f-6    -6.71184f-5  -2.32403f-5      3.93376f-6   -0.000128303
  ⋮                                     ⋱   ⋮           

In [31]:
model = ResNet5(3, 10);

loss, g = grad(loss_function, model, x, y)



(2.6952226f0, (ChainRulesCore.ZeroTangent(), Tangent{ResNet5}(input_conv = Tangent{Conv[90m{2, 4, typeof(identity), Array{Float32, 4}, Bool}[39m}(weight = [0.0033848141 0.00492214 0.0155685; 0.0019094192 0.0043721674 0.014206535; 0.0050633624 0.006268614 0.015757428;;; -0.011066678 -0.009763804 0.00187172; -0.013735583 -0.011424305 -0.00047328882; -0.010526525 -0.009302213 0.0015942054;;; -0.001088581 0.0016271224 0.011891334; -0.0050204634 -0.0012168038 0.008474737; -0.0023286804 0.00045990234 0.009934018;;;; -0.004183669 -0.007880375 -0.008217068; -0.006327441 -0.009227365 -0.0077464064; -0.0045296703 -0.00671673 -0.004411768;;; -0.02649877 -0.029833863 -0.02973304; -0.029033566 -0.03159094 -0.02957295; -0.026717253 -0.028665334 -0.025971035;;; -0.036283102 -0.040567905 -0.038424168; -0.038532212 -0.0420971 -0.03835934; -0.03484244 -0.037923675 -0.033869643;;;; -0.2046649 -0.20975651 -0.20377469; -0.22296588 -0.22449858 -0.2156485; -0.22561207 -0.22498934 -0.21568228;;; -0.13329147

In [32]:
g

(ChainRulesCore.ZeroTangent(), Tangent{ResNet5}(input_conv = Tangent{Conv[90m{2, 4, typeof(identity), Array{Float32, 4}, Bool}[39m}(weight = [0.0033848141 0.00492214 0.0155685; 0.0019094192 0.0043721674 0.014206535; 0.0050633624 0.006268614 0.015757428;;; -0.011066678 -0.009763804 0.00187172; -0.013735583 -0.011424305 -0.00047328882; -0.010526525 -0.009302213 0.0015942054;;; -0.001088581 0.0016271224 0.011891334; -0.0050204634 -0.0012168038 0.008474737; -0.0023286804 0.00045990234 0.009934018;;;; -0.004183669 -0.007880375 -0.008217068; -0.006327441 -0.009227365 -0.0077464064; -0.0045296703 -0.00671673 -0.004411768;;; -0.02649877 -0.029833863 -0.02973304; -0.029033566 -0.03159094 -0.02957295; -0.026717253 -0.028665334 -0.025971035;;; -0.036283102 -0.040567905 -0.038424168; -0.038532212 -0.0420971 -0.03835934; -0.03484244 -0.037923675 -0.033869643;;;; -0.2046649 -0.20975651 -0.20377469; -0.22296588 -0.22449858 -0.2156485; -0.22561207 -0.22498934 -0.21568228;;; -0.13329147 -0.1358981 -0

# Evaluation Function

In [33]:
function evaluate(model, test_loader)
    preds = []
    targets = []
    for (x, y) in test_loader
        # Get model predictions
        # Note argmax of nd-array gives CartesianIndex
        # Need to grab the first element of each CartesianIndex to get the true index
        logits = model(x)
        ŷ = map(i -> i[1], argmax(logits, dims=1))
        append!(preds, ŷ)

        # Get true labels
        append!(targets, y)
    end
    accuracy = sum(preds .== targets) / length(targets)
    return accuracy
end

evaluate (generic function with 1 method)

# Training Loop

In [34]:
# Setup timing output
const to = TimerOutput()

[0m[1m ────────────────────────────────────────────────────────────────────[22m
[0m[1m                   [22m         Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:      1.11s /   0.0%           56.9MiB /   0.0%    

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
[0m[1m ────────────────────────────────────────────────────────────────────[22m