In this tutorial, we implement a very simple recurrent network that sums up an arbitrary sequence of numbers. The network is constructed from scratch, without using predefined layers.

In [1]:
import torch

Generate a sequence of 5 random numbers. It will serve as a signal containing 5 frames of 1 channel or feature per frame:

In [3]:
SIGNAL = torch.rand(5)
print(SIGNAL)

tensor([0.6207, 0.3961, 0.9252, 0.3399, 0.0822])


Sum the numbers using a built-in function:

In [4]:
TARGET = SIGNAL.sum()
print(TARGET)

tensor(2.3641)


Calculate the result manually by iterating over the signal and accumulating the sum in a variable that will serve as the hidden state:

In [5]:
STATE = torch.zeros(())
for frame in range(5):
    STATE = SIGNAL[frame] + STATE
print(STATE)

tensor(2.3641)


Move the addition to a separate function called cell:

In [6]:
def cell(FRAME, STATE):
    STATE = FRAME + STATE
    return STATE

STATE = torch.zeros(())
for frame in range(5):
    STATE = cell(SIGNAL[frame], STATE)
print(STATE)

tensor(2.3641)


The cell function is our recurrent neural network. It takes a frame of the signal as well as the previous hidden state and returns the new hidden state. This state is then fed back to the function together with the next frame. In this case, the final sum is just the hidden state after the last frame. But in general the result may be different, so introduce a new variable for the result:

In [7]:
STATE = torch.zeros(())
for frame in range(5):
    STATE = cell(SIGNAL[frame], STATE)
RESULT = STATE
print(RESULT)

tensor(2.3641)


We did not train the net because it already knows how to compute the sum. Instead of simply adding two numbers, a more general cell would take their linear combination with unknown weights and a bias term:

In [8]:
BIAS = torch.randn(())
WEIGHTF = torch.randn(())
WEIGHTS = torch.randn(())

def cell(FRAME, STATE):
    STATE = BIAS + FRAME * WEIGHTF + STATE * WEIGHTS
    return STATE

STATE = torch.zeros(())
for frame in range(5):
    STATE = cell(SIGNAL[frame], STATE)
RESULT = STATE
print(RESULT)

tensor(21.8161)


The unknown bias and weights need to be determined via training. To perform training, calculate the squared error and use it as the loss:

In [9]:
STATE = torch.zeros(())
for frame in range(5):
    STATE = cell(SIGNAL[frame], STATE)
RESULT = STATE
LOSS = torch.square(RESULT - TARGET)
print(LOSS)

tensor(378.3821)


Now minimize the loss using the simple gradient descent:

In [11]:
BIAS = torch.randn((), requires_grad = True)
WEIGHTF = torch.randn((), requires_grad = True)
WEIGHTS = torch.randn((), requires_grad = True)

def cell(FRAME, STATE):
    STATE = BIAS + FRAME * WEIGHTF + STATE * WEIGHTS
    return STATE

optimizer = torch.optim.SGD([BIAS, WEIGHTF, WEIGHTS], lr = 0.01)
for epoch in range(1000):
    STATE = torch.zeros(())
    for frame in range(5):
        STATE = cell(SIGNAL[frame], STATE)
    RESULT = STATE
    LOSS = torch.square(RESULT - TARGET)
    optimizer.zero_grad()
    LOSS.backward()
    optimizer.step()
    print(epoch, LOSS.item())

0 7.164686679840088
1 6.962890148162842
2 6.773201942443848
3 6.592353343963623
4 6.41819429397583
5 6.249287128448486
6 6.084649085998535
7 5.923592567443848
8 5.765627861022949
9 5.610398769378662
10 5.457635402679443
11 5.307131290435791
12 5.158723831176758
13 5.012278079986572
14 4.867682933807373
15 4.724843502044678
16 4.583674907684326
17 4.444103240966797
18 4.30605936050415
19 4.169482707977295
20 4.034314155578613
21 3.9004974365234375
22 3.7679824829101562
23 3.6367170810699463
24 3.506653308868408
25 3.3777430057525635
26 3.24993896484375
27 3.123194694519043
28 2.997464418411255
29 2.8727006912231445
30 2.748857259750366
31 2.625886917114258
32 2.5037436485290527
33 2.382380962371826
34 2.2617545127868652
35 2.14182186126709
36 2.0225465297698975
37 1.9038982391357422
38 1.7858586311340332
39 1.6684257984161377
40 1.5516220331192017
41 1.43550443649292
42 1.3201748132705688
43 1.20579993724823
44 1.0926276445388794
45 0.9810134768486023
46 0.8714427947998047
47 0.76455932

Strictly speaking, the cell function represents what is called a recurrent cell. This cell is applied to each frame in the so-called unrolling loop. The effect of unrolling constitutes the whole recurrent network, which takes the entire signal and produces the final result. Move this network to a separate function called model:

In [12]:
BIAS = torch.randn((), requires_grad = True)
WEIGHTF = torch.randn((), requires_grad = True)
WEIGHTS = torch.randn((), requires_grad = True)

def cell(FRAME, STATE):
    STATE = BIAS + FRAME * WEIGHTF + STATE * WEIGHTS
    return STATE

def model(SIGNAL):
    STATE = torch.zeros(())
    for frame in range(5):
        STATE = cell(SIGNAL[frame], STATE)
    RESULT = STATE
    return RESULT

optimizer = torch.optim.SGD([BIAS, WEIGHTF, WEIGHTS], lr = 0.01)
for epoch in range(1000):
    RESULT = model(SIGNAL)
    LOSS = torch.square(RESULT - TARGET)
    optimizer.zero_grad()
    LOSS.backward()
    optimizer.step()
    print(epoch, LOSS.item())

0 4.623414993286133
1 4.001701354980469
2 3.3648481369018555
3 2.696044921875
4 1.9944407939910889
5 1.2912534475326538
6 0.6686767935752869
7 0.24096107482910156
8 0.051123566925525665
9 0.00590327475219965
10 0.0004315444966778159
11 2.5740882847458124e-05
12 1.4450599792326102e-06
13 7.982094984981813e-08
14 4.393086783238687e-09
15 2.476099325576797e-10
16 1.2789769243681803e-11
17 9.094947017729282e-13
18 0.0
19 0.0
20 0.0
21 0.0
22 0.0
23 0.0
24 0.0
25 0.0
26 0.0
27 0.0
28 0.0
29 0.0
30 0.0
31 0.0
32 0.0
33 0.0
34 0.0
35 0.0
36 0.0
37 0.0
38 0.0
39 0.0
40 0.0
41 0.0
42 0.0
43 0.0
44 0.0
45 0.0
46 0.0
47 0.0
48 0.0
49 0.0
50 0.0
51 0.0
52 0.0
53 0.0
54 0.0
55 0.0
56 0.0
57 0.0
58 0.0
59 0.0
60 0.0
61 0.0
62 0.0
63 0.0
64 0.0
65 0.0
66 0.0
67 0.0
68 0.0
69 0.0
70 0.0
71 0.0
72 0.0
73 0.0
74 0.0
75 0.0
76 0.0
77 0.0
78 0.0
79 0.0
80 0.0
81 0.0
82 0.0
83 0.0
84 0.0
85 0.0
86 0.0
87 0.0
88 0.0
89 0.0
90 0.0
91 0.0
92 0.0
93 0.0
94 0.0
95 0.0
96 0.0
97 0.0
98 0.0
99 0.0
100 0.0
101 0.0

Because of the hardcoded length, the current model can only take signals of this length. Adapt it to arbitrary length:

In [13]:
BIAS = torch.randn((), requires_grad = True)
WEIGHTF = torch.randn((), requires_grad = True)
WEIGHTS = torch.randn((), requires_grad = True)

def cell(FRAME, STATE):
    STATE = BIAS + FRAME * WEIGHTF + STATE * WEIGHTS
    return STATE

def model(SIGNAL):
    frames = SIGNAL.size(0)
    STATE = torch.zeros(())
    for frame in range(frames):
        STATE = cell(SIGNAL[frame], STATE)
    RESULT = STATE
    return RESULT

optimizer = torch.optim.SGD([BIAS, WEIGHTF, WEIGHTS], lr = 0.01)
for epoch in range(1000):
    RESULT = model(SIGNAL)
    LOSS = torch.square(RESULT - TARGET)
    optimizer.zero_grad()
    LOSS.backward()
    optimizer.step()
    print(epoch, LOSS.item())

0 4.728830814361572
1 4.621041774749756
2 4.514373779296875
3 4.408603668212891
4 4.303501129150391
5 4.19883394241333
6 4.094357013702393
7 3.989818811416626
8 3.884958267211914
9 3.7795040607452393
10 3.6731741428375244
11 3.565678358078003
12 3.4567160606384277
13 3.3459784984588623
14 3.2331502437591553
15 3.117908239364624
16 2.999922275543213
17 2.8788578510284424
18 2.7543752193450928
19 2.626129388809204
20 2.4937744140625
21 2.356966018676758
22 2.21536922454834
23 2.0686752796173096
24 1.9166268110275269
25 1.759064793586731
26 1.5960032939910889
27 1.4277456998825073
28 1.2550554275512695
29 1.0793919563293457
30 0.9031985402107239
31 0.7301809191703796
32 0.5654442310333252
33 0.4152339994907379
34 0.286019504070282
35 0.18287430703639984
36 0.10767509043216705
37 0.05820388346910477
38 0.028982199728488922
39 0.013421657495200634
40 0.00585935590788722
41 0.0024463795125484467
42 0.000989539548754692
43 0.0003917264402844012
44 0.00015290165902115405
45 5.913537461310625e-

The bias and weights should equal 0 and 1, respectively. Inspect the values obtained from training:

In [14]:
print(BIAS)
print(WEIGHTF)
print(WEIGHTS)

tensor(1.2742, requires_grad=True)
tensor(0.1876, requires_grad=True)
tensor(0.4498, requires_grad=True)


They are completely wrong. This is because we were fitting three parameters to only one data point. So, we need to increase the amount of training data. Since the net is supposed to sum up any signal, it should correctly compute not only the final sum, but all the partial sums of any signal. We will now fit the parameters to all the partial sums of the training signal. First, compute the partial sums as the new target:

In [15]:
SIGNAL = torch.rand(5)
TARGET = SIGNAL.cumsum(0)

print(SIGNAL)
print(TARGET)

tensor([0.6179, 0.8028, 0.9874, 0.3945, 0.2030])
tensor([0.6179, 1.4207, 2.4081, 2.8026, 3.0056])


Now, store all the intermediate hidden states and return them as the result of the model. Note that the model no more returns a single sum, but an entire new signal. It will be fitted to the target which is also a signal. Define the loss function as the squared error averaged over all the frames in these signals. Finally, minimize the new loss:

In [16]:
BIAS = torch.randn((), requires_grad = True)
WEIGHTF = torch.randn((), requires_grad = True)
WEIGHTS = torch.randn((), requires_grad = True)

def cell(FRAME, STATE):
    STATE = BIAS + FRAME * WEIGHTF + STATE * WEIGHTS
    return STATE

def model(SIGNAL):
    frames = SIGNAL.size(0)
    RESULT = torch.empty(frames)
    STATE = torch.zeros(())
    for frame in range(frames):
        STATE = cell(SIGNAL[frame], STATE)
        RESULT[frame] = STATE
    return RESULT

optimizer = torch.optim.SGD([BIAS, WEIGHTF, WEIGHTS], lr = 0.01)
for epoch in range(1000):
    RESULT = model(SIGNAL)
    LOSS = torch.square(RESULT - TARGET).mean()
    optimizer.zero_grad()
    LOSS.backward()
    optimizer.step()
    print(epoch, LOSS.item())

print(BIAS)
print(WEIGHTF)
print(WEIGHTS)

0 2.54850697517395
1 2.104743242263794
2 1.6898599863052368
3 1.3071186542510986
4 0.9645246267318726
5 0.6727701425552368
6 0.441480815410614
7 0.2744726538658142
8 0.16658158600330353
9 0.10480640828609467
10 0.07338027656078339
11 0.05897187069058418
12 0.05288122221827507
13 0.05044140666723251
14 0.04948412626981735
15 0.04909808561205864
16 0.04892493784427643
17 0.048829615116119385
18 0.04876260086894035
19 0.048705872148275375
20 0.04865308478474617
21 0.04860195890069008
22 0.04855164512991905
23 0.04850179702043533
24 0.04845240339636803
25 0.04840337485074997
26 0.048354633152484894
27 0.04830624535679817
28 0.048258207738399506
29 0.04821041226387024
30 0.04816294461488724
31 0.048115771263837814
32 0.04806884005665779
33 0.04802219942212105
34 0.04797577112913132
35 0.04792967066168785
36 0.0478837676346302
37 0.047838106751441956
38 0.04779268428683281
39 0.04774748533964157
40 0.04770251363515854
41 0.04765773192048073
42 0.04761319234967232
43 0.04756883531808853
44 0.

The loss values are now much higher than previously, because it is harder to fit to 5 data points than to 1. Depending on the random initialization of the bias and weights, their final values may be better or worse. Try 10000 epochs to see a more significant improvement. Another way to increase the amount of training data is to train on a whole batch of signals. In PyTorch, the signal length is usually the first dimension of the data matrices and the batch size is the second. Create a batch of 100 random signals of length 5. Modify the model to process all signals in a batch simultaneously. The loss function will now be the mean over all partial sums in a signal and over all signals in a batch:

In [17]:
SIGNAL = torch.rand(5, 100)
TARGET = SIGNAL.cumsum(0)

BIAS = torch.randn((), requires_grad = True)
WEIGHTF = torch.randn((), requires_grad = True)
WEIGHTS = torch.randn((), requires_grad = True)

def cell(FRAME, STATE):
    STATE = BIAS + FRAME * WEIGHTF + STATE * WEIGHTS
    return STATE

def model(SIGNAL):
    frames, samples = SIGNAL.size(0), SIGNAL.size(1)
    RESULT = torch.empty(frames, samples)
    STATE = torch.zeros(samples)
    for frame in range(frames):
        STATE = cell(SIGNAL[frame], STATE)
        RESULT[frame] = STATE
    return RESULT

optimizer = torch.optim.SGD([BIAS, WEIGHTF, WEIGHTS], lr = 0.01)
for epoch in range(1000):
    RESULT = model(SIGNAL)
    LOSS = torch.square(RESULT - TARGET).mean()
    optimizer.zero_grad()
    LOSS.backward()
    optimizer.step()
    print(epoch, LOSS.item())

print(BIAS)
print(WEIGHTF)
print(WEIGHTS)

0 3.3442115783691406
1 3.2680420875549316
2 3.1946897506713867
3 3.123898983001709
4 3.0554428100585938
5 2.9891202449798584
6 2.924750804901123
7 2.8621716499328613
8 2.801237106323242
9 2.7418134212493896
10 2.683781147003174
11 2.6270298957824707
12 2.5714590549468994
13 2.516976833343506
14 2.4634995460510254
15 2.4109489917755127
16 2.3592541217803955
17 2.3083486557006836
18 2.2581722736358643
19 2.2086691856384277
20 2.15978741645813
21 2.1114795207977295
22 2.063701629638672
23 2.0164132118225098
24 1.9695775508880615
25 1.9231607913970947
26 1.8771321773529053
27 1.8314638137817383
28 1.786131501197815
29 1.7411125898361206
30 1.6963883638381958
31 1.6519420146942139
32 1.6077607870101929
33 1.5638338327407837
34 1.5201539993286133
35 1.4767169952392578
36 1.4335218667984009
37 1.390571117401123
38 1.3478710651397705
39 1.3054323196411133
40 1.2632695436477661
41 1.2214014530181885
42 1.1798521280288696
43 1.1386514902114868
44 1.0978338718414307
45 1.0574404001235962
46 1.017

The values of bias and weights are now much closer to 0 and 1, even after 1000 epochs. Play with various hyperparameters like signal length, batch size, learning rate, optimizer kind etc. In view of the next example, rename some variables to fit the terminology used for general neural networks. The value of the linear combination is usually called activation. It can be passed through a non-linear activation function to produce what is called activity. There is no activation function here, so activity is equal to activation. Since our model takes a signal and returns a new signal, it can be viewed as a single recurrent layer. In the next example, we will add the second layer. Append index 1 to the current bias and weights to indicate that they belong to the first layer:

In [None]:
SIGNAL = torch.rand(5, 100)
TARGET = SIGNAL.cumsum(0)

BIAS1 = torch.randn((), requires_grad = True)
WEIGHT1F = torch.randn((), requires_grad = True)
WEIGHT1S = torch.randn((), requires_grad = True)

def cell(FRAME, STATE):
    ACTIVATION = BIAS1 + FRAME * WEIGHT1F + STATE * WEIGHT1S
    ACTIVITY = ACTIVATION
    return ACTIVITY

def model(SIGNAL):
    frames, samples = SIGNAL.size(0), SIGNAL.size(1)
    ACTIVITY1 = torch.empty(frames, samples)
    STATE = torch.zeros(samples)
    for frame in range(frames):
        STATE = cell(SIGNAL[frame], STATE)
        ACTIVITY1[frame] = STATE
    return ACTIVITY1

optimizer = torch.optim.SGD([BIAS1, WEIGHT1F, WEIGHT1S], lr = 0.01)
for epoch in range(1000):
    RESULT = model(SIGNAL)
    LOSS = torch.square(RESULT - TARGET).mean()
    optimizer.zero_grad()
    LOSS.backward()
    optimizer.step()
    print(epoch, LOSS.item())

print(BIAS1)
print(WEIGHT1F)
print(WEIGHT1S)