# The economics of software 2.0

In 2017, Andrej Karpathy wrote a seminal [blog post](https://karpathy.medium.com/software-2-0-a64152b37c35) where he called neural networks "software 2.0". He meant two things: (1) coding neural nets is writing software; (2) neural nets are a new kind of software. The innovation consists of substituting imperative programming with "programming by example". Instead of writing in detail the operations a program should go through to process some input and produce some output, you just specify a goal for the program and show it many examples. Watching the examples, the program will then "write itself". 

In his post, Karpathy then goes on to speculate about the consequences of this--an invaluable read. Over the years, I've found myself making Karpathy's point to various people, but I wasn't sure it really landed with them. One problem is that the types of problems we tend to solve with neural nets (computer vision, speech synthesis, natural language understanding, etc.), we never were really close to solving with imperative programming. Many people therefore don't really see that software 1.0 and software 2.0 are substitutes--or fundamentally different ways of doing the same thing: writing software. It feels like they're just different things solving non-overlapping problems. And this is mostly true. But I think the overlaps are going to grow more than we think. 

Recently, I stumbled upon a simple problem that I thought would illustrate well not only the differences between software 1.0 and software 2.0 but also how they compete with each other in solving problems. I will also use that to illustrate the economics guiding the software 1.0 v software 2.0 choice, and what are the tradeoffs involves in that choice. 

A friend was preparing for a technical interview on [Leetcode](https://leetcode.com/) and he shared with me one of the problems he had worked on: converting integers into Roman numerals. At present, no one in their right mind would use a neural net to solve this problem (it likely takes longer to code, is slower to run, and consumes more memory and compute). But, in the future, the software 2.0 tooling might be so effective, chips so optimized, and memory and compute so cheap, that it might actually make sense to solve this problem with a neural net. The main reason is that software 2.0 substitutes cheap labor for expensive labor. 

Assuming neural nets become off-the-shelf commodities, the work involved in programming by example is... collecting examples. As Karpathy puts it, "most of the active 'software development' takes the form of curating, growing, massaging and cleaning labeled datasets." This is arguably more accessible than writing imperative code. Imperative coding essentially requires humans to think like robots. Any sloppiness in reasoning and you get a memory leakage, an infinite loop, or some other unintended consequence. It's merciless, and it takes years to rewire your brain to avoid all the common pitfalls. Curating datasets, in contrast, is much more human-friendly, and therefore the labor pool is larger. 

The second big advantage of programming by example is that the same neural net architecture can be used to solve different problems. The labor that went into building the neural net can therefore be amortized over several tasks. Each different task still requires one to collect different data ("software 2.0 development"), but as we mentioned, this may be much cheaper than writing imperative coding. Software 1.0 is much more "brittle" in that sense. The imperative code to convert of integers into Roman numerals is completely different from the imperative code to convert Roman numerals into integers. In contrast, you can repurpose the exact same neural net to solve this new problem. There currently are limits to the versatility of neural nets though. Typically, there are "problem spaces" within which a given architecture can be repurposed. For instance, a recurrent neural net can be somewhat seamlessly repurposed for any problem involving converting a sequence of symbols into another sequence of symbols (this is the case I will illustrate in this post). Similarly, without changes to its architecture, the same convolutional neural net can be repurposed for various image classification tasks--whether it is detecting cats in pictures, or evidence of cancer in X-rays. You just have to feed it the right examples (this is not totally true: while the neural net architecture remains the same, you may also need to tweak its hyperparameters). While the versatility of neural nets remains contained within separate problem spaces, the main goal of the machine learning community as a whole is to expand each of those islands and eventually merge them. And in fact, recent developments hint at such a convergence: for instance, the "transformer" (a kind of neural net architecture) recently crossed over from the natural language processing space into the image processing space--a unification of sorts.

There are downsides to programming by example though. Despite future optimizations, it might remain on average less efficient than imperative coding. More importantly, a piece of software that wrote itself via neural nets is fundamentally probabilistic: you cannot 100% guarantee that it will always produce the correct output. In theory, you can offer that guarantee with software 1.0, although people rarely go through the trouble of proving correctness of an imperative codebase. That said, those two downsides pale in comparison to the above advantages (cheap development and versatility).

Before we dive into our little example to illustrate all of the above, let me quickly reiterate the point I'm trying to make here: imperative programming and programming by example, while they currently may seem to address non-overlapping problems, might be more directly competing over problems in the future. Using neural nets to convert integers into Roman numerals (or vice versa) might seem silly, but I think it makes the point crystal clear.

## Converting integers into Roman numerals

The description of the problem is fairly straigtforward. I'll just show a few example conversions:

- 4 -> IV
- 1193 -> MCXCIII
- 548 -> DXLVIII
- 3616 -> MMMDCXVI
- 21 -> XXI

Most people are familiar with Roman numerals. Unlike Arabic numerals, the value of a Roman numerical symbol does not depend on its position within a number. The symbol "3" in "300" has a different "value" than in "30". In the Roman system, "V" means 5--regardless of its position in the number. Roman symbols may also be locally subtracted as you parse them in order to figure out the final number. For instance, "XIV" = "X + (V - I)" = 14. Arabic symbols are just added: 14 = 10 + 4. 

Generally speaking, Arabic numerals are more compact, easier to compare, and way more convenient for doing arithmetic. Finally, it scales way better to represent large numbers. The Roman system would have to just keep adding symbols, making the whole thing more and more unwieldy and unparseable (thankfully, Romans didn't seem to need to handle numbers larger than 3999).

Let's write the imperative code to convert integers into Roman numerals. Later, we'll contrast this with the "machine learning way" of doing this. For both approaches, I'll use the Python programming language.

### The imperative way

Let's first define a mapping between Roman numerals and their corresponding integers, and then reverse the mapping.

In [1]:
r2i = {
    'I': 1, 
    'V': 5, 
    'X': 10, 
    'L': 50, 
    'C': 100, 
    'D': 500, 
    'M': 1000
}

i2r = {v:k for k,v in r2i.items()} # reverse the mapping

Now onto the function itself. The main trick is to know how to repeatedly extract the most significant digit of the integer. To do that, you need to divide it by the nearest power of 10 that is still lower than that integer. The quotient of that division will be the most significant digit. Then you map it to its Roman equivalent with some simple rules. You then move on to the next most significant digit. In practice, you replace the original integer with the remainder of the above divison and compute the most significant digit of that new integer (don't forget to update the divisor to be an order of magnitude lower). Honestly, I don't think this problem is straightforward for most people, which is probably why Leetcode considers it medium in difficulty (when I checked, only 60% of the submissions on the website were correct).

In [32]:
def integer_to_roman(n):
    div = 1
    while n >= div: div *= 10
    div //= 10
    out = []
    while n:
        d = n // div # get most significant digit via floor division by a power of 10
        if d < 4:
            o = i2r[div]*d
        elif d == 4:
            o = i2r[div] + i2r[div*5]
        elif d < 9:
            o = i2r[div*5] + (d-5)*i2r[div]
        else:
            o = i2r[div] + i2r[div*10]
        out.append(o)
        n = n % div # the new integer is the remainder
        div //= 10
    return ''.join(out)

Let's make a quick test set for sanity check purposes:

In [3]:
test_set = [
    ('4', 'IV'),
    ('1193', 'MCXCIII'),
    ('548', 'DXLVIII'),
    ('3616', 'MMMDCXVI'),
    ('21', 'XXI')
]

In [4]:
for src,tgt in test_set:
    assert integer_to_roman(int(src)) == tgt
print('Success')

Success


### Programming by example

As discussed, programming by example requires collecting examples. So let's assume we paid someone to do just that. For instance, that person asked 100 different people to come up with 20 examples. Of those 2000 examples, half will be used for the neural net to "write itself". The other half will be used to check how well it's doing. If there is room for improvement, typically it's because showing the various examples once is not enough. Sometimes you have to go through the same examples a few times for the neural net to learn (we call these repetitions "epochs"). Let's further keep 100 examples completely aside. At the end, when we're confident the neural net has learned to solve the problem, we use the 100 examples held out to get an estimate of its accuracy. Remember: we mentioned that a downside of software 2.0 is that it's probabilistic. The test set allows us to estimate how reliable the software is.

Let's synthetically generate some examples to simulate the data collection effort:

In [5]:
import random
from pathlib import Path

default_data_path = Path('data')

def make_data(splits, path=None, max_num=3999):
    if path is None: path = default_data_path
    if not path.exists(): path.mkdir()
    for split, size in splits.items():
        with open(path/(split+'.txt'), 'w') as o:
            for _ in range(int(size)):
                i = random.randint(1, max_num)
                r = integer_to_roman(i)
                o.write(' '.join([str(i), r]) + '\n')

In [6]:
splits  = {
    'train': 1e3,
    'valid': 1e3,
    'test': 1e2
}

make_data(splits)

To code up the neural net, I'll use PyTorch--a popular Python framework for deep learning. The first thing to do is to write classes to process the data and serve it to the neural net in the appropriate format. The `NumberDataset` class reads the examples from a file. Each line in the file consists in example pairs:

In [7]:
!head data/train.txt

3250 MMMCCL
1298 MCCXCVIII
758 DCCLVIII
784 DCCLXXXIV
2738 MMDCCXXXVIII
765 DCCLXV
1845 MDCCCXLV
3723 MMMDCCXXIII
1104 MCIV
2470 MMCDLXX


We are going to consider both the Arabic and Roman numerals as arbitrary symbols, which together form a common vocabulary. Those symbols will therefore be treated as characters. For instance, "794" is a string of the following symbols: "7", "9", "4". Similarly, "VIII" is a string of the following symbols: "V", "I", "I", and "I". All these symbols ("4", "7", "9", "I", "V") are part of our vocabulary. However, neural nets work with numbers, not strings. So you can't just serve the net with strings of characters. This is where the `Processor` class comes in: it maps the symbols in our vocabulary to unique integers. And *that*'s what we feed the neural net. One last thing: neural nets also typically expect their inputs to come in *batches*. Since, at the end of the day, neural nets are just a series of matrix multiplications with some nonlinearities peppered here and there, processing is made more efficient by using batches of numbers. That's the role of the `collate` function, which forms nice "boxes" of integers to feed the network.

In [8]:
import torch
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from functools import partial

class NumberDataset(Dataset):
    valid_targets = ['roman','integer']

    def __init__(self, lst, processor=None, target='roman'):
        assert target in self.valid_targets, f'Target needs to be one of {self.valid_targets}'
        self.target = target
        self.processor = processor
        self.lst = lst

    def __getitem__(self, i):
        i,r = self.lst[i]
        t = (i,r) if self.target == 'roman' else (r,i)
        return list(map(self.processor.process, t))

    def __len__(self):
        return len(self.lst)
    
    @classmethod
    def from_file(cls, fn, root=None, extension='txt', **kwargs):
        if root is None: root = default_data_path   
        url = root/('.'.join([fn,extension]))
        with open(url) as f: lines = [line.split() for line in f]
        return cls(lines, **kwargs)

class Processor:
    def __init__(self, vocab):
        self.vocab = vocab
        
    def process(self, x):
        seq = ['<bos>'] + list(x) + ['<eos>']
        return [self.vocab[tok] for tok in seq]

    def deprocess(self, x):
        out = []
        for idx in x:
            tok = self.vocab.lookup_token(idx)
            if tok == '<bos>': continue
            elif tok == '<eos>': return ''.join(out)
            else: out.append(tok)
        return ''.join(out)

def collate(batch, max_len=20, pad_idx=1):
    src_lst, tgt_lst = [], []
    for src, tgt in batch:
        src, tgt = map(torch.tensor, [src, tgt])
        src_lst.append(src)
        tgt_lst.append(tgt)
    src_lst[0] = nn.ConstantPad1d((0, max_len-src_lst[0].size(0)), pad_idx)(src_lst[0])
    tgt_lst[0] = nn.ConstantPad1d((0, max_len-tgt_lst[0].size(0)), pad_idx)(tgt_lst[0])
    return list(map(partial(pad_sequence, padding_value=pad_idx, batch_first=True), [src_lst, tgt_lst]))

  from .autonotebook import tqdm as notebook_tqdm


Let's now build our vocabulary, and then our datasets. PyTorch's `DataLoader` class just takes a dataset and serves batches from it.

In [9]:
max_len = 20
vocab = build_vocab_from_iterator(
    list(map(str,range(10))) # Arabic symbols
    + ['I','V','X','L','C','D','M'], # Roman symbols
    specials=['<bos>', '<pad>', '<eos>'] # Special symbols
)
vocab_size = len(vocab)
pad_idx = vocab['<pad>']
collate_fn = partial(collate, pad_idx=pad_idx, max_len=max_len)
proc = Processor(vocab)
train_ds = NumberDataset.from_file('train', processor=proc)
valid_ds = NumberDataset.from_file('valid', processor=proc)
train_dl = DataLoader(train_ds, batch_size=10, collate_fn=collate_fn)
valid_dl = DataLoader(valid_ds, batch_size=10, collate_fn=collate_fn)

Now on to the meaty stuff. We'll use a sequence-to-sequence network directly inspired by the classic [2015 Bahdanau et al. paper](https://arxiv.org/pdf/1409.0473.pdf). The architecture has three components: (1) an encoder, (2) a decoder, and (3) an attention mechanism in between. I won't go into the details of these models, as the Internet is already replete with detailed explanations of those models. Essentially, an encoder converts each symbol in a sequence into a multidimensional vector. A vector is simply a way to describe some entity. The more dimensions this vector has, the more granular the description can be. For instance the vector ["6'1", "brown", "75kg"] is a very rough description of myself that only encompasses the three dimensions of height, hair color, and weight. In this case, the "entity" is a person. In the case of our little problem, the "entity" is a numerical symbol. In both cases, the entity can be representated by a vector of numbers. Vectors are modular: if one of the descriptions in your vector is a non-number entity, it can itself be represented as a vector, and you just expand the original vector. For instance, "brown" is not a number, but "brown" can be represented as a vector of three numbers (intensities of red, green and blue), so that the final vector representing me is ["6'1", "red intensity of brown", "green intensity of brown", "blue intensity of brown", "75kg"]. Any non-number entity can eventually be represented by a vector of numbers. Were you to expand the vector, you could cram more information into it.

Anyways, the encoder simply converts a string of symbols into an array of vectors that capture information about those symbols, just like ["6'1", "brown", "75kg"] captures information about myself. The attention module then "summarizes" the information across those input symbols into a single vector to feed into the decoder. As the name implies, the decoder does the reverse process: it takes a vector as input and produces a symbol. In our case, it will produce a sequence of Roman numerical symbols. To get the decoder started, we need to feed it a special vector whose meaning is "start of the sequence". If you look back at the code we use to build our vocabulary, that's the symbol "<bos>". But on top of that seed vector, we need to influence the decoder with *some* information from the encoded input. The decoder needs to "look" at the encoded input to guide its production of symbols. Every time it spits out a new symbol, the decoder needs to pay attention to the encoded input. Say the input is "XIV" and the decoder has already produced the symbol "1". Upon producing the next symbol, it probably needs to pay most attention to the second and third vectors in the encoded input (vectors representing "I" and "V"). That's because the decoder needs to subtract one from the other. This process of "paying attention to different parts of the input" is what's known as... an attention mechanism. That's the last component of our architecture. In practice, the decoder needs to "summarize" the information it pays attention to into a single vector. The way it does that is by doing a weighted sum of the encoded input vectors. In the case where the decoder is trying to produce "4" in the sequence "14", it will likely learn to give a low weight to "X" in the input and a high weight to "I" and "V".

In [10]:
import torch.nn as nn
import torch.nn.functional as F

class Encoder(nn.Module):
    def __init__(self, vocab_size, hidden_dim, dropout):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, hidden_dim, batch_first=True, dropout=dropout, bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.project = nn.Linear(hidden_dim*2, hidden_dim)
    
    def forward(self, x):
        bs = x.size(0)
        x = self.dropout(self.emb(x))
        h, h_last = self.rnn(x)
        h_last = h_last.permute(1,0,2).contiguous().view(bs, -1) # (bs,hidden_dim*2)
        h, h_last = map(self.project, [h, h_last])
        h_last = h_last.unsqueeze(0)
        return h, h_last # (bs,seq_len,hidden_dim), (1,bs,hidden_dim)
    
class Attention(nn.Module):
    def __init__(self, encoder_hidden_dim, decoder_hidden_dim):
        super().__init__()
        self.decoder_hidden_dim = torch.tensor(decoder_hidden_dim)
        self.w = nn.Parameter(torch.FloatTensor(decoder_hidden_dim, encoder_hidden_dim).uniform_(-0.1, 0.1))
    
    def forward(self, query, values):
        score = (query.unsqueeze(1) @ self.w @ values.permute(0,2,1))/torch.sqrt(self.decoder_hidden_dim)
        attention_weights = F.softmax(score, 1)
        context = attention_weights @ values
        return context

class Decoder(nn.Module):
    def __init__(self, vocab_size, hidden_dim, dropout):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, hidden_dim)
        self.rnn = nn.GRU(hidden_dim*2, hidden_dim, batch_first=True, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, y, h_in, context):
        y = self.dropout(self.emb(y.unsqueeze(1)))
        y = torch.cat([context, y], -1) # (bs,1,hidden_dim*2)
        h, h_last = self.rnn(y, h_in) # (bs,1,hidden_dim), (1,bs,hidden_dim)
        return h.squeeze(1), h_last

class Seq2SeqWithAttention(nn.Module):
    def __init__(self, vocab_size, hidden_dim=20, dropout=.5):
        super().__init__()
        self.encode = Encoder(vocab_size, hidden_dim, dropout)
        self.attend = Attention(hidden_dim, hidden_dim)
        self.decode = Decoder(vocab_size, hidden_dim, dropout)
        self.project = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, src, tgt, teacher_forcing_proba=.5):
        bs, tgt_len = tgt.shape
        h, h_last = self.encode(src)
        s = h_last.squeeze(0) # (bs,hidden_dim)
        y = tgt[:,0]
        logits = []
        for t in range(1, tgt_len):
            context = self.attend(s, h) # context: (bs,1,hidden_dim)
            s, h_last = self.decode(y, h_last, context)
            logit = self.project(s)
            logits.append(logit)
            teacher_force = random.random() < teacher_forcing_proba
            y = tgt[:,t] if teacher_force else logit.argmax(-1)
        return torch.stack(logits, 1)

Let's now write some functions to train and evaluate our model. It's useful to distinguish here between the loss function (also known as "criterion") and the metric function. While they both give us a sense of performance, the former is used by the neural net to guide its "self-writing" process (technically known as "gradient descent") and the latter is used by the human supervisor to evaluate how well the self-writing software is doing, like a foreman overseeing a worker in the factories of old. The criterion is typically more amenable to the gradient descent process but not easily interpretable by the foreman. The metric function, in turn, is more intuitive. In our case, the criterion is cross-entropy and the metric is accuracy (how many sequences does the software get right).

In [11]:
class Acc:
    def __init__(self, ignore_index=1):
        self.ignore_index = ignore_index
    
    def __call__(self, pred, tgt):
        # both pred and tgt have shape (bs,seq_len)
        mask = tgt != self.ignore_index
        pred *= mask
        tgt *= mask
        correct = torch.eq(pred, tgt).all(1).sum()
        return correct.item()

The `fit` function encapsulates both the training process and the evaluation process. It repeats both processes until the neural net seems to have converged to a decent level of performance, which the foreman (us) can observe via the metric. The number of epochs just captures how many times the software goes through the same examples until it gets them right. The `patience` parameter is used to decide when to stop training--essentially it causes the software to stop training when its stops making any progress for a few epochs. When the software is done writing itself, the `fit` function spits out the final software.

In [12]:
def train(mdl, dl, opt, loss_fn, metric):
    mdl.train()
    epoch_loss = 0
    correct = 0
    for i, (src, tgt) in enumerate(dl):
        opt.zero_grad()
        out = mdl(src, tgt)
        bs, seq_len, out_dim = out.shape
        assert out.size(1) == tgt.size(1)-1 # we skipped the first element in the output
        # collapse seq and batch dims
        out = out.view(-1, out_dim)
        tgt = tgt[:,1:].contiguous().view(-1) # skip the first element in the ground truth
        loss = loss_fn(out, tgt)
        loss.backward()
        opt.step()
        epoch_loss += loss.item()
        pred = out.argmax(-1)
        m = metric(pred.view(bs, -1), tgt.view(bs, -1))
        correct += m
        if i > 0 and i % 1e4 == 0:
            print(f'\t{i}: {epoch_loss/i:.3f}')
    n = (i+1)*bs
    return epoch_loss/n, correct/n

In [13]:
def evaluate(mdl, dl, loss_fn, metric):
    mdl.eval()
    epoch_loss = 0
    correct = 0
    with torch.no_grad():
        for i, (src, tgt) in enumerate(dl):
            out = mdl(src, tgt, teacher_forcing_proba=0) # turn off teacher forcing
            bs, seq_len, out_dim = out.shape
            out = out.view(-1, out_dim)
            tgt = tgt[:,1:].contiguous().view(-1)
            loss = loss_fn(out, tgt)
            epoch_loss += loss.item()
            pred = out.argmax(-1)
            m = metric(pred.view(bs, -1), tgt.view(bs, -1))
            correct += m
    n = (i+1)*bs
    return epoch_loss/n, correct/n

In [14]:
import json

def fit(epochs, mdl, train_dl, valid_dl, opt, criterion, metric, patience=2):
    fmt = lambda x: f'{x:.3f}'
    best_valid_loss = float('inf')
    best_mdl = None
    irritation = 0
    for epoch in range(epochs):
        print(f'Epoch: {epoch+1:02}')
        train_loss, train_metric = train(mdl, train_dl, opt, criterion, metric)
        valid_loss, valid_metric = evaluate(mdl, valid_dl, criterion, metric)
        print('\t' + json.dumps({
            'train': {
                'loss': fmt(train_loss),
                'metric': fmt(train_metric)
            },
            'valid': {
                'loss': fmt(valid_loss),
                'metric': fmt(valid_metric)
            }
        }, indent=4))
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            best_mdl = mdl
            torch.save(mdl.state_dict(), 'model.pt')
            irritation = 0
        else:
            irritation += 1
            if irritation == patience: break
    return best_mdl

Let's instantiate our model. The `opt` object is the thing that will run the gradient descent process for us. In other words, it is what manages the neural net's "self-writing" process. This is also where we pick several so-called "hyperparameters" (`hidden_dim`, `dropout`, `lr`). Without going into the details, they determine either some structural elements of the neural net's architecture (like the size of the vectors representing the symbols) or elements of the training process (like the rate at which the software updates itself based on additional examples).

In [15]:
import torch.optim as optim

i2r_mdl = Seq2SeqWithAttention(vocab_size, hidden_dim=30, dropout=.3)
opt = optim.Adam(i2r_mdl.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
metric = Acc(ignore_index=pad_idx)



Time to fit! (Only displaying the criterion and metric values at the last epoch.)

In [16]:
best_i2r_mdl = fit(100, i2r_mdl, train_dl, valid_dl, opt, criterion, metric, patience=3)

Epoch: 01
	{
    "train": {
        "loss": "0.207",
        "metric": "0.000"
    },
    "valid": {
        "loss": "0.162",
        "metric": "0.000"
    }
}
Epoch: 02
	{
    "train": {
        "loss": "0.151",
        "metric": "0.000"
    },
    "valid": {
        "loss": "0.143",
        "metric": "0.000"
    }
}
Epoch: 03
	{
    "train": {
        "loss": "0.133",
        "metric": "0.001"
    },
    "valid": {
        "loss": "0.126",
        "metric": "0.000"
    }
}
Epoch: 04
	{
    "train": {
        "loss": "0.122",
        "metric": "0.001"
    },
    "valid": {
        "loss": "0.117",
        "metric": "0.000"
    }
}
Epoch: 05
	{
    "train": {
        "loss": "0.116",
        "metric": "0.003"
    },
    "valid": {
        "loss": "0.127",
        "metric": "0.003"
    }
}
Epoch: 06
	{
    "train": {
        "loss": "0.111",
        "metric": "0.005"
    },
    "valid": {
        "loss": "0.101",
        "metric": "0.007"
    }
}
Epoch: 07
	{
    "train": {
        "los

The validation metric is good (>90%) but not great. For a better sense of the confidence we should put in this piece of probabilistic software, let's run it on our held-out test set of 100 examples unseen during training.

In [17]:
test_ds = NumberDataset.from_file('test', processor=proc)
test_dl = DataLoader(test_ds, batch_size=5, collate_fn=collate_fn)
_, test_metric = evaluate(best_i2r_mdl, test_dl, criterion, metric)
print(test_metric)

0.82


About 90%. Not the greatest. There are various reasons for that. First of all, we only used 1000 examples, which in the realm of deep learning is extremely small. If we collect additional examples, we will reach close to 100% pretty quickly. The right level of performance depends on the use case. The various cost savings of software 2.0 might be worth the accuracy hit. Of course, with this silly use case, a 10% accuracy hit does not make any sense. (And again, the whole idea of using neural nets for this use case does not make much sense, this is just for illustration purposes.)

Let's try a few examples to see the software in action. For that we build a `predict` function that wraps the software with some light code needed to serve the input and to format the output. We show the results with our little test set. In parenthesis is the right answer.

In [18]:
def predict(mdl, tests, proc, collate):
    mdl.eval()
    tests = list(zip(tests,tests))
    test_ds = NumberDataset(tests, processor=proc)
    test_dl = DataLoader(test_ds, batch_size=len(tests), collate_fn=collate)
    with torch.no_grad():
        for src,tgt in test_dl:
            out = mdl(src, tgt, teacher_forcing_proba=0)
            bs, seq_len, out_dim = out.shape
            out = out.view(-1, out_dim)
            tgt = tgt[:,1:].contiguous().view(-1)
            pred = out.argmax(-1)
            return [proc.deprocess(seq) for seq in pred.view(bs, -1)]

In [19]:
to_predict = [i for i,r in test_set]
preds = predict(best_i2r_mdl, to_predict, proc, collate_fn)
for (i,r),pred in zip(test_set,preds):
    print(f'{i} -> {pred} ({r})')

4 -> I (IV)
1193 -> MCXCIII (MCXCIII)
548 -> DXLVIII (DXLVIII)
3616 -> MMMDCXVI (MMMDCXVI)
21 -> CXX (XXI)


## Converting Roman numerals into integers

I mentioned that versatility is another major benefit of software 2.0. Let's illustrate that by using the exact same neural net architecture to do a different but related task: converting Roman numerals into integers (instead of the other way around). To drive the point home, I first write the equivalent imperative program. Notice how completely different it is from the imperative program that converts integers into Roman numerals.

In [20]:
def roman_to_integer(s):
    l = len(s)
    tot = 0
    prev_n = 0
    for i in range(l):
        current_n = r2i[s[i]]
        next_n = r2i[s[i+1]] if i+1 < l else 0
        if current_n >= next_n:
            tot += (current_n - prev_n)
            prev_n = 0
        else:
            prev_n = current_n
    return tot

In [21]:
assert roman_to_integer('MCXCIII') == 1193

This was easier for me to code, but still took me some time.

Now let's turn back to our neural net. This is where software 2.0 really shines. All I have to do to repurpose my neural net is to feed it different data. I don't need to code anything. The only change I make is to add `target='integer'` when reading the data. This ensures that the examples are served in such a way that the neural net learns to produce integers from Roman numerals.

In [22]:
train_ds = NumberDataset.from_file('train', processor=proc, target='integer')
valid_ds = NumberDataset.from_file('valid', processor=proc, target='integer')
train_dl = DataLoader(train_ds, batch_size=10, collate_fn=collate_fn)
valid_dl = DataLoader(valid_ds, batch_size=10, collate_fn=collate_fn)
r2i_mdl = Seq2SeqWithAttention(vocab_size, hidden_dim=30, dropout=.3)
opt = optim.Adam(r2i_mdl.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
metric = Acc(ignore_index=pad_idx)
best_r2i_mdl = fit(100, r2i_mdl, train_dl, valid_dl, opt, criterion, metric, patience=3)

Epoch: 01
	{
    "train": {
        "loss": "0.234",
        "metric": "0.000"
    },
    "valid": {
        "loss": "0.200",
        "metric": "0.001"
    }
}
Epoch: 02
	{
    "train": {
        "loss": "0.189",
        "metric": "0.000"
    },
    "valid": {
        "loss": "0.170",
        "metric": "0.000"
    }
}
Epoch: 03
	{
    "train": {
        "loss": "0.167",
        "metric": "0.000"
    },
    "valid": {
        "loss": "0.155",
        "metric": "0.004"
    }
}
Epoch: 04
	{
    "train": {
        "loss": "0.152",
        "metric": "0.002"
    },
    "valid": {
        "loss": "0.143",
        "metric": "0.007"
    }
}
Epoch: 05
	{
    "train": {
        "loss": "0.140",
        "metric": "0.004"
    },
    "valid": {
        "loss": "0.133",
        "metric": "0.010"
    }
}
Epoch: 06
	{
    "train": {
        "loss": "0.130",
        "metric": "0.008"
    },
    "valid": {
        "loss": "0.121",
        "metric": "0.026"
    }
}
Epoch: 07
	{
    "train": {
        "los

In [23]:
test_ds = NumberDataset.from_file('test', processor=proc, target='integer')
test_dl = DataLoader(test_ds, batch_size=5, collate_fn=collate_fn)
_, test_metric = evaluate(best_r2i_mdl, test_dl, criterion, metric)
print(test_metric)

0.88


In [24]:
to_predict = [r for i,r in test_set]
preds = predict(best_r2i_mdl, to_predict, proc, collate_fn)
for (i,r),pred in zip(test_set,preds):
    print(f'{r} -> {pred} ({i})')

IV -> 4 (4)
MCXCIII -> 1193 (1193)
DXLVIII -> 548 (548)
MMMDCXVI -> 3616 (3616)
XXI -> 21 (21)


Pretty cool, even though it's not perfect. Let's collect 4000 more examples and see how it improves both programs. 

In [25]:
splits  = {
    'train': 5e3,
    'valid': 1e3,
    'test': 1e2
}

make_data(splits)

In [26]:
train_ds = NumberDataset.from_file('train', processor=proc)
valid_ds = NumberDataset.from_file('valid', processor=proc)
train_dl = DataLoader(train_ds, batch_size=10, collate_fn=collate_fn)
valid_dl = DataLoader(valid_ds, batch_size=10, collate_fn=collate_fn)
i2r_mdl = Seq2SeqWithAttention(vocab_size, hidden_dim=30, dropout=.3)
opt = optim.Adam(i2r_mdl.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
metric = Acc(ignore_index=pad_idx)
best_i2r_mdl = fit(100, i2r_mdl, train_dl, valid_dl, opt, criterion, metric, patience=3)

Epoch: 01
	{
    "train": {
        "loss": "0.146",
        "metric": "0.000"
    },
    "valid": {
        "loss": "0.107",
        "metric": "0.003"
    }
}
Epoch: 02
	{
    "train": {
        "loss": "0.101",
        "metric": "0.006"
    },
    "valid": {
        "loss": "0.101",
        "metric": "0.006"
    }
}
Epoch: 03
	{
    "train": {
        "loss": "0.079",
        "metric": "0.026"
    },
    "valid": {
        "loss": "0.069",
        "metric": "0.084"
    }
}
Epoch: 04
	{
    "train": {
        "loss": "0.063",
        "metric": "0.086"
    },
    "valid": {
        "loss": "0.046",
        "metric": "0.311"
    }
}
Epoch: 05
	{
    "train": {
        "loss": "0.047",
        "metric": "0.205"
    },
    "valid": {
        "loss": "0.031",
        "metric": "0.520"
    }
}
Epoch: 06
	{
    "train": {
        "loss": "0.035",
        "metric": "0.366"
    },
    "valid": {
        "loss": "0.023",
        "metric": "0.671"
    }
}
Epoch: 07
	{
    "train": {
        "los

In [27]:
test_ds = NumberDataset.from_file('test', processor=proc)
test_dl = DataLoader(test_ds, batch_size=5, collate_fn=collate_fn)
_, test_metric = evaluate(best_i2r_mdl, test_dl, criterion, metric)
print(test_metric)

0.96


In [28]:
to_predict = [i for i,r in test_set]
preds = predict(best_i2r_mdl, to_predict, proc, collate_fn)
for (i,r),pred in zip(test_set,preds):
    print(f'{i} -> {pred} ({r})')

4 -> IV (IV)
1193 -> MCXCIII (MCXCIII)
548 -> DXLVIII (DXLVIII)
3616 -> MMMDCXVI (MMMDCXVI)
21 -> XXI (XXI)


In [29]:
train_ds = NumberDataset.from_file('train', processor=proc, target='integer')
valid_ds = NumberDataset.from_file('valid', processor=proc, target='integer')
train_dl = DataLoader(train_ds, batch_size=10, collate_fn=collate_fn)
valid_dl = DataLoader(valid_ds, batch_size=10, collate_fn=collate_fn)
r2i_mdl = Seq2SeqWithAttention(vocab_size, hidden_dim=30, dropout=.3)
opt = optim.Adam(r2i_mdl.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
metric = Acc(ignore_index=pad_idx)
best_r2i_mdl = fit(100, r2i_mdl, train_dl, valid_dl, opt, criterion, metric, patience=3)

Epoch: 01
	{
    "train": {
        "loss": "0.180",
        "metric": "0.002"
    },
    "valid": {
        "loss": "0.138",
        "metric": "0.003"
    }
}
Epoch: 02
	{
    "train": {
        "loss": "0.122",
        "metric": "0.036"
    },
    "valid": {
        "loss": "0.098",
        "metric": "0.107"
    }
}
Epoch: 03
	{
    "train": {
        "loss": "0.084",
        "metric": "0.154"
    },
    "valid": {
        "loss": "0.062",
        "metric": "0.310"
    }
}
Epoch: 04
	{
    "train": {
        "loss": "0.057",
        "metric": "0.340"
    },
    "valid": {
        "loss": "0.042",
        "metric": "0.554"
    }
}
Epoch: 05
	{
    "train": {
        "loss": "0.039",
        "metric": "0.553"
    },
    "valid": {
        "loss": "0.026",
        "metric": "0.795"
    }
}
Epoch: 06
	{
    "train": {
        "loss": "0.027",
        "metric": "0.724"
    },
    "valid": {
        "loss": "0.019",
        "metric": "0.805"
    }
}
Epoch: 07
	{
    "train": {
        "los

In [30]:
test_ds = NumberDataset.from_file('test', processor=proc, target='integer')
test_dl = DataLoader(test_ds, batch_size=5, collate_fn=collate_fn)
_, test_metric = evaluate(best_r2i_mdl, test_dl, criterion, metric)
print(test_metric)

1.0


In [31]:
to_predict = [r for i,r in test_set]
preds = predict(best_r2i_mdl, to_predict, proc, collate_fn)
for (i,r),pred in zip(test_set,preds):
    print(f'{r} -> {pred} ({i})')

IV -> 4 (4)
MCXCIII -> 1193 (1193)
DXLVIII -> 548 (548)
MMMDCXVI -> 3616 (3616)
XXI -> 21 (21)


## Conclusion

In this post, I've described two software paradigms: imperative programming (software 1.0) and programming by example (software 2.0). I've argued that, because they currently tend to focus on non-overlapping problems, they are not easily seen as substitutes of each other. I used the (silly) example of converting Roman numerals into integers and vice-versa to illustrate that substitutability. In the process, I've highlighted the pros and cons of each paradigm. Because software 2.0 is "self-writing", the skills to "develop" that software (curate datasets of examples) are cheaper to source. In that sense, neural nets constitute "meta-software", or software that produces software. Of course, the neural nets themselves have to be coded imperatively, but the argument is that they are destined to become commodities. And because they are more easily multi-purpose, the fixed cost of coding them is amortized over multiple problems. In the above example, this was illustrated by repurposing the same neural net to solve a task that, in the software 1.0 paradigm, requires a complete rewrite of the software.