In [3]:
#!pip install fastai --upgrade

We have seen what `Tokenizer` and `Numericalize` do to a collection of texts, and how they're used inside the data block API, which handles those transforms for us directly using the `TextBlock`. But what if we want to apply only one of those transforms, either to see intermediate results or because we have already tokenized texts? More generally, what can we do when the data block API is not flexible enough to accomodate our particular use case? For this, we need to use fastai's *mid-level API* for processing data. The data block API is built on top of that layer, so it will allow you to do everything the data block API does and much more.

## Going Deeper into fastai's Layered API

The fastai library is built on a *layered API*. In the very top layer are *applications* that allow us to train a model in five lines of code, as we saw in Chapter 1. In the case of creating `DataLoaders` for a text classifier, for instance, we used this line:

In [4]:
from fastai.text.all import *

dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')

The factory method `TextDataLoaders.from_folder` is very convenient when your data is arranged the exact same way as the IMBd dataset, but in practice, that often won't be the case. The data block API offers more flexibility. As we saw in the preceding chapter, we can get the same result with the following:

In [5]:
path = untar_data(URLs.IMDB)
dls = DataBlock(
    blocks = (TextBlock.from_folder(path), CategoryBlock),
    get_y = parent_label,
    get_items = partial(get_text_files, folders=['train', 'test']),
    splitter = GrandparentSplitter(valid_name='test')
).dataloaders(path)

But it's sometimes not flexible enough. For debugging purposes, for instance, we might need to apply just parts of the transforms that come with this data block. Or we might want to create a `DataLoaders` for an application that isn't directly supported by fastai. In this section, we will dig into the pieces that are used inside fastai to implement the data block API. Understanding these will enable you to leverage the power and flexibility of this mid-tier API.

The mid-level API does not contain only functionality for creating `DataLoaders`. It also has the *callback* system, which allows us to customize the training loop any way we like, and the *general optimizer*. Both will be covered in Chapter 16.

### Transforms

When we studied tokenization and numericalization in the preceding chapter, we started by grabbing a bunch of texts:

In [6]:
files = get_text_files(path, folders=['train', 'test'])
txts = L(o.open().read() for o in files[:2000])

We then showed how to tokenize them with a `Tokenizer`

In [8]:
tok = Tokenizer.from_folder(path)
tok.setup(txts)
toks = txts.map(tok)
toks[0]

(#357) ['xxbos','xxmaj','for','starters',',','i','did',"n't",'even','know'...]

and how to numericalize, including automatically creating the vocab for our corpus:

In [9]:
num = Numericalize()
num.setup(toks)
nums = toks.map(num)
nums[0][:10]

tensor([  2,   8,  29,   0,  10,  19, 108,  42,  87, 142])

The classes also have a `decode` method. For instance, `Numericalize.decode` gives us back the string tokens:

In [10]:
nums_dec = num.decode(nums[0][:10])
nums_dec

(#10) ['xxbos','xxmaj','for','xxunk',',','i','did',"n't",'even','know']

`Tokenizer.decode` turns this back into a single string (it may not, however, be exactly the same as the original string; this depends on whether the tokenizer is *reversible*, which th defulat word tokenizer is not at the time of the writing of the book):

In [11]:
tok.decode(nums_dec)

"xxbos xxmaj for xxunk , i did n't even know"

`decode` is used by fastai's `show_batch` and `show_results`, as well as some other inference methods, to convert predictions and mini-batches into a human-understandable representation.

For each of `tok` or `num` in the preceding examples, we created an object called the `setup` method (which trains the tokenizer if needed for `tok` and creates the vocab for `num`), applied it to our raw texts (by calling the object as a function), and then finally decoded the result back to an understandable representation. These steps are needed for most data preprocessing tasks, so fastai provides a class that encapsulates them. This is the `Transform` class. Both `Tokenize` and `Numericalize` are `Transforms`.

In general, a `Transform` is an object that behaves like a function and has an optional `setup` method that will initialize an inner state (like the vocab inside `num`) and an optional `decode` method that will reverse the function (this reversal may not be perfect, as we saw with `tok`).

A good example of `decode` is found in the `Normalize` transform that we saw in Chapter 7: to be able to plot the images, its `decode` method undoes the normalization (i.e., it multiplies by the standard deviation and adds back the mean). On the other hand, data augmentation transforms do not have a `decode` method, since we want to show the effects on images to make sure the data augmentation is working as we want.

A special behavior of `Transforms` is that they always get applied over tuples. In general, our data is always a tuple `(input, target)` (sometimes with more than one input or more than one target). When applying a transform on an item like this, such as `Resize`, we don't want to resize the tuple as a whole; instead, we want to resize the input (if applicable) and the target (if applicable) separately. It's the same for batch transforms that do data augmentation: when the input is an image and the target is a segmentation mask, the transform needs to be applied (the same way) to the input and the target.

We can see this behavior if we pass a tuple of texts to `tok`:

In [12]:
tok((txts[0], txts[1]))

((#357) ['xxbos','xxmaj','for','starters',',','i','did',"n't",'even','know'...],
 (#131) ['xxbos','xxmaj','despite','xxmaj','disney',"'s",'best','efforts',',','this'...])

### Writing Your Own Transform

If you want to write a custom transform to apply to your data, the easiest way is to write a function. As you can see in this example, a `Transform` will be applied only to a matching type, if a type is provided (otherwise, it will always be applied). In the following code, the `:int` in the function signature means that `f` gets applied only to `ints`. That's why `tfm(2.0)` returns 2.0, but `tfm(2)` returns 3 here:

In [13]:
def f(x:int):
  return x + 1

In [16]:
tfm = Transform(f)

In [17]:
tfm(2), tfm(2.0)

(3, 2.0)

Here, `f` is converted to a `Transform` with no `setup` and no `decode` method.

Python has a special syntax for passing a function (like `f`) to another function (or something that behaves like a function, known as a *callable* in Python), called a *decorator*. A decorator is used by prepending a callable with @ and placing it before a function definition. The following is identical to the previous code:

In [18]:
@Transform
def f(x:int):
  return x+1
f(2), f(2.0)

(3, 2.0)

If you need either `setup` or `decode`, you will need to subclass `Transform` to implement the actual encoding behavior in `encodes`, then (optionally) the setup behavior in `setups` and the decoding behavior in `decodes`:

In [21]:
class NormalizeMean(Transform):
  def setups(self, items):
    self.mean = sum(items)/len(items)
  
  def encodes(self, x):
    return x-self.mean

  def encodes(self, x):
    return x+self.mean

Here, `NormalizeMean` will initialize a certain state during the setup (the mean of all elements passed); then the transformation is to subtract that mean. For decoding purposes, we implement the reverse of that transformation by adding the mean. Here is an example of `NormalizeMean` in action:

In [23]:
tfm = NormalizeMean()
tfm.setup([1,2,3,4,5])
start = 2
y = tfm(start)
z = tfm.decode(y)
tfm.mean, y, z

(3.0, 5.0, 5.0)

# Look into why this differs from the book. [13 October 2020 @ 1649 CET]