Notes for Chapter 10: NLP Deep Dive: RNNs
=========================================

Introduction & Background
-------------------------

A *language model* is a "model that has been trained to guess the next
word in a text" after reading the preceding words. This is called
"self-supervised learning," we just need to give the model a lot of text
to work with. Self-supervised learning is usually used to *pre-train* a
model for later use in transfer learning. The general process we will be
following is:

-   Concatenate all of the documents in our dataset into one big long
    string and split it into words (tokens)
-   Independent variable is first word to second-to-last word. Dependent
    variable is second word to last word.
-   Vocab will comprise common words already in vocab of pretrained
    model, and application-specific words. We will initialize new
    embedding matrix rows for these new vocab words.

Some relevant terminology:

-   *Tokenization* is the process of converting text into a list of
    words (or characters or substrings)
-   *Numericalization* means listing all of the unique words that appear
    in the vocab and converting them to a number by looking up the index
    in the vocab
-   *Data Loader Creation*: We use the `LMDataLoader` class to create
    independent and dependent variables offset from each other by one
    token.
-   *Language Model Creation*: Recurrent neural network; details to
    follow.

We will go through each of these steps separately.

### Tokenization

There are many approaches to tokenization.

-   Word-based tokenization splits by spaces and also applies
    language-specific rules to separate e.g. don't into do n't to
    capture meaning.
-   Subword-based tokenization splits words into smaller parts based on
    common substrings.
-   Character-based tokenization splits a sentence into individual
    characters.

1.  Word Tokenization in Fastai

    Fastai provides an interface to a range of different tokenizers in
    external libraries (it does not provide its own tokenizers). We're
    going to experiment with the IMDb dataset.

In [1]:
    from fastai.text.all import *
    path = untar_data(URLs.IMDB)
    path.ls()

(#7) [Path('/home/djliden91/.fastai/data/imdb/test'),Path('/home/djliden91/.fastai/data/imdb/imdb.vocab'),Path('/home/djliden91/.fastai/data/imdb/tmp_clas'),Path('/home/djliden91/.fastai/data/imdb/train'),Path('/home/djliden91/.fastai/data/imdb/unsup'),Path('/home/djliden91/.fastai/data/imdb/tmp_lm'),Path('/home/djliden91/.fastai/data/imdb/README')]

    Now we need to get the text files themselves.

    ``` jupyter-python
    files = get_text_files(path, folders=['train','test','unsup'])
    txt = files[0].open().read(); txt[:75]
    ```

    Now, we demonstrate the tokenization functions.

    ``` jupyter-python
    spacy = WordTokenizer()
    toks = first(spacy([txt]))
    print(coll_repr(toks,30))
    ```

    The default tokenizer used by `fastai` at the time of writing is
    `spaCy`. It identifies and handles specific language cases pretty
    well. For example:

    ``` jupyter-python
    first(spacy(['The U.S. dollar $1 is $1.00.']))
    ```

    We can apply some more specific options and tokenization formats
    using the `fastai` tokenizer class.

    ``` jupyter-python
    tkn = Tokenizer(spacy)
    print(coll_repr(tkn(txt), 31))
    ```

    In this case, tokens preceded by `xx` are special tokens. `xxbos`
    indicates the start of a new text, for example. Other common ones:

    -   `xxmaj` indicates the following word starts with a capital
    -   `xxunk` represents that the next word is unknown

    We can check on the default rules as follows

    ``` jupyter-python
    defaults.text_proc_rules
    ```

    1.  Subword Tokenization

        Particularly with languages such as Japanese or Chinese, spaces
        do not provide a good guide to divisions between words. Other
        languages group many "subwords" together into a single word, but
        the subwords themselves can contain various separate meanings.
        Subword tokenization can deal with these sorts of situations.
        This is a two-step process:

        1.  Analyze a corpus of documents. Find the most commonly
            occurring groups of letters. These are the vocab.
        2.  Tokenize using the subword vocab from (1)

        Let's try it out.

        ``` jupyter-python
        txts = L(o.open().read() for o in files[:2000])

        def subword(sz):
            sp = SubwordTokenizer(vocab_sz=sz)
            sp.setup(txts)
            return ' '.join(first(sp([txt]))[:40])

        subword(1000)
        ```

        ``` example
        ▁what ▁can ▁i ▁say , ▁this ▁film ▁is ▁amazing . ▁it ▁has ▁its ▁fla w s ▁like ▁every ▁film ▁does ▁for ▁example ▁w o b b ly ▁he ad st one s ▁in ▁a ▁gr a ve y ard ,
        ```

        The special character `_` represents a space in the original
        text.

        The number we passed to `subword` represents the size of the
        vocab. We can use a smaller vocab:

        ``` jupyter-python
        subword(200)
        ```

        In this case, each token represents fewer characters, so it
        takes more tokens to represent the same sentence. We can also
        see what happens when we use a *larger* vocab.

        ``` jupyter-python
        subword(10000)
        ```

        Here we see that our vocab is coming closer to capturing full
        words. What considerations guide the choice of vocab size?
        Smaller vocab means smaller embedding matrix and requires less
        data to learn. A larger vocab will require a larger embedding
        matrix and thus more data, but it means fewer token per
        sentence, which translates to faster training, less memory, and
        fewer states for the model to remember.

    2.  Numericalization with fastai

        The next step is to map tokens to integers. We can do this as
        follows:

        ``` jupyter-python
        # Revisiting our Tokenizer from before
        toks = tkn(txt)
        print(coll_repr(tkn(txt),31))
        ```

        ``` jupyter-python
        # Prepare a subset for numericalization
        toks200 = txts[:200].map(tkn)
        toks200[0]
        ```

        We apply the numericalization with the `Numericalize` class.

        ``` jupyter-python
        # Numericalize
        # Lists words -- first special tokens, then in frequency order
        num = Numericalize()
        num.setup(toks200)
        coll_repr(num.vocab, 20)
        ```

        ``` jupyter-python
        # print some text in numericalized form
        nums = num(toks)[:20]; nums
        ```

        ``` jupyter-python
        # Map back to original text
        ' '.join(num.vocab[o] for o in nums)
        ```

        ``` example
        xxbos what can i say , this film is amazing . it has its flaws like every film does for
        ```

    3.  Putting Texts into Batches for a Language Model

        We can't just resize text to the desired dimensions as we could
        with images. We want our batches to run in order, each picking
        up where the last left off. Another challenge is that language
        models typically include a large number of tokens – likely more
        than can fit in GPU memory. At each epoch, we (1) shuffle our
        collection of documents; (2) concatenate them into a stream of
        tokens; (3) cut that stream into a batch of fixed-size
        mini-streams in order.

        Let's make our dataloader and take a look at one batch:

        ``` jupyter-python
        # numericalize
        nums200 = toks200.map(num)

        # pass to LMDataloader
        dl = LMDataLoader(nums200)

        # check results by looking at first batch
        x,y = first(dl)
        x.shape, y.shape

        # Look at first row of independent variable
        ' '.join(num.vocab[o] for o in x[0][:20])
        ```

        Now we check out the dependent variable. Note that it is offset
        from the independent variable by one position.

        ``` jupyter-python
        # Look at dependent variable
        # same as independent but offset by one

        ' '.join(num.vocab[o] for o in y[0][:20])
        ```

    4.  The easier way of preprocessing: DataBlock

        We can, of course, use the high-level DataBlock API to prepare
        our data for the model. Specifically, we use a `TextBlock`.

        ``` jupyter-python
        # get items function
        get_imdb = partial(get_text_files, folders=['train','test','unsup'])

        # datablock
        dls_lm = DataBlock(
            blocks=TextBlock.from_folder(path, is_lm=True),
            get_items=get_imdb, splitter=RandomSplitter(0.1)
            ).dataloaders(path, path=path, bs=128, seq_len=80)

        dls_lm.show_batch(max_n=2)
        ```

        `TextBlock` implements a few efficiency optimizations:

        -   saves the tokenized documents in a temp folder so it doesn't
            need to do it more than once
        -   runs processes in parallel to take advantage of multiple
            CPUs.

2.  Fine-Tune the Language Model

    We will use a recurrent neural network (RNN) with an architecture
    called "AWD-LSTM". In this architecture, embeddings in the
    pretrained model are merged with random embeddings added for words
    *not* in the original vocab. The learner handles this automatically.

    ``` jupyter-python
    learn = language_model_learner(
        dls_lm, AWD_LSTM, drop_mult=0.3,
        metrics=[accuracy, Perplexity()]) #.to_fp16() requires GPU
    ```

    -   cross-entropy loss is used. We basically have a classification
        problem; the different categories are the words in our vocab.
    -   The perplexity metric is the exponential of the loss.

    It takes a long time to train each epoch, so we go one at a time and
    save the in-between results.

    ``` jupyter-python
    learn.fit_one_cycle(1, 2e-2)
    learn.reco
    ```

    :end:

    And that killed the kernel, so we'll work on this later.

    ``` jupyter-python
    ```

Testing Jupyter Mode Source Blocks
==================================

``` jupyter-python
x = 'foo'
y = 'bar'
x + ' ' + y
```

> What does `(shell-command-to-string "jupyter kernelspec list")`
> return?
>
> If it doesn't fail and returns the kernelspecs, then it might just be
> that by the time the ob-jupyter file is loaded (which is when we try
> to get the kernelspecs) the paths used by Emacs to search for shell
> programs aren't setup yet. If this is the case you should be able to
> call `(org-babel-jupyter-aliases-from-kernelspecs)` to get everything
> working again.
>
> The ob-jupyter file is loaded whenever that
> org-babel-do-load-languages call is evaluated so you should check to
> see that (executable-find "jupyter") returns a valid path right before
> the call.

OK! So that works, but it looks like there were some problems in the
order in which I started my virtual environment compared to when I
started emacs. Which makes sense given that I am using emacs-daemon, so
emacs is initialized well before I start my virtual environment. Now,
onto business!