In [2]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [3]:
from fastbook import *
from IPython.display import display,HTML

## NLP Deep Dive: RNNs
- language model is a model that has been trained to guess what the next word in a text is (having read the ones before)
- called self supervised learning because we do not need to give labels to our model, just fee it lots and lots of text
- Self supervised learning is not usually used for the model that is trained directly, but instead used for pretraining a model used for transfer learning
- better results can occur if you fine tune the sequence based langauge model prior to fine tuning the classification model
    - for instance for IMDB review sentiment analysis we can use 100,000 movie reviews to fine tune the pretrained model that was before trained on wikipedia articles. 
    - this will result in a language model that is particular good at predicting the next word of a movie review and keeping the style consistent. 

- Known as Universal Langauge Model Fine-tuning (ULMFit) appraoch
- an extra stage of fine tuning a language model prior to transfer learning to a classification task resulted in significantly better predictions
- three stages of transfer learning in NLP
    - Wikitext Language Model -> IMDb Language Model -> IMDb Classifier

## Text Preprocessing
- Approach we take for single categorical variable
    - make a list of all possible levels of that categorical variable (called vocab)
    - replace each level with its index in the vocab
    - create an embedding matrix for this containing a row for each level (each item of the vocab)
    - use this embedding matrix as the first layer of a neural net

- The same can be done with text
- Concatenate all documents in our dataset into a long string and split it into words (tokens)
- Our independent variable will be the sequence of words starting with the frist in our long list and ending with the second to last
- the dependent variable will be the sequence of words starting with the second word and ending with the last word
- vocab will consist of a mix of common words that are already in the vocabulary of the pretrained model and new words specific to our corpus (for imdb example actor names or cinematogrphic terms)
- For words in the vocab of our pretrained model we will take the corresponding row in the embedding matrix. New words will be initialized with a random vector

- steps necessary to create a language model
    - Tokenization (converting text into a list of words)
    - Numericalization (make a list of all unique words that appear (vocab), and convert each word into a number by looking up its index in the vocab
    - Language model data loader creation
        - Fastai provides a LMDataLoader class which automatically handles creating a dependent variable that is offset from the independent variable by one token.
        - also handles details such as how to shuffle the training data in a way that the dependent and independent variables maintain their structure as required
     - Language model creation
         - using a recurrent neural network

## Tokenization
- three main appraoches
    - word-based: split sentence on spaces while applying language speicifc rules to try to seperate parts of meaning when there are no spaces.
    - Subword based: split words into smaller parts based on the most commonly occuring substrings. "occasion" might be "o c ca sion".
    - Character-based: split sentence into individual characters

In [4]:
# Word tokenization with fastai
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [5]:
# grab the text files
files = get_text_files(path, folders = ["train", "test", "unsup"])

In [6]:
txt = files[0].open().read(); txt[:75]

'Jiang Xian uses the complex backstory of Ling Ling and Mao Daobing to study'

In [7]:
# Use WordTokenizer to create tokens
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#143) ['Jiang','Xian','uses','the','complex','backstory','of','Ling','Ling','and','Mao','Daobing','to','study','Mao',"'s",'"','cultural','revolution','"','(','1966','-','1976',')','at','the','village','level','.'...]


In [8]:
# fastai adds additional functionality to the tokenization process with the Tokenizer class
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))


(#158) ['xxbos','xxmaj','jiang','xxmaj','xian','uses','the','complex','backstory','of','xxmaj','ling','xxmaj','ling','and','xxmaj','mao','xxmaj','daobing','to','study','xxmaj','mao',"'s",'"','cultural','revolution','"','(','1966','-'...]


In [9]:
# those starting with "xx" are special tokens
# xxbos indicates start of new text (beginning of stream)
# this lets the model know if it needs to forget what was said before (given it is the start of a new stream)

# xxmaj indicates the next word begins with a capital (we lowercased everything before)
# xxunk indicates the word is unknown

# see default rules
defaults.text_proc_rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

In [10]:
# Subword Tokenization
# assumption spaces provide a useful separation of components of meaning in a sentence
# best used in cases where languages do not have spaces (chinese) or use little spaces (hungarian)
# two steps
    # analyze corpus of documents to find the most commonly occuring groups of letters (these become vocab)
    # tokenize the corpus using this vocab ov subword units

In [11]:
# we instantiate our tokenizer passing in the size of the vocab
# we need to train it or have it read out docs to find common sequences
txts = L(o.open().read() for o in files[:2000])

# training is done with setup
# setup is a fastai method that is called automatically in our data processing pipelines
# we have to call it ourself since we are doing this manually
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return " ".join(first(sp([txt]))[:40])

subword(1000)

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=tmp/texts.out --vocab_size=1000 --model_prefix=tmp/spm --character_coverage=0.99999 --model_type=unigram --unk_id=9 --pad_id=-1 --bos_id=-1 --eos_id=-1 --minloglevel=2 --user_defined_symbols=▁xxunk,▁xxpad,▁xxbos,▁xxeos,▁xxfld,▁xxrep,▁xxwrep,▁xxup,▁xxmaj --hard_vocab_limit=false


'▁J i ang ▁ X ian ▁us es ▁the ▁comp le x ▁back st or y ▁of ▁L ing ▁L ing ▁and ▁Ma o ▁Da o b ing ▁to ▁st u d y ▁Ma o \' s ▁" c ul'

In [13]:
# if a smaller vocab is used each token will represent fewer characters
subword(200)

'▁ J i an g ▁ X i an ▁ us es ▁the ▁c o m p le x ▁b a ck st or y ▁of ▁ L ing ▁ L ing ▁and ▁ M a o ▁ D a'

In [14]:
# picking a subword size represents a compromise
# larger vocab means fewer tokens per sentence which means faster training, less memory and less state for the model to remember
# the downside is larger embedding matrices which require more data

# subword tokenization provides a way to easily scale between character tokenization and word tokenization
# last year has gotten more popular

In [15]:
# Numericalization with fastai
# mapping tokens to integers
    # make a list of all possible levels of that categorical variable (vocab)
    # replace each level with its index in the vocab
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

(#158) ['xxbos','xxmaj','jiang','xxmaj','xian','uses','the','complex','backstory','of','xxmaj','ling','xxmaj','ling','and','xxmaj','mao','xxmaj','daobing','to','study','xxmaj','mao',"'s",'"','cultural','revolution','"','(','1966','-'...]


In [16]:
# we need to call setup to create the vocab
# we need our tokenized corpus first
# since tokenization takes a while this example will use a small subset
toks200 = txts[:200].map(tkn)
toks200[0]

(#158) ['xxbos','xxmaj','jiang','xxmaj','xian','uses','the','complex','backstory','of'...]

In [17]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#2152) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','in','it','i'...]"

In [18]:
# special rules tokens appear first, then every word appears once in frequency order
# defaults to Numericalize is min_freq=3, max_vocab=60000
# Once Numericalize is created we can use it as a function
nums = num(toks)[:20]
nums

TensorText([   2,    8,    0,    8,    0, 1269,    9, 1270,    0,   14,    8,    0,    8,    0,   12,    8,    0,    8,    0,   15])

In [20]:
# tokens have been converted to a tensor of integers that our model can recieve
# check that they map back to the original text
" ".join(num.vocab[o] for o in nums)

'xxbos xxmaj xxunk xxmaj xxunk uses the complex xxunk of xxmaj xxunk xxmaj xxunk and xxmaj xxunk xxmaj xxunk to'

In [21]:
# now that we have numbers we need to put them in batches for our model

## Putting our Texts into Batches for a Language Model