<center>

# Universal Language Model Fine-Tuning (ULMFiT)
### State-of-the-Art in Text Analysis
> Ref: https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/#introduction
> [Understanding building blocks of ULMFiT](https://medium.com/mlreview/understanding-building-blocks-of-ulmfit-818d3775325b)

## 0. 概述
> ULMFiT的诸多竞争对手们：  [ULMFit](https://arxiv.org/abs/1801.06146), [ELMo](https://allennlp.org/elmo), [GLoMo](https://arxiv.org/abs/1806.05662), [OpenAI transformer](https://blog.openai.com/language-unsupervised/), [BERT](https://arxiv.org/pdf/1810.04805.pdf)， [Transformer-XL](https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html)

### 迁移学习
![](./images/Figure_1.png)
<center>
     Figure 1: Traditional ML vs. Transfer Learning.

### 数据
使用在WikiText预训练模型，迁移至IMDB评论数据分类

- [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
- [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment/)

### ULMFiT概览
![](./images/Figure_3.png)

For the purpose of providing a general overview, the model can be split into three steps

- **General-Domain LM Pretraining:** In a first step, a LM is pretrained on a large general-domain corpus (in our case the WikiText-103 dataset). Now, the model is able to predict the next word in a sequence (with a certain degree of certainty). Figuratively speaking, at this stage the model learns the general features of the language, e.g. that the typical sentence structure of the English language is subject-verb-object.
- **Target Task LM Fine-Tuning:** Following the transfer learning approach, the knowledge gained in the first step should be utilized for the target task. However, the target task dataset (i.e. the Twitter US Airline Sentiment dataset) is likely from a different distribution than the source task dataset. To address this issue, the LM is consequently fine-tuned on the data of the target task. Just as after the first step, the model is at this point able to predict the next word in a sequence. Now however, it has also learned task-specific features of the language, such as the existence of handles in Twitter or the usage of slang.
- **Target Task Classifier:** Since ultimately, in our case, we do not want our model to predict the next word in a sequence but to provide a sentiment classification, in a third step the pretrained LM is expanded by two linear blocks so that the final output is a probability distribution over the sentiment labels (i.e. positive, negative and neutral).

## 1. 训练通用语言模型（General-Domain language Model）
> 预训练模型下载：http://files.fast.ai/models/wt103/

![](./images/Figure_5.png)
<center>
    Figure 5: Detailed Overview ULMFiT: General-Domain LM Pretraining. 

ULMFIT中使用了一个叫叫做`AWD-LSTM（ASGD Weight-Dropped LSTM）`的网络结构，里面使用了大量不同层的dropout操作：
1. **Encoder Dropout**: 将某些词向量整个随机置零（在某个句子中某个词的词向量表示为全零）；
2. **Input Dropout**： 将lookup的输出置零（某一列置零）；
3. **Weight Dropout**: LSTM的权重矩阵随机置零；
4. **Hidden Dropout**： 使用2中同样的策略，对LSTM的输入进行置零；
5. **Output Dropout**: decoder之前的dropout操作，使用2中方法；


In [1]:
from fastai.text import *

### 数据的准备
First let's download the dataset we are going to study. The [dataset](http://ai.stanford.edu/~amaas/data/sentiment/) has been curated by Andrew Maas et al. and contains a total of 100,000 reviews on IMDB. 25,000 of them are labelled as positive and negative for training, another 25,000 are labelled for testing (in both cases they are highly polarized). The remaning 50,000 is an additional unlabelled data (but we will find a use for it nonetheless).

In [2]:
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

[WindowsPath('C:/Users/gaoc/.fastai/data/imdb_sample/texts.csv')]

In [3]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [4]:
df['text'][1]

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very u

In [5]:
data_lm = TextDataBunch.from_csv(path, 'texts.csv')

By executing this line a process was launched that took a bit of time. Let's dig a bit into it. Images could be fed (almost) directly into a model because they're just a big array of pixel values that are floats between 0 and 1. A text is composed of words, and we can't apply mathematical functions to them directly. We first have to convert them to numbers. This is done in two differents steps: tokenization and numericalization. A `TextDataBunch` does all of that behind the scenes for you.

Before we delve into the explanations, let's take the time to save the things that were calculated.

In [6]:
data_lm.save()

Next time we launch this notebook, we can skip the cell above that took a bit of time (and that will take a lot more when you get to the full dataset) and load those results like this:

In [8]:
data = load_data(path)

NameError: name 'load_data' is not defined

### 序列化

The first step of processing we make the texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, but we can be smarter:

- we need to take care of punctuation
- some words are contractions of two different words, like isn't or don't
- we may need to clean some parts of our texts, if there's HTML code for instance

To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch.

In [9]:
data = TextClasDataBunch.from_csv(path, 'texts.csv')
data.show_batch()

text,target
"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj",negative
"xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with",positive
"xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of "" xxmaj at xxmaj the xxmaj movies "" in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject ,",negative
"xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the xxunk - up xxunk of xxmaj new xxmaj xxunk . \n\n xxmaj the format is the same as xxmaj xxunk xxmaj xxunk ' "" xxmaj la xxmaj xxunk , """,positive
"xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics",positive


The texts are truncated at 100 tokens for more readability. We can see that it did more than just split on space and punctuation symbols: 
- the "'s" are grouped together in one token
- the contractions are separated like this: "did", "n't"
- content has been cleaned for any HTML symbol and lower cased
- there are several special tokens (all those that begin by xx), to replace unknown tokens (see below) or to introduce different text fields (here we only have one).

Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at least twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token `UNK`.

The correspondance from ids to tokens is stored in the `vocab` attribute of our datasets, in a dictionary called `itos` (for int to string).

In [10]:
data.vocab.itos[:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the',
 ',']

In [11]:
data.train_ds[0][0]

Text xxbos xxmaj at the time of writing this review it would seem that over 50 % of imdb voters had given this film a rating of either a 10 or a 1 . i can only xxunk then that those giving it a 10 were either cast or crew members . 

 xxmaj they say that given enough monkeys and enough time and enough xxunk , those monkeys , just by random xxunk at the xxunk , would eventually type out the complete works of xxmaj shakespeare . xxmaj however , i seriously doubt that given the same number of monkeys and time , you could find a single one to give this movie a rating of 10 . 

 i xxunk watched the first half , xxunk xxunk that the film would , on some level , develop either the plot or the characters , or maybe make some kind of social comment or provoke barely intellectual thought . xxmaj failing that , i was quite prepared to accept action , suspense , comedy , horror or even gratuitous sex as a way of holding my attention . xxmaj ultimately , i was disappointed and consequently , much o

In [12]:
data.train_ds[0][0].data[:10]

array([  2,   4,  46,   8,  82,  13, 475,  20, 619,  16], dtype=int64)

### 使用data block API

We can use the data block API with NLP and have a lot more flexibility than what the default factory methods offer. In the previous example for instance, the data was randomly split between train and validation instead of reading the third column of the csv.

With the data block API though, we have to manually call the tokenize and numericalize steps. This allows more flexibility, and if you're not using the defaults from fastai, the various arguments to pass will appear in the step they're revelant, so it'll be more readable.

In [13]:
data = (TextList.from_csv(path, 'texts.csv', cols='text')
                .split_from_df(col=2)
                .label_from_df(cols=0)
                .databunch())

### 语言模型
调整bs大小，获取全量数据

In [15]:
bs = 48

In [None]:
path = untar_data(URLs.IMDB)
path.ls()

In [None]:
(path/'train').ls()

The reviews are in a training and test set following an imagenet structure. The only difference is that there is an `unsup` folder on top of `train` and `test` that contains the unlabelled data.

We're not going to train a model that classifies the reviews from scratch. Like in computer vision, we'll use a model pretrained on a bigger dataset (a cleaned subset of wikipedia called [wikitext-103](https://einstein.ai/research/blog/the-wikitext-long-term-dependency-language-modeling-dataset)). That model has been trained to guess what the next word is, its input being all the previous words. It has a recurrent structure and a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point.

We are going to use that 'knowledge' of the English language to build our classifier, but first, like for computer vision, we need to fine-tune the pretrained model to our particular dataset. Because the English of the reviews left by people on IMDB isn't the same as the English of wikipedia, we'll need to adjust the parameters of our model by a little bit. Plus there might be some words that would be extremely common in the reviews dataset but would be barely present in wikipedia, and therefore might not be part of the vocabulary the model was trained on.

This is where the unlabelled data is going to be useful to us, as we can use it to fine-tune our model. Let's create our data object with the data block API (next line takes a few minutes).

In [None]:
data_lm = (TextList.from_folder(path)
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'test', 'unsup']) 
           #We may have other temp folders that contain text files so we only keep what's in train and test
            .split_by_rand_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch(bs=bs))
data_lm.save('data_lm.pkl')

We have to use a special kind of `TextDataBunch` for the language model, that ignores the labels (that's why we put 0 everywhere), will shuffle the texts at each epoch before concatenating them all together (only for training, we don't shuffle for the validation set) and will send batches that read that text in order with targets that are the next word in the sentence.

The line before being a bit long, we want to load quickly the final ids by using the following cell.

In [None]:
data_lm = load_data(path, 'data_lm.pkl', bs=bs)
data_lm.show_batch()

We can then put this in a learner object very easily with a model loaded with the pretrained weights. They'll be downloaded the first time you'll execute the following line and stored in `~/.fastai/models/` (or elsewhere if you specified different paths in your config file).

In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot(skip_end=15)

In [None]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

In [None]:
learn.save('fit_head')

In [None]:
learn.load('fit_head')

## 2. 迁移学习Fine-Tuning目标任务数据集

### 2.1 Freezing
![](./images/Figure_15.png)

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))

In [None]:
learn.save('fine_tuned')

In [None]:
learn.load('fine_tuned');

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2

In [None]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

We have to save not only the model, but also its encoder, the part that's responsible for creating and updating the hidden state. For the next part, we don't care about the part that tries to guess the next word.

In [None]:
learn.save_encoder('fine_tuned_enc')

## 3. 迁移学习到目标分类器
![](./images/Figure_18.png)

### 逐步松弛Freeze
![](./images/Figure_21.png)

Now, we'll create a new data object that only grabs the labelled data and keeps those labels. Again, this line takes a bit of time.

In [None]:
path = untar_data(URLs.IMDB)

In [None]:
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
             .databunch(bs=bs))

data_clas.save('data_clas.pkl')

In [None]:
data_clas = load_data(path, 'data_clas.pkl', bs=bs)

In [None]:
data_clas.show_batch()

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))

In [None]:
learn.save('first')

In [None]:
learn.load('first');

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

In [None]:
learn.save('second')

In [None]:
learn.load('second');

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

In [None]:
learn.save('third')

In [None]:
learn.load('third');

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

In [None]:
learn.predict("I really loved that movie, it was awesome!")