# Usage of ULMFiT-pretrained models and the fastai_ulmfit library

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/floleuerer/fastai_ulmfit/blob/main/fastai_ulmfit_pretrained_usage.ipynb)

Install required packages for **Colab** and import them

In [None]:
pip install -Uq fastai==2.2.7 sentencepiece==0.1.95 fastcore==1.3.19 fastai-ulmfit

In [None]:
from fastai_ulmfit.pretrained import *
from fastai.text.all import *

## Prepare GermEval2019 Sentiment Analysis

This is a minimal Example and we are using a part of the **GermEval2019 Task 1** Training data - so the results will be worse than with the complete dataset. 

https://projects.fzai.h-da.de/iggsa/data-2019/

In [None]:
!wget -P tmp/ https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/09/germeval2019.training_subtask1_2_korrigiert.txt
!wget -P tmp/ https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/08/germeval2019GoldLabelsSubtask1_2.txt

Create dataframe from the downloaded files.

In [None]:
names = ['text','label','label_fine']

df_train = pd.read_csv(f'tmp/germeval2019.training_subtask1_2_korrigiert.txt', sep = '\t', names=names)
df_train['is_valid'] = False

df_test = pd.read_csv('tmp/germeval2019GoldLabelsSubtask1_2.txt', sep ='\t', names=names)
df_test['is_valid'] = True

df = pd.concat([df_train, df_test])

Do some simple preprocessing - remove @-mentions and links from the tweets.

In [None]:
def clean_text(text):
    text = re.sub('@\w+', ' ', text)
    text = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", text)
    text = ' '.join(text.split())
    return text

df['text'] = df['text'].apply(clean_text)

In [None]:
df

## Usage of ULMFiT-pretrained models

The library `fastai_ulmfit` provides the following helper functions to easily use the **pretrained models**
- create a tokenizer `tokenizer_from_pretrained`
- learner to fine-tune the language model `language_model_from_pretrained`
- train a classifier from the fine-tuned language model `text_classifier_from_lm`

### Create Tokenizer from pretrained model

The function `tokenizer_from_pretrained` creates a SentencePieceTokenizer-Tokenizer with the parameters (e.g. `vocab_sz`) the model and tokenizer was trained with.

**Will be used for both the language model fine-tuning and the training of the classifier.**

In [None]:
url = 'http://bit.ly/ulmfit-dewiki'
tok = tokenizer_from_pretrained(url)

### Language Model fine-tuning

Create the `Dataloaders` for the **language model fine-tuning** from dataframe and pass the created **tokenizer**.

In [None]:
dblocks = DataBlock(blocks=(TextBlock.from_df('text', tok=tok, is_lm=True)),
                    get_x=ColReader('text'), 
                    splitter=ColSplitter())
dls = dblocks.dataloaders(df, bs=64)

The function `language_model_from_pretrained` calls `language_model_learner` and creates a LMLearner from the pretrained model. 

In [None]:
learn = language_model_from_pretrained(dls, url=url, drop_mult=1).to_fp16()

In [None]:
learn.lr_find()

In [None]:
lr = 3e-2

In [None]:
learn.fit_one_cycle(1, lr, moms=(0.8,0.7,0.8))

In [None]:
learn.unfreeze()
learn.fit_one_cycle(3, slice(lr/100,lr), moms=(0.8,0.7,0.8))

Save the fine-tuned model to `path` (default is `learn.model_dir`) with all required files (Model, Encoder, Vocab and SentencePiece-Model)

`path` and `vocab` will be used for training the classifier.

In [None]:
path = learn.save_lm('tmp/test_lm')
vocab = learn.dls.vocab

### Train the Text Classifier

In [None]:
dblocks = DataBlock(blocks=(TextBlock.from_df('text', tok=tok, vocab=vocab), CategoryBlock),
                    get_x=ColReader('text'),
                    get_y=ColReader('label'), 
                    splitter=ColSplitter())
dls = dblocks.dataloaders(df, bs=128)

`text_classifier_from_lm` calls `text_classifier_learner` to create a learner **from the fine-tuned model** `path`.

In [None]:
learn = text_classifier_from_lm(dls, path=path, metrics=[accuracy]).to_fp16()

In [None]:
learn.lr_find()

In [None]:
learn.fine_tune(5, 1e-2)