In [None]:
# |hide
%reload_ext autoreload
%autoreload 2

# Getting Started

> **BLURR** is a library designed for fastai developers who want to train and deploy Hugging Face transformers

Named after the **fast**est **transformer** (well, at least of the Autobots), **BLURR** provides both a comprehensive and extensible framework for training and deploying 🤗 [huggingface](https://huggingface.co/transformers/) transformer models with [fastai](http://docs.fast.ai/) >= 2.0.

Utilizing features like fastai's new `@typedispatch` and `@patch` decorators, along with a simple class hiearchy, **BLURR** provides fastai developers with the ability to train and deploy transformers on a variety of tasks. It includes a high, mid, and low-level API that will allow developers to use much of it out-of-the-box or customize it as needed.

**Supported Text/NLP Tasks**:
- Sequence Classification 
- Token Classification 
- Question Answering 
- Summarization 
- Tranlsation 
- Language Modeling (Causal and Masked)

## Install

You can now pip install blurr via `pip install ohmeow-blurr`

Or, even better as this library is under *very* active development, create an editable install like this:
```
git clone https://github.com/ohmeow/blurr.git
cd blurr
pip install -e ".[dev]"
```

## How to use

Please check the documentation for more thorough examples of how to use this package.

The following two packages need to be installed for blurr to work: 
1. [fastai](http://docs.fast.ai/) 
2. [Hugging Face transformers](https://huggingface.co/transformers/installation.html)

### Imports

In [None]:
# | output: false
import os, warnings

from datasets import load_dataset, concatenate_datasets
import torch
from transformers import *
from transformers.utils import logging as hf_logging
from fastai.text.all import *

from blurr.data.core import *
from blurr.training.core import *
from blurr.utils import *



In [None]:
warnings.simplefilter("ignore")
hf_logging.set_verbosity_error()

os.environ["TOKENIZERS_PARALLELISM"] = "false"

### Get your data

In [None]:
imdb_dsd = load_dataset("imdb", split=["train", "test"])

# build HF `Dataset` objects
train_ds = imdb_dsd[0].add_column("is_valid", [False] * len(imdb_dsd[0])).shuffle().select(range(1000))
valid_ds = imdb_dsd[1].add_column("is_valid", [True] * len(imdb_dsd[1])).shuffle().select(range(200))
imdb_ds = concatenate_datasets([train_ds, valid_ds])

# build a `DataFrame` representation as well
imdb_df = pd.DataFrame(imdb_ds)
imdb_df.head()

Found cached dataset imdb (/home/wgilliam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,text,label,is_valid
0,"I grew up watching the original TV series in the sixties and one thing that I can tell you right away, there is NO comparison. This film was totally ridiculous with a flying suit that was alive. A martian that took different shapes. Special effects that looked like something that a little child would create. In contrast, in the original, characters were developed and the viewers developed a feeling for Tim and Uncle Martin. The only highlight in this film, yes, actually there was one, occurred when Ray Walston finally made an appearance at the end. He wore dark glasses and made references ...",0,False
1,"Damn, was that a lot to take in. I was pretty much mesmerised throughout. It was pretty perfect, though I would say the editing had a lot to do with that. I can't believe this guy stayed on good terms with the lot of them (Anton especially) to get all of this footage without any serious... beef. The Dandy's did come off well-together, middle-class kids who took advantage of their situation (and rightly so!). I felt bad for Jonestown and especially for Anton, which maybe wasn't what a lot of other people felt. Great piece of film-making and great choice of subject(s). I recommend this to an...",1,False
2,"I saw 'New York: I Love You' today and loved it! I was really looking forward to seeing this after watching 'Paris je t'aime' and overall I think I liked this one much better... Perhaps I need to watch 'Paris je t'aime' again I don't know... I read few of the reviews here about NY:ILY and yes, the movie is not without its faults. When you're paying tribute to a city like New York - it can get rather overwhelming and nothing seems fair enough to do the city due justice... so without elaborating on any of the film's shortcomings, I'll just write about what I liked.<br /><br />Unlike 'Paris j...",1,False
3,"This was the worst MTV Movie Awards EVER!!! I barely laughed, none of the presenters were funny, the hosts really sucked, and the parodies weren't so great either. Why can't we go back to the good olden days when the show was a riot?",0,False
4,"Recap: It's business as usual at Louche's casino in Tanger. The casino is about to close and prepares for a big transaction the next day. The owner Louche and some staff leave for the night, leaving Modesty in charge. Suddenly a troop of armed gangsters storm the casino, shooting wildly. Unknown to Modesty, they have already killed Louche, and are now after the money hidden in the vault. But no one present, and still alive, at the casino knows the code to open the vault. The vault itself is heavily booby trapped with explosives so the assailants can't blow the door as planned. Suddenly Mod...",1,False


### Get `label_names` from data for config later

In [None]:
label_names = imdb_dsd[0].features["label"].names
label_names

['neg', 'pos']

### Get your 🤗 objects

In [None]:
pretrained_model_name = "bert-base-uncased"

hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects(pretrained_model_name, label_names=label_names, verbose=True)

=== config ===
# of labels:	2

=== tokenizer ===
Vocab size:		30522
Max # of tokens:	512
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


### Build your Data 🧱 and your DataLoaders

In [None]:
# single input (note: we pass the label_names here because the labels in the dataset are already encoded as 0 or 1)
blocks = (
    TextBlock(
        tokenize_tfm=BatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model), batch_decode_kwargs={"label_names": label_names}
    ),
    CategoryBlock,
)

dblock = DataBlock(
    blocks=blocks,
    get_x=ColReader("text"),
    get_y=ColReader("label"),
    splitter=ColSplitter(),
)

dls = dblock.dataloaders(imdb_df, bs=4)

In [None]:
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

Unnamed: 0,text,target
0,"this movie was recently released on dvd in the us and i finally got the chance to see this hard - to - find gem. it even came with original theatrical previews of other italian horror classics like "" spasmo "" and "" beyond the darkness "". unfortunately, the previews were the best thing about this movie. < br / > < br / > "" zombi 3 "" in a bizarre way is actually linked to the infamous lucio fulci "" zombie "" franchise which began in 1979. similarly compared to "" zombie "", "" zombi 3 "" consists of a",neg
1,"i knew it wasn't gunna work out between me and d - wars from the moment we met. first its title was lazy. d war. like writing out dragon was too much for them. also... you really can't be that blatant with your title unless your blue monkey. blue monkey can do whatever the hell it wants. < br / > < br / > the second sign of a rocky relationship between us was the story's insane progression. here's the film, dreamy reporter guy reports on big snake tracks, flashes back to a time he and dad wander",neg


### ... and 🚂

In [None]:
# |notest
model = BaseModelWrapper(hf_model)

learn = Learner(
    dls,
    hf_model,
    opt_func=Adam,
    loss_func=CrossEntropyLossFlat(),
    metrics=[accuracy],
    cbs=[BaseModelCallback],
    splitter=blurr_splitter_on_head,
)

learn.create_opt()
learn.freeze()

learn.fit_one_cycle(3, lr_max=3e-5)

In [None]:
# |notest
learn.show_results(learner=learn, max_n=2, trunc_at=250)

Unnamed: 0,text,target,prediction
0,"when it comes to those eerie and uncanny little crime films, the sorts that revolve around characters that are bordering on scum and inhabit equally scummy surroundings, and additionally carry that wavering and bleak feel thanks to some pretty grotty",pos,neg
1,"jean renoir's homage to the paris of the late 19th century is beautiful in many ways. not only does it appear to have been photographed by toulouse - lautrec and mucha, it portrays the geographic paris ; the streets accessible only by staircases, the",pos,pos


In [None]:
# |echo:false
try:
    del learn, hf_model
except:
    pass
finally:
    clean_memory()

### Using the low-level blurr API

BLURR now supports training with plain PyTorch Datasets/DataLoaders, Hugging Face Datasets, and/or fast.ai's low-level data API.  Here's an example of the later.

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", label_names, verbose=True)

=== config ===
# of labels:	2

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


In [None]:
# tokenize the dataset
tokenize_func = partial(multiclass_tokenize_func, hf_tokenizer=hf_tokenizer)
proc_imdb_ds = imdb_ds.map(tokenize_func, batched=True)

# turn Arrow into DataFrame (`ColSplitter` only works with `DataFrame`s)
train_df = pd.DataFrame(proc_imdb_ds)
train_df.head()

# define dataset splitter
splitter = ColSplitter("is_valid")
splits = splitter(imdb_df)


# define how we want to build our inputs and targets
def _build_inputs(example):
    return {fwd_arg_name: example[fwd_arg_name] for fwd_arg_name in hf_tokenizer.model_input_names if fwd_arg_name in list(example.keys())}


def _build_targets(example):
    return example["label"]


# create our fastai `Datasets` object
dsets = Datasets(items=train_df, splits=splits, tfms=[[_build_inputs], _build_targets], n_inp=1)

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

In [None]:
data_collator = TextCollatorWithPadding(hf_tokenizer)
sort_func = partial(sorted_dl_func, hf_tokenizer=hf_tokenizer)
batch_decode_tfm = BatchDecodeTransform(hf_tokenizer, hf_arch, hf_config, hf_model, label_names=label_names)

dls = dsets.dataloaders(
    batch_size=4,
    create_batch=data_collator,
    after_batch=batch_decode_tfm,
    dl_type=partial(SortedDL, sort_func=sort_func),
)

In [None]:
dls.show_batch(dataloaders=dls, max_n=8, trunc_at=500)

Unnamed: 0,text,target
0,"I grew up watching the original TV series in the sixties and one thing that I can tell you right away, there is NO comparison. This film was totally ridiculous with a flying suit that was alive. A martian that took different shapes. Special effects that looked like something that a little child would create. In contrast, in the original, characters were developed and the viewers developed a feeling for Tim and Uncle Martin. The only highlight in this film, yes, actually there was one, occurred w",neg
1,"Poor Michael Madsen; he must be kicking himself to know folks have found out about this horrible flick. I really can't think of anything worse I have ever seen, except amateur porn. It's that bad, and all here; wooden acting, bad script, crappy moral ending, you hate it and it is in this movie.<br /><br />My question is: ""Who the Hell put $$$ into this piece of doggy doo? At least we could have seen Michael's sister Virginia nude in a scene, but I don't think even that would save this stinker...",neg
2,"I knew it wasn't gunna work out between me and D-wars from the moment we met. First its title was lazy. D war. Like writing out Dragon was too much for them. Also... you really can't be that blatant with your title unless your Blue Monkey. Blue Monkey can do whatever the hell it wants. <br /><br />The second sign of a rocky relationship between us was the story's insane progression. Here's the film, dreamy reporter guy reports on big snake tracks, flashes back to a time he and dad wandered into",neg
3,"Minor Spoilers<br /><br />Alison Parker (Cristina Raines) is a successful top model, living with the lawyer Michael Lerman (Chris Sarandon) in his apartment. She tried to commit suicide twice in the past: the first time, when she was a teenager and saw her father cheating her mother with two women in her home, and then when Michael's wife died. Since then, she left Christ and the Catholic Church behind. Alison wants to live alone in her own apartment and with the help of the real state agent Mis",pos


In [None]:
# |notest
set_seed()

model = BaseModelWrapper(hf_model)

learn = Learner(
    dls,
    model,
    opt_func=Adam,
    loss_func=CrossEntropyLossFlat(),
    metrics=[accuracy],
    cbs=[BaseModelCallback],
    splitter=blurr_splitter_on_head,
)

learn.create_opt()
learn.freeze()

learn = learn.to_fp16()

In [None]:
# |notest
print(len(learn.opt.param_groups))

2


In [None]:
# |notest
learn.summary()

BaseModelWrapper (Input shape: 4 x 1793)
Layer (type)         Output Shape         Param #    Trainable 
                     4 x 1793 x 768      
Embedding                                 98380800   False     
LayerNorm                                 1536       True      
StableDropout                                                  
Linear                                    590592     False     
Linear                                    590592     False     
Linear                                    590592     False     
StableDropout                                                  
StableDropout                                                  
Linear                                    590592     False     
LayerNorm                                 1536       True      
StableDropout                                                  
____________________________________________________________________________
                     4 x 1793 x 3072     
Linear                        

In [None]:
# |notest
learn.fit_one_cycle(3, lr_max=3e-5)

In [None]:
# |notest
learn.show_results(learner=learn, max_n=2, trunc_at=250)

Unnamed: 0,text,target,prediction
0,"this film is what happens when people see like in this particular one blair witch project and say hell people running around with cameras, acting slash documentary themed no problemo i can do it and start out with a lame idea make up a terrible scrip",neg,neg
1,The film itself is only a compilation of scenes which have no inherent meaning to someone living outside of Russia. I won't deny that some of the images and techniques were quite revolutionary at the time (filmed 1928) but the problem with the film i,neg,neg


In [None]:
# |echo:false
try:
    del learn, hf_model
except:
    pass
finally:
    clean_memory()

## ⭐ Props

A word of gratitude to the following individuals, repos, and articles upon which much of this work is inspired from:

- The wonderful community that is the [fastai forum](https://forums.fast.ai/) and especially the tireless work of both Jeremy and Sylvain in building this amazing framework and place to learn deep learning.
- All the great tokenizers, transformers, docs, examples, and people over at [huggingface](https://huggingface.co/)
- [FastHugs](https://github.com/morganmcg1/fasthugs)
- [Fastai with 🤗Transformers (BERT, RoBERTa, XLNet, XLM, DistilBERT)](https://towardsdatascience.com/fastai-with-transformers-bert-roberta-xlnet-xlm-distilbert-4f41ee18ecb2)
- [Fastai integration with BERT: Multi-label text classification identifying toxicity in texts](https://medium.com/@abhikjha/fastai-integration-with-bert-a0a66b1cecbe)
- [fastinference](https://muellerzr.github.io/fastinference/)
