# HuggingFace Transformers Library

*Notebook created by Lauren Klein, borrowing heavily from material created by Allison Parrish for her "Playing with Transformers" notebook*

[Transformers](https://huggingface.co/transformers/) is a Python library released by [Hugging Face](https://huggingface.co/) to make it easy to use pre-trained transformer language models. 

This notebook demonstrates some of the various functions that the HuggingFace `transformers` library provides out of the box, as well as a few more requiring only a small amount of configuration on your part. 

## What is a "transformer" though

"Transformer" is a name applied to neural network architectures that make use of a mechanism called ["attention"](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) (potential explain-an-explainer alert!) and can be trained in parallel rather than sequentially, as was the case with earlier neural network architectures. As we read in "Stochastic Parrots," the [introduction of this architecture](https://arxiv.org/abs/1706.03762) set off a period of tremendous growth in language model capabilities. 

This growth, as we also read, is mostly predicated on the fact that the transformer architecture makes it possible to train language models on larger and larger datasets. As of this writing, state-of-the-art transformer models are often trained on datasets many hundreds of gigabytes in size, and consequently take a tremendous amount of energy (and money, and time) to train. In most cases, it's not practical to train a transformer model from scratch on your own that has the same capabilities. Instead, researchers and artists make use of models that other organizations have trained.

## HuggingFace Transformers

That's where the HuggingFace `transformers` library comes in. It's an easy interface for downloading pre-trained transformer models and making use of them with a consistent API. In other words, you can use the same code to generate text with, say, GPT-3 or RoBERTa.

To use the HuggingFace [`transformers`](https://huggingface.co/transformers/installation.html) library, we will first need to install it with `pip`.

In [None]:
!pip3 install transformers

Once `transformers` is installed, we will import the relevant parts of the library. I'm using the `Auto` classes, which automatically load the correct code based on the model that you choose. The `AutoModelForCausalLM` (that's "causal" not "casual") is the class you use for text generation tasks (where you want to generate the next word in a sequence).

In [19]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

## Set parameters

In this notebook, I'm going to use [distilgpt2](https://huggingface.co/distilgpt2), a "distilled" version of OpenAI's GPT-2 model. The primary benefit of this model is that it is small and fast—it generates text in a speedy fashion. As you'll see below, the `transformers` library makes it very easy to swap out models, and you can find a list of available models on the HuggingFace website. 

In any case, you need to load both the model and its associated tokenizer. (We'll talk about the tokenizer in more detail below.) To load, use the `.from_pretrained()` method of the appropriate `Auto` class, like so:

In [20]:
# for the GPU
device_name = 'cuda'       

tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Once you have the tokenizer and the model, you can create a Transformers *pipeline*. A pipeline groups together and abstracts away the intermediate steps of a machine learning procedure. The `transformers` library has [many types of pipeline](https://huggingface.co/transformers/main_classes/pipelines.html), but we're going to create a text generation pipeline. Here we go!

## First up: text generation!

First step, instantiate the text generation pipeline using the model and the tokenizer that we just loaded:

In [24]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

Having created this pipeline, we can use it to generate text by calling the pipeline object as though it were a function. The parameter that you pass in is the "prompt"—i.e., the text whose completion you want to predict.

In [25]:
generator("Two roads diverged in a yellow wood, and")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Two roads diverged in a yellow wood, and the car on the opposite side was stopped.\n\n\n\nAs a reminder to residents, they had to travel by taxi to get across the river, and the police responded to the scene on their'}]

Trying this again, formatted a bit more nicely:

In [29]:
# remember this? 
import textwrap

print(textwrap.fill(generator("Two roads diverged in a yellow wood, and")[0]['generated_text'], 60))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Two roads diverged in a yellow wood, and the police found a
man under a blanket in possession of a shotgun. The man,
Roberta P. Davenport, was taken to the jail.    Sgt. Gary De


## Next up: masked language modeling!

Now we'll try another fun thing: masked language modeling, which boils down to a souped-up fill-in-the-blank type situation.

In [30]:
# instantiate the unmasker 
unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Now we'll try it out!

In [31]:
# another print-formatting library which is helpful for displaying data objects 
from pprint import pprint

pprint(unmasker(f"I am certain that you will create very {unmasker.tokenizer.mask_token} final projects."))

[{'score': 0.16116251051425934,
  'sequence': 'I am certain that you will create very interesting final '
              'projects.',
  'token': 2679,
  'token_str': ' interesting'},
 {'score': 0.086085245013237,
  'sequence': 'I am certain that you will create very good final projects.',
  'token': 205,
  'token_str': ' good'},
 {'score': 0.07449886947870255,
  'sequence': 'I am certain that you will create very successful final '
              'projects.',
  'token': 1800,
  'token_str': ' successful'},
 {'score': 0.04554571583867073,
  'sequence': 'I am certain that you will create very impressive final '
              'projects.',
  'token': 3444,
  'token_str': ' impressive'},
 {'score': 0.03319435566663742,
  'sequence': 'I am certain that you will create very nice final projects.',
  'token': 2579,
  'token_str': ' nice'}]


So how is this working, more precisely? And what are some of those names and numbers that we're seeing in this output? 

It's time to talk about... 

## Tokenization!

Machine learning models don't work on text directly; instead, they operate on numbers that correspond to parts of a text. Breaking a text up into enumerable parts is called *tokenization*. In this class, we've already explored several easy and common forms of tokenization, e.g., breaking a text up into characters, or breaking a text up into words. Most machine learning models now use a form of *sub-word* tokenization, in which a text is broken up into units that don't neatly correspond to either individual characters or whole words. The tokenization procedure itself is derived from statistical properties of the corpus—so tokenizers are, in a sense, "trained" in the same way that a machine learning model is. (This is why you have to load the tokenizer, the same way you load a model.)

A tokenizer has a *vocabulary*, which is the set of all possible unique tokens that the tokenizer recognizes. You can examine the vocabulary by calling the tokenizer's `.get_vocab()` method:

In [32]:
vocab = tokenizer.get_vocab()

Use `len()` to see how many items are in the vocabulary:

In [33]:
len(vocab)

50257

The vocabulary is returned as what's called a "dictionary" that maps tokens to their IDs. Let's just take a peek into what that looks like. I'm going to randomly sample a few items from the dictionary, like so:

In [34]:
import random
random.sample(vocab.items(), 10)

[('Ġdictated', 34756),
 ('ĠLap', 26944),
 ('£ı', 6408),
 ('Ġmaterials', 5696),
 ('ĠHalo', 18597),
 ('ĠModified', 40499),
 ('ĠLabour', 7179),
 ('ensive', 2021),
 ('=~', 31820),
 ('Ġcouncils', 27174)]

The results look pretty weird, and there a bunch of things to explain. 

First off, let's discuss the mysterious Ġ character. Subword tokenizers generally don't start off with information about where word boundaries occur; instead, they "learn" word boundaries as part of the process of "training" the tokenizer. The Ġ character is a special character that represents a space. Second, we can see that in many cases, the subword tokenizer does actually end up with tokens in its vocabulary that represent entire words. However, in other cases, we end up with what look like word parts. This is by design! Because some tokens represent word parts, the tokenizer can potentially encode any word—even words that were not present in the original corpus—by tokenizing that word as a sequence of parts.

To demonstrate, let's actually encode a string with the tokenizer using its .encode() method. Just pass in a string, and you'll get back a list of IDs:

In [35]:
src = "Behold! An alabaster anemone. Zzzzap!"
tokenizer.encode(src)

[3856,
 2946,
 0,
 1052,
 435,
 397,
 1603,
 281,
 368,
 505,
 13,
 1168,
 3019,
 89,
 499,
 0]

The tokenizer encodes this string of four words into sixteen tokens. You can find the token corresponding to the ID using the tokenizer's `.decode()` method:

In [36]:
tokenizer.decode(1603)

'aster'

With this, we can see how the tokenizer broke up the original string into units:


In [37]:
for token_id in tokenizer.encode(src):
    print(token_id, "→", "'" + tokenizer.decode(token_id) + "'")

3856 → 'Be'
2946 → 'hold'
0 → '!'
1052 → ' An'
435 → ' al'
397 → 'ab'
1603 → 'aster'
281 → ' an'
368 → 'em'
505 → 'one'
13 → '.'
1168 → ' Z'
3019 → 'zz'
89 → 'z'
499 → 'ap'
0 → '!'


(I included quotation marks in this output to emphasize the fact that the text of the token includes whitespace.)

You can decode an entire list of IDs using the `.decode()` function as well:

In [38]:
token_ids = tokenizer.encode(src)
tokenizer.decode(token_ids)

'Behold! An alabaster anemone. Zzzzap!'

For fun, get the tokenizer to decode a list of random token IDs:

In [39]:
tokenizer.decode(random.sample(list(vocab.values()), 12))

' MLA spokesman chasedVer marched Hass contra aideAdventure twinsphabet roundup'

Another way to tokenize a text is to call the tokenizer as though it's a function, passing in a list of strings as an argument:

In [41]:
pprint(tokenizer(["this is a test", "this is another test"], return_tensors="pt"))

{'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1]]),
 'input_ids': tensor([[5661,  318,  257, 1332],
        [5661,  318, 1194, 1332]])}


The value returned here is a "dictionary" in the format that the model is expecting, if you want to run the model "by hand" instead of using a pipeline, which is what we're going to do in the next notebook. 

The `return_tensors` parameter directs the tokenizer to return the results as a PyTorch tensor instead of a Python list, which is also a requirement for passing the values directly to the model.