# Session 10 - Using BERT-style models via ```Huggingface```

In the lecture today, we saw how exploring the different layers and self-attention heads in BERT-style models can gives us a more nuanced breakdown of how the model has performed and what it has learned.

There are three main tools which can be used for this task:

- BERTviz
    - https://github.com/jessevig/bertviz
- Ecco
    - https://github.com/jalammar/ecco
- Language Interpretability Toolkit (LIT)
    - https://github.com/PAIR-code/lit

Each of these has empirical results in peer reviewed journals as evidence of robustness, but each does something a little different. Feel free to explore them in this class, or in your own time.

A second thing we saw was that BERT (and BERT-style) models can be *finetuned* in order to perform specific tasks. In this class, we're going to see how this can be used for the purposes of cultural data science. To do this, we're going to be using the library called ```HuggingFace``` or sometimes just ```🤗```.

## Creating ```HuggingFace``` pipelines

We're specifically going to use the ```pipelines()``` abstraction in HuggingFace. This allows us to load a finetuned model, initialize it with the necessary requirements, and use it for the specific task for which it was finetuned. You can read more [here](https://huggingface.co/docs/transformers/v4.27.2/en/task_summary#natural-language-processing).

We're going to use the ```text-classification``` pipeline in this class (and [Assignment 4](https://classroom.github.com/a/BhnScEmU)).

In [2]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm
2023-04-12 10:11:05.116167: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Text classification

To begin with, let's use the defaul sentiment classification model to see how we can return a binary sentiment classification for a document.

In [3]:
classifier = pipeline(task="sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 629/629 [00:00<00:00, 92.2kB/s]
Downloading tf_model.h5: 100%|██████████| 268M/268M [00:00<00:00, 416MB/s] 
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 11.9kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████

In [4]:
preds = classifier("Hugging Face is the best thing since sliced bread!")

In [5]:
print(preds)

[{'label': 'POSITIVE', 'score': 0.9990912675857544}]


### Question answering

We can also use BERT-style models for much more complex texts, such as *question answering*. Again, there's a ```HuggingFace``` pipeline for this!

Let's start by defining a text we want to use as our *context*:

In [6]:
text = "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles."

We then initalize our question-answering pipeline.

In [7]:
question_answerer = pipeline(task="question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 473/473 [00:00<00:00, 61.7kB/s]
Downloading tf_model.h5: 100%|██████████| 261M/261M [00:00<00:00, 430MB/s] 
Some layers from the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing TFDistilBertForQuestionAnswering: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertF

And then we define the question we want to ask of our text:

In [8]:
answer = question_answerer(
    context = text,
    question="What are the main results of this paper?",
) # the answer is the span of tokens in the context that answers your question

In [9]:
print(answer)

{'score': 0.06767084449529648, 'start': 505, 'end': 570, 'answer': 'our best model outperforms even all previously reported ensembles'}


### Text summarization

HuggingFace also allows us to use other styles of transformers models, such as T5 and GPT, which we'll be looking at in coming weeks. These allow us to do interesting things like *text summarization* and *text generation*

In [10]:
summarizer = pipeline(task="summarization")

summary = summarizer(text)

No model was supplied, defaulted to t5-small and revision d769bba (https://huggingface.co/t5-small).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 224kB/s]
Downloading tf_model.h5: 100%|██████████| 242M/242M [00:01<00:00, 183MB/s]  
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Downloading (…)neration_config.json: 100%|██████████| 147/147 [00:00<00:00, 16.9kB/s]
Downloading (…)ve/main/spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 1.92MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.39M/1.39M [00:00<00:00, 6.59MB/s]
For now, thi

In [11]:
print(summary) # the summary is not a subset of 'text' but generated sentences

[{'summary_text': 'the Transformer replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . the Transformer can be trained significantly faster than architectures based on convolutional layers .'}]


### Text generation 

Compare how this performs relative to your trained RNN and consider that we're only using the default parameters here:

In [12]:
prompt = "Hugging Face is a community-based open-source platform for machine learning."

In [13]:
generator = pipeline(task="text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 665/665 [00:00<00:00, 95.5kB/s]
Downloading tf_model.h5: 100%|██████████| 498M/498M [00:01<00:00, 395MB/s] 
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 20.5kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 2.49MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.45MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1

In [14]:
generated = generator(prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [15]:
print(generated)

[{'generated_text': 'Hugging Face is a community-based open-source platform for machine learning. It requires no external client, but will be available for personal use. There is no limit to the number of people registered with this platform!\n\nFor technical details,'}]


### Using a different model

So far, we've only been using the default models and parameters for these tasks. But if you check out the ```HuggingFace``` model universe, you'll see that there are many (in some cases hundreds) of finetuned models which can be slotted into these pipelines.

Check out the options [here](https://huggingface.co/models).

In [16]:
classifier = pipeline("text-classification", 
                      model="j-hartmann/emotion-english-distilroberta-base", 
                      return_all_scores=True)

Downloading (…)lve/main/config.json: 100%|██████████| 1.00k/1.00k [00:00<00:00, 111kB/s]
Downloading tf_model.h5: 100%|██████████| 329M/329M [00:00<00:00, 380MB/s] 
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at j-hartmann/emotion-english-distilroberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
Downloading (…)okenizer_config.json: 100%|██████████| 294/294 [00:00<00:00, 75.5kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 798k/798k [00:00<00:00, 1.91MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.41MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 2.60MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 239/239 [00:0

In [17]:
classifier("I love this!")

[[{'label': 'anger', 'score': 0.004419785924255848},
  {'label': 'disgust', 'score': 0.0016119946958497167},
  {'label': 'fear', 'score': 0.00041385178337804973},
  {'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'neutral', 'score': 0.005764591973274946},
  {'label': 'sadness', 'score': 0.002092392183840275},
  {'label': 'surprise', 'score': 0.008528688922524452}]]

This final pipeline forms the basis of [Assignment 4](https://classroom.github.com/a/BhnScEmU), which you should start working on now!