# NLP Transfer Learning with 🤗 Transformers

#### Data Labs Code Class Tues., 10/5

#### Sam Bestvater | Computational Social Scientist

In this notebook, I introduce and demonstrate the Hugging Face (🤗) `Transformers` package, a Python package for transfer learning in neural NLP that is quickly becoming one of the essential tools in the field. 

This notebook draws heavily from the documentation for the package. For more info, see:
- [huggingface.co/transformers](https://huggingface.co/transformers)
- [huggingface.co/course](https://huggingface.co/course)

## 0. Setup
(what packages & dependencies to install if you want to follow along with the notebook)

In [1]:
# # Uncomment this block to install torch on linux (labs jupyterhub server)

# !pip install transformers --user
# !pip install datasets --user
# !pip install sentencepiece --user
# !pip install torch==1.9.1+cpu torchvision==0.10.1+cpu torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html --user
# !pip install torchinfo --user

In [2]:
# # Uncomment this block to install torch on MacOS (local machine)

# !pip install transformers 
# !pip install datasets
# !pip install sentencepiece
# !pip install conda install pytorch torchvision torchaudio -c pytorch
# !pip install torchinfo


## 1. What is a Transformer model, and how does it work?

Let's establish some other definitions first:
- neural networks: what if instead of using just one logistic regression, we chained a bunch of them together in multiple layers? Turns out that gives us a really flexible model for working with super high dimensional data types like text or images. [(3Blue1Brown has a great series on YouTube if you want to actually understand how neural nets work in any detail.)](https://www.youtube.com/watch?v=aircAruvnKk)
- model architecture: the configuration of units and connections that define a specific neural network (i.e. the overall shape of the model).
- model checkpoint: the parameters (weights and biases) that the model learns through training. (i.e. the trained model)
- transfer learning: taking a model checkpoint trained on one task or dataset and using it for another task or dataset (either as-is, or modified to adapt it to that task).
- language modeling: statistically modeling how words in natural language relate to each other, often for next-word or missing-word prediction (like predictive keyboards on smartphones). It turns out that language models are good to use for transfer learning, because the things they learn about the underlying interdependencies of language are widely useful for a variety of NLP tasks. 

A *Transformer* is a kind of neural network architecture designed for language modeling, first introduced by researchers at Google in 2017 [(Vaswani et al. 2017)](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). The big innovation of the transformer architecture is that it is much more computationally efficient than the other types of neural network architectures used for language modeling up to that point (mostly RNNs and LSTMs). The efficiency of the architecture meant that researchers could train much larger models on much larger quantities of data. Another research team at Google followed up the original Transformer paper by releasing a model called BERT, and open-sourcing the model checkpoints for other researchers to use [(Devlin et al. 2018)](https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ). The BERT model has 336M trainable parameters and recognizes a vocabulary of over 30,000 English terms. It was trained on two massive text corpora, the WikiText-103 corpus (which is basically all of wikipedia) and the BookCorpus. 

The publication of BERT launched a new era in NLP research, as these huge language models facilitated much more effective transfer learning than anything that had come before. Most tasks in NLP can be improved through the use of BERT or another transformer model like it.



## 2. What kinds of things can you do with Transformers?

A bunch of things!

- Sequence classification: (sentiment analysis, spam detection, grammar checking, logical entailment, etc.)
- Token classification: (POS-tagging, named entity recognition, masked token prediction, etc.)
- Text generation: (GPT stuff & things)
- Sequence transformation: (document summarization, document translation, etc.)
- Text representation: (producing contextual embeddings, representing multilingual texts, etc.)

The kind of task we want to complete will help determine what kind of model we need. The family of Transformer architectures can be further segmented into *encoder*, *decoder* and *encoder-decoder* models, and each of these subtypes is used for different types of tasks. 

*Encoder* Transformers take texts as input and convert them to numerical representations (embeddings) that can then be used for a variety of classification tasks. BERT is an example of an encoder-only transformer.

The original Transformer architecture introduced by [Vaswani et al.](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) is an *encoder-decoder* (sometimes called sequence-to-sequence) model. It was designed for machine translation tasks, where we want to take an input text and translate it into another language while maintaining the same essential meaning. This can be done by using both an *encoder* and a *decoder*. The encoder will convert the input text into embeddings, then the decoder will convert those embeddings into text in the new language. 

There are also *Decoder*-only Transformers. GPT-2/GPT-3 are examples of these. These models tend to be used for text generation.



## 3. What is 🤗?

🤗 (Hugging Face) exists in the form that it does today as a result of competition between the two most common deep learning frameworks for Python, `PyTorch` and `TensorFlow`. Researchers at Google first introduced the Transformer architecture [(Vaswani et al. 2017)](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) and also developed BERT, one of the early general-purpose Transformer language models [(Devlin et al. 2018)](https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ). Google also owns TensorFlow, so when the BERT paper was published in 2018, the research team open sourced the pre-trained models as TensorFlow objects (which are all still available on their [GitHub](https://github.com/google-research/bert), BTW). The developers at Hugging Face wanted to use BERT models in their research, but primarily worked in PyTorch, so they converted the models and released a python package called `pytorch-pretrained-bert` that was simply a set of commands for downloading the converted BERT models as PyTorch objects. This quickly became hugely popular, and the de-facto default implementation for other researchers who wanted to use BERT models with PyTorch (which was a lot of people). As more pre-trained transformers were published, Hugging Face started hosting those as well, and began developing a general set of tools to work with these models. Around this time they also changed the name of their python package to the more general `transformers`.

Today, Hugging Face maintains several python packages.

- `transformers` provides functions for downloading and implementing Transformer models for NLP tasks. All the tools now work in both PyTorch and TensorFlow, and many models are now available for both frameworks as well. 
- `huggingface_hub` integrates with the online repository of models that Hugging Face hosts ([huggingface.co/models](https://huggingface.co/models)) and allows users to upload their own fine-tuned transformers. There are currently 17,196 model checkpoints hosted on the Hugging Face Hub.
- `datasets` provides easy access to an online library of major public datasets for NLP, and also provides an efficient data format for use with `transformers` functions and models.

## 4. Cool! Let's see some examples!

### 4.1 At a high level: using the Pipeline API

The `Pipeline` API abstracts away most of the technical detail of the underlying models, allowing us to focus on using pre-trained models for specific, well-defined tasks.

In [3]:
from transformers import pipeline

import pandas as pd
import numpy as np

import torch

print('CUDA enabled') if torch.cuda.is_available() else print('CPU only')

2021-10-04 19:35:46.281701: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-04 19:35:46.281737: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


CPU only


### Sentiment Analysis
(asking an encoder model to determine if the overall tone of a text is positive or negative)

In [4]:
classifier = pipeline("sentiment-analysis")
classifier(
    ["Everyone on Data Labs is amazing and brilliant!", 
    "This code class is so boring, I wish Sam would stop talking."],
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9998788833618164},
 {'label': 'NEGATIVE', 'score': 0.9997871518135071}]

### Zero-shot classification
(asking an encoder model to classify a text from a list of labels it's never seen before)

In [5]:
classifier = pipeline("zero-shot-classification")
classifier(
    ["The 2020 election featured dramatic increases in lawmaker posts and audience engagement",
    "Majority in U.S. Says Public Health Benefits of COVID-19 Restrictions Worth the Costs"],
    candidate_labels=["education", "politics", "business", "health"],
)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


[{'sequence': 'The 2020 election featured dramatic increases in lawmaker posts and audience engagement',
  'labels': ['politics', 'business', 'health', 'education'],
  'scores': [0.950331449508667,
   0.026905572041869164,
   0.012079858221113682,
   0.010683099739253521]},
 {'sequence': 'Majority in U.S. Says Public Health Benefits of COVID-19 Restrictions Worth the Costs',
  'labels': ['health', 'business', 'politics', 'education'],
  'scores': [0.9772828221321106,
   0.011761322617530823,
   0.008106389082968235,
   0.00284951226785779]}]

### Text generation
(asking a decoder model to produce more text from a given input)

In [6]:
generator = pipeline("text-generation")
generator("In a hole in the ground, there lived a Hobbit.",
         num_return_sequences = 2, # how many examples you'd like the model to produce
         max_length = 50) # how long these should be

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In a hole in the ground, there lived a Hobbit. They gave him a sword and a pair of spurs. They put him into the dungeon, and he was saved.\n\nWhen he was sixteen it was the night he had just given'},
 {'generated_text': 'In a hole in the ground, there lived a Hobbit. In this case, it was a black dragon. I have no idea what it was called. Or how it looked on the ground."\n\n"The black dragon is the first character who'}]

### Summarization
(asking an encoder-decoder model to produce a shorter version of a given input)

In [7]:
summarizer = pipeline("summarization")
summarizer("""
            Although gymnast Simone Biles’ medal count fell slightly short of the 
            sports world’s lofty expectations in the Tokyo 2020 Olympic Games, 
            she dominated among U.S. Olympians in the number of times her handle, 
            @Simone_Biles, was mentioned on Twitter.

            Pew Research Center captured the Twitter handles of every athlete who 
            listed a profile on the official Team USA page and looked at tweets 
            from the broader Twitter audience that directly mentioned those handles 
            during the Games. Here are some key takeaways for how the public engaged 
            with Team USA on Twitter.
            
            All told, 598 athletes were listed on the Team USA website at the start 
            of the Games. And 438 of them (73% of the total) included a Twitter handle 
            in their athlete profile. From July 21 through Aug. 9, 2021 – the Games 
            themselves, postponed from the year before, were held July 23 to Aug. 8 – 
            more than 900,000 different Twitter accounts directly mentioned the handles 
            of U.S. Olympians in more than 2.1 million tweets. The vast majority (90%) 
            of those athlete accounts were mentioned at least once during that time.

            These mentions were especially concentrated on a few key dates. Nearly a 
            third (31%) of all athlete mentions occurred during the three days of July 27-29, 
            a period that included the women’s team and individual gymnastics finals and 
            swimmer Katie Ledecky winning the gold medal in the 1,500-meter freestyle.
            """,
           max_length = 100)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


[{'summary_text': ' Pew Research Center captured the Twitter handles of every athlete who listed a profile on the official Team USA page and looked at tweets from the broader Twitter audience . From July 21 through Aug. 9, 2021 – the Games were held July 23 to Aug. 8 – more than 900,000 different Twitter accounts directly mentioned the handles of U.S. Olympians in more than 2.1 million tweets .'}]

### Translation
(asking an encoder-decoder model to translate an input from a specified source language to a specified target language)

In [8]:
translator = pipeline("translation_en_to_fr")
translator("My name is Sam and I work on the Data Labs team at Pew Research Center in Washington DC.")

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


[{'translation_text': "Je m'appelle Sam et je travaille au sein de l'équipe Data Labs du Pew Research Center à Washington DC."}]

### Named entity recognition (NER)
(asking an encoder model to extract entities such as persons, locations, or organizations from an input sequence)

In [9]:
ner = pipeline("ner",
              grouped_entities = True # Allows n-gram entities
              )
ner("My name is Sam and I work on the Data Labs team at Pew Research Center in Washington DC.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


[{'entity_group': 'PER',
  'score': 0.99874526,
  'word': 'Sam',
  'start': 11,
  'end': 14},
 {'entity_group': 'ORG',
  'score': 0.9987684,
  'word': 'Data Labs',
  'start': 33,
  'end': 42},
 {'entity_group': 'ORG',
  'score': 0.99569225,
  'word': 'Pew Research Center',
  'start': 51,
  'end': 70},
 {'entity_group': 'LOC',
  'score': 0.99902904,
  'word': 'Washington DC',
  'start': 74,
  'end': 87}]

### Question Answering
(providing an encoder model with a context statement and asking it questions based on that context)

In [10]:
question_answerer = pipeline("question-answering")
question_answerer(
    question=["What is my name?", "Where do I work?"],
    context="My name is Sam and I work on the Data Labs team at Pew Research Center in Washington DC."
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


[{'score': 0.9965927004814148, 'start': 11, 'end': 14, 'answer': 'Sam'},
 {'score': 0.3084445297718048,
  'start': 51,
  'end': 70,
  'answer': 'Pew Research Center'}]

### Feature Extraction
(using an encoder model to encode an input text into a contextual embedding that can be used for other NLP tasks)

In [11]:
encoder = pipeline("feature-extraction")
embedding = encoder("Hobbits have hairy feet.")

embedding = np.array(embedding)

embedding.shape

No model was supplied, defaulted to distilbert-base-cased (https://huggingface.co/distilbert-base-cased)
Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


(1, 9, 768)

In [12]:
encoder = pipeline("feature-extraction")
embedding = encoder(["Hobbits have hairy feet.",
                     "Does BERT know what a Hobbit is?"])

embedding = np.array(embedding)

embedding.shape

No model was supplied, defaulted to distilbert-base-cased (https://huggingface.co/distilbert-base-cased)
Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


(2, 14, 768)

In [13]:
# remove previous models to free up memory:
import gc

del classifier
del generator
del summarizer
del translator
del ner
del question_answerer
del encoder

gc.collect()

if torch.cuda.is_available():
    torch.cuda.empty_cache()

    t = torch.cuda.get_device_properties(0).total_memory
    r = torch.cuda.memory_reserved(0)
    a = torch.cuda.memory_allocated(0)
    f = r-a  # free inside reserved

    print(t)
    print(f)

### 4.2 Less abstraction: using the Trainer API

The `Pipeline` API makes certain pre-defined tasks really really easy, but it's doing a lot under the hood. 

Moving from raw text to predictions requires not just a model, but also a tokenizer that produces inputs the model understands, and a post-processing step that can convert the raw model outputs into something interpretable. 

![Under the hood](https://huggingface.co/course/static/chapter2/full_nlp_pipeline.png)

Sometimes we need models for tasks outside of these pre-determined pipelines. For that, we can use the `Trainer` API, which gives us more control over these underlying components of the model and allows us to fine-tune pre-trained models on new, task-specific data.

Let's say we want to fine-tune a transformer for a custom classification task, such as identifying the stance of tweets about the Kavanaugh confirmation hearings (see [Bestvater & Monroe, working paper](https://bestvater.github.io/pdfs/BestvaterMonroe_SentimentIsNotStance.pdf) -- absolutely shameless self-promotion.) 

Let's load some data. Probably the most common way to do this in Python is with Pandas:

In [14]:
kav_tweets = pd.read_csv('https://github.com/bestvater/misc/raw/master/kavanaugh_tweets_stance.csv',
                         usecols = ['text', 'stance'])

kav_tweets['stance'] = np.where(kav_tweets['stance'] == 1, # make the labels more informative
                                'Supports the Kavanaugh confirmation', 
                                'Opposes the Kavanaugh confirmation')

kav_tweets = kav_tweets.sample(n = 100, random_state = 101) # subset so training doesn't take forever

kav_tweets.head()

Unnamed: 0,text,stance
795,'He's a Liar': Watters Rips RI Sen. Whitehouse...,Supports the Kavanaugh confirmation
665,HERE WE GO - WHAT IS IT WITH SENATORS FROM NEW...,Supports the Kavanaugh confirmation
1389,JUST IN: Democrats float idea of impeaching Ka...,Supports the Kavanaugh confirmation
416,She already has her mind made up...She knows A...,Opposes the Kavanaugh confirmation
147,Only 31% of Americans believe that Brett Kavan...,Opposes the Kavanaugh confirmation


Let's convert this to a hugging face dataset though--this will give us some specific functionality that we don't get with Pandas.

In [15]:
from datasets import Dataset

kav_tweets = Dataset.from_pandas(kav_tweets)

kav_tweets

Dataset({
    features: ['text', 'stance', '__index_level_0__'],
    num_rows: 100
})

Okay, now we can get started. First we need to decide on a model checkpoint. This is a sequence classification task, so we want one of the encoder models. I'm going to go with `DistilBERT` because it's relatively lightweight.

In [16]:
checkpoint = 'distilbert-base-uncased'

### Tokenizing/Preprocessing

Once we've chosen a model to use, we need to do some preprocessing to convert the raw texts we want to classify into inputs that our `DistilBERT` model will recognize. We do this using a `tokenizer` function, which is provided along with every pre-trained model in the hugging face library.

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint, # loads the tokenizer for the model we specify
                                         model_max_len = 128) # set the dimension of the input vector. Default is 512

tokenizer

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Here's what happens when we pass a raw text to this tokenizer:

In [18]:
tokenizer('Hello. This is Sono from work.')

{'input_ids': [101, 7592, 1012, 2023, 2003, 2365, 2080, 2013, 2147, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now I'm going to wrap that in another function so I can easily apply it to the whole dataset at once using `map()`. This isn't strictly necessary, but for big datasets it can speed things up a lot. 

In [19]:
def tokenize_function(input_dataset):
    return tokenizer(input_dataset['text'],
                     padding = 'max_length', # will pad documents shorter than 128 tokens
                     truncation = True ) # will truncate documents longer than 128 tokens

kav_tweets = kav_tweets.map(tokenize_function, batched = True)

kav_tweets

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['__index_level_0__', 'attention_mask', 'input_ids', 'stance', 'text'],
    num_rows: 100
})

Now our dataset has the `input_ids` and `attention_mask` vectors that we'll need to pass to the model. Before we do that, though, let's process the `stance` column, which contains our labels. 

In [20]:
kav_tweets = kav_tweets.rename_column('stance', 'labels')
kav_tweets = kav_tweets.class_encode_column('labels') # tells the model that this is the labels column

kav_tweets

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['__index_level_0__', 'attention_mask', 'input_ids', 'text', 'labels'],
    num_rows: 100
})

### The Model

Now that we've got correctly pre-processed texts, we can load and fine-tune our `DistilBERT` model.

In [21]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, # tell from_pretrained to load distilbert
                                                           num_labels = 2) # tell it to add a binary classifier head

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Okay, that warning message is telling us that we need to TRAIN this model before it will be useful to us. We can do that using a `Trainer()`.

In [22]:
from transformers import TrainingArguments, Trainer

The `TrainingArguments()` class lets us specify all of the parameters that get passed to a `Trainer()`. Most of the defaults are sensible, but we're going to tweak a couple of things:

In [23]:
training_args = TrainingArguments(output_dir = './distilbert_model', # specify the directory where our fine-tuned model will be saved
                                  overwrite_output_dir = True, # so it doesn't make a new copy every time I run this
                                  evaluation_strategy = 'no', # check model performance every epoch
                                  logging_strategy = 'no', # we're not going to log the model's performance anywhere though
                                  per_device_train_batch_size = 16, # how many inputs to process at once during training
                                  num_train_epochs = 2 # how many times to pass through the entire training set
                                 )

In [24]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = kav_tweets,
    eval_dataset = kav_tweets # don't evaluate on the training data, bad.
)

In [25]:
trainer.train() # this takes a little while on CPU

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, __index_level_0__.
***** Running training *****
  Num examples = 100
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 14


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=14, training_loss=0.6939981324332101, metrics={'train_runtime': 58.1848, 'train_samples_per_second': 3.437, 'train_steps_per_second': 0.241, 'total_flos': 26493479731200.0, 'train_loss': 0.6939981324332101, 'epoch': 2.0})

Okay, so that's a fine-tuned DistilBERT model. Let's try getting classifications for a couple new texts:

In [26]:
new_input = tokenizer(["I don't think Brett Kavanaugh should be on the Supreme Court.",
                       "Quit stalling and confirm Kavanaugh already!"],
                     padding = 'max_length',
                     truncation = True,
                     return_tensors = 'pt')

In [27]:
outputs = model(**new_input)

outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.1617, -0.0631],
        [-0.0285, -0.0232]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

Our model returns logits, which aren't particularly interpretable on their own. Let's transform those to probabilities:

In [28]:
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) #convert logits to probabilities

predictions

tensor([[0.5560, 0.4440],
        [0.4987, 0.5013]], grad_fn=<SoftmaxBackward>)

Well, those aren't very confident predictions, but we also only fine-tuned the model on 100 documents. Luckily, through some cooking show-style magic, I've also trained a version of the model on the full dataset and uploaded that to the Hugging Face model hub. Let's look at that instead.

We can load models from specific repos on the hub by passing `AutoModelForSequenceClassification.from_pretrained()` the repo and the model name, like this:

In [29]:
model = AutoModelForSequenceClassification.from_pretrained('bestvater/distilbert-kav-stance', # tell from_pretrained to load from model hub
                                                           num_labels = 2) # tell it to add a binary classifier head

loading configuration file https://huggingface.co/bestvater/distilbert-kav-stance/resolve/main/config.json from cache at /home/sbestvater/.cache/huggingface/transformers/f333dc4604dbfd5a54e90fa6a608e7ce79f8e1ba4c1bcff4dc6e2c62c9ff44aa.79ed5439eb7e4356a6238a60a4acaf44e3089240708cb99f911118cb18a7b4d8
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.10.3",
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bestvater/

In [30]:
outputs = model(**new_input)

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) #convert logits to probabilities

predictions

tensor([[0.9226, 0.0774],
        [0.0476, 0.9524]], grad_fn=<SoftmaxBackward>)

Nice. Now we have a custom classifier that can identify the stance of tweets about the Kavanaugh hearings.

### 4.3 Even less abstraction: Transformers as PyTorch or TensorFlow model objects
If we really want the ability to tweak anything at all, models on the Transformers hub are all just PyTorch objects (many are also availabile in TensorFlow). If we want to, we can just load the pretrained models and work with them directly in those frameworks, ignoring the huggingface APIs altogether. 

As a really quick proof of this, let's load the `summary` function from `torchinfo`. This is a simple function that prints out an architechture summary of any PyTorch model. We can apply this to our trained DistilBERT model:

In [31]:
from torchinfo import summary

summary(model)

Layer (type:depth-idx)                                  Param #
DistilBertForSequenceClassification                     --
├─DistilBertModel: 1-1                                  --
│    └─Embeddings: 2-1                                  --
│    │    └─Embedding: 3-1                              23,440,896
│    │    └─Embedding: 3-2                              393,216
│    │    └─LayerNorm: 3-3                              1,536
│    │    └─Dropout: 3-4                                --
│    └─Transformer: 2-2                                 --
│    │    └─ModuleList: 3-5                             42,527,232
├─Linear: 1-2                                           590,592
├─Linear: 1-3                                           1,538
├─Dropout: 1-4                                          --
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

I won't go into any more PyTorch detail today, but it's worth knowing that you can work with these models this way if you want to.

## 5. A word of warning on algorithmic bias

Transformers are cool and useful because they "learn" components of the complex interdependencies of natural language. But it's important to remember that they do this by being trained on massive amounts of text produced by humans, who have implicit biases. Large language models often learn the biases present in their training data. 

Here is an example, using the `fill-mask` pipeline, which is essentially asking the model to play madlibs and fill in a missing word in an input string.

In [32]:
fill_blank = pipeline("fill-mask", model = 'bert-base-uncased')

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /home/sbestvater/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json 

In [33]:
result = fill_blank("This man works as a [MASK].")
print([r["token_str"] for r in result])

['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']


In [34]:
result = fill_blank("This woman works as a [MASK].")
print([r["token_str"] for r in result])

['nurse', 'maid', 'teacher', 'waitress', 'prostitute']


Wow. That is *obnoxiously* sexist.

For this example we used the `bert-base-uncased` model, which is trained on the English Wikipedia and BookCorpus datasets. The BookCorpus in particular includes a lot of older texts, so that might be the source for some of this bias, but some other large language models are trained on more contemporary texts scraped from all over the internet, which can produce other, potentially worse issues.