# Transformers with HuggingFace

Following the HuggingFace course on Transformers available at: 

https://huggingface.co/learn/nlp-course/chapter8/2. 

## Basics of Transformer Pipelines

In [1]:
from transformers import pipeline 

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course the past few minutes.") 

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


RuntimeError: Failed to import transformers.models.distilbert.modeling_tf_distilbert because of the following error (look up to see its traceback):
No module named 'keras.saving.hdf5_format'

Impressive that it seems to detect sarcasm/irony!

In [None]:
classifier("I've been waiting for a HuggingFace course my whole life.")

NameError: name 'classifier' is not defined

How interesting! Clearly that phrase at the end can completely invert the sentiment!

In [None]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.",
     "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### Zero-shot classification

In [None]:
zero_shot = pipeline("zero-shot-classification") 
zero_shot(
    "This is a course about the Transformers library", 
    candidate_labels=["education", "politics", "business"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
  return torch.load(checkpoint_file, map_location="cpu")


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445961475372314, 0.11197632551193237, 0.043427467346191406]}

### Text generation

In [None]:
generator = pipeline("text-generation") 
generator("In this course, we will teach you how to") 

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


RuntimeError: Failed to import transformers.models.gpt2.modeling_tf_gpt2 because of the following error (look up to see its traceback):
No module named 'keras.saving.hdf5_format'

In [None]:
text = "My father’s family name being"

generator(text, num_return_sequences=2, max_length=15) 

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'My father’s family name being "Kathleen’le'},
 {'generated_text': 'My father’s family name being taken away or changed by something was'}]

### Selecting a Specific Model

In [None]:
distil = pipeline("text-generation", model="distilgpt2") 
generator(
    "In this course, we will teach you how to", 
    max_length=20, 
    num_return_sequences=2,
)



config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/353M [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location="cpu")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create your own virtual account on your Android device.'},
 {'generated_text': 'In this course, we will teach you how to create software and hardware prototypes of various forms such as'}]

### Mask filling

In [None]:
unmasker = pipeline("fill-mask") 
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/331M [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location="cpu")


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.1961973011493683,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.040526967495679855,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

In [None]:
bert_unmasked = pipeline('fill-mask', model='bert-base-cased')
bert_unmasked("This course will teach you all about [MASK] models.", top_k=2)



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location="cpu")
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[{'score': 0.2596305012702942,
  'token': 1648,
  'token_str': 'role',
  'sequence': 'This course will teach you all about role models.'},
 {'score': 0.09427257627248764,
  'token': 1103,
  'token_str': 'the',
  'sequence': 'This course will teach you all about the models.'}]

### Named Entity Recognition

In [None]:
from transformers import pipeline 

ner = pipeline("ner", grouped_entities=True) 
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location="cpu")


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.97960204,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

### Question answering

In [None]:
answers = pipeline("question-answering") 
answers(
    question="Where do I work?", 
    context="My name is Sylvain and I work at Hugging Face in Brooklyn"
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/261M [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location="cpu")


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.6949760317802429, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

### Summarization

In [None]:
summarizer = pipeline("summarization") 
summarizer(
    """
America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers . Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues .'}]

### Translation

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en") 
translator("La declaration des droits des hommes.")

  return torch.load(checkpoint_file, map_location="cpu")


source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'Declaration of human rights.'}]

## Bias and Limitations

In [None]:
unmasker = pipeline("fill-mask", model='bert-base-uncased') 
result = unmasker("This man works as a [MASK].") 
print([r["token_str"] for r in result]) 

result = unmasker("This woman works as a [MASK].") 
print([r["token_str"] for r in result])



pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  return torch.load(checkpoint_file, map_location="cpu")
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']


# Using Transformers

In [1]:
from transformers import AutoTokenizer 
# this downloads only the weights of the model and passes it to checkpoint
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"  
# the weights (checkpoint) can then be used as arguments as here to generate a tokenizer instance 
tokenizer = AutoTokenizer.from_pretrained(checkpoint) 



## tensors

In [2]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!", 
]
inputs = tokenizer(
    raw_inputs, 
    padding=True, 
    truncation=True, 
    return_tensors='pt' 
)
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [3]:
from transformers import AutoModel  

model = AutoModel.from_pretrained(checkpoint) 

  return torch.load(checkpoint_file, map_location="cpu")
Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
help(AutoModel)

Help on class AutoModel in module transformers.models.auto.modeling_auto:

class AutoModel(transformers.models.auto.auto_factory._BaseAutoModelClass)
 |  AutoModel(*args, **kwargs)
 |  
 |  This is a generic model class that will be instantiated as one of the base model classes of the library when created
 |  with the [`~AutoModel.from_pretrained`] class method or the [`~AutoModel.from_config`] class
 |  method.
 |  
 |  This class cannot be instantiated directly using `__init__()` (throws an error).
 |  
 |  Method resolution order:
 |      AutoModel
 |      transformers.models.auto.auto_factory._BaseAutoModelClass
 |      builtins.object
 |  
 |  Class methods defined here:
 |  
 |  from_config(**kwargs) from builtins.type
 |      Instantiates one of the base model classes of the library from a configuration.
 |      
 |      Note:
 |          Loading a model from its configuration file does **not** load the model weights. It only affects the
 |          model's configuration. Use [`

In [5]:
outputs = model(**inputs) 
print(outputs.last_hidden_state.shape) 

torch.Size([2, 16, 768])


2: n. sequences
16: sequence lengths 
768: hidden size? 

<i> Note: the ** parameter unpacks the dictionary, so that dict keys become the parameter names and values become parameter values. </i> 

In [6]:
from transformers import AutoModelForSequenceClassification  

model = AutoModelForSequenceClassification.from_pretrained(checkpoint) 
outputs = model(**inputs)

In [7]:
print(outputs.logits.shape)

torch.Size([2, 2])


Since inputs contains two sentences and two labels, what's returned is essentially a confusion matrix size [2,2] 

In [8]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


### Processing raw outputs

Passes raw logits through a softmax output layer

In [9]:
import torch 

# dim=-1 applies softmax to the last dimension of the input tensor, which will be the two classes (the two columns of the matrix) 
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) 
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


translates to: 
[[0.0402, 0.9598],
[0.9995, 0.0005]]
Which are recognizable probability scores: the sum of each row is 1. 

In [10]:
help(torch.nn.functional.softmax)

Help on function softmax in module torch.nn.functional:

softmax(input: torch.Tensor, dim: Union[int, NoneType] = None, _stacklevel: int = 3, dtype: Union[int, NoneType] = None) -> torch.Tensor
    Apply a softmax function.
    
    Softmax is defined as:
    
    :math:`\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}`
    
    It is applied to all slices along dim, and will re-scale them so that the elements
    lie in the range `[0, 1]` and sum to 1.
    
    See :class:`~torch.nn.Softmax` for more details.
    
    Args:
        input (Tensor): input
        dim (int): A dimension along which softmax will be computed.
        dtype (:class:`torch.dtype`, optional): the desired data type of returned tensor.
          If specified, the input tensor is casted to :attr:`dtype` before the operation
          is performed. This is useful for preventing data type overflows. Default: None.
    
    .. note::
        This function doesn't work directly with NLLLoss,
        which 

In [11]:
# get labels corresponding to position 
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

So the model predicted: 
First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598 
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005

TODO: 
- try this with UEQ data to see if it outperforms TextBlob 

## Models

### Creating a Transformer

In [12]:
# load a configuration object 
from transformers import BertConfig, BertModel  

# Build the config 
config = BertConfig() 

# Building model from config 
model = BertModel(config)

In [13]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



The config is essentially an architecture. 

### Loading and saving models

If a model is created from the default configuration it's intiialized with random values 

Such a model needs to be trained first. 

In [14]:
model = BertModel.from_pretrained("bert-base-cased")

  return torch.load(checkpoint_file, map_location="cpu")
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
save_path = r"C:\Users\\bened\DataScience\ANLP\\Labs\\Lab5\\models"
model.save_pretrained(f"{save_path}\\bert-base-cased") 

This will save two files to the directory: 
- config.json contains the model hyperparameters or architecture 
- pytorch_model.bin is a <i>state dictionary</i> containing model weights. 

### Inference tasks

In [16]:
sequences = ["Hello!", "Cool.", "Nice!"]
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [17]:
model_inputs = torch.tensor(encoded_sequences)

In [18]:
# use tensors as model inputs 
output = model(model_inputs)

## Tokenizers

Tokenization here means transforming text into numerical data, so what might be analogous to vectorization in another context, not just splitting up texts into smaller strings.

### Word-based

In [19]:
tokenized_text = "Jim Henson was a puppeteer".split() 
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


### Loading and saving

In [20]:
from transformers import BertTokenizer 

tokenizer = BertTokenizer.from_pretrained("bert-base-cased") 

In [21]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [22]:
tokenizer.save_pretrained(f"{save_path}\\bert-tokenizer")

('C:\\Users\\\\bened\\DataScience\\ANLP\\\\Labs\\\\Lab5\\\\models\\bert-tokenizer\\tokenizer_config.json',
 'C:\\Users\\\\bened\\DataScience\\ANLP\\\\Labs\\\\Lab5\\\\models\\bert-tokenizer\\special_tokens_map.json',
 'C:\\Users\\\\bened\\DataScience\\ANLP\\\\Labs\\\\Lab5\\\\models\\bert-tokenizer\\vocab.txt',
 'C:\\Users\\\\bened\\DataScience\\ANLP\\\\Labs\\\\Lab5\\\\models\\bert-tokenizer\\added_tokens.json')

### Tokenization

In [23]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") 

sequence = "Using a Transformer network is simple" 
tokens = tokenizer.tokenize(sequence) 

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


This tokenizer is a subword tokenizer, splitting off morphemes like '##former'

In [24]:
ids = tokenizer.convert_tokens_to_ids(tokens) 

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


### Decoding

In [25]:
decoded_string = tokenizer.decode([
    7993, 170, 11303, 1200, 2443, 1110, 3014
])
print(decoded_string)

Using a transformer network is simple


Note that the decocer reattaches morphemes appropriately. Presumably the '##' tells the decoder to reattach this token to the previous one? 

## Multiple Sequences

In [26]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint) 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint) 

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence) 
ids = tokenizer.convert_tokens_to_ids(tokens) 
input_ids = torch.tensor(ids) 
model(input_ids)

  return torch.load(checkpoint_file, map_location="cpu")


IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

In [27]:
tokenized_inputs = tokenizer(sequence, return_tensors='pt') 
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


In [28]:
input_ids = torch.tensor([ids]) 
print("Input IDs:", input_ids) 

output = model(input_ids) 
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


### Padding

In [29]:
sequence1_ids = [[200, 200, 200]] 
sequence2_ids = [[200, 200]] 
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits) 
print(model(torch.tensor(sequence2_ids)).logits) 
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward0>)


### Attention masking

An attention mask is a tensor with the same shape as the input IDs tensor, but binary, indicating whether or not to attend to a token. 

In [30]:
attention_mask = [
    [1,1,1],
    [1,1,0]
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask)) 
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


## Pipelining

In [31]:
model_inputs = tokenizer(sequence) 

In [32]:
sequences= [
    sequence, 
    "So have I!"
]
model_inputs = tokenizer(sequences)

In [33]:
# pad sequences up to max sequence length 
model_inputs_max = tokenizer(sequences, padding='longest') 

# pad sequences up to model max length 
model_inputs_maxlen = tokenizer(sequences, padding="max_length") 

# pad sequences to max length specified 
model_inputs_custom = tokenizer(sequences, padding="max_length", max_length=8)

Truncating sequences

In [34]:
# truncate below model max length 
model_inputs_trunc = tokenizer(sequences, truncation=True) 

# truncate sequences longer than specified maxlen 
model_inputs_trunc_custom = tokenizer(sequences, max_length=8, truncation=True)

Different tensors

In [35]:
# PyTorch 
inputs_pt = tokenizer(sequences, padding=True, return_tensors='pt') 

# TensorFlow 
inputs_tf = tokenizer(sequences, padding=True, return_tensors="tf") 

# NumPy 
inputs_np = tokenizer(sequences, padding=True, return_tensors="np")

In [36]:
seq_input = tokenizer(sequence) 
print(seq_input["input_ids"]) 

tokens = tokenizer.tokenize(sequence) 
ids = tokenizer.convert_tokens_to_ids(tokens) 
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


In [37]:
print(tokenizer.decode(seq_input["input_ids"])) 
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


In [38]:
tokens = tokenizer(
    sequences,
    padding=True, 
    truncation=True, 
    return_tensors="pt"
)
output=model(**tokens)

### Quiz

In [39]:
result = tokenizer.tokenize("Hello!") 
print(result)

['hello', '!']


In [40]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") 
model = AutoModel.from_pretrained("gpt2") 

encoded = tokenizer("Hey!", return_tensors='pt') 
result = model(**encoded) 

# Fine-Tuning

## Data Processing

In [41]:
from transformers import AdamW

checkpoint = "bert-base-uncased"  
tokenizer = AutoTokenizer.from_pretrained(checkpoint) 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint) 
batch = tokenizer(
    sequences, 
    padding=True, 
    truncation=True, 
    return_tensors='pt'
)
batch["labels"] = torch.tensor([1,1]) 

optimizer = AdamW(model.parameters()) 
loss = model(**batch).loss 
loss.backward() 
optimizer.step() 

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Downloading datasets

In [42]:
from datasets import load_dataset 

raw_datasets = load_dataset("glue", "mrpc") 
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [43]:
raw_train = raw_datasets["train"]
raw_train[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [44]:
raw_train.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [45]:
print(raw_train[14]) 
raw_val = raw_datasets["validation"] 
print(raw_val[86])

{'sentence1': 'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .', 'sentence2': 'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .', 'label': 0, 'idx': 15}
{'sentence1': 'He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife , singer Whitney Houston .', 'sentence2': 'He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife .', 'label': 1, 'idx': 796}


### Data preprocessing

In [46]:
tokenized_sents1 = tokenizer(raw_datasets["train"]["sentence1"]) 
tokenized_sents2 = tokenizer(raw_datasets["train"]["sentence2"])

In [47]:
inputs = tokenizer("This is the first sentence.", "And this is the second.") 
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 1998, 2023, 2003, 1996, 2117, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

token_type_ids tell the model which part of the input is the first sentence and which is the second (in this case) 

In [48]:
el15 = raw_train[14] 
el15_sent1 = el15["sentence1"] 
el15_sent2 = el15["sentence2"] 
s1_inputs = tokenizer(el15_sent1) 
print("Sentence 1 tokenized:", s1_inputs) 
s2_inputs = tokenizer(el15_sent2) 
print("Sentence 2 tokenized:", s2_inputs) 
el15_inputs = tokenizer(el15["sentence1"], el15["sentence2"]) 
print("Tokenized together:", el15_inputs)

Sentence 1 tokenized: {'input_ids': [101, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 2056, 1996, 2873, 2001, 4755, 4229, 5467, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Sentence 2 tokenized: {'input_ids': [101, 1996, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2056, 1996, 2873, 4062, 2018, 3478, 2000, 18235, 2094, 2417, 2644, 4597, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Tokenized together: {'input_ids': [101, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 2056, 1996, 2873, 2001, 4755, 4229, 5467, 1012, 102, 1996, 2132, 1997

We can see the main difference when tokenizing sequences together is that the token_type_ids will keep track of which token belongs to which sequence. 

In [49]:
# convert IDs back to words 
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'and',
 'this',
 'is',
 'the',
 'second',
 '.',
 '[SEP]']

In [50]:
# preprocess whole training set 
tokenized_trainset = tokenizer( 
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True
)

In [51]:
print(type(tokenized_trainset))

<class 'transformers.tokenization_utils_base.BatchEncoding'>


Problem: will return dataset as a dict. To keep data as a dataset we use the Dataset.map() method. 

In [52]:
# where example is a dict

def tokenize_function(example): 
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [53]:
# apply function to all datasets 
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) 
tokenized_datasets

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

### Dynamic padding

In [54]:
# instantiate a data collator for dynamic paddding 
from transformers import DataCollatorWithPadding 

data_collator = DataCollatorWithPadding(tokenizer=tokenizer) 

In [55]:
samples = tokenized_datasets["train"][:8] 
samples = { # take only the numerical values and ignore the index
    k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]
}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

Dynamic padding allows us to pad for each batch. So if the samples above were a batch, we would only have to pad up to 62 (largest sequence), instead of the largest sequence in the whole dataset. 

In [56]:
batch = data_collator(samples) 
{k: v.shape for k, v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

we can see that the data_collator dynamically pads up to the length of the largest sequence in the sample: 67

## Fine-tuning with Trainer

### Training

In [57]:
# define training arguments 
from transformers import TrainingArguments 

training_args = TrainingArguments("test-trainer") 

In [58]:
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, 
    num_labels=2
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [59]:
# instantiate trainer object 
from transformers import Trainer 

trainer = Trainer( 
    model, 
    training_args, 
    train_dataset=tokenized_datasets["train"], 
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator, 
    tokenizer=tokenizer
)

In [60]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377


  0%|          | 0/1377 [00:00<?, ?it/s]

Saving model checkpoint to test-trainer\checkpoint-500
Configuration saved in test-trainer\checkpoint-500\config.json


{'loss': 0.6323, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09}


Model weights saved in test-trainer\checkpoint-500\pytorch_model.bin
tokenizer config file saved in test-trainer\checkpoint-500\tokenizer_config.json
Special tokens file saved in test-trainer\checkpoint-500\special_tokens_map.json
Saving model checkpoint to test-trainer\checkpoint-1000
Configuration saved in test-trainer\checkpoint-1000\config.json


{'loss': 0.5365, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18}


Model weights saved in test-trainer\checkpoint-1000\pytorch_model.bin
tokenizer config file saved in test-trainer\checkpoint-1000\tokenizer_config.json
Special tokens file saved in test-trainer\checkpoint-1000\special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 4072.8639, 'train_samples_per_second': 2.702, 'train_steps_per_second': 0.338, 'train_loss': 0.5321627792103532, 'epoch': 3.0}


TrainOutput(global_step=1377, training_loss=0.5321627792103532, metrics={'train_runtime': 4072.8639, 'train_samples_per_second': 2.702, 'train_steps_per_second': 0.338, 'train_loss': 0.5321627792103532, 'epoch': 3.0})

### Evaluation

In [61]:
predictions = trainer.predict(tokenized_datasets["validation"]) 
print(predictions.predictions.shape, predictions.label_ids.shape) 

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 408
  Batch size = 8


  0%|          | 0/51 [00:00<?, ?it/s]

(408, 2) (408,)


In [62]:
import numpy as np 
# to generate a label classification from the predictions, we need to find the label with the highest probability score. 
preds = np.argmax(predictions.predictions, axis=1) 

In [64]:
from huggingface import evaluate 

# load the metrics associated with our dataset 
metric = evaluate.load("glue", "mrpc") 
metric.compute(predictions=preds, references=predictions.label_ids) 

ModuleNotFoundError: No module named 'huggingface'

In [None]:
# wrap in a function 
def compute_metrics(eval_preds): 
    metric = evaluate.load("glue", "mrpc") 
    logits, labels = eval_preds 
    predictions = np.argmax(logits, axis=-1) 
    return metric.compute(predictions=predictions, references=labels)

To train with the compute_metrics function: 

In [None]:
training_args = TrainingArguments( 
    "test-trainer", 
    evaluation_strategy="epoch"
)

trainer_eval = Trainer( 
    model, 
    training_args, 
    train_dataset=tokenized_datasets["train"], 
    eval_dataset=tokenized_datasets["validation"], 
    data_collator=data_collator, 
    tokenizer=tokenizer,
    compute_metrics=compute_metrics 
)

In [None]:
trainer_eval.train() 

## Full training

### Prepare for training

In [None]:
# remove columns that can't/shouldn't be used as features 
processed_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
# rename column to what model will expect 
processed_datasets = processed_datasets.rename_column("label", "labels") 
# set format to pt tensors 
processed_datasets.set_format("torch") 
# print column names 
processed_datasets["train"].column_names 

In [None]:
# define dataloaders 
from torch.utils.data import DataLoader 

train_dataloader = DataLoader( 
    processed_datasets["train"], 
    shuffle=True, 
    batch_size=8, 
    collate_fn=data_collator
)

eval_dataloader = DataLoader( 
    processed_datasets["validation"], 
    batch_size=8, 
    collate_fn=data_collator 
)

In [None]:
## verify preprocessing worked as expected 
for batch in train_dataloader: 
    break 
{k: v.shape for k, v in batch.items()}

In [None]:
# before initiating training, pass a batch to the model to see if results as expected 
batch_outputs = model(**batch) 
print(outputs.loss, outputs.logits.shape) 

In [None]:
# instantiate an optimizer 
optimizer = AdamW(model.parameters(), lr=5e-5) 

In [None]:
from transformers import get_scheduler 

num_epochs = 3 
num_training_steps = num_epochs * len(train_dataloader) 
lr_scheduler = get_scheduler( 
    "linear", 
    optimizer=optimizer, 
    num_warmup_steps=0, 
    num_training_steps=num_training_steps
)

print(num_training_steps)

### Training loop 

In [None]:
# define a device for training 

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") 
model.to(device) 
device

In [None]:
from tqdm.auto import tqdm 

progress_bar = tqdm(range(num_training_steps)) 

model.train() 
for epoch in range(num_epochs): 
    for batch in tran_dataloader: 
        batch = {k: v.to(device) for k, v in batch.items()} 
        outputs = model(**batch) 
        loss = outputs.loss 
        loss.backward() 

        optimizer.step() 
        lr_scheduler.step() 
        optimizer.zero_grad() 
        progress_bar.update(1) 

### Evaluation loop

In [None]:
model.eval() 
for batch in eval_dataloader: 
    batch = {k: v.to(device) for k, v in batch.items()} 
    with torch.no_grad(): 
        outputs = model(**batch) 
    
    logits = outputs.logits 
    predictions = torch.argmax(logits, dim=-1) 
    metric.add_batch(predictions=predictions, references=batch["labels"]) 

metric.compute() 

### ACCELERATE 

In [None]:
from accelerate import Accelerator  

def training_function(): 
    
    accelerator = Accelerator() 

    accelerated_optimizer = AdamW(model.paramaters(), lr=3e-5) 

    train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( 
        train_dataloader, eval_dataloader, model, optimizer 
    )

    num_epochs = 3 
    num_training_steps = num_epochs * len(train_dataloader) 
    lr_scheduler = get_scheduler( 
        "linear", 
        optimizer=accelerated_optimizer, 
        num_warmup_steps=0, 
        num_training_steps=num_training_steps
    )

    progress_bar = tqdm(range(num_training_steps)) 

    model.train() 
    for epoch in range(num_epochs): 
        for batch in train_dataloader: 
            outputs = model(**batch) 
            loss = outputs.loss 
            accelerator.backward(loss) 

            optimizer.step() 
            lr_scheduler.step() 
            optimizer.zero_grad() 
            progress_bar.update(1) 

In [None]:
from accelerate import notebook_launcher 

notebook_launcher(training_function) 

# Sharing Models