# [HuggingFace NLP Course](https://huggingface.co/learn/nlp-course)

This notebook documents some of my experiments done while completing this HuggingFace course.

## [2. Using Transformers](https://huggingface.co/learn/nlp-course/chapter2)

### [Behind the pipeline](https://huggingface.co/learn/nlp-course/chapter2/2)

> "All 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):"

Note: AutoModel means "an object that returns the correct architecture based on the checkpoint."









In [1]:
from transformers import pipeline

print('\nfill-mask:')
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])


print('\nsentiment analysis:')
classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

  from .autonotebook import tqdm as notebook_tqdm



fill-mask:


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']

sentiment analysis:


[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [25]:
from transformers import AutoTokenizer, AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

## encode input sentences to input_ids
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
    "the dog is sitting on the ground wagging its tail",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
# note the attention mask indicates where padding was applied in each sentence
print("tokenizer inputs:")
print(inputs)
print(inputs.input_ids.shape)
print(f"key: {tokenizer.convert_ids_to_tokens(inputs.input_ids[0])}")


## get the feature vectors for each input sequence
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print("\feature vector noutputs:")
print(outputs.last_hidden_state.shape)


## lets classify the sentiment of each sentence
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# passes the input_ids and corresponding attention_mask to the model
sentiment_outputs = model(**inputs)
print("\nsentiment outputs:")
print(sentiment_outputs.logits.shape)
print("sentence 0 raw logits:")
print(sentiment_outputs.logits[0, :])

print("\npreds:")
preds = torch.nn.functional.softmax(sentiment_outputs.logits, dim=-1)
print(preds)

# print final sentiment predictions
key = model.config.id2label 
for i in range(len(raw_inputs)):
    print(f"'{raw_inputs[i]}' => {key[0]}: {preds[i][0]:.3f}, {key[1]}: {preds[i][1]:.3f}")

tokenizer inputs:
{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  1996,  3899,  2003,  3564,  2006,  1996,  2598, 11333, 12588,
          2049,  5725,   102,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}
torch.Size([3, 16])
key: ['[CLS]', 'i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.', '[SEP]']
eature vector noutputs:
torch.Size([3, 16, 768])

sentiment outputs:
torch.Size([3, 2])
sentence 0 raw logits:
tensor([-1.5607,  1.6123], grad_fn=<SliceBackward0>)

preds:
tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5

In [3]:
# equivalent to the above!
classifier = pipeline("sentiment-analysis", model=checkpoint, tokenizer=checkpoint)
# can be loaded equivalently with:
#classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier(raw_inputs)

[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455},
 {'label': 'NEGATIVE', 'score': 0.9941940903663635}]

### [Models](https://huggingface.co/learn/nlp-course/chapter2/3)
> How to load and instantiate models

In [5]:
from transformers import BertConfig, BertModel

# config is like a recipe for how to build the model (e.g. define feature size, num layers etc)
config = BertConfig()
print("default BertConfig:")
print(config)

# build model from config
model = BertModel(config)
print("\nmodel architecture:")
print(model)

# save a model to disk
# my-bert-model/
# ├── config.json
# └── model.safetensors
model.save_pretrained("/tmp/models/my-bert-model")
# now load from disk
model = model.from_pretrained("/tmp/models/my-bert-model", local_files_only=True)

# loading from pretrained https://huggingface.co/google-bert/bert-base-cased
model = BertModel.from_pretrained("bert-base-uncased")


default BertConfig:
BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.37.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}


model architecture:
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(


### [Tokenzier](https://huggingface.co/learn/nlp-course/chapter2/4)

Interesting mention of alternative techniques such as **"SentencePiece or Unigram, as used in several multilingual models"**.

Encoding has two steps:
1. Tokenization (split input text into tokens) e.g. by subword method
2. Convert tokens into tensors (using the tokenizer's vocabulary)

In [23]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(f"tokens:\n", tokens)

# convert token strings to input IDs
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"ids:\n", ids)
print(f"decoded: '{tokenizer.decode(ids)}'")


# directly reading the vocabulary (dictionary)
vocab = tokenizer.get_vocab()
print(f"\nvocab type: {type(vocab)}, len: {len(vocab)}")
for k, v in vocab.items():
    print(f"example vocab: '{k}' => {v}")
    break

# invert vocabulary to map token ids (int) -> token strings (str)
id_to_token = {id: token for token, id in vocab.items()}
my_decoded = [id_to_token[id] for id in ids]
print("my_decoded:\n", my_decoded)
#print(vocab['##former']) # returns 23763

tokens:
 ['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
ids:
 [7993, 170, 13809, 23763, 2443, 1110, 3014]
decoded: 'Using a Transformer network is simple'

vocab type: <class 'dict'>, len: 28996
example vocab: 'shy' => 12076
my_decoded:
 ['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


### [Handling Multiple Sequences](https://huggingface.co/learn/nlp-course/chapter2/5)
Elaborates on how to tokenize batches of text, and the need to pad the sequences to match the longest in the batch, and the use of attention masks to tell attention layers to ignore the padding tokens.

References some transformer models specifically designed for handling longer sequence lengths:
* [Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)
* [LED](https://huggingface.co/docs/transformers/model_doc/led)

In [34]:
print(f"pad_token_id = {tokenizer.pad_token_id}, pad_token = '{tokenizer.pad_token}'")

# more info: https://huggingface.co/docs/transformers/pad_truncation
# when tokenizing we can specify a max length and padding strategy
# padding="longest" | "max_length" | "do_not_pad"
# we also request to receive the results as pytorch tensors instead of lists
model_inputs = tokenizer(raw_inputs, max_length=4, truncation=True, return_tensors="pt")
print(model_inputs)

pad_token_id = 0, pad_token = '[PAD]'
{'input_ids': tensor([[ 101, 1045, 1005,  102],
        [ 101, 1045, 5223,  102],
        [ 101, 1996, 3899,  102]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]])}


## [3. Fine-Tuning a Pretrained Model](https://huggingface.co/learn/nlp-course/chapter3/1)

### [Processing the Data](https://huggingface.co/learn/nlp-course/chapter3/2)