Following the huggingface course: https://huggingface.co/course/
# 1. Transformer Models

In [1]:
from transformers import pipeline

# to start with we will classify a single sentnence
classifier = pipeline("sentiment-analysis")
print(classifier("I watched a good movie yesterday"))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9990124702453613}]


In [2]:
# classify multiple sentences
classifier(['that was a good movie', 'He is unwell'])

[{'label': 'POSITIVE', 'score': 0.9998570680618286},
 {'label': 'NEGATIVE', 'score': 0.9989678859710693}]

In [3]:
# zero shot classification
# we havent trained the model on the labels we are using
zero_shot_classifier = pipeline("zero-shot-classification")
zero_shot_classifier("this is a very interesting course on algebra", candidate_labels=["mathematics", "physics", "chemistry", "biology"])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'this is a very interesting course on algebra',
 'labels': ['mathematics', 'biology', 'physics', 'chemistry'],
 'scores': [0.9866520166397095,
  0.004770115949213505,
  0.004415604285895824,
  0.004162236116826534]}

In [4]:
# text generation
generator = pipeline('text-generation')
generator("newtons first law states that")


No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'newtons first law states that you need to have a job or to have a college degree. The University of Michigan provides job training, job placement, and educational guidance for job seekers in an effort to provide a strong applicant pool. If you or your'}]

In [5]:
# text generation by specifying a model
generator = pipeline('text-generation', model='distilgpt2')
generator('in this course, we will teach you how to', max_length=30, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'in this course, we will teach you how to set up a new, beautiful, and simple, but more advanced approach in the curriculum.'},
 {'generated_text': 'in this course, we will teach you how to read, as well as how you can use an app that can easily convert to HTML with JavaScript,'},
 {'generated_text': 'in this course, we will teach you how to learn how to teach a common form of learning:\n\n\n\n\n\nHow to teach a'}]

In [6]:
# Fill Mask
unmasker = pipeline('fill-mask')
unmasker("sun rises in the <mask>", top_k=3)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.296585351228714,
  'token': 6360,
  'token_str': ' sky',
  'sequence': 'sun rises in the sky'},
 {'score': 0.06161285191774368,
  'token': 12351,
  'token_str': ' Arctic',
  'sequence': 'sun rises in the Arctic'},
 {'score': 0.0532655231654644,
  'token': 3778,
  'token_str': ' sun',
  'sequence': 'sun rises in the sun'}]

In [7]:
# named entity recognnition
ner = pipeline('ner', grouped_entities=True)
ner("My name is Kiran and I work at Amazon")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'PER',
  'score': 0.9954788,
  'word': 'Kiran',
  'start': 11,
  'end': 16},
 {'entity_group': 'ORG',
  'score': 0.99733573,
  'word': 'Amazon',
  'start': 31,
  'end': 37}]

In [8]:
# question answering
question_answerer = pipeline('question-answering')
question_answerer(question="What is the capital of India?", context="India is a country in South Asia. Its capital is New Delhi")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9939645528793335, 'start': 49, 'end': 58, 'answer': 'New Delhi'}

In [9]:
# summarizer = pipeline('summarization')
# summarizer('a stitch in time saves nine', min_length=100, max_length=200)

In [10]:
# !pip install sentencepiece
from transformers import pipeline
translator = pipeline('translation', model='Helsinki-NLP/opus-mt-fr-en')
translator('Bonjour, comment allez-vous?')

Downloading:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'Hello, how are you?'}]

## Encoder and Decoder models
In encoder models, attention layers can access all words in the input sequence. Pretraining involves corruping an input sequence and predicting the original sequence (say masking). They are best suited for tasks that require a full understanding of the entire sequence, such as sentence classification. 

Decoder models use only the decoder of the model. At each stage, the attention layer can access only words positioned before it in the sentence. These models are called auto-regresive models. 

Sequence-to-sequence models, use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.

## Loading a model

In [11]:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
from transformers import AutoTokenizer
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

In [21]:
# if task is sequence classification
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)
print(f'outputs - {outputs.logits}')

# to output the probabilities, we need pass it thorugh a softmax
import torch
probs = torch.softmax(outputs.logits, dim=-1)
print(f'probabilities - {probs}')

print(model.config.id2label)

torch.Size([2, 2])
outputs - tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)
probabilities - tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)
{0: 'NEGATIVE', 1: 'POSITIVE'}


### Models



In [23]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.23.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

### Tokenizers


In [42]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
text = "The medicine is arsenic album, ars albaam, allium sepa"
input_ids = tokenizer(text, return_tensors="pt").input_ids
print(input_ids[0])

# create the text from input ids
print(tokenizer.decode(input_ids[0]))

# print each token
tokens = [token for token in tokenizer.convert_ids_to_tokens(input_ids[0])]
print(tokens)


# print the vocab size
print(tokenizer.vocab_size)

tensor([  101,  1109,  5182,  1110,   170, 22972,  1596,  1312,   117,   170,
         1733,  2393,  2822,  2312,   117,  1155,  3656, 14516,  4163,   102])
[CLS] The medicine is arsenic album, ars albaam, allium sepa [SEP]
['[CLS]', 'The', 'medicine', 'is', 'a', '##rsen', '##ic', 'album', ',', 'a', '##rs', 'al', '##ba', '##am', ',', 'all', '##ium', 'se', '##pa', '[SEP]']
28996


We noticed that for unknown words, the tokenizer will split them into subwords that are not meaningful. For example, the word "huggingface" is split into "hug", "##ging", "##face". This is because the tokenizer was trained on a vocabulary that did not contain the word "huggingface".

But there is a way to simpler way to get tokens.

In [46]:
tokens = tokenizer.tokenize(text)
print(tokens)

# convert to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

['The', 'medicine', 'is', 'a', '##rsen', '##ic', 'album', ',', 'a', '##rs', 'al', '##ba', '##am', ',', 'all', '##ium', 'se', '##pa']
[1109, 5182, 1110, 170, 22972, 1596, 1312, 117, 170, 1733, 2393, 2822, 2312, 117, 1155, 3656, 14516, 4163]


Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

## 3.2 Processing the data


In [None]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForClassification

checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = [
    'I have been waiting for a Hugging face course my whole life.',
    'This course is amazing'
]
batch = tokerizer(sequences, padding=True, truncation=True, return_tensors='pt')