# Transformers from Scratch

In [2]:
!pip install transformers

[0m

In [4]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig

## Tokenization

In [5]:
model_name = "bert-base-uncased"

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [16]:
texts = [
    "our country is getting better",
    "our country is famous of tea"
]
token_info = tokenizer(texts, padding=True, return_tensors="pt")
token_info

{'input_ids': tensor([[ 101, 2256, 2406, 2003, 2893, 2488,  102,    0],
        [ 101, 2256, 2406, 2003, 3297, 1997, 5572,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}

## Making Embeddings

In [14]:
config = AutoConfig.from_pretrained(model_name)
config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [15]:
emb_gen = torch.nn.Embedding(num_embeddings=config.vocab_size, embedding_dim=config.hidden_size)

In [21]:
token_info.input_ids.shape, emb_gen(token_info.input_ids).shape

(torch.Size([2, 8]), torch.Size([2, 8, 768]))

**See. That's what Embeddings are**

If we didn't use Embeddings, we need to use one-hot encoding. If we did that, the embedding_dim will be the same as `vocab_size`. That could be super hard to use & may be inefficient too.

### Getting the PreTrained Embeddings

Earlier, we used `nn.Embedding` & it's no trained with language model.

In [30]:
model = AutoModel.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [39]:
model.embeddings.word_embeddings(token_info.input_ids).shape

torch.Size([2, 8, 768])