# Developing a Language Model using GPT and Transformers

### Transformer-based Generative Langauge Model
**Transformers** are deep learning models that adopt a mechanism known as self-attention. **Self-attention** is a technique that mimics cognitive attention. Models such as these are generally trained by **gradient descent**.


### Transformer Architecture:

- **Encoder Decoder**: The **encoder** comprises encoding layers that process input values iteratively with the objective of generating encodings that contain enough information regarding which parts of a given input are relevant to each other. On the other hand, the **decoder** comprises decoding layers does the opposire, and takes the encodings to generate an output sequence.

- **Scaled Dot-product Attention**: The main building blocks are essentially scaled dot-product attention units. The unit produces embeddings for every token in context.

- **Multi-Head Attention**: A single set of matrices $(W_Q, W_K, W_V)$ is called an attention head. Each layer in a transformer model has multiple attention heads. While each attention head attends to the tokens that are relevant to each token, multiple attention heads allows the model to do this for different definitions.

### Let us go ahead and install transformers and import the relevant libraries:

In [3]:
!pip install pytorch-transformers --quiet

You should consider upgrading via the 'c:\users\saleh alkhalifa\anaconda3\python.exe -m pip install --upgrade pip' command.


In [4]:
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

### We will need a tokenizer (vocabulary)

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

100%|████████████████████████████████████████████████████████████████████████| 1042301/1042301 [00:00<00:00, 11588224.58B/s]
100%|███████████████████████████████████████████████████████████████████████████| 456318/456318 [00:00<00:00, 7125229.56B/s]


In [6]:
text = "What is the fastest car in the"
indexed_tokens = tokenizer.encode(text)

In [7]:
tokens_tensor = torch.tensor([indexed_tokens])

In [8]:
model = GPT2LMHeadModel.from_pretrained('gpt2')


100%|██████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 332603.41B/s]
100%|████████████████████████████████████████████████████████████████████| 548118077/548118077 [00:11<00:00, 46492611.16B/s]


In [9]:
model.eval()


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

In [11]:
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

In [12]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

In [14]:
print(predicted_text)

 What is the fastest car in the world
