## Resources
- Tiktokenizer
- Excalidraw
- Andrej Karpathy (Youtube) https://www.youtube.com/watch?v=7xTGNNLPyMI
- [Quantization](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

### My Notes
- In this Notebook we're using Facebook's Open Pretrained Transformer (OPT)
- The `tokenizer` object loads the vocabulary required to interact with the model and converts the sample input (`inp` variable) to token IDs and attention mask
- The **attention mask** is a vector designed to help ignore specific tokens. (All 1's in this example, meaning we're not omitting anything)
- We use `print(OPT_model.model)` to inspect the model's architecture. You'll notice `OPTDecoder` in the output, referring to the **Decoder** Only model
- `return_tensors="pt"` in our tokenizer() instantiation means we want our tokens returned as PyTorch Tensors, not NumPy arrays (default)
- Use `nvidia-smi` to check your VRAM usage

In [3]:
!pip install transformers accelerate bitsandbytes



In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision.datasets import ImageFolder

#import timm

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


model_name = "facebook/opt-1.3b"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Define quantization config -- This enables efficient utilization on consumer GPUs
bnb_config = BitsAndBytesConfig(load_in_4bit=True)

### Load the model

Note that the `.to()` method is applied to both the Model and the Tokenizer

In [5]:
torch.cuda.empty_cache()

OPT_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, low_cpu_mem_usage=True)
OPT_model.to(device) # Model is on the GPU
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [6]:
inp = "The quick brown fox jumps over the lazy dog"
inp_tokenized = tokenizer( inp, return_tensors="pt" ).to(device) # Tokenizer is on the GPU. "pt" for PyTorch tensor format

print( inp_tokenized['input_ids'].size() )
print( inp_tokenized )

torch.Size([1, 10])
{'input_ids': tensor([[    2,   133,  2119,  6219, 23602, 13855,    81,     5, 22414,  2335]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}


## Analyzing the Model
- This is a `decoder`-only model, a common choice for transformer-based language models. The `decoder` key grants us access to its inner workings.
- the `layers` key reveals that the decoder component comprises 24 stacked layers with the same design.

In [7]:
print( OPT_model.model )

OPTModel(
  (decoder): OPTDecoder(
    (embed_tokens): Embedding(50272, 2048, padding_idx=1)
    (embed_positions): OPTLearnedPositionalEmbedding(2050, 2048)
    (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (layers): ModuleList(
      (0-23): 24 x OPTDecoderLayer(
        (self_attn): OPTSdpaAttention(
          (k_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (out_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
        )
        (activation_fn): ReLU()
        (self_attn_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear4bit(in_features=2048, out_features=8192, bias=True)
        (fc2): Linear4bit(in_features=8192, out_features=2048, bias=True)
        (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwi

In [9]:
embedded_input = OPT_model.model.decoder.embed_tokens(inp_tokenized['input_ids'])
print( "Layer:\t", OPT_model.model.decoder.embed_tokens )
print( "Size:\t", embedded_input.size() )
print( "Output:\t", embedded_input )

Layer:	 Embedding(50272, 2048, padding_idx=1)
Size:	 torch.Size([1, 10, 2048])
Output:	 tensor([[[-0.0407,  0.0519,  0.0574,  ..., -0.0263, -0.0355, -0.0260],
         [-0.0371,  0.0220, -0.0096,  ...,  0.0265, -0.0166, -0.0030],
         [-0.0455, -0.0236, -0.0121,  ...,  0.0043, -0.0166,  0.0193],
         ...,
         [ 0.0007,  0.0267,  0.0257,  ...,  0.0622,  0.0421,  0.0279],
         [-0.0126,  0.0347, -0.0352,  ..., -0.0393, -0.0396, -0.0102],
         [-0.0115,  0.0319,  0.0274,  ..., -0.0472, -0.0059,  0.0341]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<EmbeddingBackward0>)


### Explanation
The embedding layer is accessed via the decoder object's `.embed_tokens` method, which we use to deliver our tokenized inputs to the layer.

In the next cell, we call upon `embed_positions` which delivers our positional signal to each layer of the decoder.
This lets our model access positional information, which allows it to consider the order of words in the sequence more effectively.

#### Finally...
...for the sake of some output, we'll access the first layers self-attention component by indexing through the layers and using the `self_attn` method.

In [16]:
embed_position_input = embedded_input + embed_pos_input
hidden_states, _, _ = OPT_model.model.decoder.layers[0].self_attn( embed_position_input )
print( "Layer:\t", OPT_model.model.decoder.layers[0].self_attn )
print( "Size:\t", hidden_states.size() )
print( "Output:\t", hidden_states )

Layer:	 OPTSdpaAttention(
  (k_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
  (v_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
  (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
  (out_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
)
Size:	 torch.Size([1, 10, 2048])
Output:	 tensor([[[-0.0139, -0.0067,  0.0023,  ...,  0.0073, -0.0010,  0.0130],
         [-0.0132, -0.0082,  0.0026,  ...,  0.0089,  0.0010,  0.0123],
         [-0.0134, -0.0048,  0.0041,  ...,  0.0101,  0.0029,  0.0140],
         ...,
         [-0.0124, -0.0094,  0.0054,  ...,  0.0097,  0.0022,  0.0102],
         [-0.0122, -0.0098,  0.0055,  ...,  0.0097,  0.0018,  0.0096],
         [-0.0121, -0.0107,  0.0059,  ...,  0.0097,  0.0021,  0.0096]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<ToCopyBackward0>)


