## Cheptar 3

### Language models

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [6]:
## load model and tokenizer

model_name = "microsoft/Phi-3-mini-4k-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                            device_map = "cuda",
                                            torch_dtype="auto",
                                            trust_remote_code=True)

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [21]:
### Crerate a pipeline

generator = pipeline(
    "text-generation",
    model = model,
    tokenizer = tokenizer,
    return_full_text = False,
    max_new_tokens=100,
    do_sample=False
)

### autoregressive models
> (e.g., the model’s first generated token is used to generate the second token). 
> They’re called autoregressive models. 

In [22]:
prompt = "write an email apoligizing to Sarah for the tragicgardening mishap. Explain how it happen"

In [23]:
output = generator(prompt)

In [24]:
print(output[0]["generated_text"])

, express sincere regret, and offer to help her replant the flowers.

Dear Sarah,

I am deeply sorry for the unfortunate incident that occurred in your garden. I understand how much effort and love you put into nurturing your plants, and it pains me to know that I have caused such distress.

The incident happened when I was trying to help you with the watering system. Unfortunately, I accidentally knocked over the water


In [25]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

In [27]:
prompt = "The capital of France is"

## Tokenize the input prompt

input_ids = tokenizer(prompt, return_tensors = "pt").input_ids

input_ids = input_ids.to("cuda")

## Get output from the model before LM head
model_output = model.model(input_ids)

## get output of LM head
lm_head_output = model.lm_head(model_output[0])


In [29]:
model_output[0].shape

torch.Size([1, 5, 3072])

In [30]:
lm_head_output[0].shape

torch.Size([5, 32064])

### KV cache "Technique to speed up the processing"
'''Recall that when generating the second token, we simply
 append the output token to the input and do another
 forward pass through the model. If we give the model the
 ability to cache the results of the previous calculation
 (especially some of the specific vectors in the attention
 mechanism), we no longer need to repeat the calculations
 of the previous streams. This time the only needed
 calculation is for the last stream. This is an optimization
 technique called the keys and values (kv) cache and it'''

In [31]:
%%timeit -n -1
##generate the text
generation_output = model.generate(
    input_ids = input_ids,
    max_new_tokens = 100,
    use_cache=True
)

-347 ns ± 313 ns per loop (mean ± std. dev. of 7 runs, -1 loops each)


In [32]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    use_cache=False
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


1.11 s ± 353 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Inside the Transformer Block

### Type of Attentions
> 1. Self Attention
> 2. Multi head attention
> 3. Multuquery Attention
> 4. Grouped Query Attention
> 5. Flash Attention

## Summary
 '''In this chapter we discussed the main intuitions of
 Transformers and recent developments that enable the
 latest Transformer LLMs. We went over many new
 concepts, so let’s break down the key concepts that we
 discussed in this chapter:'''
- A Transformer LLM generates one token at a time.
- That output token is appended to the prompt, then this updated prompt is presented to the model again for another forward pass to generate the next token.
- The three major components of the Transformer LLM are the tokenizer, a stack of Transformer blocks, and a language modeling head.
- The tokenizer contains the token vocabulary for the model. The model has token embeddings associated with those tokens. Breaking the text into tokens and then using the embeddings of these tokens is the first step in the token generation process.
- The forward pass flows through all the stages once, one by one.
- Near the end of the process, the LM head scores the probabilities of the next possible token. Decoding strategies inform which actual token to pick as the output for this generation step (sometimes it’s the most probable next token, but not always).
- One reason the Transformer excels is its ability to process tokens in parallel. Each of the input tokens flow into their individual tracks or streams of processing. The number of streams is the model’s “context size” and this represents the max number of tokens the model can operate on.
- Because Transformer LLMs loop to generate the text one token at a time, it’s a good idea to cache the processing results of each step so we don’t duplicate the processing effort (these results are stored as various matrices within the layers).
- The majority of processing happens within Transformer blocks. These are made up of two components. One of them is the feedforward neural network, which is able to store information and make predictions and interpolations from data it was trained on.
- The second major component of a Transformer block is the attention layer. Attention incorporates contextual information to allow the model to better capture the nuance of language.
- Attention happens in two major steps: (1) scoring relevance and (2) combining information.
- A Transformer attention layer conducts several attention operations in parallel, each occurring inside an attention head, and their outputs are aggregated to make up the output of the attention layer.
- Attention can be accelerated via sharing the keys and values matrices between all heads, or groups of heads (grouped-query attention).
- Methods like Flash Attention speed up the attention calculation by optimizing how the operation is done on the different memory systems of a GPU.