### Practice: Large Language Models and Their Implications

![img](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4470ce74-e595-4750-92a5-5f21f040df6d_577x432.jpeg)


In this notebook, you're gonna play with [BLOOM](https://arxiv.org/abs/2211.05100) and [OPT](https://arxiv.org/abs/2205.01068) - some of the largest language models on the Internet.


_Based on works of: Tim Dettmers, Artem Chumachenko, Younes Belkada, Felix Marty, Yulian Gilyazev, Gosha Zolotov, Andrey Ishutin, Lena Wolf, Artemiy Vishnyakov, Svetlana Shirokovskih. Image source: https://bakztfuture.substack.com/p/gpt-3-memes ._
_Refined and updated by Radoslav Neychev_

### Part 1: prompt engineering

In the assignment, we'll use public APIs that host the 100B+ models for inference. Your task is to prompt-engineer the model into solving a few tasks for you.

__Which API?__ You are free to use any publicly available API. Here's a few options:

- BLOOM API - [bigscience/bloom](https://huggingface.co/bigscience/bloom) (on the right; recommended)
- OPT API by Alpa - [opt.alpa.ai](https://opt.alpa.ai/)
- OpenAI API - [openai.com/api](https://openai.com/api/)
- AI21 Jurrasic API - [ai21.com](https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1)

These APIs may require you to create a (free) account on their platform. Please note that some APIs also have paid subscriptions. __You do not need to pay them__, this assignment was designed to be solved using free-tier subscriptions. If no APIs work for you, you can also solve these tasks with the large (1.3B or 6.7B) model that you will find later in this notebook - but this will make the tasks somewhat harder.

__Example:__ Tony is talking to Darth Vader ([BLOOM API](https://huggingface.co/bigscience/bloom)). Black text is written manually, blue text is generated.
<hr>

![img](https://i.imgur.com/a1QhKF7.png)
<hr>

__It is fine to roll back a few times,__ e.g. in the example above, the model first generated Vader lines twice in a row, and we rolled that back. However, if you need more than 1-2 rollbacks per session, you should probably try a different prompt.

### Part 2: parameter-efficient fine-tuning

Now, let's try n load a smaller version of [OPT](https://arxiv.org/abs/2205.01068) without an API. We'll be using OPT-6b7, with a total of 6.7B parameters Beware: while this model is smaller than the ones in API, it's still over 60x larger than the BERT we played with last time. The code below will *just barely* fit into memory, so make sure you don't have anything else loaded. Sometimes it you need to restart runtime for the code to work.

Besides, it's a good time to restart your kernel and switch to GPU! (Runtime -> Change runtime type)
<center><img src="https://i.imgur.com/OOfDYzJ.png" width=240px></center>

In [1]:
import torch
if torch.cuda.get_device_capability() < (7, 5):
  raise ValueError(f"You got a GPU with capability {torch.cuda.get_device_capability()}, need at least (7, 5)")
else: print("OK")

# Note: this code requires a Turing GPU or newer. Good: T4, RTX 20xx/30xx, A100/Axx; Bad: K80, P100, V100
# Colab gives you T4. If you get older GPUs, please wait or switch to a new account (don't use both at the same time)
%pip --quiet install bitsandbytes transformers datasets accelerate zstandard jsonlines peft

OK
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m68.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import bitsandbytes as bnb
import transformers
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-iml-max-1.3b", load_in_8bit=True, device_map='auto',
    low_cpu_mem_usage=True, torch_dtype=torch.float16, offload_state_dict=True)
# note: these flags slow down the code to save RAM; remove them if you have >32GB RAM
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-iml-max-1.3b")

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/682 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/221 [00:00<?, ?B/s]

In [None]:
for module in model.modules():
    if isinstance(module, bnb.nn.Linear8bitLt):
        module.state.memory_efficient_backward = True

for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.model.decoder.project_in = lambda x: x.requires_grad_(True)

In [None]:
data = load_dataset("IlyaGusev/ru_turbo_alpaca")

Downloading builder script:   0%|          | 0.00/2.57k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/3.31k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/29822 [00:00<?, ? examples/s]

In [None]:
# cast model outputs to float32 to fix the top-k sampler
class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

In [None]:
batch = tokenizer("To live is to ", return_tensors='pt')
# note to self: find a less controversial example

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, min_length=30, max_length=50, do_sample=True)

print('\n\n', tokenizer.decode(output_tokens[0].numpy()))





 </s>To live is to ~~suffer~~ exist.  That's where my head at
To be is to ~~be miserable~~ exist.</s>


In [None]:
prefix = "When life gives me lemons"
print(prefix, end=' ')
batch = tokenizer(prefix, return_tensors='pt')
# note to self: find a less controversial example

past_key_values = None

for i in range(20):
  with torch.cuda.amp.autocast():
    outputs = model.forward(**batch, use_cache=True, past_key_values=past_key_values)
    probs = outputs.logits[0, -1].div(0.8).softmax(-1)
    token = torch.multinomial(probs, 1).view([])

    print(tokenizer.decode(token), end=' ', flush=True)
    past_key_values = outputs.past_key_values
    batch = dict(input_ids=outputs.logits[0, -1].argmax(-1).reshape(1, 1),
                 attention_mask=torch.ones(1, past_key_values[0][0].shape[-2] + 1, device='cuda'))

When life gives me lemons ,  I  make  lemon ade . 
 "  just  glad  sure  where  you  is  a  meme  to  an  or  not 

### Parameter-efficient finetuning: LoRA (1 point)

Since the model barely fits into memory, we won't be able to train it with conventional fine-tuning. Instead, you can use low-rank adapters based on [LoRA paper](https://arxiv.org/pdf/2106.09685.pdf).

The core idea is to add low-rank adapters __in parallel with attention projection matrices,__ like this:
<center><img src="https://i.imgur.com/6bQLNiG.png" width=240px></center>

In [None]:
from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer
from peft import PeftConfig, LoraConfig


lora_config = LoraConfig(
    target_modules=["q_proj", "k_proj"],
    init_lora_weights=False
)

model.add_adapter(lora_config, adapter_name="adapter_1")

### (example) How to train your model

The example below shows how to train the LoRA adapters on a dummy dataset. You will need to run a _similar_ training task later.

__Note:__ please scroll down for the homework task

In [None]:
data = load_dataset("hate_speech_pl")
data = data.map(lambda samples: tokenizer(samples['text']), batched=True)


Downloading builder script:   0%|          | 0.00/4.43k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/147k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/862k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/389k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13887 [00:00<?, ? examples/s]

Map:   0%|          | 0/13887 [00:00<?, ? examples/s]

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags', 'input_ids', 'attention_mask'],
        num_rows: 2508
    })
})

In [None]:
data['train']

Dataset({
    features: ['id', 'text_id', 'annotator_id', 'minority_id', 'negative_emotions', 'call_to_action', 'source_of_knowledge', 'irony_sarcasm', 'topic', 'text', 'rating', 'input_ids', 'attention_mask'],
    num_rows: 13887
})

In [None]:
trainer = transformers.Trainer(
    model=model, train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        warmup_steps=250, max_steps=10, learning_rate=2e-4, fp16=True,
        logging_steps=1, output_dir='outputs'),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,4.1299
2,3.9629
3,4.4268
4,3.5927
5,3.8299
6,3.9025
7,4.0773
8,3.731
9,3.8721
10,3.5568


TrainOutput(global_step=10, training_loss=3.9081836938858032, metrics={'train_runtime': 42.3842, 'train_samples_per_second': 3.775, 'train_steps_per_second': 0.236, 'total_flos': 169066356375552.0, 'train_loss': 3.9081836938858032, 'epoch': 0.01})

In [None]:
model.eval()
model.enable_adapters()

In [None]:
batch = tokenizer("To live is to ", return_tensors='pt')
# note to self: find a less controversial example

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, min_length=30, max_length=50, do_sample=True)

print('\n\n', tokenizer.decode(output_tokens[0].numpy()))





 </s>To live is to  suffer, to die is to  forgive: Buddha
If you're feeling stressed out today — or sad, angry, or angry in general, be sure to stop your thoughts and breathe every so often.
According to


In [None]:
model.disable_adapters()


In [None]:
batch = tokenizer("To live is to ", return_tensors='pt')
# note to self: find a less controversial example

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, min_length=30, max_length=50, do_sample=True)

print('\n\n', tokenizer.decode(output_tokens[0].numpy()))



 </s>To live is to      To know  And feel  And have pleasure.     And to be alive is good.     Then all can be understood by        All of


Example code for simple LoRA implementation:

In [None]:
# import torch
# import torch.nn as nn
# import torch.nn.functional as F

# class LoRALayer(nn.Module):
#     """Wraps a linear layer with LoRA-like adapter. Wraps an existing OPT linear layer"""
#     def __init__(self, module: nn.Linear, rank: int):
#         super().__init__()
#         self.module = module
#         self.adapter = nn.Sequential(
#             nn.Linear(module.in_features, rank, bias=False),
#             nn.Linear(rank, module.out_features, bias=False)
#         )
#         nn.init.kaiming_uniform_(self.adapter[0].weight, a=5 ** 0.5)
#         nn.init.zeros_(self.adapter[1].weight)

#         self.adapter.to(module.weight.device)

#     def forward(self, input):
#         # Apply self.module and LoRA adapter, return the sum (base module outputs + adapter outputs)
#         return self.module(input) + self.adapter(input)



# # test your implementation
# test_linear = nn.Linear(128, 128)
# test_linear.weight.data[...] = torch.eye(128)
# test_adapter = LoRALayer(test_linear, rank=8)

# assert torch.allclose(test_adapter(torch.ones(1, 1, 128)), test_linear.bias + 1), "please check your forward pass"

# test_adapter.adapter[0].weight.data[...] = torch.linspace(0.1, -0.5, 128 * 8).view(8, 128)
# test_adapter.adapter[1].weight.data[...] = torch.linspace(0.5, -0.1, 128 * 8).view(128, 8)
# test_linear.bias.data[...] = torch.linspace(1., -1., 128)

If you want to dig deeper, try to implement prompt-tuning.
You can read more about prompt tuning variants in paper [1](https://arxiv.org/abs/2104.08691) or paper [2](https://arxiv.org/abs/2101.00190). Both versions can be implemented by passing trainable prompts as `model.forward(..., past_key_values=your_prompts)`.




### Read more
* How post-training quantization works: https://arxiv.org/abs/2208.07339
* An overview of running large models: https://huggingface.co/docs/accelerate/package_reference/big_modeling
* A general library for different adapter types: https://adapterhub.ml/

### [extra info] How to optimize for inference

The code below converts training-optimized 8bit weights into inference-optimized layout. It should result in significantly faster inference in the same memory footprint.
However, if you do this, you can no longer run training --
 there is no way to un-convert after the first optimized forward!

```python
model.config.use_cache = True
for module in model.modules():
    if isinstance(module, bnb.nn.Linear8bitLt):
        module.state.memory_efficient_backward = False
```

### [extra info] Fine-grained inference

If for some reason you're not satisfied with `model.generate` interface, you can write your own inference code with iterative forward passes. Here's how it's done:
```python
prefix = "Somebody is"  # same as above
batch = tokenizer(prefix, return_tensors='pt')
past_key_values = None
with torch.cuda.amp.autocast():
  for i in range(50):
    outputs = model.forward(**batch, use_cache=True, past_key_values=past_key_values)
    probs = outputs.logits[0, -1].div(0.8).softmax(-1)
    token = torch.multinomial(probs, 1).view([])

    print(tokenizer.decode(token), end=' ', flush=True)
    past_key_values = outputs.past_key_values
    batch = dict(input_ids=outputs.logits[0, -1].argmax(-1).reshape(1, 1),
                 attention_mask=torch.ones(1, past_key_values[0][0].shape[-2] + 1, device='cuda'))
```