<a href="https://colab.research.google.com/github/alex-movila/ML-Colab-Tutorials/blob/master/Petals_Getting_started_with_LLaMA_65B_(GPU_Colab).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<img src="https://camo.githubusercontent.com/473dd9f992924d27457650251786464f72e54121ac6e9210add0f483ca849277/68747470733a2f2f692e696d6775722e636f6d2f3765523750616e2e706e67" width="40%">  
</div>

# Getting started with Petals

This notebook will guide you through the basics of Petals &mdash; a system for inference and fine-tuning 100B+ language models without the need to have high-end GPUs. With Petals, you can join compute resources with other people over the Internet and run large language models such as [LLaMA-65B](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md), [BLOOM-176B](https://huggingface.co/bigscience/bloom) or [BLOOMZ-176B](https://huggingface.co/bigscience/bloomz) from your desktop computer or Google Colab.

💬 If you meet any issues while running this notebook, let us know in the **[#running-a-client](https://discord.gg/J29mCBNBvm)** channel of our Discord!

So, let's get started! First, let's install [the Petals package](https://github.com/bigscience-workshop/petals):

In [None]:
%pip install git+https://github.com/bigscience-workshop/petals

## Step 1. The easiest way to generate text 🚀

Let's start with the easiest task &mdash; creating a distributed model and using it for generating text. This machine will download a small part of the model weights and rely on other computers in the network to run the rest of the model.

We suggest to start with LLaMA-65B, but you can also use BLOOM and BLOOMZ. Just set `MODEL_NAME = "bigscience/bloom"` or `"biscience/bloomz"` to load these models.

📋 **Heads up:** This Colab is provided for demonstration purposes. If you build your own app running these models, make sure you follow the [LLaMA's](https://bit.ly/llama-license) and/or [BLOOM's](https://bit.ly/bloom-license) terms of use. Note that LLaMA is available for non-commercial purposes only, and you have to file a request [here](https://bit.ly/llama-license) to use it in your own projects.

In [None]:
import torch
from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM

MODEL_NAME = "enoch/llama-65b-hf"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False, add_bos_token=False)

model = AutoDistributedModelForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()

Now, let's try to generate something by calling __`model.generate()`__ method.

The first call to this method takes a few seconds to connect to the Petals swarm. Once we do that, you should expect generation speed of 3-5 tokens/sec for LLaMA-65B and ~1 tokens/sec for BLOOM. If you don't have enough GPUs to host the entire model, this is much faster than what you get with other methods, such as offloading, which gives 10&ndash;20 sec/token for BLOOM.

In [None]:
inputs = tokenizer('A cat in French is "', return_tensors="pt")["input_ids"].cuda()
outputs = model.generate(inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0]))

A cat in French is "chat" and


The `model.generate()` method runs **greedy** generation by default. However, you can choose other generation methods like **top-p/top-k sampling** or **beam search** by passing the corresponding parameters (you'll see an example in a bit). You can even implement custom generation methods (we'll cover that in **Step 5**).

🔏 **Note:** Your data is processed by other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.

## Step 2. Chat bots and interactive generation 💬

If you'd like to talk to the model in an interactive way, you can use the __inference session__ interface. This interface provides a simple way to print generated tokens on the fly or make a chat bot that responds to human's phrases.

The inference session looks for a sequence of servers that will run successive inference steps and store past attention caches. This way, you don't need to rerun previous tokens through the transformer to generate each phrase. If one of the remote servers fails, Petals will automatically find a replacement and regenerate a small part of the caches.

Let's see how to use it to write a simple chat bot, showing tokens as soon as they are generated:

In [None]:
fake_token = tokenizer("^")["input_ids"][0]  # Workaround to make SentencePiece .decode() keep leading spaces

with model.inference_session(max_length=512) as sess:
    while True:
        prompt = input('Human: ')
        if prompt == "":
            break
        prefix = f"Human: {prompt}\nFriendly AI:"
        prefix = tokenizer(prefix, return_tensors="pt")["input_ids"].cuda()
        print("Friendly AI:", end="", flush=True)

        while True:
            outputs = model.generate(
                prefix, max_new_tokens=1, do_sample=True, top_p=0.9, temperature=0.75, session=sess
            )
            outputs = tokenizer.decode([fake_token, outputs[0, -1].item()])[1:]
            print(outputs, end="", flush=True)
            if "\n" in outputs:
                break
            prefix = None  # Prefix is passed only for the 1st token of the bot's response

Human: Hi, how are you?
Friendly AI: I'm fine, thanks. And you?
Human: 


### 📦 Making apps that use Petals

If you develop a tool for other people, you can wrap up the code using Petals into a user-friendly web app, such as [chat.petals.ml](http://chat.petals.ml). Under the hood, this app may connect to a lightweight [HTTP endpoint](https://github.com/borzunov/petals-chat) for inference that forwards all requests to the Petals swarm.

<div align="center">
<br>
<img src="https://i.imgur.com/p2nwiho.png" width="40%">  
</div>

## Step 3. How does it work? 🛠️

The `model` you are running is equal to the original model, but only a part of it is loaded into your machine's GPU. Let's have a look under the hood:

In [None]:
model

DistributedLlamaForCausalLM(
  (model): DistributedLlamaModel(
    (embed_tokens): Embedding(32000, 8192, padding_idx=0)
    (layers): RemoteSequential(modules=llama-65b-hf.0..llama-65b-hf.79)
    (norm): LlamaRMSNorm()
  )
  (lm_head): LMHead()
)

As you can see, word embeddings and some other layers are regular PyTorch modules hosted on your machine, but the rest of the model (e.g., transformers blocks) is encased in the __RemoteSequential__ class. This is an advanced PyTorch module that runs on a distributed swarm of other machines.

Still, you can access individual layers and their outputs, as well as run forward/backward through them:

In [None]:
first_five_layers = model.model.layers[0:5]
first_five_layers

RemoteSequential(modules=llama-65b-hf.0..llama-65b-hf.4)

In [None]:
dummy_inputs = torch.randn(1, 3, model.config.hidden_size, dtype=torch.bfloat16, requires_grad=True)
outputs = first_five_layers(dummy_inputs)
outputs

tensor([[[-0.2139,  0.4180, -2.1562,  ..., -0.2031, -0.6523,  1.9844],
         [-1.8594,  0.7070,  0.9141,  ...,  1.2109,  1.4219, -0.6680],
         [ 1.0469,  0.0845, -0.9609,  ...,  1.6406,  1.3672,  1.1875]]],
       dtype=torch.bfloat16, grad_fn=<_RemoteSequentialAutogradFunctionBackward>)

In [None]:
loss = torch.mean((outputs - torch.ones_like(outputs)) ** 2)
loss.backward()  # backpropagate through the internet
print("Grad w.r.t. inputs:", dummy_inputs.grad.flatten())

Grad w.r.t. inputs: tensor([-2.2507e-04, -1.9550e-05, -4.4250e-04,  ...,  5.8174e-05,
         5.0545e-05,  5.9605e-05], dtype=torch.bfloat16)


In general, you can mix and match distributed layers like in regular PyTorch and even insert and train your own layers (e.g., adapters) between the pre-trained ones.

<div align="center">
<img src="https://camo.githubusercontent.com/58732a64488a9be928e25f3e60e3692b989ffe212ac86cb4902d8df20a042b03/68747470733a2f2f692e696d6775722e636f6d2f525459463379572e706e67" width="80%">
</div>

<p align="center">📜 <b><a href="https://arxiv.org/pdf/2209.01188.pdf">Read details in our paper</a></b></p>

## Step 4. Adding a trainable adapter 🏋️

While the remotely hosted transformer blocks are **frozen** to keep the pretrained model the same for all users, using **parameter-efficient adapters** (small trainable layers added between the pretrained blocks of the model, such as [LoRA](https://arxiv.org/abs/2106.09685)) or **trainable prompts** (trainable inputs added before the inputs to the model, such as in [P-Tuning v2](https://arxiv.org/abs/2110.07602)) is usually enough to make BLOOM solve a variety of downstream tasks.

Below, we show an example of how to add a basic **trainable** linear layer between 5th and 6th transformer blocks of the pretrained model. The layer's weights and the corresponding optimizer statistics will be stored locally:

In [None]:
import torch.nn as nn
import torch.nn.functional as F


class BloomBasedClassifier(nn.Module):
  def __init__(self, model):
    super().__init__()
    self.distributed_layers = model.transformer.h
    self.adapter = nn.Sequential(nn.Linear(model.config.hidden_size, 32), nn.Linear(32, model.config.hidden_size))
    self.head = nn.Linear(model.config.hidden_size, 2)

  def forward(self, embeddings):
    mid_block = len(self.distributed_layers) // 2
    hidden_states = self.distributed_layers[:mid_block](embeddings)
    hidden_states = self.adapter(hidden_states)
    hidden_states = self.distributed_layers[mid_block:](hidden_states)
    pooled_states = torch.mean(hidden_states, dim=1)
    return self.head(pooled_states)

In [None]:
classifier = BloomBasedClassifier(model).cuda()
opt = torch.optim.Adam(classifier.parameters(), 3e-5)
inputs = torch.randn(3, 2, model.config.hidden_size, device='cuda')
labels = torch.tensor([1, 0, 1], device='cuda')

for i in range(5):
  loss = F.cross_entropy(classifier(inputs), labels)
  print(f"loss[{i}] = {loss.item():.3f}")
  opt.zero_grad()
  loss.backward()
  opt.step()

print('predicted:', classifier(inputs).argmax(-1))  # l, o, l

loss[0] = 11.039
loss[1] = 6.550
loss[2] = 2.489
loss[3] = 0.455
loss[4] = 0.038
predicted: tensor([1, 0, 1], device='cuda:0')


## Step 5. Using custom sampling methods 🎰

The __`model.inference_session()`__ interface in Petals also allows you to write custom inference code. You can use this to implement any sampling algorithms you want, or write a custom beam search algorithm that forbids the model from using swearwords.

Below, let's see how we can reimplement the standard `model.generate()` interface by making forward passes through all the layers manually:

In [None]:
from hivemind import get_logger

logger = get_logger()

fake_token = tokenizer("^")["input_ids"][0]  # Workaround to make SentencePiece .decode() keep leading spaces

text = "What is a good chatbot? Answer:"
token_ids = tokenizer(text, return_tensors="pt")["input_ids"].cuda()
max_length = 100
with torch.inference_mode():
    with model.inference_session(max_length=max_length) as sess:
        while len(text) < max_length:
            embs = model.transformer.word_embeddings(token_ids)
            embs = model.transformer.word_embeddings_layernorm(embs)

            h = sess.step(embs)
            h_last = model.transformer.ln_f(h[:, -1])
            logits = model.lm_head(h_last)

            next_token = logits.argmax(dim=-1)
            text += tokenizer.decode([fake_token, next_token.item()])[1:]
            token_ids = next_token.reshape(1, 1)
            logger.info(text)

Jun 24 02:59:32.990 [[1m[34mINFO[0m] What is a good chatbot? Answer: A
Jun 24 02:59:34.073 [[1m[34mINFO[0m] What is a good chatbot? Answer: A chat
Jun 24 02:59:35.088 [[1m[34mINFO[0m] What is a good chatbot? Answer: A chatbot
Jun 24 02:59:36.103 [[1m[34mINFO[0m] What is a good chatbot? Answer: A chatbot that
Jun 24 02:59:37.165 [[1m[34mINFO[0m] What is a good chatbot? Answer: A chatbot that is
Jun 24 02:59:38.200 [[1m[34mINFO[0m] What is a good chatbot? Answer: A chatbot that is able
Jun 24 02:59:39.219 [[1m[34mINFO[0m] What is a good chatbot? Answer: A chatbot that is able to
Jun 24 02:59:40.245 [[1m[34mINFO[0m] What is a good chatbot? Answer: A chatbot that is able to understand
Jun 24 02:59:41.226 [[1m[34mINFO[0m] What is a good chatbot? Answer: A chatbot that is able to understand the
Jun 24 02:59:42.229 [[1m[34mINFO[0m] What is a good chatbot? Answer: A chatbot that is able to understand the user
Jun 24 02:59:43.208 [[1m[34mINFO[0m] What is a good 

## Step 6. Making fox innocent 🦊

Next, let's see how to fine-tune a model using trainable (optionally, deep) prompts.

We'll take the model saying "*A quick brown fox jumps over the lazy dog.*" and teach it to say the opposite &ndash; that actually "*A quick brown fox did not jump over the lazy dog*".

In [None]:
inputs = tokenizer("A quick brown fox", return_tensors="pt")["input_ids"].cuda()
outputs = model.generate(inputs, max_new_tokens=7)
print("generated:", tokenizer.decode(outputs[0]))

generated: A quick brown fox jumps over the lazy dog.


In [None]:
model = AutoDistributedModelForCausalLM.from_pretrained(MODEL_NAME, tuning_mode='deep_ptune', pre_seq_len=3)
model = model.cuda()

Jun 23 15:12:42.335 [[1m[34mINFO[0m] Using DHT prefix: llama-65b-hf


In [None]:
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

the_fox_is_innocent = tokenizer("A quick brown fox did not jump over the lazy dog", return_tensors="pt")["input_ids"].cuda()
for i in range(30):
    loss = model(input_ids=the_fox_is_innocent, labels=the_fox_is_innocent).loss
    print(f"loss[{i}] = {loss.item():.3f}")

    opt.zero_grad()
    loss.backward()
    opt.step()
    print("opt.step()")

In [None]:
inputs = tokenizer("A quick brown fox", return_tensors="pt")["input_ids"].cuda()
outputs = model.generate(inputs, max_new_tokens=7)
print("generated:", tokenizer.decode(outputs[0]))

generated: A quick brown fox did not jump over the lazy dog


## Step 7. Sharing is caring 🤗

We developed Petals to be a community-run system, so we rely on people giving out their GPUs to increase the swarm’s capacity. If you have some GPUs that are not always busy, please **consider running a Petals server.** You can pause it any time if you want to use the GPUs for something else. As a bonus, people running a server get a certain speedup when using Petals, since a larger part of the model is hosted locally.

<br>

🐋 You can run our [Docker](https://www.docker.com) image (works on Linux, macOS, and Windows with [WSL2](https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl)):

```
sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
    learningathome/petals:main python -m petals.cli.run_server bigscience/bloom --port 31330
```

🐍 Or run these commands in an [Anaconda](https://www.anaconda.com) env (requires Linux and Python 3.7+):

```
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -U petals
python -m petals.cli.run_server bigscience/bloom
```

<br>

📚 See [FAQ](https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#running-a-server) to learn how to configure the server to use multiple GPUs, address common issues, etc.

You can also host [BLOOMZ](https://huggingface.co/bigscience/bloomz), a version of BLOOM fine-tuned to follow human instructions in the zero-shot regime — just replace `bloom-petals` with `bloomz-petals`.

🔒 Hosting a server does not allow others to run custom code on your computer. Learn more about security [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety).

💬 If you have any issues or feedback, let us know on [our Discord server](https://discord.gg/D9MwApKgWa)!

## Step 7. Using other fine-tuning and prompt-tuning methods

While you can write your own custom adapters, Petals implements several [standard](https://arxiv.org/abs/2104.08691) [methods](https://arxiv.org/abs/2101.00190) for parameter-efficient fine-tuning. We provide a couple of advanced examples in our GitHub repository:

- Training a personified chatbot: [notebook](https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-personachat.ipynb)

- Fine-tuning BLOOM for text semantic classification: [notebook](https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb)

## What's next?

Congratulations on finishing our tutorial! Now, you are familiar with how to use Petals for different tasks, how it works under the hood, and how to increase its capacity.

You can find a few other helpful resources below:

* __More about Petals.__ The [README](https://github.com/bigscience-workshop/petals#readme) file in our GitHub repository has links to more Petals-related materials, including instructions for starting your own swarm (possibly, with a model other than BLOOM).

* __Discord server.__ If you have any feedback, questions, or technical issues, please [join our Discord server](https://discord.gg/D9MwApKgWa) and let us know. If you want to build something based on Petals, we'd be happy to hear what you are up to.

* __Academic paper.__ We have released a [paper](https://arxiv.org/abs/2209.01188) that goes into details about our research and what happens in Petals under the hood.