<a href="https://colab.research.google.com/github/R3gm/InsightSolver-Colab/blob/main/Petals_and_Together_BLOOM_176B_GPU_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<img src="https://camo.githubusercontent.com/473dd9f992924d27457650251786464f72e54121ac6e9210add0f483ca849277/68747470733a2f2f692e696d6775722e636f6d2f3765523750616e2e706e67" width="40%">  
</div>


| Code Credits | Link |
| ----------- | ---- |
| 🎉 Repository | [![GitHub Repository](https://img.shields.io/github/stars/petals-infra/chat.petals.ml?style=social)](https://github.com/petals-infra/chat.petals.ml) |
| Original Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ervk6HPNS6AYVr3xVdQnY5a-TjjmLCdQ?usp=sharing) |
| 🚀 Online inference for Guanaco-65B, LLaMA-65B, BLOOM and BLOOMZ | [Chat Petals](http://chat.petals.ml) |
| 🚀 Alternative with api Together | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sambanovasystems/BLOOMChat) |
| 🔥 Discover More Colab Notebooks | [![GitHub Repository](https://img.shields.io/badge/GitHub-Repository-black?style=flat-square&logo=github)](https://github.com/R3gm/Colab-resources/) |

Petals is a project that leverages [Hivemind](https://github.com/learning-at-home/hivemind).

Hivemind is a PyTorch library for decentralized deep learning, enabling distributed training without a master node, fault-tolerant backpropagation, and decentralized parameter averaging. It allows training large models on hundreds of computers across the Internet. You can install it with pip or from source and find examples in the documentation. Contributions are welcome.

---




# Getting started with Petals

This notebook will guide you through the basics of Petals &mdash; a system for inference and fine-tuning 100B+ language models without the need to have high-end GPUs. With Petals, you can join compute resources with other people over the Internet and run large language models such as Guanaco-65B, LLaMA-65B, 176B-parameter [BLOOM](https://huggingface.co/bigscience/bloom) or [BLOOMZ](https://huggingface.co/bigscience/bloomz), which are of the same size as GPT-3.

💬 If you meet any issues while running this notebook, let us know in the **[#running-a-client](https://discord.gg/J29mCBNBvm)** channel of our Discord!

So, let's get started! First, let's install [the Petals package](https://github.com/bigscience-workshop/petals):


In [None]:
%pip install -q petals

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.1/87.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m110.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.2/9.2 MB[0m [31m111.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.1/4.1 MB[0m [31m105.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

## Step 1. The easiest way to generate text 🚀

Let's start with the easiest task &mdash; creating a __`DistributedBloom`__ model and using it for generating text.

This machine will download a small part of the model weights (~8 GB out of 352 GB) and rely on other computers in the network to run the rest of the model. Downloading the local part of the weights usually takes ~3 minutes.

🧑‍🏫 __Note:__ We suggest to start with the regular BLOOM, but you can also use [BLOOMZ](https://huggingface.co/bigscience/bloomz) &mdash; a version of BLOOM fine-tuned to better follow human instructions in the zero-shot regime. You would need to set `MODEL_NAME = "bigscience/bloomz-petals"` to load this model.

In [None]:
import torch
from transformers import BloomTokenizerFast
from petals import DistributedBloomForCausalLM

MODEL_NAME = "bigscience/bloom-petals"
tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()

Downloading (…)okenizer_config.json:   0%|          | 0.00/263 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/7.19G [00:00<?, ?B/s]

Now, let's try to generate something by calling __`model.generate()`__ method.

The first call to this method takes ~5 sec to connect to the Petals swarm. Once we do that, you should expect generation speed of 1&ndash;1.5 sec/token. If you don't have enough GPUs to host the entire model, this is much faster than what you get with other methods, such as offloading (which takes at least 10&ndash;20 sec/token).

In [None]:
inputs = tokenizer('A cat in French is "', return_tensors="pt")["input_ids"].cuda()
outputs = model.generate(inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0]))

The `model.generate()` method runs **greedy** generation by default. However, you can choose other generation methods like **top-p/top-k sampling** or **beam search** by passing the corresponding parameters (you'll see an example in a bit). You can even implement custom generation methods (we'll cover that in **Step 5**).

🔏 **Note:** Your data is processed by other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.

## Step 2. Chat bots and interactive generation 💬

If you'd like to talk to the model in an interactive way, you can use the __inference session__ interface. This interface provides a simple way to print generated tokens on the fly or make a chat bot that responds to human's phrases.

The inference session looks for a sequence of servers that will run successive inference steps and store past attention caches. This way, you don't need to rerun previous tokens through the transformer to generate each phrase. If one of the remote servers fails, Petals will automatically find a replacement and regenerate a small part of the caches.

Let's see how to use it to write a simple chat bot, showing tokens as soon as they are generated:

In [None]:
with model.inference_session(max_length=512) as sess:
    while True:
        prompt = input('Human: ')
        if prompt == "":
            break
        prefix = f"Human: {prompt}\nFriendly AI:"
        prefix = tokenizer(prefix, return_tensors="pt")["input_ids"].cuda()
        print("Friendly AI:", end="", flush=True)

        while True:
            outputs = model.generate(
                prefix, max_new_tokens=1, do_sample=True, top_p=0.9, temperature=0.75, session=sess
            )
            outputs = tokenizer.decode(outputs[0, -1:])
            print(outputs, end="", flush=True)
            if "\n" in outputs:
                break
            prefix = None  # Prefix is passed only for the 1st token of the bot's response

### 📦 Making apps that use Petals

If you develop a tool for other people, you can wrap up the code using Petals into a user-friendly web app, such as [chat.petals.ml](http://chat.petals.ml). Under the hood, this app may connect to a lightweight [HTTP endpoint](https://github.com/borzunov/petals-chat) for inference that forwards all requests to the Petals swarm.

📋 **Note:** If you build an app running BLOOM with Petals, make sure it follows the BLOOM's [terms of use](https://huggingface.co/bigscience/bloom).

<div align="center">
<br>
<img src="https://i.imgur.com/p2nwiho.png" width="40%">  
</div>

## Step 3. How does it work? 🛠️

The `model` you are running is the actual BLOOM-176B, but only a part of it is loaded into your machine's GPU. Let's have a look under the hood:

In [None]:
model.transformer

DistributedBloomModel(
  (word_embeddings): Embedding(250880, 14336)
  (word_embeddings_layernorm): LayerNorm((14336,), eps=1e-05, elementwise_affine=True)
  (h): RemoteSequential(modules=bigscience/bloom-petals.0..bigscience/bloom-petals.69)
  (ln_f): LayerNorm((14336,), eps=1e-05, elementwise_affine=True)
)

As you can see, word embeddings and some other layers are regular PyTorch modules hosted on your machine, but the rest of the model (e.g., transformers blocks) is encased in the __RemoteSequential__ class. This is an advanced PyTorch module that runs on a distributed swarm of other machines.

Still, you can access individual layers and their outputs, as well as run forward/backward through them:

In [None]:
first_five_layers = model.transformer.h[0:5]
first_five_layers

RemoteSequential(modules=bigscience/bloom-petals.0..bigscience/bloom-petals.4)

In [None]:
dummy_inputs = torch.randn(1, 3, 14336, dtype=torch.bfloat16, requires_grad=True)
outputs = first_five_layers(dummy_inputs)
outputs

tensor([[[-2.8906, -0.6484, -1.9141,  ...,  3.5938,  1.2656, -2.0156],
         [ 0.5156, -2.2031, -0.5820,  ...,  1.1250, -0.0127, -1.3594],
         [-0.5508, -0.9766, -0.7695,  ...,  1.4766, -1.3594, -0.3086]]],
       dtype=torch.bfloat16, grad_fn=<_RemoteSequentialAutogradFunctionBackward>)

In [None]:
loss = torch.mean((outputs - torch.ones_like(outputs)) ** 2)
loss.backward()  # backpropagate through the internet
print("Grad w.r.t. inputs:", dummy_inputs.grad.flatten())

Grad w.r.t. inputs: tensor([-0.0137,  0.0459,  0.0242,  ...,  0.0010, -0.0005,  0.0002],
       dtype=torch.bfloat16)


In general, you can mix and match distributed layers like in regular PyTorch and even insert and train your own layers (e.g., adapters) between the pre-trained ones.

<div align="center">
<img src="https://camo.githubusercontent.com/58732a64488a9be928e25f3e60e3692b989ffe212ac86cb4902d8df20a042b03/68747470733a2f2f692e696d6775722e636f6d2f525459463379572e706e67" width="80%">
</div>

<p align="center">📜 <b><a href="https://arxiv.org/pdf/2209.01188.pdf">Read details in our paper</a></b></p>

## Step 4. Adding a trainable adapter 🏋️

While the remotely hosted transformer blocks are **frozen** to keep the pretrained model the same for all users, using **parameter-efficient adapters** (small trainable layers added between the pretrained blocks of the model, such as [LoRA](https://arxiv.org/abs/2106.09685)) or **trainable prompts** (trainable inputs added before the inputs to the model, such as in [P-Tuning v2](https://arxiv.org/abs/2110.07602)) is usually enough to make BLOOM solve a variety of downstream tasks.

Below, we show an example of how to add a basic **trainable** linear layer between 5th and 6th transformer blocks of the pretrained model. The layer's weights and the corresponding optimizer statistics will be stored locally:

In [None]:
import torch.nn as nn
import torch.nn.functional as F


class BloomBasedClassifier(nn.Module):
  def __init__(self, model):
    super().__init__()
    self.distributed_layers = model.transformer.h
    self.adapter = nn.Sequential(nn.Linear(14336, 32), nn.Linear(32, 14336))
    self.head = nn.Sequential(nn.LayerNorm(14336), nn.Linear(14336, 2))

  def forward(self, embeddings):
    hidden_states = self.distributed_layers[0:6](embeddings)
    hidden_states = self.adapter(hidden_states)
    hidden_states = self.distributed_layers[6:10](hidden_states)
    pooled_states = torch.mean(hidden_states, dim=1)
    return self.head(pooled_states)

In [None]:
classifier = BloomBasedClassifier(model).cuda()
opt = torch.optim.Adam(classifier.parameters(), 3e-5)
inputs = torch.randn(3, 2, 14336, device='cuda')
labels = torch.tensor([1, 0, 1], device='cuda')

for i in range(5):
  loss = F.cross_entropy(classifier(inputs), labels)
  print(f"loss[{i}] = {loss.item():.3f}")
  opt.zero_grad()
  loss.backward()
  opt.step()

print('predicted:', classifier(inputs).argmax(-1))  # l, o, l

loss[0] = 0.706
loss[1] = 0.554
loss[2] = 0.446
loss[3] = 0.362
loss[4] = 0.294
predicted: tensor([1, 0, 1], device='cuda:0')


## Step 5. Using custom sampling methods 🎰

The __`model.inference_session()`__ interface in Petals also allows you to write custom inference code. You can use this to implement any sampling algorithms you want, or write a custom beam search algorithm that forbids the model from using swearwords.

Below, let's see how we can reimplement the standard `model.generate()` interface by making forward passes through all the layers manually:

In [None]:
text = "What is AI? Answer:"
token_ids = tokenizer(text, return_tensors="pt")["input_ids"].cuda()
max_length = 1000
with torch.inference_mode():
    with model.inference_session(max_length=max_length) as sess:
        while len(text) < max_length:
            embs = model.transformer.word_embeddings(token_ids)
            embs = model.transformer.word_embeddings_layernorm(embs)

            h = sess.step(embs)
            h_last = model.transformer.ln_f(h[:, -1])
            logits = model.lm_head(h_last)

            next_token = logits.argmax(dim=-1)
            text += tokenizer.decode(next_token)
            token_ids = next_token.reshape(1, 1)
            print(text)

What is AI? Answer: AI
What is AI? Answer: AI is
What is AI? Answer: AI is a
What is AI? Answer: AI is a computer
What is AI? Answer: AI is a computer program
What is AI? Answer: AI is a computer program that
What is AI? Answer: AI is a computer program that can
What is AI? Answer: AI is a computer program that can think
What is AI? Answer: AI is a computer program that can think and
What is AI? Answer: AI is a computer program that can think and learn
What is AI? Answer: AI is a computer program that can think and learn like
What is AI? Answer: AI is a computer program that can think and learn like a
What is AI? Answer: AI is a computer program that can think and learn like a human
What is AI? Answer: AI is a computer program that can think and learn like a human.
What is AI? Answer: AI is a computer program that can think and learn like a human. It
What is AI? Answer: AI is a computer program that can think and learn like a human. It is
What is AI? Answer: AI is a computer program th

## Step 6. Sharing is caring 🤗

We developed Petals to be a community-run system, so we rely on people giving out their GPUs to increase the swarm’s capacity. If you have some GPUs that are not always busy, please **consider running a Petals server.** You can pause it any time if you want to use the GPUs for something else. As a bonus, people running a server get a certain speedup when using Petals, since a larger part of the model is hosted locally.

<br>

🐋 You can run our [Docker](https://www.docker.com) image (works on Linux, macOS, and Windows with [WSL2](https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl)):

```
sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
    learningathome/petals:main python -m petals.cli.run_server bigscience/bloom-petals --port 31330
```

🐍 Or run these commands in an [Anaconda](https://www.anaconda.com) env (requires Linux and Python 3.7+):

```
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -U petals
python -m petals.cli.run_server bigscience/bloom-petals
```

<br>

📚 See [FAQ](https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#running-a-server) to learn how to configure the server to use multiple GPUs, address common issues, etc.

You can also host [BLOOMZ](https://huggingface.co/bigscience/bloomz), a version of BLOOM fine-tuned to follow human instructions in the zero-shot regime — just replace `bloom-petals` with `bloomz-petals`.

🔒 Hosting a server does not allow others to run custom code on your computer. Learn more about security [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety).

💬 If you have any issues or feedback, let us know on [our Discord server](https://discord.gg/D9MwApKgWa)!

## Step 7. Using other fine-tuning and prompt-tuning methods

While you can write your own custom adapters, Petals implements several [standard](https://arxiv.org/abs/2104.08691) [methods](https://arxiv.org/abs/2101.00190) for parameter-efficient fine-tuning. We provide a couple of advanced examples in our GitHub repository:

- Training a personified chatbot: [notebook](https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-personachat.ipynb)

- Fine-tuning BLOOM for text semantic classification: [notebook](https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb)

## What's next?

Congratulations on finishing our tutorial! Now, you are familiar with how to use Petals for different tasks, how it works under the hood, and how to increase its capacity.

You can find a few other helpful resources below:

* __More about Petals.__ The [README](https://github.com/bigscience-workshop/petals#readme) file in our GitHub repository has links to more Petals-related materials, including instructions for starting your own swarm (possibly, with a model other than BLOOM).

* __Discord server.__ If you have any feedback, questions, or technical issues, please [join our Discord server](https://discord.gg/D9MwApKgWa) and let us know. If you want to build something based on Petals, we'd be happy to hear what you are up to.

* __Academic paper.__ We have released a [paper](https://arxiv.org/abs/2209.01188) that goes into details about our research and what happens in Petals under the hood.

# Other interesting projects

- Together: A decentralized cloud for
artificial intelligence.
Open. Scalable.

  https://www.together.xyz