FAQ: Frequently asked questions

General

💬 If you have any questions not covered here, feel free to ask them in the #discussion channel of our Discord.

Why is the platform named "Petals"?

While Petals now supports different language models, it was originally developed for the BigScience BLOOM model. "Petals" was a metaphor for people serving parts of the model (together, they host the entire model — BLOOM).
Where can I find a more detailed description of how Petals works?

See our research paper.
Can I use Petals to train a large language model (LLM)?

Petals supports fine-tuning of pretrained LLMs with parameter-efficient adapters or trainable prompts, such as P‑Tuning v2. In this case:
- Transformer block weights stored on Petals servers remain unchanged, since these weights are shared among all clients that use Petals.
- You can add trainable prompts or a small number of trainable parameters (usually < 5% of LLM weights) stored locally. In most cases, this amount of trainable parameters is enough to adapt the LLM to a specific downstream task, since the task-specific information is much smaller than all the knowledge stored inside the original LLM.
- You can fine-tune the input and output embeddings, since they are stored locally on the machine running the Petals client.
Petals does not support training LLMs from scratch, since it requires updating the transformer block weights. Instead, you can use hivemind - our library for decentralized deep learning, which Petals uses under the hood. We have already trained multiple models this way, however, they are much smaller than BLOOM and required much less compute for training.
What's the motivation for people to host model layers in the public swarm?

People who run inference and fine-tuning themselves get a certain speedup if they host a part of the model locally. Some may be also motivated to "give back" to the community helping them to run the model (similarly to how BitTorrent users help others by sharing data they have already downloaded).

Since it may be not enough for everyone, we are also working on introducing explicit incentives ("bloom points") for people donating their GPU time to the public swarm. Once this system is ready, we will display the top contributors on our website. People who earned these points will be able to spend them on inference/fine-tuning with higher priority or increased security guarantees, or (maybe) exchange them for other rewards.
Will Petals incentives be based on crypto, blockchain, etc.?

No, we are working on a centralized incentive system similar to the AI Horde kudos, even though Petals is a fully decentralized system in all other aspects. We do not plan to provide a service to exchange these points for money, so you should see these incentives as "game" points designed to be spent inside our system.

Petals is an ML-focused project designed for ML researchers and engineers, it does not have anything to do with finance. We decided to make the incentive system centralized because it is much easier to develop and maintain, so we can focus on developing features useful for ML researchers.

Contributing

💬 If you have any questions not covered here, feel free to ask them in the #dev channel of our Discord.

How can I help the Petals development?

Please take a look at the issues with the good first issue and help wanted tags and check if there's something you'd like to work on. If you decide to work on something, let us know in the corresponding issue.

We add the good first issue tag to the issues that (a) can be solved in 1-2 days of work without much expertise and (b) provide a good opportunity to study the code of Petals and hivemind (our library for decentralized deep learning, which Petals uses under the hood). If you address some of them, we'd be happy to collaborate on more impactful/fundamental tasks more closely.

Feel free to ping the devs in Discord if you have any questions regarding these issues. Please expect a few iterations on the pull requests before merging your code, since we struggle to achieve better code quality.
Do you have a style guide?

We use black and isort for all pull requests. Before committing your code, simply run black . && isort . and you will be fine.
How to test my changes?

For small changes, you can submit a draft pull request and use our CI for testing after receiving a maintainer's pre-approval.

For larger changes, you may want to test them on a real swarm. If your changes only affect the client, you are free to test them on the public swarm. If your changes affect the server, please do not connect it to the public swarm - if the server is broken, it may return incorrect results for other users. Instead, you can set up a small private swarm with a test model (one GPU is usually enough for this):
```
export MODEL_NAME=bigscience/bloom-560m

python -m petals.cli.run_server $MODEL_NAME --block_indices 0:12 \
  --identity_path tests/test.id --host_maddrs /ip4/127.0.0.1/tcp/31337 --new_swarm  &> server1.log &
sleep 5  # wait for the first server to initialize DHT

python -m petals.cli.run_server $MODEL_NAME --block_indices 12:24 \
  --initial_peers SEE_THE_OUTPUT_OF_THE_1ST_PEER &> server2.log &

tail -f server1.log server2.log  # view logs for both servers
```
Then you can launch pytest and/or test functionality specific to your changes manually:
```
export MODEL_NAME=bigscience/bloom-560m REF_NAME=bigscience/bloom-560m
export INITIAL_PEERS=/ip4/127.0.0.1/tcp/31337/p2p/QmS9KwZptnVdB9FFV7uGgaTq4sEKBwcYeKZDfSpyKDUd1g
PYTHONPATH=. pytest tests --durations=0 --durations-min=1.0 -v
```
After you're done, you can terminate the servers and ensure that no zombie processes are left with pkill -f petals.cli.run_server && pkill -f p2p.

The automated tests use a more complex server configuration that can be found here.

Running a client

💬 If you have any questions not covered here, feel free to ask them in the #running-a-client channel of our Discord.

What are the minimal requirements for running a client?
- The client works on both CPU-only and GPU-enabled machines. During inference, only a small part of computations are done locally, so GPU is only a bit faster than CPU. During fine-tuning, you may need a GPU for decent performance depending on your loss and optimizer.
- See CPU/GPU memory requirements here.
- Petals requires Linux and Python 3.8+. If you use macOS or Windows, you can run code using Petals inside our Docker image.
How the client uses the network? Do I need to configure the firewall?

The client makes outgoing TCP connections to different TCP ports. This includes sending requests to servers (which may use any TCP port) and downloading model weights from Hugging Face Model Hub (which uses TCP ports 80/443).

Most firewalls allow all outgoing connections by default, so usually you don't need to configure anything.
How can I get more detailed logs that include IP addresses, ports, etc.?

Use the environment variables HIVEMIND_LOGLEVEL=DEBUG and GOLOG_LOG_LEVEL=DEBUG. You can set them in the beginning of a Jupyter notebook or Colab:
```
%env HIVEMIND_LOGLEVEL=DEBUG
%env GOLOG_LOG_LEVEL=DEBUG
```
Or in Bash, while running a script that uses the Petals client:
```
export HIVEMIND_LOGLEVEL=DEBUG
export GOLOG_LOG_LEVEL=DEBUG
python run_client.py
```
How can I hide Petals logs?

Use the environment variable HIVEMIND_LOGLEVEL=ERROR. You can set it in the same way as in the answer to the previous question.

Running a server

💬 If you have any questions not covered here, feel free to ask them in the #running-a-server channel of our Discord.

Installation

What are the minimal requirements for running a server?
- You can run the server on any machine with a GPU and 4+ GB GPU memory.
  - CPU-only servers are not recommended. They are slow and supported mostly for testing purposes.
- You need 8 GB of RAM per each GPU you'd like to use for a server.
- To run the server with Anaconda env, you need Linux, CUDA 11.x or 12.0, and Python 3.8+.
- To run on Windows, you need WSL - follow the Windows tutorial.
- To run on Docker, you can use Linux, macOS, or Windows with WSL2.
- For AMD GPUs, follow the AMD GPUs tutorial.
Can I contribute using a CPU-only server or a mobile phone?

No, these devices are too weak compared to desktop GPUs and can't make reasonable contributions at the moment.
How to deploy the server for a long time, so that it starts automatically after reboot, I can watch its logs, etc.?

On Linux, the easiest way to do this is to make Petals a systemd service. See how to do that here.

Cloud providers

How can I run Petals on Runpod?

We made a Runpod template to easily create a pod with a Petals server hosting LLaMA-65B with Guanaco adapters. To use it, please create a Runpod account, open the template, specify GPU you want to use, click "Continue" and then "Deploy".

To change the model, click "Customize Deployment" and change repository/llama-65b-hf to your model name. If you want to add more adapters, please add them to the --adapters argument. If you'd like to use a private swarm due data privacy/correctness concerns, please follow the "Launch your own swarm" tutorial.

Managing GPUs

How to make the server use a specific GPU?

If your machine has multiple GPUs, the Petals server uses only the first GPU by default. In an Anaconda env, run this before starting the server to make it use a different GPU:
```
export CUDA_VISIBLE_DEVICES=0  # Insert the GPU index here, counting from zero
```
In case of Docker, replace --gpus all with --gpus 0 (with the correct index) in the Docker command.
How to make the server use multiple GPUs at once?

We recommend to run a separate Petals server for each GPU. You'd need to specify a different GPU index for each of them (see the answer to the previous question).

This is a temporary solution, and we currently work on intra-host tensor parallelism for the Petals server. Once it's ready, the server will automatically use all available GPUs (unless you explicitly tell it not to) and run all computations in parallel. Our benchmarks show that it may give almost linear speedup (e.g., the requests will be processed ~1.9x faster on 2 GPUs, ~3.8x faster on 4 GPUs, and so on).

We already have experimental code for tensor parallelism, but it is not integrated well yet and doesn't work properly with all GPU configurations. If you'd like to try it out, you can run a server with the --tensor_parallel_devices cuda:0 cuda:1 flag (specify as many cuda:N as the number of GPUs you'd like to use). Please ensure that the compute RPS measured in this case is indeed faster than the RPS you have without tensor parallelism - otherwise, you may end up hosting a slow server that would harm the swarm.
How can I leave some free GPU memory, so I can use it for other tasks without stopping the server?

Add the --num_blocks argument with the number of blocks smaller than the automatically chosen one:
```
python -m petals.cli.run_server bigscience/bloom --num_blocks 3
```

Served blocks

How many blocks fit into my GPU?

The server chooses the default number of blocks to fill all memory of your GPU. If you want to leave some GPU memory for other tasks or see out-of-memory errors, you can decrease the number of served blocks manually by adding the --num_blocks N parameter (N is a number).

By default, one BLOOM/BLOOMZ block takes 3 GiB of GPU memory. This includes 2.5 GiB for weights in 8-bit and 0.5 GiB for attention caches (enough for 4 users doing concurrent inference with max allowed sequence length). On top of that, you need 2-3 GiB per GPU for temporary tensors created while doing forward/backward passes.
The server often switches blocks and downloads new ones, thus using significant amount of network traffic. How can I change this behavior?

You can make the server switch blocks less often by lowering the --balance_quality argument (the default value is 0.75). A reasonable value is --balance_quality 0.2 - in this case, the server will switch blocks only if this eventually leads to 5x increase of the total swarm throughput.

You can also disable block switching completely by setting --balance_quality 0 or pin the server to a certain range of blocks (see below).
How to pin the server to a certain range of blocks?

You may want to pin the server to a certain range of blocks if you observe that servers hosting them now are not very reliable.

To do this, add an argument like --block_indices 43:46 (this makes the server host blocks 43, 44, 45 - the last block in the range is not included). Note that the argument's value should contain : even if you'd like to host one block (use the value n:n+1 in this case).

Please make sure that the range contains the number of blocks your server can actually fit in its GPU memory. You should not make the server load more blocks than it chooses to host by default (without --block_indices and --num_blocks arguments).

Disk cache

How to limit the disk space the server can use?

In the worst case, a server running for a long time may use a significant amount of the disk space (up to 350 GB for BLOOM-176B or BLOOMZ-176B). This may happen when the server switches blocks many times, so that it needs to download a significant part of the model locally. However, the server always checks the amount of available space before loading new blocks, so you should not normally run out of the disk space.

Still, if you'd like to limit the disk space the server can use, please use the --max_disk_space argument, for example:
```
python -m petals.cli.run_server bigscience/bloom --max_disk_space 10GB
```
If the cache size is already larger than this value, the server will remove the oldest records before starting, so that the disk cache fits in the limit you've specified.
How to clean the disk cache?

You may want to completely drop the disk cache if it became corrupted (so you have errors while loading blocks) or you decided to stop hosting a server. Please run this command for doing this:
```
rm -rv ~/.cache/petals
```

Network

How the server uses the network? How to set up the firewall?

The server listens to one TCP port and accepts incoming connections from other peers. Also, it makes outgoing TCP connections to arbitrary TCP ports. This includes communicating with other servers (which may use any TCP port) and downloading model weights from Hugging Face Model Hub (which uses TCP ports 80/443).

If the server detects that your NAT/firewall doesn't allow incoming connections, it uses the libp2p's circuit relay functionality. This means that the server opens a long-term connection to another peer, which routes requests to your server through this connection. In this case, your server will be marked as available through a Relay on the health monitor.

Most firewalls allow all outgoing connections by default, so usually you don't need to configure anything to start. However, servers available through relays are slower. If this is your case, you can set up port forwarding to improve performance:
- Ensure that your Internet provider gives you a public IP address (what's my IP?). Add the --public_ip 1.2.3.4 argument with this address.
- Choose a specific port for the Petals server, for example, 31330. Add the --port 31330 argument with this port.
- If you use Docker, set up port forwarding for the container. Add the -p 31330:31330 argument to the Docker command.
- If you have firewall or NAT, configure them to allow incoming connections to the chosen TCP port.
- Start the server and ensure that it has Direct availability on the health monitor.

Throughput

What does RPS mean when the server measures and reports its throughput?

RPS is a measure of the server's throughput. When you start the server for the first time, it benchmarks its own compute and network RPS and takes the minimum:
- The compute RPS is measured as the number of forward passes through one block with a batch of 16 tokens per second, then multiplied by the batch size (= 16). This is designed to be a middle ground between the performance of inference (mostly used with 1 sequence) and fine-tuning (mostly used with large batches).
- The network RPS is measured as the number of token's activations per second the server can send or receive, given the throughput of its Internet connection.
The throughput is mostly relevant for the fine-tuning performance, since network and compute throughput are main bottlenecks in this case. In contrast, inference requests are small and fast to process, so the throughput does not affect performance much and the client-server latency becomes the main bottleneck (unless you generate many sequences in parallel or process a long prefix before generating new tokens).

Since the overall swarm throughput is the minimum of throughputs among all model blocks, we use the throughput values to evenly assign blocks to servers. You can watch the current assignment along with the overall swarm state on the health monitor.
My network/GPU configuration has changed, how to reevaluate the throughput?

Add the --throughput eval argument. Also, if the benchmarks fail, you can use the --throughput 100 argument to specify the value manually (in RPS).

This project is a part of the BigScience research workshop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly