Skip to content

Run a custom model with Petals

Alexander Borzunov edited this page Aug 6, 2023 · 18 revisions

Starting with Petals 1.2.0, you don't have to convert a model to a special Petals-compatible format and can serve it directly from a Hugging Face hub repository (e.g., you can host smaller versions of BLOOM and LLaMA off-the-self).

Still, Petals supports only a predefined set of model architectures defined in the petals.models package. If you'd like to support a new architecture, you need to copy the src/petals/models/bloom or src/petals/models/llama directory and update all files to work with your new model.

We recommend to do that in the following order:

  1. Prerequisites:

    • Ensure that model weights are available on Hugging Face Hub (if necessary, you can use a private repo and the use_auth_token argument in both Petals client and server).
    • Ensure that you have a small version of the model, so you can compare the Petals outputs to the outputs of a model running locally on your GPU.
    • If you're stuck, don't hesitate to reach us out on Discord!
  2. Edit config.py and __init__.py:

    • Make sure that the config is correctly loaded from a Hugging Face Hub repo when using AutoDistributedConfig.from_pretrained(...).
  3. Edit block.py:

    • Make sure that you can run a Petals server with your model's blocks.
    • Make sure the server returns correct results for forward and backward passes (the outputs are close the ones of a locally hosted block).
    • You have to pay attention to the dimension order in attention caches (both keys and values), since many implementations use different dimension orders (e.g., see dimension reordering code in llama/block.py).
    • Run the server with --throughput eval to test inference code and check that you have no shape errors.
  4. Edit model.py:

    • Create distributed model wrappers using code from the 🤗 Transformers implementation.
    • Check that you can run a Petals client and get correct results for inference, forward, backward passes and all model types (the outputs are close to a locally hosted model).
    • Check that AutoDistributedModel.from_pretrained(...), AutoDistributedModelForCausalLM.from_pretrained(...), and similar functions correctly load the model from Hugging Face Hub.
  5. (optional) Share your code by making a pull request to Petals:

    • We'll review your pull request and may add it to the repo if the model is worth to be maintained by our team.
    • If appropriate, we may add it to health.petals.ml and chat.petals.dev services.