Skip to content

Commit

Permalink
feat: serve adapter layers (#52)
Browse files Browse the repository at this point in the history
  • Loading branch information
aarnphm committed Jun 23, 2023
1 parent 5981e49 commit dfca956
Show file tree
Hide file tree
Showing 33 changed files with 1,895 additions and 495 deletions.
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
nightly-requirements.txt linguist-generated=true
* text=auto eol=lf
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -346,10 +346,9 @@ async def prompt(input_text: str) -> str:

OpenLLM seamlessly integrates with Hugging Face Agents.

> **Warning** The Hugging Face Agent is still in the experimental stage. It is
> recommended to OpenLLM with
> `pip install -r nightly-requirements.generated.txt` to get the latest API
> update for Hugging Face agent.
> **Warning** The HuggingFace Agent is still at experimental stage. It is
> recommended to OpenLLM with `pip install -r nightly-requirements.txt` to get
> the latest API update for HuggingFace agent.
```python
import transformers
Expand Down
45 changes: 45 additions & 0 deletions changelog.d/52.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#### Serving LLM with fine-tuned LoRA, QLoRA adapters layers

Then the given fine tuning weights can be served with the model via
`openllm start`:

```bash
openllm start opt --model-id facebook/opt-6.7b --adapter-id /path/to/adapters
```

If you just wish to try some pretrained adapter checkpoint, you can use
`--adapter-id`:

```bash
openllm start opt --model-id facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-lora
```

To use multiple adapters, use the following format:

```bash
openllm start opt --model-id facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-lora --adapter-id aarnphm/opt-6.7b-lora:french_lora
```

By default, the first `adapter-id` will be the default lora layer, but
optionally users can change what lora layer to use for inference via
`/v1/adapters`:

```bash
curl -X POST http://localhost:3000/v1/adapters --json '{"adapter_name": "vn_lora"}'
```

> Note that for multiple `adapter-name` and `adapter-id`, it is recomended to
> update to use the default adapter before sending the inference, to avoid any
> performance degradation
To include this into the Bento, one can also provide a `--adapter-id` into
`openllm build`:

```bash
openllm build opt --model-id facebook/opt-6.7b --adapter-id ...
```

### Rework

Separate out configuration builder, to make it more flexible for future
configuration generation.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,15 +58,15 @@ requires-python = ">=3.8"
# NOTE: Don't modify project.optional-dependencies
# as it is managed by ./tools/update-optional-dependencies.py
[project.optional-dependencies]
agents = ["transformers[agents]", "diffusers", "soundfile"]
agents = ["transformers[agents]>=4.30", "diffusers", "soundfile"]
all = [
"openllm[chatglm]",
"openllm[starcoder]",
"openllm[falcon]",
"openllm[agents]",
"openllm[flan-t5]",
"openllm[fine-tune]",
"openllm[openai]",
"openllm[flan-t5]",
]
chatglm = ["cpm_kernels", "sentencepiece"]
falcon = ["einops", "xformers", "safetensors"]
Expand Down
20 changes: 20 additions & 0 deletions src/openllm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,9 @@
from __future__ import annotations

import logging
import os
import typing as t
import warnings

from . import utils as utils
from .__about__ import __version__ as __version__
Expand All @@ -39,6 +41,24 @@

utils.configure_logging()
logging.basicConfig(level=logging.NOTSET)
else:
# configuration for bitsandbytes before import
os.environ["BITSANDBYTES_NOWELCOME"] = os.environ.get("BITSANDBYTES_NOWELCOME", "1")
# The following warnings from bitsandbytes, and probably not that important
# for users to see when DEBUG is False
warnings.filterwarnings(
"ignore", message="MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization"
)
warnings.filterwarnings(
"ignore", message="MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization"
)
warnings.filterwarnings(
"ignore",
message=(
"The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization"
" are unavailable."
),
)


_import_structure = {
Expand Down

0 comments on commit dfca956

Please sign in to comment.