-
Notifications
You must be signed in to change notification settings - Fork 3
Feature/vllm #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Feature/vllm #30
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
683d926
Add vllm as feature and a llm_test_run_script
mo374z 69837fa
small fixes in vllm class
mo374z 7563712
differentiate between vllm and api inference
mo374z af6f9f8
set up experiment over multiple tasks and prompts
mo374z bc9997a
change csv saving
mo374z 7958b86
add base llm super class
mo374z e82db35
add changes from PR review
mo374z 0045de7
change some VLLM params
mo374z 0b3c7cb
fix tensor parallel size to 1
mo374z a73c378
experiment with batch size
mo374z 1f68410
experiment with larger batch sizes
mo374z f5fe188
add continuous batch llm
mo374z 1330a9e
remove arg
mo374z c6dbb7b
remove continuous batch inference try
mo374z 42ab6c9
add batching to vllm
mo374z 0be3d06
add batching in script
mo374z c5ac101
Add release notes and increase version number
timo282 0eb701b
remove llm_test_run.py script
mo374z f4f9722
Merge branch 'feature/vllm' of https://github.com/finitearth/promptol…
mo374z fae0113
change system prompt
mo374z File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,3 @@ | ||
| [flake8] | ||
| max-line-length = 120 | ||
| ignore = F401, W503 | ||
| ignore = E731,E231,E203,E501,F401,W503 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,4 +6,5 @@ rsync_exclude.txt | |
| __pycache__/ | ||
| temp/ | ||
| dist/ | ||
| outputs/ | ||
| poetry.lock | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,135 @@ | ||
| """Module for running language models locally using the vLLM library.""" | ||
|
|
||
|
|
||
| from logging import INFO, Logger | ||
|
|
||
| try: | ||
| import torch | ||
| from transformers import AutoTokenizer | ||
| from vllm import LLM, SamplingParams | ||
| except ImportError as e: | ||
| import logging | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
| logger.warning(f"Could not import vllm, torch or transformers in vllm.py: {e}") | ||
|
|
||
| from promptolution.llms.base_llm import BaseLLM | ||
|
|
||
| logger = Logger(__name__) | ||
| logger.setLevel(INFO) | ||
|
|
||
|
|
||
| class VLLM(BaseLLM): | ||
| """A class for running language models using the vLLM library. | ||
|
|
||
| This class sets up a vLLM inference engine with specified model parameters | ||
| and provides a method to generate responses for given prompts. | ||
|
|
||
| Attributes: | ||
| llm (vllm.LLM): The vLLM inference engine. | ||
| tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model. | ||
| sampling_params (vllm.SamplingParams): Parameters for text generation. | ||
|
|
||
| Methods: | ||
| get_response: Generate responses for a list of prompts. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| model_id: str, | ||
| batch_size: int = 64, | ||
| max_generated_tokens: int = 256, | ||
| temperature: float = 0.1, | ||
| top_p: float = 0.9, | ||
| model_storage_path: str = None, | ||
| token: str = None, | ||
| dtype: str = "auto", | ||
| tensor_parallel_size: int = 1, | ||
| gpu_memory_utilization: float = 0.95, | ||
| max_model_len: int = 2048, | ||
| trust_remote_code: bool = False, | ||
| ): | ||
| """Initialize the VLLM with a specific model. | ||
|
|
||
| Args: | ||
| model_id (str): The identifier of the model to use. | ||
| batch_size (int, optional): The batch size for text generation. Defaults to 8. | ||
| max_generated_tokens (int, optional): Maximum number of tokens to generate. Defaults to 256. | ||
| temperature (float, optional): Sampling temperature. Defaults to 0.1. | ||
| top_p (float, optional): Top-p sampling parameter. Defaults to 0.9. | ||
| model_storage_path (str, optional): Directory to store the model. Defaults to None. | ||
| token: (str, optional): Token for accessing the model - not used in implementation yet. | ||
| dtype (str, optional): Data type for model weights. Defaults to "float16". | ||
| tensor_parallel_size (int, optional): Number of GPUs for tensor parallelism. Defaults to 1. | ||
| gpu_memory_utilization (float, optional): Fraction of GPU memory to use. Defaults to 0.95. | ||
| max_model_len (int, optional): Maximum sequence length for the model. Defaults to 2048. | ||
| trust_remote_code (bool, optional): Whether to trust remote code. Defaults to False. | ||
|
|
||
| Note: | ||
| This method sets up a vLLM engine with specified parameters for efficient inference. | ||
| """ | ||
| self.batch_size = batch_size | ||
| self.dtype = dtype | ||
| self.tensor_parallel_size = tensor_parallel_size | ||
| self.gpu_memory_utilization = gpu_memory_utilization | ||
| self.max_model_len = max_model_len | ||
| self.trust_remote_code = trust_remote_code | ||
|
|
||
| # Configure sampling parameters | ||
| self.sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_generated_tokens) | ||
|
|
||
| # Initialize the vLLM engine | ||
| self.llm = LLM( | ||
| model=model_id, | ||
| tokenizer=model_id, | ||
| dtype=self.dtype, | ||
| tensor_parallel_size=self.tensor_parallel_size, | ||
| gpu_memory_utilization=self.gpu_memory_utilization, | ||
| max_model_len=self.max_model_len, | ||
| download_dir=model_storage_path, | ||
| trust_remote_code=self.trust_remote_code, | ||
| ) | ||
|
|
||
| # Initialize tokenizer separately for potential pre-processing | ||
| self.tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
|
|
||
| def get_response(self, inputs: list[str]): | ||
| """Generate responses for a list of prompts using the vLLM engine. | ||
|
|
||
| Args: | ||
| prompts (list[str]): A list of input prompts. | ||
|
|
||
| Returns: | ||
| list[str]: A list of generated responses corresponding to the input prompts. | ||
|
|
||
| Note: | ||
| This method uses vLLM's batched generation capabilities for efficient inference. | ||
| """ | ||
| prompts = [ | ||
| self.tokenizer.apply_chat_template( | ||
| [ | ||
| { | ||
| "role": "system", | ||
| "content": "You are a helpful assistant.", | ||
| }, | ||
| {"role": "user", "content": input}, | ||
| ], | ||
| tokenize=False, | ||
| ) | ||
| for input in inputs | ||
| ] | ||
|
|
||
| # generate responses for self.batch_size prompts at the same time | ||
| all_responses = [] | ||
| for i in range(0, len(prompts), self.batch_size): | ||
| batch = prompts[i : i + self.batch_size] | ||
| outputs = self.llm.generate(batch, self.sampling_params) | ||
| responses = [output.outputs[0].text for output in outputs] | ||
| all_responses.extend(responses) | ||
|
|
||
| return all_responses | ||
|
|
||
| def __del__(self): | ||
| """Cleanup method to delete the LLM instance and free up GPU memory.""" | ||
| del self.llm | ||
| torch.cuda.empty_cache() | ||
finitearth marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.