Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Open-source Models #245

Merged
merged 32 commits into from
Sep 1, 2023
Merged

Integrate Open-source Models #245

merged 32 commits into from
Sep 1, 2023

Conversation

HalberdOfPineapple
Copy link
Member

Description

  • Tried to introduce open-source models into CAMEL framework. Have successfully used LLaMA2 (7b, chat) and Vicuna (7b, v1.5) for assistant role, with corresponding ModelType elements defined.
  • The basic workflow is based on an external server running LLM inference. This is because:
    1. such a server can keep the LLM in memory eliminating the need to repeatedly loading it from file for every request;
    2. such type of servers typically implemented OpenAI compatible interfaces, which makes the interactions between them quite convenient. Current implementation assumes the user is running a server by himself and should provide CAMEL with the URL to this server, e.g. http://localhost:8000/v1
    3. This can provided some degree of freedom to choose their preferred server and decouples the inference logic of LLMs from CAMEL. My current choice is FastChat, which supports OpenAI compatible interfaces and many mainstream open-source models.
  • Necessary arguments and how they are specified and passed:
    • path to model tokenizers: added an extra argument model_path for BaseModelBackend class, though this should only be applicable for open-source models. It can either be a model id in HuggingFace such as lmsys/vicuna-13b-v1.5-16k or a path to the local folder containing the tokenizer (necessary) and models.
    • URL to the server running the LLM: It should be specified in the environmental variable OPENAI_API_BASE, which will be set to alter openai.api_base.
  • Refactored the token counting functionality as different types of open-source models have different tokenization regulations. This PR packed the related functionalities into utils/token_counting.py and equips each model object with a specific token counter.
  • Removed BaseMessage.token_len as this function seems not useful. Counting number of tokens cannot be decoupled from the specific model and especially the open-source models need to load specific tokenizers. Each time of counting tokens here needs to load the tokenizer, which is quite costly. We can just count the tokens after models are defined with their token counters.

The functionality structure may need to be further optimized but Guohao told me to create this PR for more convenient reviewing.

Motivation and Context

Why is this change required? What problem does it solve?
close #225

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

Implemented Tasks

  • Subtask 1
  • Subtask 2
  • Subtask 3

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

Copy link
Collaborator

@dandansamax dandansamax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome change and we can invoke any model using openAI API after merging it!

There are some suggestions about code structure.

However, I think the biggest problem is that I should set up a entirely new environment to launch a LLM server in my computer to use open source models, which is diffcult and time wasting. If I am a user who want to use a opensource model in camel, it is possibly because I don't have an openai key but I want to try camel for free. These complicated set up processes may prevent me from trying camel. We should either provide detailed instructions of how to set up a llm server or using a wrapper to do that in the code.

Anyway, this PR is very valuable for invoking other models, thank you so much!

camel/models/open_source_model.py Show resolved Hide resolved
camel/typing.py Outdated Show resolved Hide resolved
camel/utils/functions.py Show resolved Hide resolved
camel/utils/token_counting.py Outdated Show resolved Hide resolved
camel/utils/token_counting.py Outdated Show resolved Hide resolved
examples/test/test_ai_society_example.py Outdated Show resolved Hide resolved
def create(
model_type: ModelType,
model_config_dict: Dict,
model_path: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to not contain model_path parameter in this? I think it is only used to create the tokenizer. So can we assign it inside the factory by model_type instead of passing it in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is the current model_type has only general information of the model but without the specific model name/path (e.g. lm-sys/vicuna-7b-v1.5).
I might get what you mean by modifying ModelType to be a data class? Should we implement another class called like ModelCard, which has a field being ModelType in limited range and can have other information like model path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo

@HalberdOfPineapple
Copy link
Member Author

This is an awesome change and we can invoke any model using openAI API after merging it!

There are some suggestions about code structure.

However, I think the biggest problem is that I should set up a entirely new environment to launch a LLM server in my computer to use open source models, which is diffcult and time wasting. If I am a user who want to use a opensource model in camel, it is possibly because I don't have an openai key but I want to try camel for free. These complicated set up processes may prevent me from trying camel. We should either provide detailed instructions of how to set up a llm server or using a wrapper to do that in the code.

Anyway, this PR is very valuable for invoking other models, thank you so much!

Thanks for these valuable suggestions! Yeah I totally agree with the problem for the inconvenience launching the server, which I have discussed with Guohao. I am now trying to find a way to automatically launching a FastChat server given a model path. But after all, such an option will make users' options of the server limited to our choice.

Copy link
Collaborator

@Obs01ete Obs01ete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great feature you are adding! Thanks! Let's polish the code.

camel/agents/chat_agent.py Outdated Show resolved Hide resolved
camel/agents/chat_agent.py Show resolved Hide resolved
camel/agents/critic_agent.py Outdated Show resolved Hide resolved
camel/agents/task_agent.py Outdated Show resolved Hide resolved
camel/models/open_source_model.py Show resolved Hide resolved
examples/test/test_ai_society_example.py Outdated Show resolved Hide resolved
examples/open_source_models/role_playing_with_llama2.py Outdated Show resolved Hide resolved
class OpenSourceModel(BaseModelBackend):
r"""OpenAI API in a unified BaseModelBackend interface."""

def __init__(self, model_type: ModelType, model_config_dict: Dict[str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test for this newly introduced class.

camel/models/open_source_model.py Show resolved Hide resolved
return num_tokens


class TokenCounterFactory:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test for TokenCounterFactory with all possible models types.

@HalberdOfPineapple
Copy link
Member Author

Have just modified the current implementation but has not completed the documentation and tests. Will do in tomorrow morning as it is quite late (UTC+8)

Copy link
Member

@lightaime lightaime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Left some comments.

camel/agents/chat_agent.py Show resolved Hide resolved
camel/models/base_model.py Outdated Show resolved Hide resolved
camel/models/base_model.py Outdated Show resolved Hide resolved
def create(
model_type: ModelType,
model_config_dict: Dict,
model_path: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo

camel/models/model_factory.py Outdated Show resolved Hide resolved
camel/utils/token_counting.py Show resolved Hide resolved
camel/utils/token_counting.py Outdated Show resolved Hide resolved
camel/utils/token_counting.py Outdated Show resolved Hide resolved
camel/utils/token_counting.py Outdated Show resolved Hide resolved
camel/models/open_source_model.py Show resolved Hide resolved
@HalberdOfPineapple
Copy link
Member Author

I cannot agree with this idea. A enumeration containing too much information is not good.

I might get what you mean and write the code below. Could you have a look at the implementation below? This code isolate the data and behaviour of each model type into another dataclass.

e.g.

@dataclass(frozen=True)
class ModelConfig:
    name: str = ...
    model_path: str = ...
    is_open_source: bool = ...
    ...

class ModelType(Enum):
    GPT_3_5_TURBO = "gpt-3.5-turbo"
    GPT_3_5_TURBO_16K = "gpt-3.5-turbo-16k"
    GPT_4 = "gpt-4"
    GPT_4_32k = "gpt-4-32k"
    STUB = "stub"

    LLAMA_2 = "llama-2"
    VICUNA = "vicuna"
    VICUNA_16K = "vicuna-16k"

class ModelConfigFactory:
    def create(model_type: ModelType, kwargs: Dict[str, Any]):
        if model_type in {
                ModelType.GPT_4,
                ModelType.GPT_3_5_TURBO,
                ...
        }:
            return ModelConfig(
                name = kwargs.get("model_name", model_type.value),
                model_path = None,
                is_open_source = False,
                ...
            )
       elif model_type in {
                ModelType.LLAMA_2,
                ...,
        }:
            if "model_path" not in kwargs:
                raise ValueError(...)

            return ModelConfig(
                name = kwargs.get("model_name", model_type.value),
                model_path = kwargs["model_path"]
                is_open_source = True,
                ...
            )
        else:
                raise ValueError("Invalid model type")

Hi @Obs01ete @lightaime I write the sample code following Tianqi's suggestion and am wondering whether such an implementation of ModelType is viable?
And some of the comments given by Guohao are mismatched with the commit done on yesterday (but I uploaded a commit after Guohao's comments without reading them today)

@Obs01ete
Copy link
Collaborator

@MorphlingEd Why do we need a factory for configs? Can't we make model backends in the factory and that's it?

@HalberdOfPineapple
Copy link
Member Author

HalberdOfPineapple commented Aug 16, 2023

@MorphlingEd Why do we need a factory for configs? Can't we make model backends in the factory and that's it?

Yeah that's a illed design shown here using a redundant factory. Later I tried to introduce a unified ModelBackendConfig which can be created with the model backend objects together in ModelFactory, containing all the information that is possessed by ModelType now. But I did not see the obvious convenience by completely removing properties from the ModelType enumeration. I need to write ModelBackendConfig for all of the already defined constants (e.g. GPT_4) one-by-one as they do not share properties. For example:

...
if model_type is ModelType.GPT_4:
    model_config = ModelConfig(
                ...
                token_limit = 8192,
                ...
            )
    return OpenAIModel(model_config, ...)
elif model_type is ModelType.GPT_3_5_TURBO:
    model_config = ModelConfig(
                ...
                token_limit = 4096
                ...
            )
    return OpenAIModel(model_config, ...)
...

Though I could use a dictionary like token_limit_map for mapping the model types to different token limits to reduce the repeated code, I think this is not simpler than using a "naive" enum class.

In addition, by introducing the new OpenSourceConfig in config.py, both the new arguments, api_base (server_url) and model_path, needed by open-source models can be conveniently passed.
So I guess we can just keep the original implementation of enum?

@HalberdOfPineapple
Copy link
Member Author

HalberdOfPineapple commented Aug 16, 2023

I just found there are some authentication problems due to the HuggingFace repo for LLaMA2 is gated. If we do not authenticate by the following commands in the check environment like below:

export HUGGINGFACE_TOKEN="<huggingface_token>"
huggingface-cli login --token $HUGGINGFACE_TOKEN

when using HuggingFace's AutoTokenizer, it will report errors like:
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json. Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.

@HalberdOfPineapple
Copy link
Member Author

Sorry just closed the branch by mistake

Copy link
Collaborator

@Obs01ete Obs01ete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tests in test_open_source_model.py. Left some comments.

camel/models/base_model.py Outdated Show resolved Hide resolved
camel/configs.py Outdated Show resolved Hide resolved
camel/models/model_factory.py Outdated Show resolved Hide resolved
camel/models/base_model.py Outdated Show resolved Hide resolved
camel/models/open_source_model.py Outdated Show resolved Hide resolved
camel/models/open_source_model.py Show resolved Hide resolved
camel/models/openai_model.py Outdated Show resolved Hide resolved
camel/models/stub_model.py Outdated Show resolved Hide resolved
camel/societies/role_playing.py Outdated Show resolved Hide resolved
camel/typing.py Show resolved Hide resolved
Copy link
Collaborator

@Obs01ete Obs01ete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress. Left comments.

camel/models/stub_model.py Outdated Show resolved Hide resolved
camel/models/openai_model.py Show resolved Hide resolved
.github/workflows/documentation.yaml Outdated Show resolved Hide resolved
@HalberdOfPineapple
Copy link
Member Author

@Obs01ete I have reverted the python version back to 3.8. But I failed trying to implement an equivalent functionality to kw_only even with a decorator. Finally I changed the OpenSourceConfig to be composed of an instance of ChatGPTConfig. This might be meaningful as configurations for open-source models should be different from ChatGPTConfig and the parameter values in ChatGPTConfig should be compacted into "parameters for OpenAI API". I am not sure whether this makes sense to you.

Copy link
Collaborator

@Obs01ete Obs01ete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, now all comments are resolved, and the code looks nice. Approved.

@dandansamax
Copy link
Collaborator

LGTM. Only the config part is a little redundant now. We can consider refactoring it later. Anyway, good job! Feel free to merge it.

@Obs01ete
Copy link
Collaborator

Obs01ete commented Aug 24, 2023

I changed the OpenSourceConfig to be composed of an instance of ChatGPTConfig

Yeah, this does not look natural... Okay, let's leave it like this and rework later. Go ahead with merging this PR.

@lightaime
Copy link
Member

lightaime commented Aug 24, 2023 via email

@HalberdOfPineapple
Copy link
Member Author

Could you wait for my final check a bit?

Yes of course :D

@HalberdOfPineapple
Copy link
Member Author

@lightaime Hi Guohao, could I ask whether there is still something to be improved further? :)

Copy link
Member

@lightaime lightaime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look great!! Thanks @MorphlingEd. Left some small comments. Please free feel to merge it.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
camel/models/open_source_model.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
camel/models/open_source_model.py Outdated Show resolved Hide resolved
@HalberdOfPineapple HalberdOfPineapple merged commit 7f0315f into master Sep 1, 2023
10 checks passed
@HalberdOfPineapple HalberdOfPineapple deleted the llama_wenxuan branch September 1, 2023 03:28
@ShukriChiu
Copy link

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Analysis

  • 🎯 Main theme: Integration of open-source models into the CAMEL framework
  • 📝 PR summary: This PR introduces open-source models into the CAMEL framework, specifically LLaMA2 and Vicuna. The implementation assumes the user is running a server by himself and should provide CAMEL with the URL to this server. The PR also includes refactoring of the token counting functionality and removal of BaseMessage.token_len function.
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: Yes
  • 🔒 Security concerns: No security concerns found

PR Feedback

  • 💡 General suggestions: The PR is well-structured and includes a good amount of detail in the description. The integration of open-source models into the CAMEL framework is a significant enhancement. The refactoring of the token counting functionality and the removal of the BaseMessage.token_len function seem to be well-justified. However, it would be beneficial to include more comments in the code to explain the logic and the functionality of the methods, especially for the token counting functionality.

  • 🤖 Code feedback:

    • relevant file: camel/utils/token_counting.py
      suggestion: Consider adding more comments to explain the logic and functionality of the methods, especially for the token counting functionality. This will make the code more maintainable and understandable for other developers. [medium]
      relevant line: class BaseTokenCounter(ABC):

    • relevant file: camel/utils/token_counting.py
      suggestion: It would be beneficial to handle exceptions more gracefully in the get_model_encoding function. Instead of just printing "Model not found. Using cl100k_base encoding.", consider logging this information and also adding more context to the error message. [medium]
      relevant line: def get_model_encoding(value_for_tiktoken: str):

    • relevant file: camel/utils/functions.py
      suggestion: It seems like the openai_api_key_required function could be simplified. The function currently checks if the model is a stub or an open-source model, and if the OpenAI API key is in the environment variables. Consider simplifying this function to only check if the OpenAI API key is in the environment variables, as this is the only requirement for the function to work correctly. [medium]
      relevant line: def openai_api_key_required(func: F) -> F:

    • relevant file: camel/configs.py
      suggestion: The OpenSourceConfig class could benefit from more detailed docstrings. Specifically, it would be helpful to explain what the model_path and server_url parameters are used for, and what their expected values are. [medium]
      relevant line: class OpenSourceConfig(BaseConfig):

How to use

Tag me in a comment '@CodiumAI-Agent' and add one of the following commands:
/review [-i]: Request a review of your Pull Request. For an incremental review, which only considers changes since the last review, include the '-i' option.
/describe: Modify the PR title and description based on the contents of the PR.
/improve [--extended]: Suggest improvements to the code in the PR. Extended mode employs several calls, and provides a more thorough feedback.
/ask <QUESTION>: Pose a question about the PR.
/update_changelog: Update the changelog based on the PR's contents.

To edit any configuration parameter from configuration.toml, add --config_path=new_value
For example: /review --pr_reviewer.extra_instructions="focus on the file: ..."
To list the possible configuration parameters, use the /config command.

@CodiumAI-Agent
Copy link

PR Analysis

  • 🎯 Main theme: Integration of Open-source Models into CAMEL Framework
  • 📝 PR summary: This PR introduces open-source models into the CAMEL framework. It includes the implementation of token counting for different types of open-source models and the necessary arguments for their specification. The PR also includes tests for the new functionality and refactors some existing code.
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: Yes
  • 🔒 Security concerns: No security concerns found

PR Feedback

  • 💡 General suggestions: The PR is well-structured and includes a significant amount of work. It is good to see that tests have been added for the new functionality. However, it would be beneficial to add more comments in the code to explain the logic and the purpose of the functions, especially for complex ones. This will make the code easier to understand and maintain in the future.

  • 🤖 Code feedback:

    • relevant file: camel/utils/token_counting.py
      suggestion: Consider handling exceptions more gracefully in the OpenSourceTokenCounter class. Instead of raising a ValueError when the tokenizer fails to load, you could log an error message and return a default value or handle the exception in a way that doesn't interrupt the execution of the program. [important]
      relevant line: '+ raise ValueError(

    • relevant file: camel/utils/token_counting.py
      suggestion: The messages_to_prompt function seems to have a lot of repeated code for different model types. Consider refactoring this function to reduce code duplication. You could create a dictionary to map model types to their corresponding separators and role maps, and then use this dictionary in the function. [medium]
      relevant line: '+def messages_to_prompt(messages: List[OpenAIMessage], model: ModelType) -> str:'

    • relevant file: camel/utils/functions.py
      suggestion: The openai_api_key_required decorator could be improved by using a more descriptive error message. Instead of just saying "OpenAI API key not found", you could provide more information about how to set the API key. [medium]
      relevant line: '+ raise ValueError("OpenAI API key not found")'

    • relevant file: camel/configs.py
      suggestion: In the OpenSourceConfig class, it would be helpful to add type hints for the model_path and server_url parameters. This will make the code more readable and will allow static type checkers and IDE features to work correctly. [medium]
      relevant line: '+ model_path: str'

How to use

Tag me in a comment '@CodiumAI-Agent' and add one of the following commands:
/review [-i]: Request a review of your Pull Request. For an incremental review, which only considers changes since the last review, include the '-i' option.
/describe: Modify the PR title and description based on the contents of the PR.
/improve [--extended]: Suggest improvements to the code in the PR. Extended mode employs several calls, and provides a more thorough feedback.
/ask <QUESTION>: Pose a question about the PR.
/update_changelog: Update the changelog based on the PR's contents.

To edit any configuration parameter from configuration.toml, add --config_path=new_value
For example: /review --pr_reviewer.extra_instructions="focus on the file: ..."
To list the possible configuration parameters, use the /config command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Model Related to backend models
Projects
Archived in project
6 participants