Arxiv | PaperList | Readme | Resource
Considering the heterogeneity in pricing does not necessarily correlate with the user experience, it is a great need to explore effective invocation methods for LLM services in practice. As shown in Figure 1, we expect to make use of massive LLM services to construct an effective invocation strategy according to different methods, meeting targets in different scenarios. To this end, we attempts to provide a comprehensive study of the development and recent advances on effective invocation methods in LMaaS. In detail, We first formalizes the task of constructing effective invocation strategy as a multi-objective optimization problem. This entails simultaneous consideration of latency, performance, and cost factors. Then, we propose a taxonomy to provide a unified view on effective invocation methods in LMaaS where the existing methods are categorized into: input abstract, semantic cache, solution design, and output enhancement. These four components can be flexibly combined and unified in a flexible framework. Finally, we highlight the challenges and potential directions and hope our work can provide a useful roadmap for beginners interested in this area and shed light on future research.
The contributions of this survey can be concluded as follows:
- As shown in PaperList, a taxonomy of effective invocation methods in LMaaS is proposed, which categorizes existing methods from four different aspects: input abstract, semantic cache, solution design and output enhancement.
- As shown in Figure 2, the framework can unify the four type components, allowing each of them to work independently or simultaneously, during the life cycle of the LLM service invocation.
- To facilities the methods of this task, the price rules of popular LMaaS products is present in Resource and the paper list of existing works is available.
Before invocation, the user enters a query
Here, we give an example where a user wants to know the answer of "I want to hold a family party. Please tell me what should I do?". And gave three possible prompts, related to the taste of friends and the motivation to host the party.
The processing of the input query
For example, here we filter out the "I attend C's family party..." prompt that is irrelevant to the question and shorten the token length of the input query by 67%.
Semantic cache is also an important strategy to improve service performance, reduce latency and cost before invocation, which is divided into traditional cache and neural cache according to different structures. It checks whether there is a semantically similar query in the cache, if so, it directly returns, otherwise it goes into the invocation phase.
Solution design aims to construct the best invocation solution
After invocation, output enhancement focuses on the information returned to the user. The output
Through the establishment of a taxonomy, we categorize existing methods into four categories: input abstract, semantic cache, solution design, and output enhancement. Then we formalize the problem of effective LLM services strategy construction, and propose a LLM services invocation framework. Each component in the framework can work independently or simultaneously to form effective strategy for LLM service invocation that are low-latency, high-performance, and cost-saving.
Existing methods tend to focus on only one component of the framework, and we can use them as plugins. A case is shown in Figure 3, a simple invocation strategy constructed from three existing methods. The development prospects of this field are promising. We look forward to future research further advancing the field, providing users with low-latency, high-performance, and cost-effective LLM services solution, and promoting the healthy development of the LMaaS ecosystem.
The platform demonstrates the implementation of constructing an effective strategy for LMaas mentioned in the paper using Streamlit, allowing each component to combine flexibly to test different LLM invocation scenarios. Streamlit is an open-source app framework for Machine Learning and Data Science teams to create beautiful, performant apps quickly.
To get started with this project, clone the repository and install the required dependencies.
git clone https://github.com/W-caner/Effective-strategy-for-LMaas.git
cd platform
pip install -r requirements.txt
After installing the dependencies, you can run the Streamlit app locally using the following command:
streamlit run Welcome.py
This will start a local server, and you can view the app in your web browser at http://localhost:8501
.