diff --git a/.gitignore b/.gitignore index a48cf0d..d70ebaa 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1 @@ -public +public \ No newline at end of file diff --git a/content/en/contribution/contribute/d1.pr.md b/content/en/contribution/contribute/d1.pr.md index 9c57464..768f0ae 100644 --- a/content/en/contribution/contribute/d1.pr.md +++ b/content/en/contribution/contribute/d1.pr.md @@ -81,7 +81,8 @@ Try to use options that are already listed. If you need to add new ones, please ### Subject Content The title should clearly indicate the main content of the current submission. - +For Example +`[feature](coagent)<增加antflow兼容和增加coagent demo>` ## Example comming soon diff --git a/content/en/docs/chatbot/c1.quickstart.md b/content/en/docs/chatbot/c1.quickstart.md index ec3bafd..68a3118 100644 --- a/content/en/docs/chatbot/c1.quickstart.md +++ b/content/en/docs/chatbot/c1.quickstart.md @@ -2,9 +2,9 @@ title: QuickStart slug: QuickStart description: 介绍主要功能 -url: "docs/quickstart" +url: "docs/codefuse-chatbot-quickstart" aliases: -- "/docs/quickstart" +- "/docs/codefuse-chatbot-quickstart" ---

diff --git a/content/en/docs/codefuse-evalution/1_quickstart.md b/content/en/docs/codefuse-evalution/1_quickstart.md new file mode 100644 index 0000000..cbad2e6 --- /dev/null +++ b/content/en/docs/codefuse-evalution/1_quickstart.md @@ -0,0 +1,258 @@ +--- +title: QuickStart +description: 介绍主要功能 +url: docs/codefuse-evalution-quickstart +aliases: +- "/docs/codefuse-evalution-quickstart" +--- + + + + + +## Generation environment: +CodeFuse-13B: Python 3.8 or above,PyTorch 1.12 or above, with a recommendation for 2.0 or above, Transformers 4.24.0 or above ,CUDA 11.4 or above (for GPU users and flash-attention users, this option should be considered). + +CodeFuse-CodeLlama-34B:python>=3.8,pytorch>=2.0.0,transformers==4.32.0,Sentencepiece,CUDA 11. + +## Evaluation Environment +The evaluation of the generated codes involves compiling and running in multiple programming languages. The versions of the programming language environments and packages we use are as follows: + +| Dependency | Version | +| ---------- |----------| +| Python | 3.10.9 | +| JDK | 18.0.2.1 | +| Node.js | 16.14.0 | +| js-md5 | 0.7.3 | +| C++ | 11 | +| g++ | 7.5.0 | +| Boost | 1.75.0 | +| OpenSSL | 3.0.0 | +| go | 1.18.4 | +| cargo | 1.71.1 | + +In order to save everyone the trouble of setting up the environments for these languages, we create a Docker image with the required environments and codefuseEval. +```bash +docker pull registry.cn-hangzhou.aliyuncs.com/codefuse/codefuseeval:latest +``` + +If you are familiar with docker, you can build the image from `codefuseEval/docker/Dockerfile` or configure the Dockerfile as you like it: + +```bash +cd codefuseEval/docker +docker build [OPTIONS] . +``` + +After obtaining the image, you can build a container using the following command: + +```bash +docker run -it --gpus all --mount type=bind,source=,target= [OPTIONS] +``` + +## Check result Command: +We provide the script to check the result for provided code LLMs. Please use following scripts to check corresponding results and the environment . + +```bash +bash codefuseEval/script/check_reference.sh codefuseEval/result/CodeFuse-CodeLlama-34B/humaneval_result_python.jsonl humaneval_python +bash codefuseEval/script/check_reference.sh codefuseEval/result/CodeFuse-13B/humaneval_result_python.jsonl humaneval_python +``` + +## How to use CodeFuseEval +1. Download the model and update current model infomation in ckpt_config.json. Mainly update 「path」parameter in corresponding model and version. +2. Run following generation comand to generate result. +``` +bash codefuseEval/script/generation.sh MODELNAME MODELVERSION EVALDATASET OUTFILE + +eg: +bash codefuseEval/script/generation.sh CodeFuse-13B v1 humaneval_python result/test.jsonl +``` +3. Run following evaluation command to evaluate the generated result for corresponding model and version. +``` +bash codefuseEval/script/evaluation.sh +eg: +bash codefuseEval/script/evaluation.sh codefuseEval/result/test.jsonl pass@k humaneval_python +``` + + +## Evaluation + +We recommend evaluating in [the provided image](#evaluation-environment). To evaluate the generated samples, save generated codes in the following JSON list format: + +``` +{"task_id": "../..", "generation: "..."} +{"task_id": "../..", "generation: "..."} +... +``` + +and evaluate them using the following script under the root directory of the repository (please execute with caution, the generated codes might have unexpected behaviours though with very low possibility. See the warnings in [execution.py](execution.py) and uncomment the execution lines at your own risk): + +### Evaluation Data +Data are stored in ``codefuseEval/data``, using JSON list format. We first integrated humaneval-X dataset. + +* ``task_id``: indicates the target language and ID of the problem. Language is one of ["Python", "Java", "JavaScript", "CPP", "Go"]. +* ``prompt``: the function declaration and docstring, used for code generation. +* ``declaration``: only the function declaration, used for code translation. +* ``canonical_solution``: human-crafted example solutions. +* ``test``: hidden test samples, used for evaluation +* ``example_test``: public test samples (appeared in prompt), used for evaluation. +* ``prompt_text``: prompt text +* ``prompt_explain``: prompt explanation +* ``func_title``: code function title +* ``prompt_text_chinese``: Chinese prompt + + +### Evaluation Metrics +In addition to the unbiased pass@k indicators currently provided in [Codex](https://arxiv.org/abs/2107.03374), we will also integrate the relevant indicators of huggingface open source with [CodeBLEU](https://arxiv.org/abs/2009.10297) for integration. +The main indicators currently recommended for users are as follows: +* ``codebleu`` +* ``pass@k`` +* ``bleu`` +* ``bleurt`` + +For other related metrics, you can check the code of the metric or the evaluation code to meet your requirements. + +At the same time, we supplemented the indicators of the total and average generation time of the model for the dataset `total_time_cost` and `Average time cost` + +Output during each generation, making it convenient for users to measure the generation performance of the model in the same environment. This indicator is passive output, and it will be output every time it is generated. + +### Evaluation Command: +``` +bash codefuseEval/script/evaluation.sh +eg: +bash codefuseEval/script/evaluation.sh codefuseEval/result/test.jsonl pass@k humaneval_python +``` + +At the same time, we currently provide the following flags, which can directly bring the sample answers in the test data set as generated answers for testing. + +* ``TEST_GROUDTRUTH`` default False + +When TEST_GROUDTRUTH is True, the self-test mode is turned on, PROBLEM_FILE will be read, and the sample answer will be substituted as the generated answer for testing. + +When TEST_GROUDTRUTH is False, open the evaluation mode, read RESULT_FILE and PROBLEM_FILE, and substitute the generated answer for testing. + + +## More Infomation + +### Evaluation self model and dataset + +1. Registry your evaluate dataset. +* Download evaluation dataset to store in `codefuseEval/data` or other directory. Dataset must be jsonl. +* Setup information dataset `EVAL_DATASET`,`DATASET_SUPPORT` and `DATASET_LANGUAGE` in `codefuseEval/util.py` for dataset path, dataset task_mode and generation code language +2. Registry your evaluate model. +* Download evaluation model to store in `codefuseEval/model` or other directory. +* Write your evaluation model processor code in `codefuseEval/processor` package. + +We designed an infrastructure called Processor. Its main purpose is to handle the differences between different models. It mainly needs to complete three abstract functions: +* ``load_model_tokenizer``:Due to differences in model loading parameters and tokenizer terminators, models need to use different parameters for adaptation and loading. The current function is mainly to help users load and adapt different models. +* ``process_before``: Since prompt adapts to different prompt styles according to different types of evaluation tasks or different models selected by users, the 「process_before」function is extracted mainly to help users process prompts. +* ``process_after``:Due to the diversity of model generation results, in order to adapt to the evaluation framework, the generated result data can be spliced into appropriate use cases for automated operation. The current function mainly processes the generated results to adapt to the evaluation data set and results based on the task type and data set conditions. + +You can extend the `BaseProcessor` in `codefuseEval/processor/base.py` and implement above functions + +* Setup information model in `ckpt_config.json`. For Example as follow +``` +{ + "CodeFuse-13B": { //model name + "v1": { //model version + "path": "/mnt/model/CodeFuse13B-evol-instruction-4K/", // model path + "processor_class": "codefuseEval.process.codefuse13b.Codefuse13BProcessor", // model processor + "tokenizer": { // tokenizer params to token input string. + "truncation": true, + "padding": true, + "max_length": 600 + }, + "generation_config": { //generation config params. + "greedy": { //If JsonObject, it is a decode mode, you can set 「decode_mode」param to load params defined in the decode_mode. + "do_sample": false, + "num_beams": 1, + "max_new_tokens": 512 + }, + "beams": { + "do_sample": false, + "num_beams": 5, + "max_new_tokens": 600, + "num_return_sequences": 1 + }, + "dosample": { + "da_sample": true + }, + "temperature": 0.2, //If not JsonObject, it is a default param, we will set in generation_config default. You can cover param in decode_mode same name param. + "max_new_tokens": 600, + "num_return_sequences": 1, + "top_p": 0.9, + "num_beams": 1, + "do_sample": true + }, + "batch_size": 1, // batch size for generate + "sample_num": 1, // The number of samples generated by a single piece of data + "decode_mode": "beams" // choose decode mode defined in generation_config + } + } +``` + +### Check dataset Command: +To check whether the reference values provided by the evaluation data set are correct, +we provide the following command to check the dataset. + +CodeCompletion +```bash +bash codefuseEval/script/check_dataset.sh humaneval_python + +bash codefuseEval/script/check_dataset.sh humaneval_java + +bash codefuseEval/script/check_dataset.sh humaneval_js + +bash codefuseEval/script/check_dataset.sh humaneval_rust + +bash codefuseEval/script/check_dataset.sh humaneval_go + +bash codefuseEval/script/check_dataset.sh humaneval_cpp +``` +NL2Code +```bash +bash codefuseEval/script/check_dataset.sh mbpp +``` +CodeTrans +``` +bash codefuseEval/script/check_dataset.sh codeTrans_python_to_java + +bash codefuseEval/script/check_dataset.sh codeTrans_python_to_cpp + +bash codefuseEval/script/check_dataset.sh codeTrans_cpp_to_java + +bash codefuseEval/script/check_dataset.sh codeTrans_cpp_to_python + +bash codefuseEval/script/check_dataset.sh codeTrans_java_to_python + +bash codefuseEval/script/check_dataset.sh codeTrans_java_to_cpp +``` +CodeScience +``` +bash codefuseEval/script/check_dataset.sh codeCompletion_matplotlib + +bash codefuseEval/script/check_dataset.sh codeCompletion_numpy + +bash codefuseEval/script/check_dataset.sh codeCompletion_pandas + +bash codefuseEval/script/check_dataset.sh codeCompletion_pytorch + +bash codefuseEval/script/check_dataset.sh codeCompletion_scipy + +bash codefuseEval/script/check_dataset.sh codeCompletion_sklearn + +bash codefuseEval/script/check_dataset.sh codeCompletion_tensorflow + +bash codefuseEval/script/check_dataset.sh codeInsertion_matplotlib + +bash codefuseEval/script/check_dataset.sh codeInsertion_numpy + +bash codefuseEval/script/check_dataset.sh codeInsertion_pandas + +bash codefuseEval/script/check_dataset.sh codeInsertion_pytorch + +bash codefuseEval/script/check_dataset.sh codeInsertion_scipy + +bash codefuseEval/script/check_dataset.sh codeInsertion_sklearn + +bash codefuseEval/script/check_dataset.sh codeInsertion_tensorflow +``` \ No newline at end of file diff --git a/content/en/docs/codefuse-mft-vlm/1_quickstart.md b/content/en/docs/codefuse-mft-vlm/1_quickstart.md new file mode 100644 index 0000000..a4761c7 --- /dev/null +++ b/content/en/docs/codefuse-mft-vlm/1_quickstart.md @@ -0,0 +1,66 @@ +--- +title: QuickStart +slug: QuickStart +description: QuickStart Document +aliases: +- "/docs/codefuse-mft-vlm-quickstart" +--- + + + +## Contents +- [Install](#Install) +- [Datasets](#Datasets) +- [Multimodal Alignment](#Multimodal-Alignment) +- [Visual Instruction Tuning](#Visual-Instruction-Tuning) +- [Evaluation](#Evaluation) + +## Install +Please run sh init\_env.sh + +## Datasets +Here's the table of datasets we used to train CodeFuse-VLM-14B: + +Dataset | Task Type | Number of Samples +| ------------- | ------------- | ------------- | +synthdog-en | OCR | 800,000 +synthdog-zh | OCR | 800,000 +cc3m(downsampled)| Image Caption | 600,000 +cc3m(downsampled)| Image Caption | 600,000 +SBU | Image Caption | 850,000 +Visual Genome VQA (Downsampled) | Visual Question Answer(VQA) | 500,000 +Visual Genome Region descriptions (Downsampled) | Reference Grouding | 500,000 +Visual Genome objects (Downsampled) | Grounded Caption | 500,000 +OCR VQA (Downsampled) | OCR and VQA | 500,000 + +Please download these datasets on their own official websites. + +## Multimodal Alignment +Please run sh scripts/pretrain.sh or sh scripts/pretrain\_multinode.sh + +## Visual Instruction Tuning +Please run sh scripts/finetune.sh or sh scripts/finetune\_multinode.sh + +## Evaluation +Please run python scripts in directory llava/eval/. Our pre-trained CodeFuse-VLM-14B can be loaded with the following code: + +``` +import os +from llava.model.builder import load_mixed_pretrained_model + +model_path = '/pretrained/model/path' +tokenizer, model, image_processor, context_len = load_mixed_pretrained_model(model_path, None, 'qwen-vl-14b', os.path.join(model_path, 'Qwen-VL-visual'), 'cross_attn', os.path.join(model_path, 'mm_projector/mm_projector.bin')) +``` + +You can also run scripts/merge\_qwen\_vl\_weights.sh first and load the merged model by the following code: + +``` +from llava.model import LlavaQWenForCausalLM + +model = LlavaQWenForCausalLM.from_pretrained('/path/to/our/pretrained/model') +``` + +## CodeFuse-VLM Product Video +Here's the demo video of front-end code copilot backed by our VLM model + +https://private-user-images.githubusercontent.com/22836551/300398424-201f667d-6b6b-4548-b3e6-724afc4b3071.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY1MjE5MTIsIm5iZiI6MTcwNjUyMTYxMiwicGF0aCI6Ii8yMjgzNjU1MS8zMDAzOTg0MjQtMjAxZjY2N2QtNmI2Yi00NTQ4LWIzZTYtNzI0YWZjNGIzMDcxLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTI5VDA5NDY1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI0ZmJmZWNlNDZmNWM3NzA0OThlMmY1ODY4MDkxNWY5ZWNiNzRiYjJkYmE4NjEzM2EwYWRiNWY2ODc3N2ViYjEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.BIvWGNx0XV7RoauxB0c2noEdbfZfu8-16LPHtCaCJ9k \ No newline at end of file diff --git a/content/en/docs/codefuse-modelcache/1_quickstart.md b/content/en/docs/codefuse-modelcache/1_quickstart.md new file mode 100644 index 0000000..66ea1ff --- /dev/null +++ b/content/en/docs/codefuse-modelcache/1_quickstart.md @@ -0,0 +1,48 @@ +--- +title: QuickStart +description: 介绍主要功能 +url: "/docs/codefuse-modelcache-quickstart" +aliases: +- "/docs/codefuse-modelcache-quickstart" +--- + + +ModelCache is easy to use, and you can build a cache testing demo in just one step. + +## Quick Start +### Building a Cache +The default interface for Cache is shown below: +``` +class Cache: + # it should be called when start the cache system + def __init__(self): + self.has_init = False + self.cache_enable_func = None + self.embedding_func = None + self.post_process_messages_func = None + self.config = Config() +``` + +Before creating a ModelCache, consider the following questions: + +How will you generate embedding vectors for queries? (embedding_func) This function embeds text into a dense vector for contextual similarity search. ModelCache can support various methods of embedding context: Huggingface, ONNX, and SentenceTransformers. In the default logic, the text2vec model from huggingface, which performs better in the Chinese domain, is used. Simply initialize your embedding function to: text2vec.to_embeddings +``` +data_manager = get_data_manager(CacheBase("mysql", config=mysql_config), + VectorBase("milvus", dimension=data2vec.dimension, milvus_config=milvus_config)) +cache.init( + embedding_func=data2vec.to_embeddings, + data_manager=data_manager, + similarity_evaluation=SearchDistanceEvaluation(), + query_pre_embedding_func=query_multi_splicing, + insert_pre_embedding_func=insert_multi_splicing, +) +``` + + +Where will you cache data? (data_manager cache storage) The cache storage is used to store all scalar data such as original questions, prompts, answers, and access times. ModelCache supports multiple cache storage options like SQLite, MySQL, and OceanBase. More NoSQL database options will be added in the future. +Where will you store and search vector embeddings? (data_manager vector storage) The vector storage component is used to store and search all embedding vectors to semantically find the most similar results. ModelCache supports vector search libraries like FAISS or vector databases like Milvus. More vector database and cloud service options will be added in the future. +Here are some examples: +``` +data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=data2vec.dimension)) +data_manager = get_data_manager(CacheBase("oceanbase"), VectorBase("milvus", dimension=data2vec.dimension)) +``` \ No newline at end of file diff --git a/content/en/docs/codefuse-modelcache/2_feature.md b/content/en/docs/codefuse-modelcache/2_feature.md new file mode 100644 index 0000000..599f44e --- /dev/null +++ b/content/en/docs/codefuse-modelcache/2_feature.md @@ -0,0 +1,168 @@ +--- +title: Feature +description: 介绍主要功能 +url: "/docs/codefuse-modelcache-feature" +aliases: +- "/docs/codefuse-modelcache-feature" +--- + + + + + +From a functional standpoint, to address Huggingface network issues and improve inference speed, local inference capabilities for embeddings have been added. Given some limitations in the SQLAlchemy framework, we have rewritten the relational database interaction module for more flexible database operations. In practice, large model products need to interface with multiple users and models; thus, support for multi-tenancy has been added to ModelCache, as well as preliminary compatibility with system commands and multi-turn conversations. + +Below is a feature comparison table for ModelCache and GPTCache modules: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModuleFunction
ModelCacheGPTCache
Basic InterfaceData query interface
Data writing interface
EmbeddingEmbedding model configuration
Large model embedding layer
BERT model long text processing
Large model invocationDecoupling from large models
Local loading of embedding model
Data isolationModel data isolation
Hyperparameter isolation
DatabasesMySQL
Milvus
OceanBase
Session managementSingle-turn dialogue
System commands
Multi-turn dialogue
Data managementData persistence
One-click cache clearance
Tenant managementSupport for multi-tenancy
Milvus multi-collection capability
OtherLong-short dialogue distinction
+ +## Core Features +In ModelCache, the main ideas of GPTCache are carried forward, including a series of core modules: adapter, embedding, similarity, and data_manager. The adapter module's main function is to handle the business logic for various tasks and connect modules like embedding, similarity, and data_manager; the embedding module is responsible for converting text into semantic vector representations, transforming user queries into vectors for recall or storage; the rank module ranks and evaluates the similarity of recalled vectors; the data_manager module manages the database. To better industrialize, we've made architectural and functional upgrades as follows: + +- [x] Architectural Adjustment (Lightweight Integration): Embedded in large model products in a cache mode similar to Redis, it provides semantic caching capabilities without interfering with LLM invocation and security audits, adaptable toall large model services. + +- [x] Multiple Model Loading Schemes: + - Support for loading local embedding models to resolve Huggingface connectivity issues. + - Support for loading various pre-trained model embedding layers. + +- [x] Data Isolation Capabilities: + - Environmental Isolation: Depending on the environment, different database configurations can be pulled to achieve isolation (development, staging, production). + - Multi-Tenant Data Isolation: Dynamically create collections according to the model to isolate data, addressing data isolation issues for multiple models/services in large model products. + +- [x] Support for System Commands: Using concatenation to solve system command issues within the prompt paradigm. + +- [x] Distinguishing Long and Short Texts: Long texts pose more challenges to similarity assessment, so the differentiation between long and short texts has been enhanced, allowing separate configuration of judgment thresholds. + +- [x] Performance Optimization for Milvus: Adjusting Milvus's consistency_level to "Session" level for better performance. + +- [x] Data Management Capabilities: + - One-click cache clearing ability for data management after model upgrades. + - Recall hit queries for subsequent data analysis and model iteration reference. + - Asynchronous log write-back capability for data analysis and statistics. + - Added model fields and data statistics fields for feature expansion. + - Future features that will continue to be built upon include: +- [ ] Data isolation based on hyperparameters. +- [ ] System prompt partitioned storage capability to improve the accuracy and efficiency of similarity matching. +- [ ] More versatile embedding models and similarity evaluation algorithms. \ No newline at end of file diff --git a/content/en/docs/codefuse-modelcache/3_config.md b/content/en/docs/codefuse-modelcache/3_config.md new file mode 100644 index 0000000..331c967 --- /dev/null +++ b/content/en/docs/codefuse-modelcache/3_config.md @@ -0,0 +1,21 @@ +--- +title: How to better configure your cache +description: 介绍主要功能 +url: "/docs/codefuse-modelcache-config" +aliases: +- "/docs/codefuse-modelcache-config" +--- + +## Environment Dependencies +- Python version: 3.8 or higher +- To install dependencies: pip install requirements.txt + +## Service Startup +- Before starting the service, the following environment configurations should be performed: +- Install relational database MySQL, import SQL to create tables, SQL file: reference_doc/create_table.sql +- Install vector database Milvus +- Add database access information to the configuration files, which are: + - modelcache/config/milvus_config.ini + - modelcache/config/mysql_config.ini +- Download offline model bin files, refer to: https://huggingface.co/shibing624/text2vec-base-chinese/tree/main, and place the downloaded bin files into the model/text2vec-base-chinese folder +- Start the backend service using the flask4modelcache.py script. \ No newline at end of file diff --git a/content/en/docs/codefuse-modelcache/4_release.md b/content/en/docs/codefuse-modelcache/4_release.md new file mode 100644 index 0000000..6f4c683 --- /dev/null +++ b/content/en/docs/codefuse-modelcache/4_release.md @@ -0,0 +1,20 @@ +--- +title: Release Note +description: 介绍主要功能 +url: "/docs/codefuse-modelcache-release" +aliases: +- "/docs/codefuse-modelcache-release" +--- + + +| 时间 |功能 |版本号| +| ----- | ------ | ----- | +| 20230430| Completed GPTCache research, open-source process running through OpenAI interface, single-node form |无| +| 20230509| 1. Completed technology selection and upstream/downstream interaction scheme
2. Redeveloped database module, replaced SQLAlchemy framework
3. Refactored llm_handler module, compatible with codegpt, adapted codegpt model parameters 数| V0.1.0| +| 20230519| 1. Dynamically selected codegpt service mode based on environment
2. Capability for local model loading and pre-loading
3. Added dynamic loading capability for local paths based on environment| V0.1.1| +| 20230522| 1. Architecture optimized, adjusted to a Redis-like structure, decoupled large model invocation
2. Switched relational database from SQLite to OceanBase
3. Switched vector database from FAISS to Milvus
4. Model data isolation capability
5. Added core modules adapter_query, adapter_insert |V0.2.0| +| 20230531| 1. Online environment launched with dynamic sensing capability
2. Embedding model evaluation and selection
3. Added staging environment and data isolation capability
4. Added exposure capability for the original query field| V0.2.1| +| 20230607| 1. Optimized relational database access performance
2. Optimized environment and model isolation capabilities| V0.2.2| +| 20230630| 1. Added large model embedding layer adaptation module in modelCache
2. Added adoption rate statistical capability |V0.2.3| +| 20230730| 1. Added cache statistics feature
2. Added data deletion function interface
3. One-click cache clearing capability launched
4. Developed multi-turn conversation ability, supporting system commands and multi-turn dialogues| v0.3.0| +| 20230830| 1. Added asynchronous processing capability, performance improved by over 20%
2. Architecture change, decoupled embedding inference and business processing logic
3. Blacklist filtering feature |V0.3.1| \ No newline at end of file diff --git a/content/en/docs/codefuse-query/1_abstract.en.md b/content/en/docs/codefuse-query/1_abstract.en.md new file mode 100644 index 0000000..55c5f2c --- /dev/null +++ b/content/en/docs/codefuse-query/1_abstract.en.md @@ -0,0 +1,26 @@ +--- +title: Abstract +slug: Abstract +description: 介绍主要功能 +url: "docs/abstract" +aliases: +- "/docs/abstract" +--- + +# Abstract +With the increasing popularity of large-scale software development, the demand for scalable and adaptable static code analysis techniques is growing. Traditional static analysis tools such as Clang Static Analyzer (CSA) or PMD have shown good results in checking programming rules or style issues. However, these tools are often designed for specific objectives and are unable to meet the diverse and changing needs of modern software development environments. These needs may relate to Quality of Service (QoS), various programming languages, different algorithmic requirements, and various performance needs. For example, a security team might need sophisticated algorithms like context-sensitive taint analysis to review smaller codebases, while project managers might need a lighter algorithm, such as one that calculates cyclomatic complexity, to measure developer productivity on larger codebases. + +These diversified needs, coupled with the common computational resource constraints in large organizations, pose a significant challenge. Traditional tools, with their problem-specific computation methods, often fail to scale in such environments. This is why we introduced CodeQuery, a centralized data platform specifically designed for large-scale static analysis. +In implementing CodeQuery, we treat source code and analysis results as data, and the execution process as big data processing, a significant departure from traditional tool-centric approaches. We leverage common systems in large organizations, such as data warehouses, data computation facilities like MaxCompute and Hive, OSS object storage, and flexible computing resources like Kubernetes, allowing CodeQuery to integrate seamlessly into these systems. This approach makes CodeQuery highly maintainable and scalable, capable of supporting diverse needs and effectively addressing changing demands. Furthermore, CodeQuery's open architecture encourages interoperability between various internal systems, facilitating seamless interaction and data exchange. This level of integration and interaction not only increases the degree of automation within the organization but also improves efficiency and reduces the likelihood of manual errors. By breaking down information silos and fostering a more interconnected, automated environment, CodeQuery significantly enhances the overall productivity and efficiency of the software development process. +Moreover, CodeQuery's data-centric approach offers unique advantages when addressing domain-specific challenges in static source code analysis. For instance, source code is typically a highly structured and interconnected dataset, with strong informational and relational ties to other code and configuration files. By treating code as data, CodeQuery can adeptly handle these issues, making it especially suitable for use in large organizations where codebases evolve continuously but incrementally, with most code undergoing minor changes daily while remaining stable. CodeQuery also supports use cases like code-data based Business Intelligence (BI), generating reports and dashboards to aid in monitoring and decision-making processes. Additionally, CodeQuery plays an important role in analyzing training data for large language models (LLMs), providing deep insights to enhance the overall effectiveness of these models. + +In the current field of static analysis, CodeQuery introduces a new paradigm. It not only meets the needs of analyzing large, complex codebases but is also adaptable to the ever-changing and diversified scenarios of static analysis. CodeQuery's data-centric approach gives it a unique advantage in dealing with code analysis issues in big data environments. Designed to address static analysis problems in large-scale software development settings, it views both source code and analysis results as data, allowing it to integrate flexibly into various systems within large organizations. This approach not only enables efficient handling of large codebases but can also accommodate various complex analysis needs, thereby making static analysis work more effective and accurate. + +The characteristics and advantages of CodeQuery can be summarized as follows: + +- **Highly Scalable**: CodeQuery can handle large codebases and adapt to different analysis needs. This high level of scalability makes CodeQuery particularly valuable in large organizations. +- **Data-Centric**: By treating source code and analysis results as data, CodeQuery's data-centric approach gives it a distinct edge in addressing code analysis problems in big data environments. +- **Highly Integrated**: CodeQuery can integrate seamlessly into various systems within large organizations, including data warehouses, data computation facilities, object storage, and flexible computing resources. This high level of integration makes the use of CodeQuery in large organizations more convenient and efficient. +- **Supports Diverse Needs**: CodeQuery can process large codebases and accommodate various complex analysis needs, including QoS analysis, cross-language analysis, algorithmic needs, and performance requirements. + +CodeQuery is a powerful static code analysis platform, suitable for large-scale, complex codebase analysis scenarios. Its data-centric approach and high scalability give it a unique advantage in the modern software development environment. As static code analysis technology continues to evolve, CodeQuery is expected to play an increasingly important role in this field. \ No newline at end of file diff --git a/content/en/docs/codefuse-query/2_introduction.en.md b/content/en/docs/codefuse-query/2_introduction.en.md new file mode 100644 index 0000000..a4475e3 --- /dev/null +++ b/content/en/docs/codefuse-query/2_introduction.en.md @@ -0,0 +1,118 @@ +--- +title: Introduction +slug: Introduction +description: 介绍主要功能 +url: docs/codefuse-query-introduction +aliases: +- "/docs/codefuse-query-introduction" +--- + +# Introduction +CodeFuse-Query is a code data platform that supports structured analysis of various programming languages. The core idea is to transform all code into data using various language parsers and to store this data in a structured format within a code database. Data analysis is then performed according to business needs using a custom query language, as shown in the diagram below: +![image.png](/images/codefuse-query/introduction01.png) + +## 2.1 Architecture of CodeFuse-Query +Overall, the CodeFuse-Query code data platform is divided into three main parts: the code data model, the code query DSL (Domain-Specific Language), and platform productization services. The main workflow is illustrated in the following diagram: +![image.png](/images/codefuse-query/introduction02.png) + +### Code Datafication and Standardization: COREF +We have defined a model for code datafication and standardization called COREF, which requires all code to be converted to this model through various language extractors. +COREF mainly includes the following information: +**COREF** = AST (Abstract Syntax Tree) + ASG (Abstract Semantic Graph) + CFG (Control Flow Graph) + PDG (Program Dependency Graph) + Call Graph + Class Hierarchy + Documentation (Documentation/Commentary Information) +Note: As the computational complexity of each type of information varies, not all languages' COREF information includes all of the above. The basic information mainly includes AST, ASG, Call Graph, Class Hierarchy, and Documentation, while other information (CFG and PDG) is still under development and will be gradually supported. +### Code Query DSL +Based on the generated COREF code data, CodeFuse-Query uses a custom DSL language called **Gödel** for querying, thereby fulfilling code analysis requirements. +Gödel is a logic-based reasoning language, whose underlying implementation is based on the logical reasoning language Datalog. By describing "facts" and "rules," the program can continuously derive new facts. Gödel is also a declarative language, focusing more on describing "what is needed" and leaving the implementation to the computational engine. +Since code has already been converted to relational data (COREF data stored in the form of relational tables), one might wonder why not use SQL directly, or use an SDK instead of learning a new DSL language. Because Datalog's computation is monotonic and terminating. Simply put, Datalog sacrifices expressiveness to achieve higher performance, and Gödel inherits this feature. + +- Compared to SDKs, Gödel's main advantage is its ease of learning and use. As a declarative language, users do not need to focus on intermediate computations and can simply describe their needs as they would with SQL. +- Compared to SQL, Gödel's advantages are stronger descriptive capabilities and faster computation speed, for example, describing recursive algorithms and multi-table joint queries, which are difficult for SQL. +### Platformization and Productization +CodeFuse-Query includes the **Sparrow CLI** and the online service **Query Centre**. Sparrow CLI contains all components and dependencies, such as extractors, data models, compilers, etc., and users can completely generate and query code data locally using Sparrow CLI (for how to use Sparrow CLI, please see Section 3: Installation, Configuration, Running). If users have online query needs, they can use the Query Centre to experiment. +## 2.2 Languages Supported by CodeFuse-Query for Analysis +As of October 31, 2023, CodeFuse-Query supports data analysis for 11 programming languages. Among these, support for 5 languages (Java, JavaScript, TypeScript, XML, Go) is very mature, while support for the remaining 6 languages (Objective-C, C++, Python3, Swift, SQL, Properties) is in beta and has room for further improvement. The specific support status is shown in the table below: + +| Language | Status | Number of Nodes in the COREF Model | +| ------------- | ------ | ---------------------------------- | +| Java | Mature | 162 | +| XML | Mature | 12 | +| TS/JS | Mature | 392 | +| Go | Mature | 40 | +| OC/C++ | Beta | 53/397 | +| Python3 | Beta | 93 | +| Swift | Beta | 248 | +| SQL | Beta | 750 | +| Properties | Beta | 9 | + +Note: The maturity level of the language status above is determined based on the types of information included in COREF and the actual implementation. Except for OC/C++, all languages support complete AST information and Documentation. For example, COREF for Java also supports ASG, Call Graph, Class Hierarchy, and some CFG information. +## 2.3 Use Cases of CodeFuse-Query +### Querying Code Features +A developer wants to know which String type variables are used in Repo A, so they write a Gödel script as follows and submit it to the CodeFuse-Query system for results. +```rust +// script +use coref::java::* + +fn out(var: string) -> bool { + for(v in Variable(JavaDB::load("coref_java_src.db"))) { + if (v.getType().getName() = "String" && var = v.getName()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` +Similar needs: Queries for classes, functions, variables, return values, call graphs, class inheritance, etc. + +### Outputting Static Analysis Capabilities +A security team member sets up **a system** to cross-verify that log data and code data are consistent. To complete a certain analysis task, they plan to derive static data D1 through Gödel queries, merge with dynamic data D2, and combine analysis to reach conclusion C. After verifying the technical feasibility on CodeFuse-Query, they integrate the system using the standard API provided by CodeFuse-Query. +Similar needs: Using static analysis as a system checkpoint, improving testing efficiency, merging the analyzed data into a documentation. +### Code Rule Checker +A team lead finds that the team often introduces similar bugs, Bug A, **and decides to establish a code rule and its checker** to be applied during CodeReview. After writing an analysis query on the CodeFuse-Query platform and testing that it meets requirements, they codify the query as a code rule and roll it out to the CodeReview/CI phase. Since then, this bug has never occurred again. +Similar needs: Writing static defect scanning rules to intercept code risks. +### Analyzing Code Characteristics +A developer from the R&D department wants to know the current proportion of Spring and Spring Boot projects in the code repository to quantify the promotion of the new framework. By writing a Gödel Query to describe different project analysis features, they **queried 110,000 code repositories at once** and obtained all the code data after a few dozen minutes, happily moving on to their KPIs. +Similar needs: Application profiling, code profiling, architectural analysis. +### Getting Statistical Data +A researcher finds that traditional code complexity metrics struggle to accurately measure the complexity of the code. Inspired by international advanced experiences and a moment of insight, they design a set of complexity metrics and algorithms. After implementing it with Gödel and finding it already highly performant with little optimization, they quickly apply it to over 10 languages and more than 110,000 repositories. They now have an in-depth understanding of the overall complexity of the code repositories, unlike before when they had to parse the code and analyze the syntax tree themselves, **which is so much more convenient**. +Similar needs: Code statistics, code metrics, algorithm design, academic research. +### Architectural Analysis +An architect recently promoted a new message middleware based on txt files, and existing analysis platforms couldn't support analyzing dependencies in such systems. By quickly modeling the message format with Gödel, they soon obtain the dependency relationships between different components in the system. +Similar needs: System overview, architecture governance, lineage analysis. +### Model Validation +A developer designs a system that requires users to play games before claiming coupons. They describe **the model's validation logic** with Gödel, then use the CodeFuse-Query system to **ensure that both current and future system implementations** fully comply with the model. No longer worried about potential financial losses from the game! +Similar needs: System verification, network validation, permission verification. +## 2.4 Application Areas of CodeFuse-Query +Currently, CodeFuse-Query at Ant Group already supports **CodeFuse large language model data cleaning**, **code metrics evaluation**, **R&D risk control**, **privacy security analysis**, **code intelligence**, **terminal package size management**, and other scenarios with implemented applications, serving over a million monthly calls. +![image.png](/images/codefuse-query/introduction03.png) + +### High-Quality Code Data Cleaning - CodeFuse Large Code Model +The CodeFuse Large Code Model is a model by Ant Group for handling code-related issues and has been open-sourced. For the CodeFuse large language model, the quality of the training data directly affects the model's inference results. Low-quality code data can directly contaminate the language model's output, for example: the model might learn incorrect code patterns, generating erroneous code; if the data only contains code in a single programming language, the model might not adapt well to code in other languages. +To control the quality of code data entering the model and thereby improve the model's inferencing capabilities, we have drawn upon the Ant Group program analysis team's years of practical experience coupled with industry consensus to clarify the definition of high-quality code. We have also implemented automated, large-scale code data cleaning using existing program analysis technologies. +CodeFuse-Query provides the following data cleaning capabilities for the CodeFuse Large Code Model: + +- High-quality code data cleaning: Clean code data, including vulnerability scanning for 7 languages (Python, Java, JavaScript, TypeScript, Go, C, C++), filtering by language type/star number, filtering out data with 0 valid lines of code, etc. We have currently accumulated about **2TB** of cleaned code data from GitHub and internally at Ant Group. +- Code Profiling: Implements high-performance, multi-dimensional automatic tagging for large-scale code, supporting **10** languages (Java, Scala, Kotlin, JavaScript, JSX, TypeScript, TSX, Vue, Python, Go), **77** common tags, **40** Ant-specific tags, totaling **117** tags. The current auto-tagging performance can reach **40MB/s**. +- Other Atomic Abilities + - Advanced code feature extraction, including extraction of AST (Abstract Syntax Tree), DFG (Data Flow Graph), etc. The AST information has been used for SFT training with about 97% accuracy. + - Code snippet identification, used for extracting code from text data, convenient for formatting or adding Markdown: + - Text extraction of code: Extracting code block information from text, parsing main languages, function and class definitions, only verifying a binary problem, that is, verifying whether the text contains code blocks with about 83% accuracy. + - Identifying the programming language of a code snippet: Identifying the programming language of any code snippet, supporting 30+ languages, with about 80% accuracy. + - Code comment pair extraction: Supports extracting method-level comment-code pair information, covering **15** most popular languages on GitHub, used for Text To Code/Code To Text SFT training. +### Code Data Metrics - Guangmu +Guangmu is an internal product at Ant Group aimed at different R&D personnel and team managers, providing objective data and analysis results to assess code capabilities. +Guangmu offers individual code capability assessment reports, daily code capability metric data analysis, team code capability management, and code excellence award displays, all aimed at helping Ant Group's R&D engineers continuously improve code quality, reduce code debt, and enhance R&D efficiency in the long run. +CodeFuse-Query provides Guangmu with two types of capabilities: + +- Code Evaluation Metrics: Code complexity, code annotation rate, standard development volume, etc. +- Code Excellence Metrics: Code reuse degree. +### Change Analysis - Youku Server-Side R&D Efficiency +The Youku Quality Assurance team started exploring server-side precision testing in 2023. After six months of technical sedimentation and system building, they established a precision testing system capable of **change content identification, change impact analysis, testing capability recommendation, and test coverage assessment**. +In this process, CodeFuse-Query can provide capabilities including: + +- Analyzing the impacted objects based on code change content (file + line number): methods, entry points (HTTP entry, HSF entry), call routes (all call routes from the entry to the changed method), database operations (tables, types of operations). +- Enhancing the effectiveness and readiness of change analysis impact by combining the precise analysis capabilities of online dynamic call routes (method routes) and CodeFuse-Query static analysis call routes. + +To date, Youku has integrated all core applications through CodeFuse-Query and based on static analysis data collection, has built a complete server-side code and traffic knowledge base. \ No newline at end of file diff --git a/content/en/docs/codefuse-query/3_install_and_run.en.md b/content/en/docs/codefuse-query/3_install_and_run.en.md new file mode 100644 index 0000000..006d3bd --- /dev/null +++ b/content/en/docs/codefuse-query/3_install_and_run.en.md @@ -0,0 +1,175 @@ +--- +title: QuickStart +slug: QuickStart +description: CodeFuse介绍主要功能 +url: /docs/codefuse-query-quickstart +aliases: +- "/docs/codefuse-query-quickstart" +--- + +# Installation, Configuration, and Running + +## Hardware and Software Requirements + +- Hardware: 4C8G + +- Environment Requirements: Java 1.8 and Python 3.8 or above runtime environments. Please ensure Java and Python executables are available. + +## Sparrow Installation Steps and Guidance + +- The CodeFuse-Query download package is a zip archive that contains tools, scripts, and various files specific to CodeFuse-Query. If you do not have a CodeFuse-Query license, downloading this archive indicates your agreement with the [CodeFuse-Query Terms and Conditions](../LICENSE). +- CodeFuse-Query is currently only supported on Mac and Linux systems. The download links are: (currently, only a sample is given, the official download link will be provided after open-source release) + - Mac: [CodeFuse-Query 2.0.0](https://github.com/codefuse-ai/CodeFuse-Query/releases/tag/2.0.0) + - Linux: [CodeFuse-Query 2.0.0](https://github.com/codefuse-ai/CodeFuse-Query/releases/tag/2.0.0) +- You should always use the CodeFuse-Query bundle to ensure version compatibility. + +### Tips: + +- On Mac systems, directly downloading the package may prompt a verification for the developer. + +![image.png](/images/codefuse-query/macos_cannot_open_godel.png) + +- You can modify the verification in the security settings. + +![image.png](/images/codefuse-query/security_allow_godel_run.png) + +- Click "Allow Anyway." + +- For detailed steps, please refer to the [Mac Official Documentation: How to safely open an app on your Mac](https://support.apple.com/zh-cn/HT202491) + +- Or use the `xattr -d com.apple.quarantine` command to remove the external attribute assigned to CodeFuse-Query by macOS. + +- `xattr -d com.apple.quarantine` is a command-line instruction used to delete a file's `com.apple.quarantine` extended attribute. This attribute is used by the macOS system to mark files or applications downloaded from external sources to ensure security. + +```java +xattr -d com.apple.quarantine path/to/file +``` + +## Configuring and Initializing the CodeFuse-Query Development Environment + +- Unzip using the command line or by simply clicking to unzip. + +- You need to have Java 8 and Python 3.8 or higher runtime environments. + +- After unzipping CodeFuse-Query, you can run the Sparrow process by running the executable in the following ways: + +- By executing `/sparrow-cli/sparrow`, where `` is the folder where you extracted the CodeFuse-Query package. + +- By adding `/sparrow-cli` to your PATH, so you can directly run the executable `sparrow`. + +At this point, you can execute the `sparrow` command. + +## Running + +### Execution Steps + +- Confirm the source code directory you need to query. + +- Extract code data from the source code. + +- Write a Gödel script based on the code data to obtain the desired code data. + +- For how to write Gödel scripts, refer to [GödelScript Query Language](./4_godelscript_language.md) + +### Execution Example + +#### Data Extraction +```java +/sparrow-cli/sparrow database create -s -lang -o +``` + +- ``: The output directory for the code data extracted from the codebase, referred to as `` later. + +- ``: The language of the code to be extracted, fill in "java" for analyzing Java. + +- ``: The source code directory to be scanned. + +- In the data extraction step, you obtain the database `` required for executing the script. + +#### Writing Gödel Scripts + +- Assuming you have the following Gödel script to get all Java method names from a specified repository: + +- For specific Gödel script writing, refer to [GödelScript Query Language](./4_godelscript_language.md) + +```java +// script +use coref::java::* + +// Define the global Java database +fn default_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +// Iterate over all methods, get the method name, output limit +fn getFunctionName(name: string) -> bool { + let (db = default_db()) { + for (method in Method(db)) { + if (name = method.getName()) { + return true + } + } + } +} + + +fn main() { + output(getFunctionName()) +} +``` + +#### Script Execution +```java +/sparrow-cli/sparrow query run -d -gdl -o +``` + +- ``: The code data extracted from the codebase to be scanned, consistent with `` above. + +- ``: The path where the Gödel script is located, fill in the directory path, and it will execute all files ending with `.gdl` in that directory in sequence. + +- ``: The output directory path, the result of executing xxx.gdl will be stored in `/xxx.json` in JSON format. + +- You can verify if the script executed correctly by checking the data file. + +#### Example + +Suppose there is the following Java code: + +```java +public class HelloWorld { + public static void main(String[] args) { + HelloWorld tmp = new HelloWorld(); + String hello = tmp.getHello(); + String world = tmp.getWorld(); + System.out.println(hello + " " + world); + } + + public String getHello() { + return "Hello"; + } + + public String getWorld() { + return "World"; + } +} + +``` + +```java +sparrow database create -s -lang java -o ./db/ +sparrow query run -d ./db/ -gdl example.gdl -o ./ +``` + +- `` is the directory where the given Java file is stored. + +- example.gdl is the given Gödel script sample, saved in the current directory. + +- After execution, you can find the example.json file in the current directory. + +The corresponding script output JSON file content is as follows: +```java +[{"name": "getHello"}, +{"name": "getWorld"}, +{"name": "main"}] + +``` \ No newline at end of file diff --git a/content/en/docs/codefuse-query/4_godelscript_language.en.md b/content/en/docs/codefuse-query/4_godelscript_language.en.md new file mode 100644 index 0000000..daf7e22 --- /dev/null +++ b/content/en/docs/codefuse-query/4_godelscript_language.en.md @@ -0,0 +1,2295 @@ +--- +title: GodelLanguage +slug: GodelLanguage +description: CodeFuse介绍主要功能 +url: /docs/codefuse-query-godellanguage +aliases: +- "/docs/codefuse-query-godellanguage" +--- + + +# GödelScript Query Language + +## Index + +- [GödelScript Basic Concepts and Syntax](#gödelscript-basic-concepts-and-syntax) + - [Introduction](#introduction) + - [Basic Program Structure](#basic-program-structure) + - [Fundamental Types and Compiler Built-in Functions](#fundamental-types-and-compiler-built-in-functions) + - [Functions](#functions) + - [Statements](#statements) + - [Schema](#schema) + - [Database](#database) + - [Trait](#trait) + - [Import](#import) + - [Query](#query) + - [Ungrounded Error: Unassigned/Unbound Error](#ungrounded-error-unassignedunbound-error) +- [Query Examples](#query-examples) + - [Java](#java) + - [Python](#python) + - [JavaScript](#javascript) + - [XML](#xml) + - [Go](#go) +- [Query Debugging and Optimization Tips](#query-debugging-and-optimization-tips) + - [Schema Arguments Causing Excessively Large Cartesian Products](#schema-arguments-causing-excessively-large-cartesian-products) + - [Multiple Layers of `for` Causing Excessively Large Cartesian Products](#multiple-layers-of-for-causing-excessively-large-cartesian-products) + - [Avoid Misusing `@inline` and Strategies for Necessary Inline Optimization](#avoid-misusing-inline-and-strategies-for-necessary-inline-optimization) +- [Using Query Scripts on a Local Machine](#using-query-scripts-on-a-local-machine) + +## Basic Concepts and Syntax of GödelScript + +### Introduction + +```rust +// script +fn hello(greeting: string) -> bool { + return greeting = "hello world!" +} + +fn main() { + output(hello()) +} +``` + +GödelScript, the Gödel query language, is a domain-specific language (DSL) for querying and data processing used by CodeQuery. GödelScript uses syntax similar to Rust, providing strict type checking, convenient type inference, and user-friendly error messages, allowing users to get started quickly. + +Main use cases for the GödelScript compiler include: + +1. Writing simple or complex queries for users, offering more convenient syntax to improve query writing efficiency. +2. Providing strict type checking and type inference, offering smarter code modification suggestions. +3. Offering strict [ungrounded](#ungrounded-error) detection to avoid triggering the common Soufflé Ungrounded Error. +4. Support for Language Server and IDE Extension. + +### Basic Program Structure + +#### Program Structure + +A GödelScript program may include: + +- [Module and symbol import statements](#import) +- [Schema type declarations](#schema) +- [Database type declarations](#database) +- [Trait declarations](#trait) +- [Method implementations](#method-implementation) +- [Function declarations and implementations](#function) +- [Query declarations](#query) + +An example containing all the above components: + +```rust +// script +// Package import/symbol import +use coref::java::* // Import all symbols +use coref::java::{JavaDB, Class} // Selective symbol import + +// Function declaration +fn default_db() -> JavaDB { + return JavaDB::load("example.db") +} + +// Schema declaration +schema File { + @primary id: int +} + +// Database declaration +database NewDB { + file: *File +} + +// Trait declaration +trait FileTrait { + fn getId(self) -> int; +} + +// Impl trait for +impl FileTrait for File { + fn getId(self) -> int { + return self.id + } +} + +// Impl +impl File { + @data_constraint + fn all() -> *File { + yield File {id: 1} + yield File {id: 2} + } +} + +// Query +query get_all_anno from + Annotation anno in Annotation(default_db()) +select + anno.id as id +``` + +#### Comments + +GödelScript uses comment syntax similar to C-like languages. + +```rust +// Single line comment + +/* +* 1. Multi-line comment +* 2. Multi-line comment +*/ +``` + +#### The `main` Function + +A GödelScript query script can include a `main` function, which has no return value. If the `main` function is not implemented and no query declarations are written, the program will not produce any output. + +For more details, please refer to [main function](#gödelscript-main-function). + +```rust +fn main() { + output(query_1()) + output(query_2()) +} +``` + +### Basic Types and Built-in Compiler Functions + +GödelScript includes basic types `int`, `string`, and `bool`. `bool` is a basic type but cannot be stored as a value. + +#### `int` Type Native Functions + +| Function | Type | Explanation | +| --- | --- | --- | +| pow | (int, int) -> int | Exponentiation. Arguments must be non-negative numbers. | +| rem | (int, int) -> int | Remainder operation. | +| bitand | (int, int) -> int | Bitwise conjunction. | +| bitor | (int, int) -> int | Bitwise disjunction. | +| bitxor | (int, int) -> int | Bitwise exclusive disjunction. | +| bitnot | (int) -> int | Bitwise negation. | +| neg | (int) -> int | Arithmetic negation. | +| to_string | (int) -> string | Conversion to a string. | +| add | (int, int) -> int | Addition (+). | +| sub | (int, int) -> int | Subtraction (-). | +| mul | (int, int) -> int | Multiplication (*). | +| div | (int, int) -> int | Division (/). | +| eq | (int, int) -> bool | Equality (=). | +| ne | (int, int) -> bool | Inequality (!=). | +| gt | (int, int) -> bool | Greater than (>). | +| ge | (int, int) -> bool | Greater than or equal to (>=). | +| lt | (int, int) -> bool | Less than (<). | +| le | (int, int) -> bool | Less than or equal to (<=). | +| to_set | (int) -> *int | Cast to a set type. | + +#### `string` Type Native Functions + +| Function | Type | Explanation | +| --- | --- | --- | +| len | (string) -> int | Gets the length of a string. | +| substr | (string, int, int) -> string | Substring extraction using initial index and length. | +| contains | (string, string) -> bool | Checks if one string is contained within the current string. | +| matches | (string, string) -> bool | Checks if a regular expression fully matches the current string. | +| get_regex_match_result | (string, string, int) -> string | Gets a capture result from a full regex match on the current string, determined by the second parameter (int). For example, "abcdef".get_regex_match_result("a(.*)f", 1) yields "bcde". | +| to_int | (string) -> int | Converts to an integer. | +| add | (string, string) -> string | String concatenation. | +| eq | (string, string) -> bool | Checks string equality. | +| ne | (string, string) -> bool | Checks string inequality. | +| to_set | (string) -> *string | Cast to a set type. | + +#### `bool` Type Native Functions + +While `bool` exists as a basic type, it cannot be used as data in intermediate calculations, only as a conditional result. + +| Function | Type | Explanation | +| --- | --- | --- | +| not | (bool) -> bool | Logical negation. | +| and | (bool, bool) -> bool | Logical conjunction. | +| or | (bool, bool) -> bool | Logical disjunction. | +| eq | (bool, bool) -> bool | Equality. | +| ne | (bool, bool) -> bool | Inequality. | + +#### Native Functions for Sets + +| Function | Type | Explanation | +| --- | --- | --- | +| len | (*T) -> int | Gets the count of a data set. | +| max | (*int) -> int | Finds the maximum value. | +| min | (*int) -> int | Finds the minimum value. | +| sum | (*int) -> int | Summation of the values. | +| find | (*T0) -> T1 | Finds a data entry from a set using a primary key. | + +#### Global Native Functions + +| Function | Type | Explanation | +| --- | --- | --- | +| output | ((...) -> bool) -> | Outputs query content. | + +#### Database Native Functions + +| Function | Type | Explanation | +| --- | --- | --- | +| load | (string) -> T | Loads the database. | + +#### Schema Native Functions + +| Function | Type | Explanation | +| --- | --- | --- | +| to | (self) -> T | Converts to another schema type, using duck typing. | +| is | (self) -> bool | Determines if it can be another schema type, using duck typing. If the schema has a primary key, the underlying check will only use the primary key to determine compatibility. | +| key_eq | (self, T) -> bool | Checks if the primary keys of two schema instances are equal. | +| key_neq | (self, T) -> bool | Checks if the primary keys of two schema instances are **not** equal. | + +Schema native function example: + +```rust +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +fn example() -> bool { + for(stmt in StatementParent(default_java_db())) { + if (stmt.is()) { + return true + } + } +} + +fn convert() -> *ElementParent { + for(stmt in StatementParent(default_java_db())) { + yield stmt.to() + } +} +``` + +### Functions + +#### The `main` Function of GödelScript + +The main function is the only function in GödelScript that does not declare a return type. The main function only allows the use of output, and other statements will result in a compilation error. Using output(...) multiple times can output multiple query results, which will be displayed in separate tables, with the table names corresponding to the names of the query functions called within output. + +#### Query Functions + +Query functions are recommended to have a `bool` return type and need to use `output()` to output query results. + +The query functions called within `output()` are no longer invoked in the conventional manner of passing arguments to functions. At this point, the parameter list changes to represent the table schema of the output table. Here are two examples of how query functions are applied: + +1. Single-table `output` + + A single-table `output` specifically refers to using `output` only once within the `main` function to produce output. + + ```rust + fn example(a: int, b: string) -> bool {...} + + fn main() { + output(example()) // At this point, the parameter list becomes the output table schema and requires no arguments + } + ``` + + The corresponding output table schema would be: + + ```json + [ + {"a": 0, "b": "xxx"}, + {"a": 1, "b": "xxx"} + ] + ``` + +2. Multi-table `output` + + A multi-table `output` refers to using `output` multiple times within the `main` function to produce output. In this case, the output data will include corresponding table names. + + ```rust + fn example0(a: int, b: string) -> bool {...} + fn example1(a: string, b: int) -> bool {...} + + fn main() { + output(example0()) + output(example1()) + } + ``` + + The corresponding output table schema would be: + + ```json + { + "example0":[ + {"a": 0, "b": "xxx"}, + {"a": 1, "b": "xxx"} + ], + "example1":[ + {"a": "xxx", "b": 0}, + {"a": "xxx", "b": 1} + ] + } + ``` + +Below is a more detailed example where we directly construct two sets of data for output. In the following code, note that: + +1. In GödelScript, boolean values can be represented with the keywords `true` and `false`. + +2. The `=` symbol in GödelScript is quite special and should not be interpreted in the same way as in conventional programming languages. GödelScript is a Datalog language. Here, the `=` symbol carries dual semantics: both __assignment__ and __equality comparison__. Details can be found in [`=` operator](#assignment-and-equality-comparison-operator). + +3. In the conditional statements of this example, both `a` and `b` use the assignment semantics of `=`, because the `int` and `string` type parameters are considered `ungrounded (unassigned/unbound)` within the function body and must be assigned before they can be used. + +4. The return value of the `=` assignment statement is `true`. + +```rust +fn example(a: int, b: string) -> bool { + // The = symbol serves both assignment and comparison purposes, depending on whether the left-hand value has been "assigned" + // Here, the = symbols for a and b are used with assignment semantics + if (a = 1 && b = "1") { + // GödelScript uses the keywords true and false to represent boolean values + return true + } + if (a = 2 && b = "2") { + return true + } +} + +fn main() { + output(example()) +} +``` + +The expected output should be: + +```json +[ + {"a": 1, "b": "1"}, + {"a": 2, "b": "2"} +] +``` + +#### Regular Functions + +Regular functions are used to encapsulate complex processes, and these functions must have a clear return type. +There are two possible return types: + +1. A single return value, followed by a declaration of the return type after the arrow. + +```rust +fn getFile(c: Class) -> File { + return c.getRelativePath() +} +``` + +2. A set of return values, the return type after the arrow needs to be prefixed with `*` to indicate it's a set. + +```rust +fn getAllFiles(db: JavaDB) -> *File { + for (f: File in File(db)) { + yield f + } +} +``` + +Generally, `return` is used for functions with a single return value, while `yield` is used for functions returning a set. +In practice, since GödelScript uses the Datalog engine underneath, all operations are based on sets; a single return value actually only means that the returned set may contain only one data item, but it could also contain multiple items. + +### Statements + +#### `for` Statement: Declaring Variables from a Set + +GödelScript uses the `for` keyword and syntax similar to loop statements to declare variables from a set: + +```rust +for(f: File in getAllFiles()) { + ... +} +``` + +The type after the colon for `f: File` can be omitted. +The `for` statement allows the direct definition of multiple variables, where subsequent variables can use all previously defined variables in the same statement during initialization: + +```rust +for(a in XmlAttribute(db), b in XmlAttribute(db), c in XmlElement(db)) { + ... +} + +for(a in getAllFiles(), b in a.getAllPaths()) { + ... +} +``` + +#### `let` Statement: Declaring a Single Variable + +GödelScript uses the `let` keyword to declare a single/intermediate variable: + +```rust +let(f: File = c.getRelativePath()) { + ... +} +``` + +The type after the colon for `f: File` can be omitted. +The `let` statement allows the direct definition of multiple variables, where subsequent variables can use all previously defined variables in the same statement during initialization: + +```rust +let(a = 1, b = a + 1, c = b + 1) { + ... +} +``` + +#### `if` Statement + +Conditional statements in GödelScript are similar to many procedural programming languages: + +```rust +if (f.getName().contains("util") || f.getName().contains("com")) { + ... +} +``` + +Conditions can be connected using logical operators: `!` for NOT, `||` for OR, and `&&` for AND. + +Comparative operators in conditions: `>` for greater than, `<` for less than, `>=` for greater than or equal to, `<=` for less than or equal to, `=` for equal to or assignment, `!=` for not equal to. + +Regular arithmetic operations can use the following operators: `+` for addition, `-` for subtraction/negation, `*` for multiplication, `/` for division. + +##### Assignment and Equality Comparison Operator `=` + +The `=` symbol in GödelScript carries two different semantics: assignment and equality comparison. The specific semantics need to be discussed based on the context: + +1. Assignment + + Assignment generally occurs with fundamental type variables such as `int` and `string`. These variables, when used as function parameters, are typically considered unassigned. When a function with such variables is called, the parameters passed in actually serve as filtering conditions. + + ```rust + fn example(a: int) -> bool { + // This is somewhat counterintuitive; in procedural languages, this is usually taken to mean a == 1 + // However, in Datalog dialects, each function in Datalog is essentially calculating an intermediate table (view) + // So this function is essentially generating a view with data [{"a": 1}] + return a = 1 // assign a = 1 + } + + fn test() -> bool { + // Although it seems like we are passing a parameter to make a = 2, it's not really the case + // example() itself returns the view: [{"a": 1}] + // Then it is constrained by a = 2, and as you can see, we don't get any result here + // So it returns false + return example(2) // false + } + ``` + +2. Equality Comparison + + For schema types, since each schema type has a universe behind it, schema type parameters in the parameter list are generally considered to have been assigned. For variables that have already been assigned, `=` operates as an equality comparison. + + ```rust + // Declare schema + schema A {...} + + // Implement schema member functions + impl A { + // Here we define the universe for schema A + @data_constraint + pub fn __all__() -> *A {...} + } + + fn example(a: A) -> bool { + for(temp in A::__all__()) { + if (a = temp) { + return true + } + } + } + ``` + + Similarly, for internally declared `int` or `string` with initial values, `=` also operates as an equality comparison. + + ```rust + fn example() -> bool { + let (a = 1) { // assign a = 1 + if (a = 1) { // compare a = 1 + return true + } + } + } + ``` + +#### `match` Statement + +GödelScript allows writing `match` statements for `int` and `string` types. A `match` statement is similar to a `switch` statement with multiple conditional branches, and the conditions in the `match` must be literals: + +```rust +match(a) { + 1 => return 0, + 2 => return 1, + 3 => if (a + 1 < 10) { + return 10 + } +} +``` + +#### Return Statements + +GödelScript uses `return` and `yield`. `return` is for functions with a single return value, and `yield` is for returning sets. + +```rust +fn a() -> int { + return 0 +} + +fn b() -> *int { + yield 1 + yield 2 + yield 3 +} +``` + +### Schema + +Schema is a structure for complex data tables in GödelScript. + +#### Structure Declaration + +GödelScript uses the `schema` keyword to declare a table structure: + +```rust +schema File { + id: int, + name: string +} +``` + +If a field exists as a primary key in the database, you can use the `@primary` annotation to indicate that it's a primary key: + +```rust +schema File { + @primary id: int, + name: string +} +``` + +**Table structures with a primary key significantly improve query speed, so try to bind a primary key, preferably of type `int`.** + +#### Method Implementation + +GödelScript declares and implements methods related to `schema` as follows: + +```rust +impl File { + // Static method + fn f1() -> ... {...} + // Member method, the first argument must be self + fn f2(self) -> ... {...} + ... +} +``` +##### Static Methods + +Static methods do not require `self` as the first argument and are straightforward to use: `ClassName::MethodName(...)`. + +```rust +impl File { + fn getSchemaName() -> string { + return "File" + } +} + +fn out(t: string) -> bool { + if (t = File::getSchemaName()) { + return true + } +} +``` + +##### Member Methods + +The first argument for member methods must be `self`, without specifying its type. These functions are called using `InstanceName.FunctionName(...)`. + +```rust +impl File { + fn getName(self) -> string { + return self.name + } +} + +fn out(path: string) -> bool { + let (db = JavaDB::load("coref_java_src.db")) { + for (f in File::__all__(db)) { + if (path = f.getName()) { + return true + } + } + } +} +``` + +##### Data Loading Method `fn __all__(db)` + +A `schema` can contain a special **static method** for loading its dataset from the database. + +```rust +impl File { + @data_constraint + fn __all__(db: JavaDB) -> *File { + ... + } +} +``` + +This method must contain the special annotation `@data_constraint`, indicating that it is specialized for loading. Without this annotation, the method will return an **empty set**. The return type must be a set of itself. + +A `schema` that includes this method can use syntactic sugar to get its full set: + +```rust +fn out() -> bool { + for(f in File(JavaDB::load("..."))) { + ... + } + ... +} +// Equivalent to +fn out() -> bool { + for(f in File::__all__(JavaDB::load("..."))) { + ... + } + ... +} +``` + +##### Custom Full Set Method + +A `schema` allows using static methods with different names than `__all__` to indicate that some sets also exist within its full set. This method must also contain the special annotation `@data_constraint`. This method is generally used to manually add some data to the full set of that type. + +```rust +impl File { + @data_constraint + fn extend_example() -> *File { + yield File {id: 1234567} + } +} +``` + +#### Constructing Anonymous Instances + +GödelScript allows for the creation of anonymous instances with a specific syntax. The creation of anonymous instances is contingent on the instance existing within the full set of the `schema`, unless this usage appears within a `@data_constraint` method, in which case the result will be empty. + +```rust +schema A { + @primary id: int, + name: string +} +``` + +The corresponding syntax to create an anonymous instance is as follows: + +```rust +A {id: 1, name: "first"} +``` + +#### Schema Inheritance + +Schema inheritance in GödelScript is very straightforward, exemplified as follows: + +```rust +schema MyFile extends File {} +``` + +##### Parent Field Inheritance + +The subclass will inherit all fields from the parent class by default, so there is no need to manually rewrite them. + +```rust +schema File { + @primary id: int, + name: string +} + +schema MyFile extends File {} +``` + +##### Parent Method Inheritance + +The subclass will inherit all methods from the parent class by default, except for those marked with `@data_constraint`. There is no need to manually rewrite them. However, the `__all__` method is special and will not be inherited, so you need to rewrite the `__all__` method to determine the full set of the inherited schema. + +```rust +schema File { + @primary id: int, + name: string +} + +impl File { + @data_constraint + fn __all__() -> *File {...} + fn getId(self) -> int {...} + fn staticMethod() -> string {return "File"} +} + +schema MyFile extends File {} +``` + +##### Method Override + +If the subclass implementation contains a method with the same name as the parent class, the parent method will be **overridden** by the subclass method. + +```rust +schema File { + @primary id: int, + name: string +} + +impl File { + fn staticMethod() -> string {return "File"} +} + +schema MyFile extends File {} + +impl MyFile { + fn staticMethod() -> string {return "MyFile"} +} +``` + +In this case, `File::staticMethod` is overridden by `MyFile::staticMethod`, so when calling the subclass method, the result obtained will be `"MyFile"`. + +### Database + +#### Database Declaration + +The declaration format for databases is as follows: + +```rust +database DatabaseName { + // table_name corresponds to the real table name in the db + // GodelSchemaType corresponds to the schema in which the table data is stored after reading into godel + table_name : *GodelSchemaType +} +``` + +Before the colon is the **real table name** in the loaded database; after the colon is the **data table format**, which must be a `schema` type. +For example, if a table called `annotation` exists in the db and corresponds to the `Annotation` schema, the declaration would be: + +```rust +database JavaDB { + // Reads data from the db's annotation table and stores it in Annotation + annotation : *Annotation +} +``` + +Additionally, it is necessary to ensure that the `Annotation` structure matches the table structure. For example: + +```rust +schema Annotation { + @primary id: int, // The primary annotation indicates that this field is the primary key; a table can also have no primary key + content: string +} +``` + +The `annotation` table must contain `id` and `content` fields with corresponding storage types. + +#### Database Loading + +Database types have a static method `(database)::load(filename: string)` + +```rust +fn loadDatabaseExample() -> bool { + // The string passed to load is the db's filename, not the path + // The db's path will be passed as a command-line argument when executing godel + let (db: JavaDB = JavaDB::load("...")) { + ... + } +} +``` + +#### Data Table Access + +In the example above, to access the `annotation` table: + +```rust +fn getAnnotation() -> Annotation { + // The string passed to load is the db's filename, not the path + // The db's path will be passed as a command-line argument when executing godel + let (db: JavaDB = JavaDB::load("...")) { + // Directly use db.field to access the table data + for (anno: Annotation in db.annotation) { + ... + } + } +} +``` + +### Trait + +#### Trait Declaration + +The syntax for declaring a `trait` is as follows: + +```rust +trait Example { + fn getId(self) -> int; + fn getName(self) -> string; + fn getValueByName(self, name: string) -> string; +} +``` + +#### Impl Trait + +The syntax is similar to `impl`, but you must implement all the functions declared in the `trait` to pass compilation. + +```rust +impl Example for XmlElement { + fn getId(self) -> int {return self.id} + fn getName(self) -> int {return self.name} + fn getValueByName(self, name: string) -> int { + for(attr in XmlAttribute(XmlDB::load("...")) { + if (attr.getName() = name && attr.id = self.getAttribute().id) { + return attr.getValue() + } + } + } +} +``` + +### Import + +GödelScript uses the `use` keyword to import symbols from other files: + +```rust +use coref::java::* // Import all symbols +use coref::xml::Location // Import a single symbol +use coref::xml::{XmlDB, XmlElement} // Import multiple symbols +``` + +#### Module Import Rules + +The GödelScript package manager is enabled when the input parameters include `-p {package dir path}`. + +The package manager will parse the folder structure, traversing all `.gdl` files. After obtaining the relative path of the files, it will map the path to the corresponding package path. If the relative path contains `-`, or if a folder name or filename starts with a digit, the path will not be accepted by the package manager, but it will not issue an error and will simply ignore it. + +If you want to know which paths were ignored, you can use the `-v` parameter. With this parameter, the package manager will report the ignored paths as `warnings`. If there are path conflicts in the mapped paths, the package manager will report them as `errors` and exit the compilation process. + +```rust +packages: + coref::cfamily -> /.../Library/coref.cfamily.gdl + coref::go -> /.../Library/coref.go.gdl + coref::java -> /.../Library/coref.java.gdl + coref::javascript -> /.../Library/coref.javascript.gdl + coref::properties -> /.../Library/coref.properties.gdl + coref::python -> /.../Library/coref.python.gdl + coref::sql -> /.../Library/coref.sql.gdl + coref::xml -> /.../Library/coref.xml.gdl +modules + +--coref -> coref + |--xml -> coref::xml + |--properties -> coref::properties + |--cfamily -> coref::cfamily + |--java -> coref::java + |--javascript -> coref::javascript + |--go -> coref::go + |--sql -> coref::sql + +--python -> coref::python +``` + +#### Path Mapping Example + +```rust +Library +|-- coref.java.gdl +|-- coref.xml.gdl ++-- coref + |-- go.gdl + +-- a + +-- b.gdl +=> +coref::java +coref::xml +coref::go +coref::a::b +``` + +In this example, there is a path conflict: + +```rust +Library +|-- coref +| |-- java.gdl +| +-- python.gdl ++-- coref.python.gdl +=> +coref::java +coref::python -- \ + > Conflict +coref::python -- / +``` + +In this example, there are invalid characters in the path: + +```rust +Library +|-- 0123.gdl +|-- my-godel-lib +| +-- js.gdl ++-- lib-file.123.gdl +=> +0123 +^ The first character is a digit +my-godel-lib::js + ^ ^ Uses the `-` character +lib-file::123 + ^ ^ First character after `.` is a digit, and the path contains `-` +``` + +#### Symbol Conflict + +In use, it's possible to encounter situations with symbol conflicts. In such cases, direct use of `File` will result in a symbol conflict, and you need to specify one of the symbols. + +```rust +use coref::java::Location +use coref::xml::Location +schema MyLoc extends Location {} + ^^^^^^^^ +Error: "Location" is ambiguous, with multiple symbols + "coref::java::Location, coref::xml::Location". +``` + +Like other languages, GödelScript allows specifying a symbol directly through its full path, provided the symbol has been imported. + +```rust +use coref::java::Location +use coref::xml::Location +schema MyLoc extends coref::xml::Location {} +``` + +Full path symbols can be used in the following situations: + +- Schema inheritance + +```rust +schema JavaLocation extends coref::java::Location {} +``` + +- Function parameters and return values + +```rust +fn return_java_file(f: coref::java::File) -> coref::java::File { + ... +} +``` + +- Database declarations + +```rust +database MyDB { + java_file: coref::java::File, + xml_file: coref::xml::File, + java_loc: coref::java::Location, + xml_loc: coref::xml::Location +} +``` + +- Query list type declarations + +```rust +query example from + coref::java::Location loc in coref::java::Location(coref::java::JavaDB::load("...")) +where + ... +select + ... +``` + +- Schema static method calls + +```rust +for(loc in coref::java::Location(coref::java::JavaDB::load("..."))) { + ... +} + +stmt.to() +stmt.is() +``` + +### Query + +Query is used for simple queries and is guaranteed to be output even without declaring a `main` function. The syntax format for query is as follows: + +```rust +query name from + variable in initial value, + variable in initial value, + variable in initial value +where condition +select value as output column name + value as output column name, + value as output column name, + value as output column name +``` + +Variable declarations in the `from` list do not need type annotations, as the compiler will automatically infer them. Additionally, the `select` list does not use `=` but the `in` keyword. Also, in the `select` list, the output column name cannot conflict with the calculation variables, but the column name can be omitted. Omitted column names will take random names in the output results, so it's best not to omit them. + +Here is a `hello world` written in query syntax: + +```rust +query hello_world from + info in "hello world" +select info as greeting +``` + +The code above is equivalent to the following code: + +```rust +fn hello_world(greeting: string) -> bool { + let (info = "hello world") { + if (greeting = info) { + return true + } + } +} +fn main() { + output(hello_world()) +} +``` + +#### Example and Composition Structure + +Query includes a query name, a `from` list, a `where` filter condition, and a `select` list. + +```rust +// script +use coref::java::{Callable, Class, Interface, JavaDB} + +fn db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +query class_method from + Callable m in Callable(db()), + Class c in Class(db()) +where + c.id = m.getBelongedClass().id +select + c.getQualifiedName() as className, + m.getName() as methodName, + m.getSignature() as methodSignature +``` + +#### Equivalent Code + +The example above is equivalent to the following code: + +```rust +// script +use coref::java::{Callable, Class, Interface, JavaDB} + +fn db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +fn main() { + output(class_method()) +} + +fn class_method(className: string, methodName: string, methodSignature: string) -> bool { + for (m in Callable(db()), c in Class(db())) { + if (c.id = m.getBelongedClass().id) { + if (className = c.getQualifiedName() && + methodName = m.getName() && + methodSignature = m.getSignature()) { + return true + } + } + } +} +``` + +### Ungrounded Error + +GödelScript will determine symbols that are not bound to a set as `ungrounded`. The basic rule of judgment is: + +- Uninitialized/unusued/unbound symbols + - Unbound `int`, `string` arguments + - Unused database type arguments + - Function body has statements, but no return statements +- Symbols bound within negation blocks + - For example, `!(__tmp = 1)`, `__tmp` is considered unbound + - Calling inline functions or data constructors in negation blocks + +#### 1. Unused Database/Basic Type Parameters + +In the function block, if there is a branch that does not use database or basic type parameters, it will inevitably lead to `ungrounded`: + +```rust +fn test(db: JavaDB, a: int, b: string) -> bool {} + ^^ ^ ^ ^^ +Error: ungrounded parameter "db, a, b" in this branch. +``` + +The compiler will indicate in which branch there is an unused parameter. Check the corresponding execution path and complete the parameter constraints based on the prompt. + +If some functions have basic type parameters but always use literals when called, and if `ungrounded` is incorrectly reported, you can add an `@inline` annotation to the function to avoid incorrect constraint checks. + +```rust +impl XXX { + @inline + fn getValueByAttributeNameByDefaultValue(self, attributeName: string) -> string { + if (self.hasAttribute(attributeName)) { + return self.getValueByAttributeName(attributeName) + } + if (!self.hasAttribute(attributeName) { + return "null" + } + } +} + +fn xxx() -> xx { + .. + attr.getValueByAttributeNameByDefaultValue("pattern") + ^^^^^^^^^ Use literals, add @inline to pass the check +} +``` + +#### 2. No Return Statement in Non-Empty Function Body + +GödelScript allows an empty function body without any statements. However, if there are other statements in the function body, GödelScript requires at least one return statement, otherwise an `ungrounded` error will occur. + +```rust +fn test() -> int {} + ^^ No statements, passes compilation + +fn test() -> int { + let (a = 1) {} + ^^^^^^^^^^^^^^ Statements present, no return statement, ungrounded +} +``` + +#### 3. Using Inline Functions or Data Constructors in Negation Blocks + +As mentioned above, `@inline` annotation can be used to circumvent `ungrounded` errors. However, if inline functions are used in negation blocks, it will inevitably result in `ungrounded` errors. + +Similarly, data constructors are used to bind temporary intermediate variables, but this will directly result in `ungrounded` errors. +Therefore, using inline functions or data constructors in negation blocks will inevitably lead to `ungrounded` errors, and the compiler will report errors for all such cases. + +```rust +if (!check(method.to())) { + ^^^^^^^^^^^^^^^^^^^^^^^^^^ ungrounded +} +if (!check(ElementParent {id: 0})) { + ^^^^^^^^^^^^^^ ungrounded +} + +@inline +fn for_test() -> ElementParent { + ... +} +if (!check(for_test())) { + ^^^^^^^^^^ Negation block contains inline function, ungrounded +} +``` + +#### 4. Negation of Chained Calls + +GödelScript does not perform `ungrounded` checks for negation of chained calls, but this writing will cause an `ungrounded` error in Soufflé: + +```rust +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +fn get_field() -> *Field { + for (field in Field(default_java_db())) { + if (!field.getLocation().getFile().getRelativePath().contains("/test/")) { + yield field + } + } +} +``` + +Where: + +```rust +!field.getLocation().getFile().getRelativePath().contains("/test/") +``` + +It will be translated to a Soufflé code fragment like this: + +```rust +!(__tmp = field, Field_getLocation(__tmp, __tmp_1), ..., contains("/test/", __tmp_4)) + ^^^^^ ^^^^^^^ +``` + +The variables used for intermediate storage being bound in `!(...)` but due to the negation operator, this binding is considered hypothetical. However, `__tmp`, `__tmp_1` are then considered to be variables declared for the entire statement scope, leading to `ungrounded`. + +This can be avoided by declaring intermediate variables to catch intermediate results in a negation operation: + +```rust +fn get_field() -> *Field { + for (field in Field(default_java_db())) { + let (path = field.getLocation().getFile().getRelativePath()) { + if (!path.contains("/test/")) { + yield field + } + } + } +} +``` + +## Query Examples + +### Java + +#### Unused Methods + +```rust +// script +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +// find unused methods +fn unused_method(unused: string) -> bool { + for(c in Callable(default_java_db()), method in Callable(default_java_db()), caller in method.getCaller()) { + if (c != caller && unused = method.getSignature()) { + return true + } + } +} + +fn main() { + output(unused_method()) +} +``` + +#### Class Inheritance Relationship + +```rust +// script +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +/** + * Find all class and the inheritances + * including parent class inheritance and ancestor class inheritance + */ +fn class_hierarchy(className : string, superClassName : string) -> bool { + for (c in Class(default_java_db()), ancestor in c.getAnAncestorClass()) { + if (className = c.getQualifiedName() && + superClassName = ancestor.getQualifiedName()) { + return true + } + } +} + +fn main() { + output(class_hierarchy()) +} +``` + +#### Querying All Methods in a Class + +```rust +// script +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +// Find all methods of the class +fn methods(className : string, methodName : string) -> bool { + for (c in Class(default_java_db()), m in c.getAllMethods()) { + if (className = c.getQualifiedName() && + methodName = m.getName()){ + return true + } + } +} + +fn main() { + output(methods()) +} +``` + +### Python + +#### Cyclomatic Complexity + +```rust +// script +use coref::python::* + +fn default_db() -> PythonDB { + return PythonDB::load("coref_python_src.db") +} + +/** + * Get cyclomatic complexity of functions + * + * @param name function name + * @param value cyclomatic complexity of function + * @param path path of file including this function + * @param sline function start line + * @param eline function end line + */ +fn getCyclomaticComplexity( + name: string, + value: int, + path: string, + sline: int, + eline: int) -> bool { + // get metric function + for (c in MetricFunction(default_db())) { + if (path = c.getLocation().getFile().getRelativePath() && + name = c.getQualifiedName() && + value = c.getCyclomaticComplexity() && + sline = c.getLocation().getStartLineNumber() && + eline = c.getLocation().getEndLineNumber()) { + return true + } + } +} + +fn main() { + output(getCyclomaticComplexity()) +} +``` + +#### Comment Rate + +```rust +// script +use coref::python::* + +schema PublicVisitedElement extends CombineElement {} + +impl PublicVisitedElement { + @data_constraint + pub fn __all__(db: PythonDB) -> *PublicVisitedElement { + for (tmp in Class(db)) { + yield PublicVisitedElement {id: tmp.element_oid} + } + for (tmp in Function(db)) { + yield PublicVisitedElement {id: tmp.element_oid} + } + } +} + +fn default_db() -> PythonDB { + return PythonDB::load("coref_python_src.db") +} + + +// count number of total public element +fn countTotalPublicElement() -> int { + return PublicVisitedElement(default_db()).len() +} + +// get public elements with Docstring comment +fn withDocstringCommentElement() -> *PublicVisitedElement { + let (db = default_db()) { + for (e in PublicVisitedElement(db), j in DocstringComment(db)) { + if (e.key_eq(j.getDocumentableElement())) { + yield e + } + } + } +} + +// count number of public elements with Docstring comment +fn countTotalPublicDocumentedElement() -> int { + return withDocstringCommentElement().len() +} + +fn withPublicDocumentedBelowElement() -> *PublicVisitedElement { + let (db = default_db()) { + for (e in PublicVisitedElement(db), j in Comment(db)) { + if (e.key_eq(j.getDocumentedClassOrFunctionElement())) { + yield e + } + } + } +} + +// count number of public element with single line comment +fn countTotalPublicDocumentedBelowElement() -> int { + return withPublicDocumentedBelowElement().len() +} + + +// calculate documented percentage +fn getDocumentedPercentage(documentedPercentage: int) -> bool { + let (i = countTotalPublicElement(), + j = countTotalPublicDocumentedElement(), + k = countTotalPublicDocumentedBelowElement()) { + if (i = 0) { + if (documentedPercentage = -1) { + return true + } + } + if (i != 0) { + if (documentedPercentage = (j + k) * 1000 / i) { + return true + } + } + } +} + +fn main() { + output(getDocumentedPercentage()) +} +``` + +#### Comments in a Method + +```rust +// script +use coref::python::* + +schema PublicVisitedElement extends CombineElement {} + +impl PublicVisitedElement { + @data_constraint + pub fn __all__(db: PythonDB) -> *PublicVisitedElement { + for (tmp in Class(db)) { + yield PublicVisitedElement {id: tmp.element_oid} + } + for (tmp in Function(db)) { + yield PublicVisitedElement {id: tmp.element_oid} + } + } + + pub fn getName(self) -> string { + let (tmp = Class(__all_data__).find(self)) { + return tmp.getQualifiedName() + } + let (tmp = Function(__all_data__).find(self)) { + return tmp.getQualifiedName() + } + } +} + +fn default_db() -> PythonDB { + return PythonDB::load("coref_python_src.db") +} + +fn hasComment(e: PublicVisitedElement) -> bool { + let (db = default_db()) { + for (j in DocstringComment(db)) { + if (e.key_eq(j.getDocumentableElement())) { + return true + } + } + for (j in Comment(db)) { + if (e.key_eq(j.getDocumentedClassOrFunctionElement())) { + return true + } + } + } +} + +/** + * Get comment of each public element + * + * @param type public visited element type + * @param name public visited element name + * @param filePath file path + * @param sline element start line + * @param eline element end line + * @param isCommented if is commented + */ +fn output_result( + type: string, + name: string, + filePath: string, + sline: int, + eline: int, + isCommented: int) -> bool { + for (e in PublicVisitedElement(default_db())) { + if (type = e.getType() && + name = e.getName() && + filePath = e.getLocation().getFile().getRelativePath() && + sline = e.getLocation().getStartLineNumber() && + eline = e.getLocation().getEndLineNumber()) { + if (hasComment(e)) { + if (isCommented = 1) { + return true + } + } + if (!hasComment(e)) { + if (isCommented = 0) { + return true + } + } + } + } +} + +fn main() { + output(output_result()) +} +``` + +### JavaScript + +#### AST Print + +```rust +// script +use coref::javascript::* + +/** + * print AST + * + * @param filePath file path + * @param parentId parent node ID + * @param parentKind parent node kind + * @param parentStartLine parent node start line + * @param parentEndLine parent node end line + * @param childId child node ID + * @param childKind child node kind + * @param childStartLine child node start line + * @param childEndLine child node end line + * @param index child node index + */ +fn out( + filePath: string, + parentId: int, + parentKind: string, + parentStartLine: int, + parentEndLine: int, + childId: int, + childKind: string, + childStartLine: int, + childEndLine: int, + index: int +) -> bool { + let (db = JavascriptDB::load("coref_javascript_src.db")) { + for (parent in Node(db), + child in Node(db), + parentSyntaxKind in SyntaxKind(), + childSyntaxKind in SyntaxKind(), + parentLocation in Location(db), + childLocation in Location(db), + file in File(db)) { + if (parent.key_eq(child.getParent()) && + parentId = parent.id && + childId = child.id && + parentSyntaxKind.id = parent.getKind() && + childSyntaxKind.id = child.getKind() && + parentKind = parentSyntaxKind.getName() && + childKind = childSyntaxKind.getName() && + index = child.getIndex() && + parentLocation = parent.getLocation() && + childLocation = parent.getLocation() && + file = parentLocation.getFile() && + filePath = file.getRelativePath() && + parentStartLine = parentLocation.getStartLineNumber() && + parentEndLine = parentLocation.getEndLineNumber() && + childStartLine = childLocation.getStartLineNumber() && + childEndLine = childLocation.getEndLineNumber()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +#### Cyclomatic complexity + +```rust +// script +use coref::javascript::* + +fn default_db() -> JavascriptDB { + return JavascriptDB::load("coref_javascript_src.db") +} + +/** + * Output the cyclomatic complexity of each function + * + * @param filePath file path + * @param functionName function name + * @param complexity cyclomatic complexity + * @param startLine function start line + * @param endLine function end line + */ +fn out(filePath: string, functionName: string, complexity: int, startLine: int, endLine: int) -> bool { + let (db = default_db()) { + for (func in FunctionLikeDeclaration(db), file in File(db)) { + if (complexity = func.getCyclomaticComplexity() && + functionName = func.getName() && + file = func.getLocation().getFile() && + filePath = file.getRelativePath() && + startLine = func.getLocation().getStartLineNumber() && + endLine = func.getLocation().getEndLineNumber()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +#### Change Effect + +```rust +// script +use coref::javascript::* + +fn default_db() -> JavascriptDB { + return JavascriptDB::load("coref_javascript_src.db") +} + +fn getACallerFunction(function: FunctionLikeDeclaration, callerFunction: FunctionLikeDeclaration) -> bool { + for (mayInvokeExpression in MayInvokeExpression(default_db())) { + if (mayInvokeExpression = function.getACallSite() && + callerFunction = mayInvokeExpression.getEnclosingFunction()) { + return true + } + } +} + +fn getAnEffectedFunction(function: FunctionLikeDeclaration, effectedFunction: FunctionLikeDeclaration) -> bool { + if (getACallerFunction(function, effectedFunction)) { + return true + } + for (callerFunction in FunctionLikeDeclaration(default_db())) { + if (getACallerFunction(function, callerFunction) && + getAnEffectedFunction(callerFunction, effectedFunction)) { + return true + } + } +} + +/** + * Query the effected functions according to the changed lines. + * + * @param function the changed function id + * @param signature the changed function signature + * @param functionPath the changed function file path + * @param startLine the changed function start line + * @param endLine the changed function end line + * @param effectedFunction the effected function id + * @param effectedSignature the effected function signature + * @param effectedFunctionPath the effected function file path + * @param effectedStartLine the effected function start line + * @param effectedEndLine the effected function end line + */ +fn out( + function: FunctionLikeDeclaration, + signature: string, + functionPath: string, + startLine: int, + endLine: int, + effectedFunction: FunctionLikeDeclaration, + effectedSignature: string, + effectedFunctionPath: string, + effectedStartLine: int, + effectedEndLine: int +) -> bool { + if (getAnEffectedFunction(function, effectedFunction)) { + let (symbol = function.getSymbol(), + effectedSymbol = effectedFunction.getSymbol(), + location = function.getLocation(), + effectedLocation = effectedFunction.getLocation()) { + if (signature = symbol.getDescription() && + effectedSignature = effectedSymbol.getDescription() && + functionPath = location.getRelativePath() && + startLine = location.getStartLineNumber() && + endLine = location.getEndLineNumber() && + effectedFunctionPath = effectedLocation.getRelativePath() && + effectedStartLine = effectedLocation.getStartLineNumber() && + effectedEndLine = effectedLocation.getEndLineNumber()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +### XML + +#### Getting bean + +```rust +// script +use coref::xml::* + +schema BeanXmlElement extends XmlElement {} + +impl BeanXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *BeanXmlElement { + for (e in XmlElement(db)) { + let (path = e.getLocation().getFile().getRelativePath()) { + if (!path.contains("target") && e.getName() = "bean") { + yield BeanXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } + } +} + +schema EntryXmlElement extends XmlElement {} + +impl EntryXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *EntryXmlElement { + for (e in XmlElement(db)) { + if (e.getName() = "entry") { + yield EntryXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema PropertyXmlElement extends XmlElement {} + +impl PropertyXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *PropertyXmlElement { + for (e in XmlElement(db)) { + if (e.getName() = "property") { + yield PropertyXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +fn default_db() -> XmlDB { + return XmlDB::load("coref_xml_src.db") +} + +// get class name +fn getClassName(bean: BeanXmlElement) -> string { + for (attr in bean.getAttribute()) { + if (attr.getName() = "class") { + return attr.getValue() + } + } +} + +// get key +fn getKey(e: EntryXmlElement) -> string { + for (attr in e.getAttribute()) { + if (attr.getName() = "key") { + return attr.getValue() + } + } +} + +// output value and class info of the bean +fn output1(className: string, pName: string, kName: string) -> bool { + let (db = default_db()) { + for (bean in BeanXmlElement(db), p in PropertyXmlElement(db), e in EntryXmlElement(db)) { + if (className = getClassName(bean) && + bean.key_eq(p.getParent()) && + p.key_eq(e.getParent().getParent()) && + pName = p.getName() && + kName = getKey(e)) { + return true + } + } + } +} + +fn main() { + output(output1()) +} +``` + +#### POM + +```rust +// script +use coref::xml::* + +schema DependencyElement extends XmlElement {} + +impl DependencyElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *DependencyElement { + for(e in XmlElement(db)) { + if (e.getElementName() = "dependency") { + yield DependencyElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema GroupElement extends XmlElement {} + +impl GroupElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *GroupElement { + for(e in XmlElement(db)) { + if (e.getElementName() = "groupId") { + yield GroupElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema VersionElement extends XmlElement {} + +impl VersionElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *VersionElement { + for(e in XmlElement(db)) { + if (e.getElementName() = "version") { + yield VersionElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema ArtifactElement extends XmlElement {} + +impl ArtifactElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *ArtifactElement { + for(e in XmlElement(db)) { + if (e.getElementName() = "artifactId") { + yield ArtifactElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema PomFile extends XmlFile {} + +impl PomFile { + @data_constraint + pub fn __all__(db: XmlDB) -> *PomFile { + for(f in XmlFile(db)) { + if (f.getFileName() = "pom.xml") { + yield PomFile { + id: f.id, + file_name: f.file_name, + relative_path: f.relative_path + } + } + } + } +} + +// output relative path of the file, referenced jar name and version +fn out(fileName: string, m1: string, m2: string, m3: string) -> bool { + let (db = XmlDB::load("coref_xml_src.db")) { + for (f in PomFile(db), + e1 in GroupElement(db), + e2 in VersionElement(db), + e3 in ArtifactElement(db), + c1 in XmlCharacter(db), + c2 in XmlCharacter(db), + c3 in XmlCharacter(db), + p in DependencyElement(db)) { + if (f.key_eq(p.getLocation().getFile()) && + fileName = f.getRelativePath() && + p.key_eq(e1.getParent()) && + e1.key_eq(c1.getBelongedElement()) && + m1 = c1.getText() && + p.key_eq(e2.getParent()) && + e2.key_eq(c2.getBelongedElement()) && + m2 = c2.getText() && + p.key_eq(e3.getParent()) && + e3.key_eq(c3.getBelongedElement()) && + m3 = c3.getText()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +#### RPC + +```rust +// script +use coref::xml::* + +// select XmlElement containing "mobileService" +schema MobileServiceXmlElement extends XmlElement{} + +impl MobileServiceXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *MobileServiceXmlElement { + for (e in XmlElement(db)) { + if (e.getElementName() = "mobileService") { + yield MobileServiceXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } + + pub fn getServiceBeanValue(self) -> string { + for (a in self.getAttribute()) { + if (a.getName() = "serviceBean") { + return a.getValue() + } + } + } +} + +// select XmlElement containing "sofa:extension" +schema SofaExtensionXmlElement extends XmlElement{} +impl SofaExtensionXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *SofaExtensionXmlElement { + for (e in XmlElement(db)) { + if (e.getName() = "sofa:extension") { + yield SofaExtensionXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +fn out(value: string) -> bool { + let (db = XmlDB::load("coref_xml_src.db")) { + for (m in MobileServiceXmlElement(db), s in SofaExtensionXmlElement(db), ancestor in m.getAnAncestor()) { + if (s.key_eq(ancestor) && value = m.getServiceBeanValue()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +### Go + +#### Message of All Files + +```rust +// script +use coref::go::* + +fn default_db() -> GoDB { + return GoDB::load("coref_go_src.db") +} +/** + * @param name file name + * @param funcCount function/method quantity + * @param totallines total lines of file + * @param codelines code line of file + * @param commentlines comment line of fine + * @param md5 md5 of this file + * @param sha256 sha256 of this file + */ +fn out( + name: string, + funcCount: int, + totallines: int, + codelines: int, + commentlines: int, + md5: string, + sha256: string) -> bool { + for(f in File(default_db())) { + if (name = f.getName() && + funcCount = f.getFunctionCount() && + md5 = f.getMd5Sum() && + sha256 = f.getSha256Sum() && + totallines = f.getLineInfo().getNumberOfTotalLines() && + codelines = f.getLineInfo().getNumberOfCodeLines() && + commentlines = f.getLineInfo().getNumberOfCommentLines()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` + +#### Methods and Corresponding Comments + +```rust +// script +use coref::go::* + +fn default_db() -> GoDB { + return GoDB::load("coref_go_src.db") +} + +// Define a predicate called 'out' with parameters fileName, funcName, funcComment, and signature +fn out(fileName: string, funcName: string, funcComment: string, signature: string) -> bool { + // Check if there exists a Function object 'func' + for(func in Function(default_db())) { + if ( + // Get the name of the file the function belongs to and assign it to the variable 'fileName' + fileName = func.getBelongsFile().getName() && + // Get the name of the function and assign it to the variable 'funcName' + funcName = func.getName() && + // Get the associated comment string for the function and assign it to the variable 'funcComment' + funcComment = func.getAssociatedCommentString() && + // Get the function type signature and assign it to the variable 'signature' + signature = func.getFunctionTypeSignature()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` + +#### Cyclomatic complexity + +```rust +// script +use coref::go::* + +fn default_db() -> GoDB { + return GoDB::load("coref_go_src.db") +} + +/** + * @param name: file name + * @param func: function name + * @param cmplx: function cyclomatic complexity + * @param sl,el,sc,ec: function location info + */ +fn out(name: string, func: string, cmplx: int, sl: int, el: int) -> bool { + for(f in GoFile(default_db()), function in Function(default_db())) { + if ((!f.isAutoGenereatedFile()) && + f.key_eq(function.getBelongsFile()) && + name = f.getName() && + func = function.getName() && + cmplx = function.getCyclomaticComplexity() && + sl = function.getLocation().getStartLineNumber() && + el = function.getLocation().getEndLineNumber()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` + +## Query Debugging and Optimization Techniques + +When running GödelScript scripts, it is common to encounter issues with excessively long run times. Here, we provide some basic methods for diagnosis and solutions. + +### Schema Parameters Causing Excessive Cartesian Products + +By default, function parameters without the `@inline` annotation are considered "qualification" conditions, not true input values. + +For example, in the following case, `get` receives a `Class` type parameter, but the actual final compilation result will resemble the code below: + +```rust +fn check(class: Class) -> bool { + if (class.getName().contains("io")) { + return true + } +} + +// Actual compilation result +fn check(class: Class) -> bool { + // Actually, it needs to fetch the entire Class set first + for(__temp_class in Class::__all__(__all_data__)) { + if (class = __temp_class) { + if (class.getName().contains("io")) { + return true + } + } + } +} +``` + +Therefore, when passing multiple schema types as parameters, there will be Cartesian products of multiple full schema sets, leading to a significant increase in space and time costs. +The solution is simple: just add an `@inline` annotation: + +```rust +@inline +fn check(class: Class) -> bool { + if (class.getName().contains("io")) { + return true + } +} + +fn example() -> bool { + for(class in Class(default_java_db())) { + if (check(class)) { + return true + } + } +} + +// The inline annotation will forcibly inline the function into the statement during the code generation stage, avoiding multiple table loads +// The actual compilation result is similar to +fn example() -> bool { + for(class in Class(default_java_db())) { + if (class.getName().contains("io")) { + return true + } + } +} +``` + +### Multiple `for` Loops Causing Excessive Cartesian Products + +In some cases, it is unavoidable to use multiple layers of `for` loops to load multiple tables for joint queries, causing severe inflation of Cartesian products. The number of Cartesian product results can be reduced by decreasing (filtering) the size of the sets in advance, as shown in the example: + +```rust +fn getByIndex(self) -> Expression { + let (db = default_java_db()) { + for(e in Expression(db), p in Parameter(db)) { + let (i = p.getIndex()) { + if (e.key_eq(self.getValueByIndex(i))) { + return e + } + } + } + } +} +``` + +In this example, e and p form a Cartesian product, causing the intermediate process to take too long. +The set i is actually obtained from a method of p, and in actual use, this set is very small, much smaller than the full set of Parameter. Therefore, the retrieval of the i set can be extracted as a separate function to produce a small set, avoiding Cartesian product computations between large sets while ensuring result equivalence: + +```rust +fn getAllParameterIndex() -> *int { + let (db = default_java_db()) { + for (p in Parameter(db)) { + yield p.getIndex() + } + } +} + +fn getByIndex(self) -> Expression { + let (db = default_java_db()) { + for(e in Expression(db), i in getAllParameterIndex()) { + if (e.key_eq(self.getValueByIndex(i))) { + return e + } + } + } +} +``` + +The Cartesian product of e and p becomes e and i. Operationally, the cost of the Cartesian product is reduced, and the `getIndex` operation is advanced, rather than taking place after the Cartesian product, significantly improving performance. + +### Do Not Overuse `@inline` / Must Use `@inline` Optimization Strategy + +The underlying mechanism of inline functions is to **expand at the call site**. If the function does not have a large number of schema parameters and is called in many places, inline may lead to **code bloat and an exponential increase in the number of redundant calculations**, which may sometimes be counterproductive in reducing runtime. +If you must use inline, such as to avoid `ungrounded`, but find that using inline slows down the execution speed, you can split the embedded statements into predicates to prevent code bloat caused by expansion. + +In the following example, `getValueByAttributeNameByDefaultValue` is marked with inline to prevent `attributeName` from being identified as `ungrounded`. Subsequently, a conditional statement was added in the if branch, causing the execution time to increase from 3 seconds to 35 seconds: + +```rust +impl XmlElementBase { + @inline + fn getValueByAttributeNameByDefaultValue(self, attributeName: string) -> string { + if (self.hasAttribute(attributeName)) { + // return self.getValueByAttributeName(attributeName) + // Changed to the following statement: + let(value = self.getValueByAttributeName(attributeName)) { + If (value = "n/a") { + return "" + } + if (value != "n/a") { + return value + } + } + } + if (!self.hasAttribute(attributeName)) { + return "null" + } + } +} +``` + +As you can see, adding a level of assignment and a conditional statement, where this function is called nearly 20 times in the subsequent context, resulted in the code being expanded nearly 20 times. This also caused a magnitude difference in performance. At this point, you can extract the changed statement into a separate function. Since the extracted function does not use complex types as parameters, performance is not lost without inline, and after extraction, the result is as follows: + +```rust +impl XmlElementBase { + fn getTransValueByAttributeName(self, attributeName: string) -> string { + let (value = self.getValueByAttributeName(attributeName)) { + if (value = "n/a") { + return "" + } + if (value != "n/a") { + return value + } + } + } + @inline + fn getValueByAttributeNameByDefaultValue(self, attributeName: string) -> string { + if (self.hasAttribute(attributeName)) { + return self.getTransValueByAttributeName(attributeName) + } + if (!self.hasAttribute(attributeName)) { + return "null" + } + } +} +``` + +This way, the execution time is reduced from 35 seconds back to 3 seconds, meeting expectations. + +## Using Query Scripts Locally + +For instructions on using query scripts on your machine, see [Installation, Configuration, and Running](./3_install_and_run.md). \ No newline at end of file diff --git a/content/en/docs/codefuse-query/5_toolchain.en.md b/content/en/docs/codefuse-query/5_toolchain.en.md new file mode 100644 index 0000000..803ee8d --- /dev/null +++ b/content/en/docs/codefuse-query/5_toolchain.en.md @@ -0,0 +1,98 @@ +--- +title: Toolchain +slug: Toolchain +description: CodeFuse介绍主要功能 +url: /docs/codefuse-query-toolchain +aliases: +- "/docs/codefuse-query-toolchain" +--- + + +# Developing Plugins (VSCode) +## Installation +### Install from VSCode marketplace (Recommand) +[VSCode Extension](https://marketplace.visualstudio.com/items?itemName=CodeFuse-Query.codefuse-query-extension) +### Install from local via VSIX pack +1. Download the plugin. +2. Manually install from vsix: +![image.png](/images/codefuse-query/toolchain01.png) +3. Or use the command directly from the terminal to install: +```bash +code --install-extension [extension vsix file path] +``` + +## Environment Preparation + +- Sparrow CLI, refer to Section 3 Installation, Configuration, and Running. +## Extension Features +This extension provides the following feature modules: + +- COREF AST Viewer +- Gödel Language Server +- Gödel Language Runner +### COREF AST Viewer +The following features need to be enabled in the extension settings. Currently, it only supports the Java language. +#### Convert Java Files into Tree-Like COREF Nodes +![](/images/codefuse-query/toolchain02.gif) +#### Locate COREF Nodes and Code Positions Interactively +![](/images/codefuse-query/toolchain03.gif) +#### View Node APIs and Copy Nodes in Lib API Viewer +![](/images/codefuse-query/toolchain04.gif) +#### Lib API Viewer: Querying and Copying Usage +![](/images/codefuse-query/toolchain05.gif) +### Gödel Language Server Features +The following features need to be enabled after setting up the extension. Syntax highlighting is still available without setting related items. +#### Error Information Tips +Error information automatically updates with code changes. +![](/images/codefuse-query/toolchain06.gif) +#### Symbol Information Tips and Completion +Completion suggestions that include local variables and global symbols. Keywords provide corresponding usage examples; global symbol information offers more detailed internal information, such as member variables, member methods, and static methods. + +![](/images/codefuse-query/toolchain07.gif) + +- Keyword completion and usage example tips +- Local variable type information and symbol completion +- `.` followed by symbol information and completion +- `::` followed by symbol information and completion +- Annotation usage example tips +- Global symbol type information (internal structure, member methods, static methods) +#### Go to Definition +You can jump to definitions with a right-click or `ctrl`/`command`+`left click` to go directly to the exact symbol definition location. + +![](/images/codefuse-query/toolchain08.gif) +#### Code Snippets (Snippets) +The extension provides some code snippets to quickly write Gödel 1.0/script code. + +![](/images/codefuse-query/toolchain09.gif) +### GödelScript Runner +Use after setting the Sparrow CLI path in the extension. The database needs to be loaded before running the script. For how to generate a database, refer to Section 3.4, Running, in the data extraction part. +#### Running Scripts +![panel.gif](/images/codefuse-query/toolchain10.gif) +There are four different script running buttons provided: +1. Right-click to execute at the script you want to run. +2. Choose `Run GödelScript` on the extension `GodelScript Runner` panel. +3. Choose `Run` on the extension `GodelScript Runner Setting` panel. +4. Click the run button at the top right of the extension `GodelScript Runner Setting` panel. +#### Database Folder Loading +1. Right-click at the script you want to run and choose the folder containing the database to load. +2. Choose `Load Database Directory` on the extension `GodelScript Runner` panel. +3. Choose `Database` on the extension `GodelScript Runner Setting` panel. +4. Click the database load button at the top right of the extension `GodelScript Runner Setting` panel. +## Extension Settings +### COREF AST Viewer Settings + +- `corefASTViewer.sparrowCliRoot` + - Specify the root directory of Sparrow CLI, referring to Section 3 of the installation part. +### Gödel Language Server Settings +When the extension starts, a prompt will pop up if any one of the following two items is not set. Clicking the `configure` button will redirect to the respective configuration page. + +- `godelScript.executablePath` + - Used to specify the executable path of GödelScript, default is empty. Please replace with the actual absolute path of the GödelScript executable when needed. + - If Sparrow CLI is already downloaded, then the GödelScript executable file is `[sparrow cli root]/godel-script/usr/bin/godel`. +- `godelScript.libraryDirectoryPath` + - Used to specify the library folder path of GödelScript, default is empty. Please replace with the absolute path of the GödelScript library folder when needed. + - If Sparrow CLI is already downloaded, then the library folder path is `[sparrow cli root]/lib-1.0`. + +# Smart Assistant + +Stay tuned for the opening! \ No newline at end of file diff --git a/content/en/docs/codefuse-query/user_case.en.md b/content/en/docs/codefuse-query/user_case.en.md new file mode 100644 index 0000000..ff2eef0 --- /dev/null +++ b/content/en/docs/codefuse-query/user_case.en.md @@ -0,0 +1,59 @@ +--- +title: User Case +slug: 用户案例 +description: CodeFuse介绍主要功能 +url: /docs/codefuse-query-usercase +aliases: +- "/docs/codefuse-query-usercase" +--- + +# Use Cases +## Querying Code Features +A developer wants to know which String type variables are used in Repo A, so he writes a Gödel script as follows and submits it to the CodeFuse-Query system for results. +```rust +// script +use coref::java::* + +fn out(var: string) -> bool { + for(v in Variable(JavaDB::load("coref_java_src.db"))) { + if (v.getType().getName() = "String" && var = v.getName()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` +Similar needs: querying for classes, functions, variables, return values, call graphs, class inheritance, etc. +## Code Rule Checker +A team leader found that the team always wrote many bugs similar to Bug A. **He wanted to establish a code rule for Bug A and its checker** and do a check at the CodeReview stage. Through writing a query analysis on the CodeFuse-Query platform, and after testing it on the platform to meet the requirements, he solidified this analysis query as a code rule and launched it to the CodeReview/CI stage. Since then, this bug has never happened again. +Similar needs: writing static defect scanning rules for code risk interception. +## Obtaining Statistical Data +A researcher found that traditional code complexity metrics are difficult to accurately measure code complexity. By learning from international advanced experience and a stroke of genius, he designed a set of complexity metrics and algorithms. After implementing it with Gödel, **he found that without much optimization, the performance was already very high**, and it was quickly applied to more than 10 languages and over 110,000 repositories. He immediately had an in-depth understanding of the overall complexity of code repositories. Compared to the past, when he had to parse code and analyze syntax trees himself, and interface with systems, **it's hard to know how much more convenient it has become**. +Similar needs: code statistics, code metrics, algorithm design, academic research. + +# Application Fields +Currently, CodeFuse-Query at Ant Group has already supported the implementation of multiple scenarios such as **CodeFuse large language model data cleaning**, **code metric assessment**, **R&D risk control**, **privacy security analysis**, **code intelligence**, **end-package size governance**, etc., with a monthly service call volume exceeding one million. +## High-Quality Code Data Cleaning - CodeFuse Code Large Model +The CodeFuse code large model is a model for dealing with code-related issues open-sourced by Ant Group. For the CodeFuse large language model, the quality of the training data directly affects the inference results of the model. Low-quality code data will directly pollute the output of the language model. For example, the model may learn incorrect code patterns, thereby generating incorrect code. If the data only contains code in a certain programming language, the model may not adapt well to the code of other programming languages. +To control the quality of code data entering the model and thereby improve the inferential capability of the model, we have sorted out the definition of high-quality code based on years of practical experience of the Ant code analysis team combined with industry consensus, and implemented automated, large-scale code data cleaning using existing program analysis technology. +CodeFuse-Query provides the following data cleaning capabilities for the CodeFuse code large model: + +- High-quality code data cleaning: clean code data, including vulnerability scanning for Python, Java, JavaScript, TypeScript, Go, C, C++ 7 languages, filtering by language type/star count, filtering out data with 0 effective code lines, etc. Currently, about **2TB** of cleaned GitHub and Ant internal code data has been accumulated. +- Code Portrait: Implement high-performance, multi-dimensional automatic annotation of large-scale code, supporting Java, Scala, Kotlin, JavaScript, JSX, TypeScript, TSX, Vue, Python, Go, and other **10** languages, **77** common tags, **40** Ant-specific tags, a total of **117** tags. Current auto-annotation performance can reach **40MB/s**. +- Other atomic capabilities + - Advanced code feature extraction, including AST (Abstract Syntax Tree), DFG (Data Flow Graph) data extraction, etc. Currently, AST information has been used for SFT training, with an accuracy of about 97%. + - Code snippet identification, used for extracting code from text data, convenient for code formatting or adding Markdown format: + - Text extraction code: extract code block information from the text, support parsing of mainstream languages, function and class definitions, only validate binary classification, which is to verify whether the text contains code blocks, accuracy is about 83%. + - Identify the programming language type of code snippets: identify the programming language type of any code snippet, support 30+ languages, accuracy is about 80%. + - Code comment pair extraction: support extraction of method-level comment-code pair information, cover **15 kinds** of GitHub's most popular languages, used for Text To Code/Code To Text SFT training. +## Change Analysis - Youku Server-side R&D Efficiency +From 2023, the Youku quality assurance team started exploring precise testing for the server-side. After half a year of technical accumulation and system building, a precise testing system with **change content identification, change impact analysis, testing capability recommendation, testing coverage assessment** was formed. +In this process, the capabilities that CodeFuse-Query can provide mainly include: + +- Analyze the affected objects based on the code changes (file + line number): methods, entry points (http entry, hsf entry), call routes (all call routes from entry to changed method), database operations (table, operation type) +- Combined with the online dynamic call route (method route), CodeFuse-Query static analysis call route impact surface precise analysis capability, improve the effectiveness and preparation rate of change analysis impact surface + +Up to now, Youku has integrated all core applications through CodeFuse-Query and has built a comprehensive server-side code knowledge base and network traffic knowledge base based on static analysis. diff --git a/content/en/docs/devops-model/1_traindetail.md b/content/en/docs/devops-model/1_traindetail.md new file mode 100644 index 0000000..61c043d --- /dev/null +++ b/content/en/docs/devops-model/1_traindetail.md @@ -0,0 +1,46 @@ +--- +title: Train Detail +slug: Train Detail +description: 介绍主要功能 +url: "/docs/codefuse-devops-model-train" +aliases: +- "/docs/codefuse-devops-model-train" +--- + + +## Training Process + +According to the literature review, it is known that most domain models are based on conversational models and undergo knowledge infusion through Supervised Fine-Tuning (SFT). However, the QA corpus required for SFT fine-tuning largely comes from ChatGPT generation, which may not fully cover domain knowledge. + +Therefore, the DevOps-Model adopts a pre-training plus training followed by SFT fine-tuning approach, as illustrated in Figure 2.1. We believe that for large domain models, additional pre-training is necessary. This can inject some domain knowledge into the large model during the pre-training phase. If this knowledge has not been covered during the general large model's pre-training, it will allow the large model to learn new information; if it has been covered, it will further reinforce the model's knowledge. The second step is model alignment, aiming to enable the large model to provide the most appropriate content in response to questions. + +![](/images/devops-model/devops_train_framework.png) +![](/images/devops_model/devops_train_framework.png) + +## Training Data +### Data Collection +The model is positioned as a large Chinese DevOps domain model, so we collect pre-training and QA data related to Chinese DevOps. + +The pre-training data mainly comes from the internet, including technical blogs, documentation, and books, amounting to over 50GB of pre-training corpus data. +For the QA data, our goal is not only to align the model with general Q&A capabilities but also to learn how to answer questions better in the DevOps domain. Therefore, we collected both general single-turn and multi-turn dialogue data and generated domain-specific QA data for the DevOps field through crawling and using ChatGPT. Ultimately, we carefully selected around 200K pieces of QA data for SFT fine-tuning training, as shown in the table below. + +|Data Type |Volume| +| -- | - | +|General Single-turn QA| 50K| +|General Multi-turn QA| 20K| +|DevOps Domain QA| 130K| + + +## Data Selection +![](/images/devops-model/devops_data_filter.png) +![](/images/devops_model/devops_data_filter.png) + +Since most of the pre-training data is collected from the internet, the quality can be uneven. As data is the most crucial component in large model training, we established a cleaning Pipeline as shown above to thoroughly filter the quality of the collected data. + +First, experts and manual screening have summarized a set of heuristic filtering rules at the document level, primarily to filter out those documents of very poor quality. +Then, even within an article of slightly lower quality, there may still be some valuable domain knowledge, which we need to collect as much as possible. Here, we split the article into paragraphs. +Next, the split paragraphs are filtered again using the rules from step 1, yielding a batch of paragraphs that have passed rule-based filtering. +We then picked out 1000 paragraphs for labeling by experienced professional developers to obtain high-quality labeled data. +Finally, we trained a scoring model based on the labeling results to score the quality of paragraphs. The vector model for paragraphs was the pre-trained Chinese version of Sentence-Bert, and the scoring algorithm was logistic regression. To avoid errors in the scoring model, we used the Pareto distribution to decide whether to filter a paragraph based on its quality score. +After this Pipeline, we finally settled on approximately 15GB of data for the pre-training plus training of the large model. + diff --git a/content/en/docs/devops-model/2_quickstart.md b/content/en/docs/devops-model/2_quickstart.md new file mode 100644 index 0000000..0f16577 --- /dev/null +++ b/content/en/docs/devops-model/2_quickstart.md @@ -0,0 +1,71 @@ +--- +title: QuickStart +slug: QuickStart +description: 介绍主要功能 +url: "/docs/codefuse-devops-model-quickstart" +aliases: +- "/docs/codefuse-devops-model-quickstart" +--- + + + +## Dependency Installation +Please install the packages listed in the requirements.txt file from the GitHub address first. You can refer to the following code: +``` +pip install -r requirements.txt +``` + + +## Model Download +Model download information is as follows: + +🤗 Huggingface Address + +| - | Base Model |Aligned Model| +| -- | ---------- | ------- | +|7B| DevOps-Model-7B-Base| DevOps-Model-7B-Chat| +|14B| DevOps-Model-14B-Base| DevOps-Model-14B-Chat| + +🤖 ModelScope Address + +| - | Base Model |Aligned Model| +| -- | ---------- | ------- | +|7B | DevOps-Model-7B-Base |DevOps-Model-7B-Chat| +|14B| DevOps-Model-14B-Base| DevOps-Model-14B-Chat| + +Find the version of the Chat model you want to download; currently, 7B and 14B models are provided. + + +## Model Usage +Interact with the Chat model using the following code: +``` +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + +tokenizer = AutoTokenizer.from_pretrained("path_to_DevOps-Model-Chat", trust_remote_code=True) + +model = AutoModelForCausalLM.from_pretrained("path_to_DevOps-Model-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() + +# 指定 generation_config +model.generation_config = GenerationConfig.from_pretrained("path_to_DevOps-Model-Chat", trust_remote_code=True) + +# First round of conversation +resp, hist = model.chat(query='你是谁', tokenizer=tokenizer, history=None) +print(resp) +# 我是 DevOps-Model,一个由蚂蚁集团平台技术事业群风险智能团队和北京大学联合研发的人工智能机器人,可以与用户进行自然语言交互,并协助解答 DevOps 全生命周期中的各种问题。如果您有任何需要协助的问题或者想要进行闲聊,都可以和我交流哦。 + +# Second round of conversation +resp2, hist2 = model.chat(query='Java 中 HashMap 和 Hashtable 有什么区别', tokenizer=tokenizer, history=hist) +print(resp2) +# HashMap 和 Hashtable 都是 Java 中常用的哈希表实现,它们的主要区别在于: +# 1. Hashtable 是线程安全的,而 HashMap 不是线程安全的,因此在多线程环境下,Hashtable 的性能更稳定。 +# 2. Hashtable 中的方法都是同步的,而 HashMap 的方法不是同步的,因此在多线程环境下,Hashtable 的性能更好。 +# 3. Hashtable 中的 key 和 value 都必须实现 Serializable 接口,而 HashMap 中的 key 和 value 可以是任何对象,包括基本数据类型。 +# 4. Hashtable 的初始容量是 11,而 HashMap 的初始容量是 16。 +# 总之,如果需要在多线程环境下使用哈希表,并且需要保证线程安全,那么应该使用 Hashtable;如果不需要考虑线程安全,或者需要快速地进行哈希表操作,那么应该使用 HashMap。 + +# Third round of conversation +resp3, hist3 = model.chat(query='线程安全代表什么', tokenizer=tokenizer, history=hist2) +print(resp3) +# 线程安全是指在多线程环境下,程序能够正确地处理并发访问,并且不会出现数据竞争、死锁、饥饿等异常情况。线程安全的程序可以保证在不同的线程之间共享同一个数据结构时,数据的正确性和一致性。线程安全的实现通常需要使用同步机制,如锁、原子操作等,来保证对共享数据的访问是线程安全的。在 Java 中,可以通过 synchronized 关键字、Lock 接口等机制来实现线程安全。 +``` \ No newline at end of file diff --git a/content/en/docs/devops_eval/c1.data.md b/content/en/docs/devops_eval/c1.data.md new file mode 100644 index 0000000..0175452 --- /dev/null +++ b/content/en/docs/devops_eval/c1.data.md @@ -0,0 +1,111 @@ +--- +title: Data +slug: Data +description: 介绍主要功能 +url: "docs/data" +aliases: +- "/docs/data" +--- + +## ⏬ Data +#### Download +* Method 1: Download the zip file (you can also simply open the following link with the browser): + ``` + wget https://huggingface.co/datasets/codefuse-admin/devopseval-exam/resolve/main/devopseval-exam.zip + ``` + then unzip it and you may load the data with pandas: + ``` + import os + import pandas as pd + + File_Dir="devopseval-exam" + test_df=pd.read_csv(os.path.join(File_Dir,"test","UnitTesting.csv")) + ``` +* Method 2: Directly load the dataset using [Hugging Face datasets](https://huggingface.co/datasets/codefuse-admin/devopseval-exam): + ```python + from datasets import load_dataset + dataset=load_dataset(r"DevOps-Eval/devopseval-exam",name="UnitTesting") + + print(dataset['val'][0]) + # {"id": 1, "question": "单元测试应该覆盖以下哪些方面?", "A": "正常路径", "B": "异常路径", "C": "边界值条件","D": 所有以上,"answer": "D", "explanation": ""} ``` + +#### 👀 Notes +To facilitate usage, we have organized the category name handlers and English/Chinese names corresponding to 55 subcategories. Please refer to [category_mapping.json](/images/devops_eval/categroy_mapping.json) for details. The format is: + +``` +{ + "UnitTesting.csv": [ + "unit testing", + "单元测试", + {"dev": 5, "test": 32} + "TEST" + ], + ... + "file_name":[ + "English Name", + "Chinese Name", + "Sample Number", + "Supercatagory Label(PLAN,CODE,BUILD,TEST,RELEASE,DEPOLY,OPERATE,MONITOR choose 1 out of 8)" + ] +} +``` +Each subcategory consists of two splits: dev and test. The dev set per subcategory consists of five exemplars with explanations for few-shot evaluation. And the test set is for model evaluation. Labels on the test split are also released. + +Below is a dev example from 'version control': + +``` +id: 4 +question: 如何找到Git特定提交中已更改的文件列表? +A: 使用命令 `git diff --name-only SHA` +B: 使用命令 `git log --name-only SHA` +C: 使用命令 `git commit --name-only SHA` +D: 使用命令 `git clone --name-only SHA` +answer: A +explanation: +分析原因: +git diff --name-only SHA命令会显示与SHA参数对应的提交中已修改的文件列表。参数--name-only让命令只输出文件名,而忽略其他信息。其它选项中的命令并不能实现此功能。 +``` +#### 🔥 AIOps Sample Example +👀 👀 Taking **log parsing** and **time series anomaly detection** as examples, here is a brief showcase of the AIOps samples: + +LogParsing +``` +id: 0 +question: +Here are some running logs + 0 04:21:15,429 WARN Cannot open channel to 2 at election address /10.10.34.12:3888 + 1 19:18:56,377 WARN ******* GOODBYE /10.10.34.11:52703 ******** + 2 19:13:46,128 WARN ******* GOODBYE /10.10.34.11:52308 ******** + 3 19:16:26,268 WARN ******* GOODBYE /10.10.34.11:52502 ******** + 4 09:11:16,012 WARN Cannot open channel to 3 at election address /10.10.34.13:3888 + 5 16:37:13,837 WARN Cannot open channel to 2 at election address /10.10.34.12:3888 + 6 09:09:16,008 WARN Cannot open channel to 3 at election address /10.10.34.13:3888 + 7 15:27:03,681 WARN Cannot open channel to 3 at election address /10.10.34.13:3888 +The first three parts of the log are index, timestamp, and log level. Without considering these three parts, Here we assume that the variables in the logs are represented as '<*>', separated by spaces between tokens. What is the specific log template for the above logs? +A: Notification time out: <*> 和 Connection broken for id <*>, my id = <*>, error = +B: Send worker leaving thread 和 Connection broken for id <*>, my id = <*>, error = +C: Received connection request /<*>:<*> 和 Interrupting SendWorker +D: Cannot open channel to <*> at election address /<*>:<*> 和 ******* GOODBYE /<*>:<*> ******** +answer: D +explanation: The log includes the fixed template fragments "Cannot open channel to <> at election address /<>:<>" and "****** GOODBYE /<>:<> ********," both of which appear in option D. Meanwhile, the template fragments in the other options do not match the content in the log. Therefore, option D is the most consistent with the log template. +``` +TimeSeriesAnomalyDetection +``` +id: 0 +question: +Analyze the following time series +[50,62,74,84,92,97,99,98,94,87,77,65,265,40,28,17,8,3,0,0,4,10,20,31,43,56,68,79,89,95,99,99,96,91,82,71,59,46,34,22,12,5,1,0,2,7,15,25,37,49] +Please identify the indices of obvious outlier points. Outlier points generally refer to points that significantly deviate from the overall trend of the data. +A: 46 +B: 0 +C: 37 +D: 12 +answer: D +explanation: According to the analysis, the value 265 in the given time series at 12 o'clock is significantly larger than the surrounding data, indicating a sudden increase phenomenon. Therefore, selecting option D is correct. +``` +#### 🔧 ToolLearning Sample Example + +👀 👀The data format of ToolLearning samples is compatible with OpenAI's Function Calling. + +Please refer to [tool_learning_info.md](/docs/devops_eval/tool_learning_info.md) for details. +
diff --git a/content/en/docs/devops_eval/c2.evaluate.md b/content/en/docs/devops_eval/c2.evaluate.md new file mode 100644 index 0000000..37b95ac --- /dev/null +++ b/content/en/docs/devops_eval/c2.evaluate.md @@ -0,0 +1,76 @@ +--- +title: Evaluate +slug: Evaluate +description: 介绍主要功能 +url: "docs/codefuse-devops-eval-quickstart" +aliases: +- "/docs/codefuse-devops-eval-quickstart" +--- + +## 🚀 How to Evaluate +If you need to test your own huggingface-formatted model, the overall steps are as follows: +1. Write the loader function for the model. +2. Write the context_builder function for the model. +3. Register the model in the configuration file. +4. Run the testing script. +If the model does not require any special processing after loading, and the input does not need to be converted to a specific format (e.g. chatml format or other human-bot formats), you can directly proceed to step 4 to initiate the testing. + +#### 1. Write the loader function +If the model requires additional processing after loading (e.g. adjusting the tokenizer), you need to inherit the `ModelAndTokenizerLoader` class in `src.context_builder.context_builder_family.py` and override the corresponding `load_model` and `load_tokenizer` functions. You can refer to the following example: +```python +class QwenModelAndTokenizerLoader(ModelAndTokenizerLoader): + def __init__(self): + super().__init__() + pass + + @override + def load_model(self, model_path: str): + # Implementation of the method + pass + + @override + def load_tokenizer(self, model_path: str): + # Implementation of the method + pass +``` + +#### 2. Write the context_builder function for the Model +If the input needs to be converted to a specific format (e.g. chatml format or other human-bot formats), you need to inherit the ContextBuilder class in `src.context_builder.context_builder_family` and override the make_context function. This function is used to convert the input to the corresponding required format. An example is shown below: +```python +class QwenChatContextBuilder(ContextBuilder): + def __init__(self): + super().__init__() + + @override + def make_context(self, model, tokenizer, query: str, system: str = "hello!"): + # Implementation of the method + pass +``` + +#### 3. Register the model in the configuration file +Go to the `model_conf.json` file in the conf directory and register the corresponding model name and the loader and context_builder that will be used for this model. Simply write the class names defined in the first and second steps for the loader and context_builder. Here is an example: +```json +{ + "Qwen-Chat": { + "loader": "QwenModelAndTokenizerLoader", + "context_builder": "QwenChatContextBuilder" + } +} +``` + +#### 4. Execute the testing script +Run the following code to initiate the test: +```Bash +python src/run_eval.py \ +--model_path path_to_model \ +--model_name model_name_in_conf \ +--model_conf_path path_to_model_conf \ +--eval_dataset_list all \ +--eval_dataset_fp_conf_path path_to_dataset_conf \ +--eval_dataset_type test \ +--data_path path_to_downloaded_devops_eval_data \ +--k_shot 0 +``` +👀 👀 The specific evaluation process is as follows 📖 [**Evaluate Tutorial**](/docs/devops_eval/tutorial.md) + +
\ No newline at end of file diff --git a/content/en/docs/devops_eval/tool_learning_info.md b/content/en/docs/devops_eval/tool_learning_info.md new file mode 100644 index 0000000..d3db092 --- /dev/null +++ b/content/en/docs/devops_eval/tool_learning_info.md @@ -0,0 +1,87 @@ +### 数据样例 +在数据上我们完全兼容了 OpenAI Function Calling,具体格式如下: + +**Function Call的数据格式** + +| Input Key | Input Type | Input Description | +| --- | --- | --- | +| functions | List[Swagger] | 工具集合 | +| chatrounds | List[chatround] | 多轮对话数据 | + +**chatrounds的数据格式** + +| Input Key | Input Type | Input Description | +| --- | --- | --- | +| role | string | 角色名称,包含三种类别,user、assistant、function | +| name | string | 若role为function,则存在name字段,为function的名称 | +| content | string | role的返回内容 | +| function_call | dict | 工具调用 | + +``` +{ + "functions": + [ + { + "name": "get_fudan_university_scoreline", + "description": "查询复旦大学往年分数线,例如:查询2020年复旦大学的分数线", + "parameters": + { + "type": "object", + "properties": + { + "year": + { + "type": "string", + "description": "年份,例如:2020,2019,2018" + } + }, + "required": + [ + "year" + ] + } + } + ], + "chatrounds": + [ + { + "role": "system", + "content": "CodeFuse是一个面向研发领域的智能助手,旨在中立的、无害的帮助用户解决开发相关的问题,所有的回答均使用Markdown格式返回。\n你能利用许多工具和功能来完成给定的任务,在每一步中,你需要分析当前状态,并通过执行函数调用来确定下一步的行动方向。你可以进行多次尝试。如果你计划连续尝试不同的条件,请每次尝试一种条件。若给定了Finish函数,则以Finish调用结束,若没提供Finish函数,则以不带function_call的对话结束。" + }, + { + "role": "user", + "content": "查询2020年复旦大学的分数线" + }, + { + "role": "assistant", + "content": null, + "function_call": + { + "name": "get_fudan_university_scoreline", + "arguments": "{\n \"year\": \"2020\"\n}" + } + }, + { + "role": "function", + "name": "get_fudan_university_scoreline", + "content": "{\n \"scoreline\":{\n \"文科一批\": 630, \n \"文科二批\": 610, \n \"理科一批\": 650, \n \"理科二批\": 630 \n }\n}" + }, + { + "role": "assistant", + "content": "2020年复旦大学的分数线如下:\n\n- 文科一批:630分\n- 文科二批:610分\n- 理科一批:650分\n- 理科二批:630分" + } + ] +} +``` + +上述Function Call的数据样例为给定特定工具集后,用于回答用户查询某高校录取分数线的问题。 + + +### 评测指标 +由于一般通用模型无法具备工具调用的能力,因此在进行Tool Learn-Eval评测之前需要对通用模型进行微调,先让模型学会工具使用的基本范式 + +下面,我们定义了几种评估工具使用的指标: + + + +②③④⑤的和为1,代表工具调用失败的总数,⑤工具幻觉是工具名识别失败的一种特殊情况 \ No newline at end of file diff --git a/content/en/docs/devops_eval/tutorial.md b/content/en/docs/devops_eval/tutorial.md new file mode 100644 index 0000000..3ec8b47 --- /dev/null +++ b/content/en/docs/devops_eval/tutorial.md @@ -0,0 +1,133 @@ +## Evaluate Tutorial + +## 🚀 How to Evaluate +If you need to test your own huggingface-formatted model, the overall steps are as follows: +1. Write the loader function for the model. +2. Write the context_builder function for the model. +3. Register the model in the configuration file. +4. Run the testing script. +If the model does not require any special processing after loading, and the input does not need to be converted to a specific format (e.g. chatml format or other human-bot formats), you can directly proceed to step 4 to initiate the testing. + +#### 1. Write the loader function +If the model requires additional processing after loading (e.g. adjusting the tokenizer), you need to inherit the `ModelAndTokenizerLoader` class in `src.context_builder.context_builder_family.py` and override the corresponding `load_model` and `load_tokenizer` functions. You can refer to the following example: +```python +class QwenModelAndTokenizerLoader(ModelAndTokenizerLoader): + def __init__(self): + super().__init__() + pass + + def load_model(self, model_path: str): + model = super().load_model(model_path) + model.generation_config = GenerationConfig.from_pretrained(model_path) + return model + + def load_tokenizer(self, model_path: str): + tokenizer = super().load_tokenizer(model_path) + + # read generation config + with open(model_path + '/generation_config.json', 'r') as f: + generation_config = json.load(f) + tokenizer.pad_token_id = generation_config['pad_token_id'] + tokenizer.eos_token_id = generation_config['eos_token_id'] + return tokenizer +``` + +#### 2. Write the context_builder function for the Model +If the input needs to be converted to a specific format (e.g. chatml format or other human-bot formats), you need to inherit the ContextBuilder class in `src.context_builder.context_builder_family` and override the make_context function. This function is used to convert the input to the corresponding required format. An example is shown below: +```python +class QwenChatContextBuilder(ContextBuilder): + def __init__(self): + super().__init__() + + def make_context( + self, + model, + tokenizer, + query: str, + system: str = "you are a helpful assistant" + ): + ''' + model: PretrainedModel + tokenizer: PretrainedTokenzier + query: Input string + system: System prompt if needed + ''' + im_start, im_end = "<|im_start|>", "<|im_end|>" + im_start_tokens = [tokenizer.im_start_id] + im_end_tokens = [tokenizer.im_end_id] + nl_tokens = tokenizer.encode("\n") + + def _tokenize_str(role, content): + return f"{role}\n{content}", tokenizer.encode( + role, allowed_special=set() + ) + nl_tokens + tokenizer.encode(content, allowed_special=set()) + + system_text, system_tokens_part = _tokenize_str("system", system) + system_tokens = im_start_tokens + system_tokens_part + im_end_tokens + + raw_text = "" + context_tokens = [] + + context_tokens = system_tokens + context_tokens + raw_text = f"{im_start}{system_text}{im_end}" + raw_text + context_tokens += ( + nl_tokens + + im_start_tokens + + _tokenize_str("user", query)[1] + + im_end_tokens + + nl_tokens + + im_start_tokens + + tokenizer.encode("assistant") + + nl_tokens + ) + raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n" + return raw_text, context_tokens +``` + +#### 3. Register the model in the configuration file +Go to the `model_conf.json` file in the conf directory and register the corresponding model name and the loader and context_builder that will be used for this model. Simply write the class names defined in the first and second steps for the loader and context_builder. Here is an example: +```json +{ + "Qwen-Chat": { + "loader": "QwenModelAndTokenizerLoader", + "context_builder": "QwenChatContextBuilder" + } +} +``` + +#### 4. Execute the testing script +Run the following code to initiate the test: +```Bash +# model_path: path to the model for testing +# model_name: the model name corresponding to the model in the configuration file, default is Default, which represents using the default loader and context_builder +# model_conf_path: path to the model configuration file, usually the devopseval_dataset_fp.json file in the conf directory +# eval_dataset_list: the names of the datasets to be tested, default is all to test all datasets, if you need to test one or more datasets, use the # symbol to connect them, for example: dataset1#dataset2 +# eval_dataset_fp_conf_path: path to the dataset configuration file +# eval_dataset_type: the type of testing, only supports the default test type of test dataset +# data_path: path to the evaluation dataset, fill in the downloaded dataset address +# k_shot: supports 0-5, represents the number of example prefixes added for few-shot + +python src/run_eval.py \ +--model_path path_to_model \ +--model_name model_name_in_conf \ +--model_conf_path path_to_model_conf \ +--eval_dataset_list all \ +--eval_dataset_fp_conf_path path_to_dataset_conf \ +--eval_dataset_type test \ +--data_path path_to_downloaded_devops_eval_data \ +--k_shot 0 +``` + +For example, if the evaluation dataset is downloaded to `folder1`, the code is placed in `folder2`, and the model is in `folder3`, and the model does not require custom loader and context_builder, and all zero-shot scores of all datasets need to be tested, you can use the following script to initiate the test: +```Bash +python folder2/src/run_eval.py \ +--model_path folder3 \ +--model_name Default \ +--model_conf_path folder1/conf/model_conf.json \ +--eval_dataset_list all \ +--eval_dataset_fp_conf_path folder1/conf/devopseval_dataset_fp.json \ +--eval_dataset_type test \ +--data_path folder2 \ +--k_shot 0 +``` +
\ No newline at end of file diff --git a/content/en/docs/mftcoder/1_introduction.md b/content/en/docs/mftcoder/1_introduction.md new file mode 100644 index 0000000..5bd6e08 --- /dev/null +++ b/content/en/docs/mftcoder/1_introduction.md @@ -0,0 +1,44 @@ +--- +title: Introduction +slug: Introduction +description: Introduction Document +url: /docs/mftcoder-introduction +aliases: +- "/docs/mftcoder-introduction" +--- + + + +## Introduction + +**High Accuracy and efficiency Multi-task Fine-tuning framework for Code LLMs.** + +**MFTCoder** is an open-source project of CodeFuse for accurate and efficient Multi-task Fine-tuning(MFT) on Large Language Models(LLMs), especially on Code-LLMs(large language model for code tasks). +Moreover, we open source Code LLM models and code-related datasets along with the MFTCoder framework. + +In MFTCoder, we released two codebases for finetuning Large Language Models: +- **```MFTCoder-accelerate```** is a framework with accelerate and DeepSpeed/FSDP. All tech-stacks are open-source and vibrant. We highly recommend you try this framework and make your fintuning accurate and efficient. +- ```MFTCoder-atorch``` is based on the [ATorch frameworks](https://github.com/intelligent-machine-learning/dlrover), which is a fast distributed training framework of LLM. + +The aim of this project is to foster collaboration and share advancements in large language models, particularly within the domain of code development. + +### Frameworks +![img.jpg](/images/mftcoder/img.jpg) + +### Highlights +:white_check_mark: **Multi-task**: Train models on multiple tasks while maintaining a balance between them. The models can even generalize to new, previously unseen tasks. + +:white_check_mark: **Multi-model**: It integrates state-of-the-art open-source models such as gpt-neox, llama, llama-2, baichuan, Qwen, chatglm2, and more. (These finetuned models will be released in the near future.) + +:white_check_mark: **Multi-framework**: It provides support for both Accelerate (with Deepspeed and FSDP) and ATorch + +:white_check_mark: **Efficient fine-tuning**: It supports LoRA, QLoRA as well as Full-parameters training, enabling fine-tuning of large models with minimal resources. The training speed meets the demands of almost all fine-tuning scenarios. + +The main components of this project include: +- Support for both SFT (Supervised FineTuning) and MFT (Multi-task FineTuning). The current MFTCoder achieves data balance among multiple tasks, and future releases will achieve a balance between task difficulty and convergence speed during training. +- Support for QLoRA instruction fine-tuning, LoRA fine-tuning as well as Full-parameters fine-tuning. +- Support for most mainstream open-source large models, particularly those relevant to Code-LLMs, such as DeepSeek-coder, Mistral, Mixtral, Chatglm3, Code-LLaMA, Starcoder, Codegeex2, Qwen, GPT-Neox, and more. +- Support for weight merging between the LoRA adaptor and base models, simplifying the inference process. +- Release of 2 high-quality code-related instruction fine-tuning datasets: [Evol-instruction-66k](https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k) and [CodeExercise-Python-27k](https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k). +- Release of many Code LLMs, please refer to organizations: [codefuse-ai on huggingface](https://huggingface.co/codefuse-ai) or [codefuse-ai on modelscope](https://modelscope.cn/organization/codefuse-ai). + diff --git a/content/en/docs/mftcoder/2_quickstart.md b/content/en/docs/mftcoder/2_quickstart.md new file mode 100644 index 0000000..34d6a0e --- /dev/null +++ b/content/en/docs/mftcoder/2_quickstart.md @@ -0,0 +1,58 @@ +--- +title: QuickStart +slug: QuickStart +description: QuickStart Document +url: /docs/mftcoder-quickstart +aliases: +- "/docs/mftcoder-quickstart" +--- + + + + +## Requirements +To begin, ensure that you have successfully installed CUDA (version >= 11.4, preferably 11.7) along with the necessary drivers. Additionally, make sure you have installed torch (version 2.0.1). + +Next, we have provided an init_env.sh script to simplify the installation of required packages. Execute the following command to run the script: +```bash +sh init_env.sh +``` +We highly recommend training with flash attention(version >= 2.1.0, preferably 2.3.6), please refer to the following link for installation instructions: https://github.com/Dao-AILab/flash-attention + + +## Training +As mentioned above, we open source two training frameworks. You could refer to their own READMEs for more details as followed. + +If you are familiar with open source ```transformers```, ```DeepSpeed``` or ```FSDP```, we highly recommend you try: + +🚀🚀 [**MFTCoder-accelerate: Accelerate + Deepspeed/FSDP Codebase for MFT(Multi-task Finetuning)**](/docs/mftcoder-accelerate) + + +If you want to explore some new framework like atorch, you could check: + +🚀 [MFTCoder-atorch: Atorch Codebase for MFT(Multi-task Finetuning)](/docs/mftcoder-atorch) + + +## Models + +We are excited to release the following two CodeLLMs trained by MFTCoder, now available on both HuggingFace and ModelScope: + + +| Model | HuggingFace Links | ModelScope Links | Base Model | Num of examples trained | Batch Size | Seq Length | +|--------------------------------------|---------------------------------------------------------------------------|---------------------------------------------------------------------------------|----------------------|------|------------|------------| +| 🔥 CodeFuse-DeepSeek-33B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B) | DeepSeek-coder-33B | 60万 | 80 | 4096 | +| 🔥 CodeFuse-Mixtral-8x7B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-Mixtral-8x7B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-Mixtral-8x7B) | Mixtral-8x7B | 60万 | 80 | 4096 | +| 🔥 CodeFuse-CodeLlama-34B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B) | CodeLlama-34b-Python | 60万 | 80 | 4096 | +| 🔥 CodeFuse-CodeLlama-34B-4bits | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits) | CodeLlama-34b-Python | | | 4096 | +| 🔥 CodeFuse-StarCoder-15B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-StarCoder-15B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-StarCoder-15B) | StarCoder-15B | 60万 | 80 | 4096 | +| 🔥 CodeFuse-QWen-14B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-QWen-14B) | Qwen-14b | 110万 | 256 | 4096 | +| 🔥 CodeFuse-CodeGeex2-6B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-CodeGeex2-6B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeGeex2-6B) | CodeGeex2-6B | 110万 | 256 | 4096 | + + +## Datasets +We are also pleased to release two code-related instruction datasets, meticulously selected from a range of datasets to facilitate multitask training. Moving forward, we are committed to releasing additional instruction datasets covering various code-related tasks. + +| Dataset | Description | +|-----------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [⭐ Evol-instruction-66k](https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k) | Based on open-evol-instruction-80k, filter out low-quality, repeated, and similar instructions to HumanEval, thus get high-quality code instruction dataset. | +| [⭐ CodeExercise-Python-27k](https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k) | python code exercise instruction dataset | diff --git a/content/en/docs/mftcoder/3_accelerate.md b/content/en/docs/mftcoder/3_accelerate.md new file mode 100644 index 0000000..cd4977e --- /dev/null +++ b/content/en/docs/mftcoder/3_accelerate.md @@ -0,0 +1,353 @@ +--- +title: "MFTCoder-accelerate: Training Framework with Accelerate and DeepSpeed/FSDP" +description: 介绍主要功能 +url: /docs/mftcoder-accelerate +aliases: +- "/docs/mftcoder-accelerate" +--- + + +[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/codefuse-ai) + + GitHub + + +[[中文]](/docs/mftcoder-accelerate-zh) [**English**] + +## 1. Updates + +🔥 MFTCoder-accelerate supports Full-parameters/LoRA using accelerate + FSDP Framework; + +🔥 MFTCoder-accelerate supports MFT/SFT on more new mainstream open-source base models: mistral, mixtral-8x7b(Mixture of Experts), deepseek, chatglm3; + +🔥 MFTCoder-accelerate supports Self-Paced Loss for Convergence Balance; + +🔥 MFTCoder-accelerate supports Full-parameters/QLoRA/LoRA using accelerate + DeepSpeed Framework; + +🔥 MFTCoder-accelerate supports Multitask Fine-Tuning(MFT), which is able to balance diffenrent tasks in data level. + +🔥 MFTCoder-accelerate supports finetuning most of mainstream open-source base models: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen. + +## 2. Data Format +### 2.1 Training Data Format +The training data is required to be a uniformed JSONL format, in which each line of data has the following "chatML"-style JSON format. The "chat_rounds" field is required, and other fields can be added or removed based on specific needs. +The reason why we selected "chatML" style as our training and inference data format is that "chatML" style is compatible with both "conversation" and "instruction/response" scenarios. + +For the keys of roles in "chat_rounds", you could use "system/human/bot" tuple or "system/user/assistant" tuple. + +```json +{ + "id":0, + "data_name":"code-helper", + "chat_rounds":[ + { + "role": "system", + "content": "You are a expert in coding and help answer code questions" + }, + { + "role": "human", + "content": "Write a python function of quick sort" + }, + { + "role": "bot", + "content": "Below is the function of quick sort: ..." + }, + { + "role": "human", + "content": "Explain the code" + }, + { + "role": "bot", + "content": "OK, this code ..." + } + ] +} +``` + +### 2.2 Default Inference Data Format +Inference data format is the real string format consumed by tokenizers and then LLMs. It is also the string format to which the training data is converted before tokenization. +The default inference data format contains strings concatenated by conversation data(system, human and bot contents) in the training data format. +It is used as the data "seen"(before tokenization) by the model in training process. +It is used as input during the inference process as well. +Here is an example format of the inference string: + +``` +""" +system +System instruction +human +User 1st round input +bot +Assistant 1st round output{EOS_TOKEN} +human +User 2nd round input +bot +Assistant 2nd round output{EOS_TOKEN} +... +... +... +human +User nth round input +bot +{Assistant output to be genreated}{EOS_TOKEN} +""" +``` +When applying inference, you always make your input string end with ```bot\n``` to request the model generating answers. + + + +## 3. Model Training +Currently, the "MFTCoder-accelerate" codebase supports Full-parameters/LoRA/QLoR along with Multi-Task FineTuning(MFT). +In theory, this project can be used to train any publicly available model in the HuggingFace Format. + +Here are some excellent pre-trained models weights available on Huggingface that can be finetuned with this codebase: + +🤗 [Latest code pre-trained SOTA, CodeLlama-34b-Python](https://huggingface.co/codellama/CodeLlama-34b-Python-hf) : code-llama-34b, code-llama-34b-python, a new SOTA base model. + +🤗 [Best 10B level pre-trained Code LLM, Starcoder:](https://huggingface.co/bigcode/starcoder) wizardCoder-15B, PanGu-coder2, and other previous SOTA were trained on it. + +🤗 [Multilingual powerhouse, Qwen-7b](https://huggingface.co/Qwen/Qwen-7B): Suitable for multilingual tasks, including Chinese tasks, for instruction fine-tuning. + +**mftcoder_accelerate directory structure** +``` +mftcoder_accelerate + | + src + configs + | + data + | + model + | + *pefts* + | + tokenizer + | + utils + | + evals +``` +我们将训练中使用的各种组件抽取出来,以便后续的扩展和优化, 详见```src```目录下的实现。 + +训练入口文件是```mftcoder_accelerate/src/pefts/mft_accelerate.py``` + +参数配置存储在```mftcoder_accelerate/src/configs```目录下,方便统一管理和更改。 + +**_所以,在你开启训练之前,请进入src目录_** +``` +cd mftcoder_accelerate/src +``` + +You can find the implementations in the ```mftcoder_accelerate/src``` directory. +The entry directory for fine-tuning training is ```mftcoder_accelerate/src```, and the entry file for training is ```mftcoder_accelerate/src/pefts/mft_accelerate.py```. +Configurations are stored in the ```mftcoder_accelerate/src/configs``` directory for easy management and modification. + +**_As a result, before you start training, you should first change your dir by_** +``` +cd mftcoder_accelerate/src +``` + +### 3.1 Tokenization +During training, we concatenate multi-turn dialogues into the following format (also known as the inference data format mentioned before) and then tokenize it. + +In default format, ```human\n``` starts the user's input (i.e., prompt),```bot\n``` starts the assistant's output (i.e., response) + +```{EOS_TOKEN}``` represents the proper eos_token. +We have different eos_tokens in ```src/pefts/model_mapping.py``` which fits different base models. + +Here is a visionable example of the training data after formatting: +``` +f"human\n{input1}bot\n{target1}{EOS_TOKEN}\nhuman\n{input2}bot\ntarget2{EOS_TOKEN}\n" +``` +During the calculation of loss, we use a ```loss mask``` to ensure that the loss from the input part does not contribute to parameter updates. Only the loss from the ```target{EOS_TOKEN}``` part is used for updating parameters. +This approach takes full advantage of the benefits of model parallelism, making training more efficient. It also leverages the characteristic of decoder-only models with left-to-right attention. +By including all target parts from multiple turns in a single training iteration, the training process becomes more efficient. + + +### 3.2 LoRA/QLoRA + +#### Intro +You can refer to the Lora paper for details about LoRA:[LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf) + +You can refer to the Qlora paper for details about QLoRA:[QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314.pdf) + +QLoRA (Quantized LoRA) is a method that combines 4-bit nf4 quantization and additional adapters to achieve a balance between reducing GPU memory consumption and approaching the performance of full-parameter fine-tuning. + +According to the QLoRA paper, this method enables fine-tuning of a 33B model on a single V100 GPU while achieving performance close to that of full-parameter fine-tuning. + +To perform LoRA/QLoRA fine-tuning, you can execute the following command: + +#### Launch via Deepspeed +DeepSpeed config in accelerate_ds_config.yaml. +```bash +accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "DeepSpeed" +``` +or +DeepSpeed config in command line arguments +```bash +sh ds_single_launch.sh +``` + +#### Launch via FSDP +FSDP config in accelerate_fsdp_config.yaml. +```bash +accelerate launch --config_file accelerate_fsdp_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "FSDP" +``` +or +FSDP config in command line arguments +```bash +sh ds_single_launch.sh +``` + +#### Traing Arguments +All arguments allowed in ***_train_config.josn are defined in ```arguments.py```. + +Frequently used arguments are provided in ```configs/***_train_config``` and explained as follows. You can modify these parameters according to your needs: + +- **load_raw_dataset**: Need to be true at present. Only JSONL format is supported. + +- **data_paths**: Input data paths in a String of list format, e.g., "[path1,path2,path3]". Each path represents a task directory and each task directory contains one or more JSONL data files. + +- **output_dir**: Training output directory to store checkpoints, Lora adapter, etc. + +- **tb_dir**: TensorBoard directory to store logs, metrics, etc. + +- **model_type**: Type of the model to train, e.g., "mixtral | llama | starcoder | chatglm2 | qwen | gpt_neox". + +- **attn_implementation**: "flash_attention_2" or "eager" or "sdpa", worked when model is supported by transformers officially + +- **peft_type**: null or "lora" or "qlora". null for full-params training + +- **lora_rank**: Rank value for Lora. + +- **lora_alpha**: Alpha value for Lora. + +- **lora_dropout**: Dropout rate for Lora. + +- **target_modules**: List of target modules in lora, we have default values if None + +- **quantization**: "4bit" for QLoRA/ null for LoRA and Full-params training. + +- **pretrained_model_path**: Local/Shared disk path or model name on HuggingFace for the pre-trained model. + +- **weighted_loss_mode**: Loss weighting method for multitask training. "case3" is recommended at present, "self-paced" is supported but need tuning of hyperparameters. + +- **padding_mode**: The way tokenized data is set. "padding" means padding for each sample to seq_length, "pack" means putting samples into seq_length as many as possible. + +- **num_train_epochs**: Number of training epochs. + +- **per_device_train_batch_size**: Batch size per GPU for training. + +- **per_device_eval_batch_size**: Batch size per GPU for evaluation. + +- **gradient_accumulation_steps**: Number of gradient accumulation steps. Global batch size is calculated as num_gpus * per_device_train_batch_size * gradient_accumulation_steps. + +- **learning_rate**: Initial Learning rate. For full-parameter fine-tuning, it is recommended to use a smaller value such as 1e-5 or 5e-6. For QLoRA, a larger learning rate is generally used, such as 1e-4 or 2e-4. + +- **min_lr**: Minimum learning rate. Usually set to one-tenth of the learning rate. + +- **seq_length**: Maximum input sequence length during training. + +- **log_interval**: Log training loss every ```log_interval``` steps. + +- **checkpointing_steps**: Save a checkpoint every ```checkpointing_steps``` steps. + +- **evaluation_steps**: Evaluate on the validation set every ```evaluation_steps``` steps. + +- **early_stopping**: Enable early stopping or not. + +- **early_stopping_stall_num**: Number of evaluation points without improvement which triggers early stopping. + +- **lr_scheduler_type**: Type of learning rate scheduler. "cosine" is a good choice already. + +- **num_warmup_steps**: Number of warm-up steps to gradually increase the learning rate. + +- **seed**: Random seed for reproducibility. + +- **saving_limit**: ckpt saving limit num, must be set in Full-parameter training. + +- **role_markers**: {"system": "\system\n", "user": "\human\n", "assistant": "\bot\n} as default(null). You could set your preferred role_markers as the templates startting "system", "user" and "assistant". e.g. {"system": "### System:\n", "user": "### Instruction:\n", "assistant": "### Response:\n"} + + +## 4. Model Usage + +### 4.1 Merge Adaptor weights +Using LoRA or QLoRA for training, this project only saves the weights and configuration files of the adapters. +To merge the adapter weights with the base model: +``` +python pefts/merge_base_and_lora_to_hf.py \ + --base_model_or_path model_path \ + --adaptor_path lora_adapter_path \ + --model_type model_type \ + --merged_output_path output_path +``` + +### 4.2 Inference demo +Here is the script for inference on models trained by MFTCoder since v0.3.0, which is compatible with most HuggingFace models: +```python +from transformers import ( + AutoTokenizer, + AutoModelForCausalLM, +) +model_name_or_path = "codefuse-ai/CodeFuse-Deepseek-33B" +tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, padding_side="left") +tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids("<|end▁of▁sentence|>") +tokenizer.pad_token_id = tokenizer.eos_token_id +model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True) + +HUMAN_ROLE_START_TAG = "human\n" +BOT_ROLE_START_TAG = "bot\n" +texts = ["write a python function of quick sort."] +texts = [f"{HUMAN_ROLE_START_TAG}{text}{BOT_ROLE_START_TAG}" for text in texts] + +inputs = tokenizer(texts, return_tensors='pt', padding=True, add_special_tokens=False).to("cuda") +outputs = model.generate( + inputs=inputs["input_ids"], + attention_mask=inputs["attention_mask"], + max_new_tokens=512, + top_p=0.95, + temperature=0.1, + do_sample=True, + eos_token_id=tokenizer.eos_token_id, + pad_token_id=tokenizer.pad_token_id + ) +gen_text = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True) +print(gen_text) +``` + + +Indeed, the parameters top_p, temperature, repetition_penalty, do_sample, etc., have a significant impact on the model's generation output. +You can modify these parameters based on your specific use case. + +In code generation scenarios, if you are using the sampling mode (do_sample=True), the following parameter settings can yield good results for the Pass@1 metric: + +top_p: Set a higher value, such as 0.95, to retain highly probable generated words. This helps ensure more accurate and fluent generation results. + +temperature: Set a lower value, such as 0.1, to reduce randomness. Lower temperature values make the generation output more deterministic. + +These parameter combinations can control the diversity of the generated outputs while maintaining naturalness. Additionally, you can adjust other related parameters, such as repetition_penalty, to reduce repetition in the generated results. + +If you choose the non-sampling mode (do_sample=False), you can consider the following parameter settings: + +beam_num: Set a smaller value such as 1 or 3. ```beam_num=1``` represents greedy decoding, which selects the most probable single generated word. ```beam_num=3``` represents beam search mode, which considers multiple potential generation paths and chooses the best path among them. + +## 5. FAQ +#### Q1:What should I do when cuda OOM happens? +If OOM happened,you can reduce parameters such as per_device_train_batch_size and seq_length. Since you are dealing with large models (6B, 13B, 34B, 70B, etc.), you are already using gradient checkpointing technology by default, which significantly reduces GPU memory consumption. +However, this may slightly slow down the training speed. + +#### Q2:install packages +Please refer to init_env.sh and requirements.txt +We highly recommend you install Flash Attention 2 (flash_attn>=2.1.0, 2.3.6 used by us) first to get memory-efficient and fast training. + +#### Q3:How should I specify the GPUs for training? +You can specify the visiable GPUs as below: +```bash +CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json +``` + +#### Q4:Whats is a recommended Distributed Training? +For LoRA/QLoRA, we recommend DeepSpeed(ZeRO2) as the underlying framework, because it is easy and stable to use, moreover it is more compatable for different settings. +And FSDP does not support Quantization(integer type in training). + +For Full-parameter finetuning, FSDP is usually faster, and may help you with very large models by sharding parameters and gradients. \ No newline at end of file diff --git a/content/en/docs/mftcoder/4_atorch.md b/content/en/docs/mftcoder/4_atorch.md new file mode 100644 index 0000000..a53f7f2 --- /dev/null +++ b/content/en/docs/mftcoder/4_atorch.md @@ -0,0 +1,239 @@ +--- +title: "MFTCoder Training: Atorch Framework" +description: 介绍主要功能 +url: /docs/mftcoder-atorch +aliases: +- "/docs/mftcoder-atorch" +--- + + +[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/codefuse-ai) + + GitHub + + +[[中文]](/docs/mftcoder-atorch-zh) [**English**] + +## 1. Updates + +🔥 MFTCoder supports fine-tuning of the GPTNeoX model under the Atorch framework. + +🔥 MFTCoder supports both fully supervised fine-tuning. + +🔥 MFTCoder supports LoRA using the Atorch Framework. + +## 2. Data Format +### 2.1 Training Data Format +The training data is in a uniformed JSONL format, in which each line of data has the following JSON format. The "chat_rounds" field is required, and other fields can be added or removed based on the specific need. + +```json +{ + "id":0, + "data_name":"code-helper", + "chat_rounds":[ + { + "role": "system", + "content": "You are a expert in coding and help answer code questions", + "chat_round_id": 0 + }, + { + "role": "human", + "content": "Write a python function of quick sort", + "chat_round_id": 1 + }, + { + "role": "bot", + "content": "Below is the function of quick sort: ...", + "chat_round_id": 1 + }, + { + "role": "human", + "content": "Explain the code", + "chat_round_id": 2 + }, + { + "role": "bot", + "content": "OK, this code ...", + "chat_round_id": 2 + } + ] +} +``` + +### 2.2 Inference Data Format +The inference data contains strings concatenated by conversation data(system, human and bot contents) in the training data format. +It is used as the data "seen"(before tokenization) by the model in training process. +It is used as input during the inference process as well. +Here is an example format of the concatenated string: + +```python +""" +<|role_start|>system<|role_end|>System instruction +<|role_start|>human<|role_end|>Human 1st round input +<|role_start|>bot<|role_end|>Bot 1st round output +<|role_start|>human<|role_end|>Human 2nd round input +<|role_start|>bot<|role_end|>Bot 2nd round output +... +... +... +<|role_start|>human<|role_end|>Human nth round input +<|role_start|>bot<|role_end|>{Bot output to be genreated} +""" +``` +When applying inference, you always make your input string end with "<|role_start|>bot<|role_end|>" to request the model generating answers. + +## 3. Model Training +Currently, the "MFTCoder/mft_atorch" code repository supports fully instruction fine-tuning, and LoRA instruction fine-tuning. Only the training of the GPTNeoX model is supported. In theory, the pretrained weights of the GPTNeoX model available on HuggingFace can be used for training within this project. + +We have extracted various components used in training to facilitate future extension and optimization. Please refer to the implementation in the main directory for more details. The entry directory for fine-tuning training is ```train/```, and the entry file for training is ```train/run_train.py```. The parameter configurations are stored in the launch scripts such as ```train/run_gpt_*.sh```, making it easier to manage and modify them uniformly. + +### 3.1 Tokenization +During training, we concatenate multi-turn dialogues into the following format (also known as the inference data format mentioned earlier) and then tokenize it. In this format, <|role_start|>human<|role_end|> represents the human input (i.e., prompt), <|role_start|>bot<|role_end|> represents the bot output, and represents the eos_token. +You can modify and replace the eos_token based on different models' requirements. + +Here is an example of the concatenated format with prompts: +``` +"<|role_start|>human<|role_end|>input1target1input2target2... +``` +During the calculation of loss, we use a ```loss mask``` to ensure that the loss from the input part does not contribute to the parameter updates. Only the loss from the ```target``` part is used for updating parameters. +This approach takes full advantage of the benefits of model parallelism, making training more efficient. It also leverages the characteristic of decoder-only models with left-to-right attention. +By including all target parts from multiple turns in a single training iteration, the training process becomes more efficient. + +### 3.2 Fully Supervised Fine-Tuning (SFT) +To perform fully SFT, you can execute the following command: +```bash +sh run_gpt_mft.sh 10 1 8 5 +``` +Please note that the four parameters after the launch script have the following meanings: +- The first parameter is the per GPU batch size. +- The second parameter is the number of tensor parallelism (currently only supports 1). +- The third parameter is the number of data parallelism, which should match the number of GPUs used. +- The fourth parameter is the number of training epochs. + +For other training modes, the same four parameters need to be configured in the launch script. + +### 3.3 LoRA Supervised Fine-Tuning +To perform LoRA SFT, you can execute the following command: +```bash +sh run_gpt_mft_peft.sh 10 1 8 5 +``` + +### 3.4 Parameter Explanations +The main parameter explanations for the ```train/run_gpt_*.sh``` are as follows. You can modify these parameters according to your needs: + +- **tokenize_mode**: Need to be 'sft' at present. + +- **train_mode**: Need to be 'sft' at present. + +- **load_raw_dataset**: Need to be 'True' at present. Only JSONL format is supported. + +- **data_paths**: "[path1,path2,path3]" Input data addresses, a string enclosed in [], with different paths separated by commas (,). Each path is a directory where the last level of the directory name is considered as the task name. Each task directory contains 1 to multiple jsonl data files. + +- **output_dir**: Training output directory to store checkpoints, lora_adaptor checkpoints, etc. + +- **tensorboard_dir**: Can be temporarily ignored, as the actual tensorboard is stored in the runs directory under output_dir. + +- **model_type**: Currently only supports gpt_neox. + +- **peft_type**: Currently only supports lora. + +- **pretrained_model_path**: Local directory of the pre-trained model. + +- **total_train_batch_size**: The total batch size for training across all GPUs, calculated automatically based on per gpu batch size entered in the script. + +- **per_device_valid_batch_size**: The batch size for evaluation on each GPU, calculated automatically based on per gpu batch size entered in the script. + +- **gradient_accumulation_steps**: Number of gradient accumulation steps. Global batch size = num_gpus * per_device_train_batch_size * gradient_accumulation_steps. + +- **checkpoint_activations**: Enable if running out of GPU memory. Trades time for space by not caching activation states, resulting in two forward passes to save memory. + +- **learning_rate**: Learning rate. When fine-tuning the entire model, it is recommended to use a smaller value, such as 1e-5 or 5e-6. For lora, a larger learning rate is generally used, such as 1e-4 or 2e-4. + +- **min_lr**: Minimum learning rate, usually one-tenth of the learning_rate. + +- **seq_length**: Maximum length during training. Set according to your device, longer lengths require more memory. + +- **log_interval**: Frequency of logging training loss. + +- **checkpointing_steps**: Frequency of saving a model checkpoint. + +- **evalation_steps**: Frequency of evaluating on the validation set. + +- **early_stopping_patience**: Number of consecutive eval points without further convergence to stop training. + +- **lr_scheduler_type**: Learning rate changing strategy. + +- **num_warmup_steps**: Number of warm-up steps for the learning rate to increase to the specified value. + +- **seed**: Random seed used for reproducibility of experimental results. + +- **train_iters**: Can be temporarily set to a small value, such as 10, which does not affect the actual number of training steps, kept for future expansion to support reading datasets in other formats. + +- **valid_iters**: Can be temporarily set to a small value, such as 10, which does not affect the actual number of training steps, kept for future expansion to support reading datasets in other formats. + +- **evaluation_strategy**: Evaluation strategy during training. "steps" means to evaluate every "valid_interval" steps, "epoch" means to evaluate every epoch. Both can be enabled simultaneously. + +- **save_strategy**: Strategy for saving model weights during training. "steps" means to save every "checkpointing_steps" steps. +- **extra_save_by_epoch**: Whether to save an epoch-level checkpoint every epoch. + +- **save_total_limit**: Maximum number of model checkpoints to keep. Generally set to 2, retaining the checkpoint with the lowest valid loss and the latest checkpoint. Note that epoch-level checkpoints will always be retained and are not subject to this limit. + +- **weighted_loss_mode**: Loss weighting method for multi-task training. + +## 4. Model Usage + +### 4.1 Merge Adaptor weights +Using LoRA or QLoRA for training, this project only saves the weights and configuration files of the adapters. +To merge the adapter weights with the base model, see ```src/pefts/merge_base_and_lora_to_hf.py``` + +### 4.2 Inference demo +Here is the script for inference on our trained models, which is compatible with most Hugging Face models: +```python +from transformers import ( + AutoTokenizer, + AutoModelForCausalLM, +) +tokenizer = AutoTokenizer.from_pretrained(mode_name_or_path, trust_remote_code=True, use_fast=False, legacy=False) +tokenizer.padding_side = "left" +tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids("") +tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids("") +model = AutoModelForCausalLM.from_pretrained(mode_name_or_path, trust_remote_code=True) + +HUMAN_ROLE_START_TAG = "<|role_start|>human<|role_end|>" +BOT_ROLE_START_TAG = "<|role_start|>bot<|role_end|>" +texts = ["write a python function of quick sort."] +texts = [f"{HUMAN_ROLE_START_TAG}{text}{BOT_ROLE_START_TAG}" for text in texts] + +inputs = tokenizer(texts, return_tensors='pt', padding=True, add_special_tokens=False).to("cuda") +outputs = model.generate( + inputs=inputs["input_ids"], + attention_mask=inputs["attention_mask"], + max_new_tokens=512, + top_p=0.95, + temperature=0.1, + do_sample=True, + eos_token_id=tokenizer.eos_token_id, + pad_token_id=tokenizer.pad_token_id + ) +gen_text = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True) +print(gen_text) +``` + +Indeed, the parameters top_p, temperature, repetition_penalty, do_sample, etc., have a significant impact on the model's generation output. +You can modify these parameters based on your specific use case. + +In code generation scenarios, if you are using the sampling mode (do_sample=True), the following parameter settings can yield good results for the Pass@1 metric: + +top_p: Set a higher value, such as 0.95, to retain highly probable generated words. This helps ensure more accurate and fluent generation results. + +temperature: Set a lower value, such as 0.1, to reduce randomness. Lower temperature values make the generation output more deterministic. + +These parameter combinations can control the diversity of the generated outputs while maintaining naturalness. Additionally, you can adjust other related parameters, such as repetition_penalty, to reduce repetition in the generated results. + +If you choose the non-sampling mode (do_sample=False), you can consider the following parameter settings: + +beam_num: Set a smaller value such as 1 or 3. ```beam_num=1``` represents greedy decoding, which selects the most probable single generated word. ```beam_num=3``` represents beam search mode, which considers multiple potential generation paths and chooses the best path among them. + +## 5. FAQ +### Q1:What should I do when cuda OOM happens? +If OOM (Out of Memory) occurs, you can mitigate it by reducing parameters such as per GPU batch size (the first argument when starting the training script) and seq_length. You can also set gradient_checkpointing=true, which significantly reduces memory usage but may slow down the training speed. \ No newline at end of file diff --git a/content/en/docs/overview/b1.codefusechatbot.md b/content/en/docs/overview/b1.codefusechatbot.md new file mode 100644 index 0000000..198f850 --- /dev/null +++ b/content/en/docs/overview/b1.codefusechatbot.md @@ -0,0 +1,60 @@ +--- +title: Codefuse-ChatBot Development by Private Knowledge Augmentation +slug: codefuse-chatbot +language: en +description: 介绍主要功能 +aliases: +- "/docs/codefuse-chatbot" +--- + +

+ 中文  |  English  +

+ + +This project is an open-source AI intelligent assistant, specifically designed for the entire lifecycle of software development, covering design, coding, testing, deployment, and operations. Through knowledge retrieval, tool utilization, and sandbox execution, Codefuse-ChatBot can not only answer professional questions you encounter during the development process but also coordinate multiple independent, dispersed platforms through a conversational interface. + + +## 📜 Contents +- [🤝 Introduction](#-introduction) +- [🧭 Technical Route](#-technical-route) + +## 🤝 Introduction + +💡 The aim of this project is to construct an AI intelligent assistant for the entire lifecycle of software development, covering design, coding, testing, deployment, and operations, through Retrieval Augmented Generation (RAG), Tool Learning, and sandbox environments. It transitions gradually from the traditional development and operations mode of querying information from various sources and operating on standalone, disparate platforms to an intelligent development and operations mode based on large-model Q&A, changing people's development and operations habits. + +- **🧠 Intelligent Scheduling Core:** Constructed a well-integrated scheduling core system that supports multi-mode one-click configuration, simplifying the operational process.[Use Introduction](/docs/multi-agent) +- **💻 Comprehensive Code Repository Analysis:** Achieved in-depth understanding at the repository level and coding and generation at the project file level, enhancing development efficiency. +- **📄 Enhanced Document Analysis:** Integrated document knowledge bases with knowledge graphs, providing deeper support for document analysis through enhanced retrieval and reasoning. +- **🔧 Industry-Specific Knowledge:** Tailored a specialized knowledge base for the DevOps domain, supporting the self-service one-click construction of industry-specific knowledge bases for convenience and practicality. +- **🤖 Compatible Models for Specific Verticals:** Designed small models specifically for the DevOps field, ensuring compatibility with related DevOps platforms and promoting the integration of the technological ecosystem. + +🌍 Relying on open-source LLM and Embedding models, this project can achieve offline private deployments based on open-source models. Additionally, this project also supports the use of the OpenAI API.[Access Demo](/docs/fastchat) + +👥 The core development team has been long-term focused on research in the AIOps + NLP domain. We initiated the CodefuseGPT project, hoping that everyone could contribute high-quality development and operations documents widely, jointly perfecting this solution to achieve the goal of "Making Development Seamless for Everyone." + + +
+ Image +
+ +🌍 Relying on open-source LLM and Embedding models, this project can achieve offline private deployments based on open-source models. Additionally, this project also supports the use of the OpenAI API. + +👥 The core development team has been long-term focused on research in the AIOps + NLP domain. We initiated the DevOpsGPT project, hoping that everyone could contribute high-quality development and operations documents widely, jointly perfecting this solution to achieve the goal of "Making Development Seamless for Everyone." + +## 🧭 Technical Route +
+ Image +
+ +- 🧠 **Multi-Agent Schedule Core:** Easily configurable to create interactive intelligent agents. +- 🕷️ **Multi Source Web Crawl:** Offers the capability to crawl specified URLs for collecting the required information. +- 🗂️ **Data Processor:** Effortlessly handles document loading, data cleansing, and text segmentation, integrating data from different sources. +- 🔤 **Text Embedding & Index:**:Users can easily upload files for document retrieval, optimizing the document analysis process. +- 🗄️ **Vector Database & Graph Database:** Provides flexible and powerful data management solutions. +- 📝 **Prompt Control & Management:**:Precisely defines the contextual environment for intelligent agents. +- 🚧 **SandBox:**:Safely executes code compilation and actions. +- 💬 **LLM:**:Supports various open-source models and LLM interfaces. +- 🛠️ **API Management::** Enables rapid integration of open-source components and operational platforms. + +For implementation details, see: [Technical Route Details](sources/readme_docs/roadmap.md) diff --git a/content/en/docs/overview/b10.codefuse-evalution.md b/content/en/docs/overview/b10.codefuse-evalution.md new file mode 100644 index 0000000..e81e980 --- /dev/null +++ b/content/en/docs/overview/b10.codefuse-evalution.md @@ -0,0 +1,25 @@ +--- +title: "CodeFuseEval: Multi-tasking Evaluation Benchmark for Code Large Language Model" +description: 介绍主要功能 +url: "/docs/codefuse-evalution" +aliases: +- "/docs/codefuse-evalution" +--- + + +# CodeFuseEval: Multi-tasking Evaluation Benchmark for Code Large Language Model + +
+

+ 简体中文| + CodeFuseEval on ModelScope| + CodeFuseEval on Hugging Face +

+ +
+ +CodeFuseEval is a Code Generation benchmark that combines the multi-tasking scenarios of CodeFuse Model with the benchmarks of HumanEval-x and MBPP. This benchmark is designed to evaluate the performance of models in various multi-tasking tasks, including code completion, code generation from natural language, test case generation, cross-language code translation, and code generation from Chinese commands, among others.Continuously open, stay tuned ! + +

+ English Introduction +

diff --git a/content/en/docs/overview/b2.codefuseDevopsEval.md b/content/en/docs/overview/b2.codefuseDevopsEval.md new file mode 100644 index 0000000..71b2ba0 --- /dev/null +++ b/content/en/docs/overview/b2.codefuseDevopsEval.md @@ -0,0 +1,133 @@ +--- +title: codefuse-devops-eval +slug: codefuse-devops-eval +description: 介绍主要功能 +aliases: +- "/docs/codefuse-devops-eval" +--- + +

+ + + +DevOps-Eval is a comprehensive evaluation suite specifically designed for foundation models in the DevOps field. We hope DevOps-Eval could help developers, especially in the DevOps field, track the progress and analyze the important strengths/shortcomings of their models. + + +📚 This repo contains questions and exercises related to DevOps, including the AIOps, ToolLearning; + +💥️ There are currently **7486** multiple-choice questions spanning 8 diverse general categories, as shown [below](/images/devops_eval/data_info.png). + +🔥 There are a total of **2840** samples in the AIOps subcategory, covering scenarios such as **log parsing**, **time series anomaly detection**, **time series classification**, **time series forecasting**, and **root cause analysis**. + +🔧 There are a total of **1509** samples in the ToolLearning subcategory, covering 239 tool scenes across 59 fields. + +

+ +## 🏆 Leaderboard +Below are zero-shot and five-shot accuracies from the models that we evaluate in the initial release. We note that five-shot performance is better than zero-shot for many instruction-tuned models. +### 👀 DevOps +#### Zero Shot + +| **ModelName** | plan | code | build | test | release | deploy | operate | monitor | **AVG** | +|:------------------------:|:-----:|:-----:|:-----:|:------:|:--------:|:------:|:-------:|:--------:|:-----------:| +| DevOpsPal-14B-Chat | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 69.89 | 79.17 | 78.23 | +| DevOpsPal-14B-Base | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 71.18 | 82.41 | 78.23 | +| Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 69.57 | 80.56 | 77.18 | +| Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 70.05 | 80.09 | 76.19 | +| Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 70.37 | 83.8 | 73.73 | +| Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 67.63 | 84.72 | 72.9 | +| DevOpsPal-7B-Chat | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 64.73 | 77.78 | 71.92 | +| DevOpsPal-7B-Base | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 65.54 | 78.7 | 71.69 | +| Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 65.06 | 80.09 | 71.09 | +| Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 62.64 | 79.17 | 69.75 | +| Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 59.42 | 79.63 | 66.97 | +| Internlm-7B-Chat | 60.61 | 62.15 | 77.06 | 76.32 | 66.98 | 74.51 | 60.39 | 78.24 | 66.27 | +| Baichuan2-7B-Base | 56.06 | 62.45 | 75.69 | 70.61 | 74.06 | 69.8 | 61.67 | 75.93 | 66.21 | +| Internlm-7B-Base | 54.55 | 58.29 | 79.36 | 78.95 | 77.83 | 70.59 | 65.86 | 75.93 | 65.99 | + + +#### Five Shot + +| **ModelName** | plan | code | build | test | release | deploy | operate | monitor | **AVG** | +|:------------------------:|:-----:|:-----:|:-----:|:------:|:--------:|:------:|:-------:|:--------:|:---------:| +| DevOpsPal-14B-Chat | 63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 72.95 | 81.48 | 79.69 | +| DevOpsPal-14B-Base | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 71.98 | 80.09 | 79.63 | +| Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 70.85 | 81.48 | 77.81 | +| Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 72.46 | 80.56 | 77.56 | +| Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 73.75 | 85.19 | 75.8 | +| Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 70.37 | 81.94 | 75.36 | +| Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 71.98 | 80.56 | 74.12 | +| DevOpsPal-7B-Chat | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 68.6 | 76.85 | 73.61 | +| DevOpsPal-7B-Base | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 67.15 | 79.17 | 73.35 | +| Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 65.38 | 81.02 | 71.69 | +| Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 67.31 | 79.63 | 70.8 | +| Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 60.06 | 77.31 | 69.21 | +| Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 64.9 | 79.17 | 69.05 | +| Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 63.45 | 75.46 | 67.17 | + +### 🔥 AIOps + +
+ +#### Zero Shot +| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | **AVG** | +|:-------------------:|:------------:|:------------------:|:---------------------------:|:-----------------------------------------:|:---------------------------:|:-------:| +| Qwen-14B-Base | 66.29 | 58.8 | 25.33 | 43.5 | 62.5 | 52.25 | +| DevOpsPal-14B—Base | 63.14 | 53.6 | 23.33 | 43.5 | 64.06 | 50.49 | +| Qwen-14B-Chat | 64.57 | 51.6 | 22.67 | 36 | 62.5 | 48.94 | +| DevOpsPal-14B—Chat | 60 | 56 | 24 | 43 | 57.81 | 48.8 | +| Qwen-7B-Base | 50 | 39.2 | 22.67 | 54 | 43.75 | 41.48 | +| DevOpsPal-7B—Chat | 56.57 | 30.4 | 25.33 | 45 | 44.06 | 40.92 | +| Baichuan2-13B-Chat | 64 | 18 | 21.33 | 37.5 | 46.88 | 39.3 | +| Qwen-7B-Chat | 57.43 | 38.8 | 22.33 | 39.5 | 25.31 | 36.97 | +| Internlm-7B—Chat | 58.86 | 8.8 | 22.33 | 28.5 | 51.25 | 36.34 | +| Baichuan2-7B-Chat | 60.86 | 10 | 28 | 34.5 | 39.06 | 36.34 | +| Baichuan2-7B-Base | 53.43 | 12.8 | 27.67 | 36.5 | 40.31 | 35.49 | +| Baichuan2-13B-Base | 54 | 12.4 | 23 | 34.5 | 42.81 | 34.86 | +| DevOpsPal-7B—Base | 46.57 | 20.8 | 25 | 34 | 38.75 | 33.94 | +| Internlm-7B—Base | 48.57 | 18.8 | 23.33 | 37.5 | 33.75 | 33.1 | + +#### One Shot +| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | **AVG** | +|:-------------------:|:------------:|:------------------:|:---------------------------:|:-----------------------------------------:|:---------------------------:|:-------:| +| DevOpsPal-14B—Chat | 66.29 | 80.8 | 23.33 | 44.5 | 56.25 | 54.44 | +| DevOpsPal-14B—Base | 60 | 74 | 25.33 | 43.5 | 52.5 | 51.13 | +| Qwen-14B-Base | 64.29 | 74.4 | 28 | 48.5 | 40.31 | 50.77 | +| Qwen-7B-Base | 56 | 60.8 | 27.67 | 44 | 57.19 | 49.44 | +| Qwen-14B-Chat | 49.71 | 65.6 | 28.67 | 48 | 42.19 | 46.13 | +| Baichuan2-13B-Base | 56 | 43.2 | 24.33 | 41 | 46.88 | 42.89 | +| Baichuan2-7B-Chat | 58.57 | 31.6 | 27 | 31.5 | 51.88 | 41.83 | +| DevOpsPal-7B—Base | 52.86 | 44.4 | 28 | 44.5 | 36.25 | 41.2 | +| Baichuan2-7B-Base | 48.29 | 40.4 | 27 | 42 | 40.94 | 39.86 | +| Qwen-7B-Chat | 54.57 | 52 | 29.67 | 26.5 | 27.19 | 38.73 | +| Baichuan2-13B-Chat | 57.43 | 44.4 | 25 | 25.5 | 30.63 | 37.75 | +| DevOpsPal-7B—Chat | 56.57 | 27.2 | 25.33 | 41.5 | 33.44 | 37.46 | +| Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 50.31 | 36.69 | +| Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 31.56 | 35.85 | + +
+ + +### 🔧 ToolLearning +
+ +| **FuncCall-Filler** | dataset_name | fccr | 1-fcffr | 1-fcfnr | 1-fcfpr | 1-fcfnir | aar | +|:-------------------:| :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| Qwen-14b-chat | luban | 61 | 100 | 97.68 | 63.32 | 100 | 69.46 | +| Qwen-7b-chat | luban | 50.58 | 100 | 98.07 | 52.51 | 100 | 63.59 | +| Baichuan-7b-chat | luban | 60.23 | 100 | 97.3 | 62.93 | 99.61 | 61.12 | +| Internlm-chat-7b | luban | 47.88 | 100 | 96.14 | 51.74 | 99.61 | 61.85 | +| Qwen-14b-chat | fc_data | 98.37 | 99.73 | 99.86 | 98.78 | 100 | 81.58 | +| Qwen-7b-chat | fc_data | 99.46 | 99.86 | 100 | 99.59 | 100 | 79.25 | +| Baichuan-7b-chat | fc_data | 97.96 | 99.32 | 100 | 98.64 | 100 | 89.53 | +| Internlm-chat-7b | fc_data | 94.29 | 95.78 | 100 | 98.5 | 100 | 88.19 | +| CodeLLaMa-7b | fc_data | 98.78 | 99.73 | 100 | 99.05 | 100 | 94.7 | +| CodeLLaMa-7b-16 | fc_data | 98.1 | 99.87 | 99.73 | 98.5 | 100 | 93.14 | +| CodeFuse-7b-4k | fc_data | 98.91 | 99.87 | 99.87 | 99.18 | 100 | 89.5 | + + +
diff --git a/content/en/docs/overview/b3.codefuseDevopsModel.md b/content/en/docs/overview/b3.codefuseDevopsModel.md new file mode 100644 index 0000000..4c2d39e --- /dev/null +++ b/content/en/docs/overview/b3.codefuseDevopsModel.md @@ -0,0 +1,61 @@ +--- +title: codefuse-devops-model +slug: codefuse-devops-model +description: 介绍主要功能 +aliases: +- "/docs/codefuse-devops-model" +--- + + +## codeFuse-devops-model +DevOps-Model is a large language model for the Chinese DevOps field jointly released by Ant Group and Peking University. By collecting professional data related to the DevOps domain and conducting additional training and alignment on the model, a large model has been produced to help engineers enhance efficiency throughout the entire development and operations lifecycle. This fills the current gap in large models within the DevOps domain, with the aim to provide solutions to any problems by asking DevOps-Model! +We have now open-sourced two versions of the model, the Base model with additional training and the Chat model after alignment, in both 7B and 14B specifications, as well as the corresponding training code. We welcome everyone to collaborate and contribute! + +## Project Address +GitHub Address: https://github.com/codefuse-ai/CodeFuse-DevOps-Model/tree/main +ModelScope Address: + +- DevOps-Model-7B-Base: https://modelscope.cn/models/codefuse-ai/CodeFuse-DevOps-Model-7B-Base/summary +- DevOps-Model-7B-Chat: https://modelscope.cn/models/codefuse-ai/CodeFuse-DevOps-Model-7B-Chat/summary +- DevOps-Model-14B-Base: https://modelscope.cn/models/codefuse-ai/CodeFuse-DevOps-Model-14B-Base/summary +- DevOps-Model-14B-Chat: https://modelscope.cn/models/codefuse-ai/CodeFuse-DevOps-Model-14B-Chat/summary + +## Evaluation Questions +For model evaluation, there was initially no benchmark for testing in the DevOps domain, so we first selected some domain-related multiple-choice questions from general open-source tests for evaluation. The specific test data is as follows: + +|Dataset |Subject |Total Questions| +| ---- | --------- | ----- | +|CMMLU |Computer science 204| +|Computer |security |171| +|Machine |learning |122| +|CEval |college programming| 37| +|CEval |computer_architecture| 21| +|CEval |computer_network |19| +|总计 |总计题目数 |574| + + +## Evaluation Methods +Since all are multiple-choice questions, we adopted the method of selecting the highest-scoring Token among the four option Tokens in the first Token produced by the model as the model's answer to the question. We also tested Zero-shot and Five-shot results. + + +## Evaluation Results +![](/images/devops_model/devops_eval.webp) + +The specific scores are shown in the table below: + +|Scale of Parameters |Model |Model Size |Zero-shot Score |Five-shot Score| +| - | ---- | --- | ---- | ---- | +|10+ B| DevOps-Model-14B-Base |14B |70.73 |73.00| +|10+ B|Qwen-14B-Base |14B |69.16| 71.25| +|10+ B|Baichuan2-13B-Base |13B |55.75| 61.15| +|10+ B|DevOps-Model-14B-Chat| 14B |74.04 |75.96| +|10+ B|Qwen-14B-Chat |14B |69.16| 70.03| +|10+ B|Baichuan2-13B-Chat |13B |52.79 |55.23| +|7B| DevOps-Model-7B-Base| 7B |62.72| 62.02| +|7B|Qwen-7B-Base| 7B| 55.75| 56.0| +|7B|Baichuan2-7B-Base| 7B |49.30| 55.4| +|7B|Internlm-7B-Base |7B |47.56 |52.6| +|7B|DevOps-Model-7B-Chat| 7B |62.20| 64.11| +|7B|Qwen-7B-Chat| 7B |46.00 |52.44| +|7B|Baichuan2-7B-Chat| 7B| 52.26| 54.46| +|7B|Internlm-7B-Chat |7B |52.61 |55.75| \ No newline at end of file diff --git a/content/en/docs/overview/b4.MFTCoder.md b/content/en/docs/overview/b4.MFTCoder.md new file mode 100644 index 0000000..d9f51e9 --- /dev/null +++ b/content/en/docs/overview/b4.MFTCoder.md @@ -0,0 +1,125 @@ +--- +title: "MFTCoder: High Accuracy and Efficiency Multi-task Fine-Tuning Framework" +slug: MFTCoder +description: 介绍主要功能 +aliases: +- "/docs/mftcoder" +--- + +
+ +

+ 🤗 HuggingFace + • 🤖 ModelScope + +

+ +[[中文]](/docs/mftcoder-zh) [**English**] + +
+ + + +## Contents +- [News](#News) +- [Articles](#Articles) +- [Introduction](#Introduction) +- [Requirements](#Requirements) +- [Training](#Training) +- [Models](#Models) +- [Datasets](#Datasets) +- [Star History](#Star-History) + + +## News +🔥🔥🔥 [2024/01/17] We released MFTCoder v0.3.0, mainly for MFTCoder-accelerate. It now supports new models like Mixtral(MoE), DeepSeek-coder, chatglm3. It supports FSDP as an option. It also supports Self-paced Loss as a solution for convergence balance in Multitask Fine-tuning. + +🔥🔥🔥 [2024/01/17] [CodeFuse-DeepSeek-33B](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B) has been released, achieving a pass@1 (greedy decoding) score of 78.7% on HumanEval. It lists as top-1 LLM on Bigcode Leardboard in terms of win-rate, the official result is going to be published later. + +🔥🔥🔥 [2024/01/17] [CodeFuse-Mixtral-8x7B](https://huggingface.co/codefuse-ai/CodeFuse-Mixtral-8X7B) has been released, achieving a pass@1 (greedy decoding) score of 56.1% on HumanEval. + +🔥🔥 [2023/11/07] [MFTCoder Paper](https://arxiv.org/abs/2311.02303) has been released on Arxiv, which discloses technique details of multi-task-fine-tuning. + +🔥🔥 [2023/10/20] [CodeFuse-QWen-14B](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B) has been released, achieving a pass@1 (greedy decoding) score of 48.8% on HumanEval, which gains 16% absolute improvement over the base model [Qwen-14b](https://huggingface.co/Qwen/Qwen-14B) + +🔥🔥 [2023/09/27] [CodeFuse-StarCoder-15B](https://huggingface.co/codefuse-ai/CodeFuse-StarCoder-15B) has been released, achieving a pass@1 (greedy decoding) score of 54.9% on HumanEval. + +🔥🔥 [2023/09/26]We are pleased to announce the release of the [4-bit quantized version of CodeFuse-CodeLlama-34B](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits). Despite the quantization process, the model still achieves a remarkable 73.8% accuracy (greedy decoding) on the HumanEval pass@1 metric. + +🔥🔥 [2023/09/07]We released [**CodeFuse-CodeLlama-34B**](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits), which achieves the **74.4% Python Pass@1** (greedy decoding) and surpasses GPT4 (2023/03/15) and ChatGPT-3.5 on the [HumanEval Benchmarks](https://github.com/openai/human-eval). + +🔥🔥 [2023/08/26]We released MFTCoder-v0.1.0 which supports finetuning Code Llama, Llama, Llama2, StarCoder, ChatGLM2, CodeGeeX2, Qwen, and GPT-NeoX models with LoRA/QLoRA. + +### HumanEval Performance +| Model | HumanEval(Pass@1) | Date | +|:----------------------------|:-----------------:|:-------:| +| **CodeFuse-DeepSeek-33B** | **78.7%** | 2024/01 | +| **CodeFuse-CodeLlama-34B** | **74.4%** | 2023/09 | +| **CodeFuse-CodeLlama-34B-4bits** | **73.8%** | 2023/09 | +| WizardCoder-Python-34B-V1.0 | 73.2% | 2023/08 | +| GPT-4(zero-shot) | 67.0% | 2023/03 | +| PanGu-Coder2 15B | 61.6% | 2023/08 | +| **CodeFuse-Mixtral-8x7B** | **56.1%** | 2024/01 | +| **CodeFuse-StarCoder-15B** | **54.9%** | 2023/08 | +| CodeLlama-34b-Python | 53.7% | 2023/08 | +| **CodeFuse-QWen-14B** | **48.8%** | 2023/10 | +| CodeLlama-34b | 48.8% | 2023/08 | +| GPT-3.5(zero-shot) | 48.1% | 2022/11 | +| OctoCoder | 46.2% | 2023/08 | +| StarCoder-15B | 33.6% | 2023/05 | +| QWen-14B | 32.3% | 2023/10 | + + +## Articles +[MFT Arxiv paper](https://arxiv.org/abs/2311.02303) + +## Introduction + +**High Accuracy and efficiency Multi-task Fine-tuning framework for Code LLMs.** + +**MFTCoder** is an open-source project of CodeFuse for accurate and efficient Multi-task Fine-tuning(MFT) on Large Language Models(LLMs), especially on Code-LLMs(large language model for code tasks). +Moreover, we open source Code LLM models and code-related datasets along with the MFTCoder framework. + +In MFTCoder, we released two codebases for finetuning Large Language Models: +- **```MFTCoder-accelerate```** is a framework with accelerate and DeepSpeed/FSDP. All tech-stacks are open-source and vibrant. We highly recommend you try this framework and make your fintuning accurate and efficient. +- ```MFTCoder-atorch``` is based on the [ATorch frameworks](https://github.com/intelligent-machine-learning/dlrover), which is a fast distributed training framework of LLM. + +The aim of this project is to foster collaboration and share advancements in large language models, particularly within the domain of code development. + +### Frameworks +![img.jpg](/images/mftcoder/img.jpg) + +### Highlights +:white_check_mark: **Multi-task**: Train models on multiple tasks while maintaining a balance between them. The models can even generalize to new, previously unseen tasks. + +:white_check_mark: **Multi-model**: It integrates state-of-the-art open-source models such as gpt-neox, llama, llama-2, baichuan, Qwen, chatglm2, and more. (These finetuned models will be released in the near future.) + +:white_check_mark: **Multi-framework**: It provides support for both Accelerate (with Deepspeed and FSDP) and ATorch + +:white_check_mark: **Efficient fine-tuning**: It supports LoRA, QLoRA as well as Full-parameters training, enabling fine-tuning of large models with minimal resources. The training speed meets the demands of almost all fine-tuning scenarios. + +The main components of this project include: +- Support for both SFT (Supervised FineTuning) and MFT (Multi-task FineTuning). The current MFTCoder achieves data balance among multiple tasks, and future releases will achieve a balance between task difficulty and convergence speed during training. +- Support for QLoRA instruction fine-tuning, LoRA fine-tuning as well as Full-parameters fine-tuning. +- Support for most mainstream open-source large models, particularly those relevant to Code-LLMs, such as DeepSeek-coder, Mistral, Mixtral, Chatglm3, Code-LLaMA, Starcoder, Codegeex2, Qwen, GPT-Neox, and more. +- Support for weight merging between the LoRA adaptor and base models, simplifying the inference process. +- Release of 2 high-quality code-related instruction fine-tuning datasets: [Evol-instruction-66k](https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k) and [CodeExercise-Python-27k](https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k). +- Release of many Code LLMs, please refer to organizations: [codefuse-ai on huggingface](https://huggingface.co/codefuse-ai) or [codefuse-ai on modelscope](https://modelscope.cn/organization/codefuse-ai). + + +## Contributing +Contributions are welcome! If you have any suggestions, ideas, bug reports, or new model/feature supported, please open an issue or submit a pull request. + +## Citation +If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below. +``` +@article{mftcoder2023, + title={MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning}, + author={Bingchang Liu and Chaoyu Chen and Cong Liao and Zi Gong and Huan Wang and Zhichao Lei and Ming Liang and Dajun Chen and Min Shen and Hailian Zhou and Hang Yu and Jianguo Li}, + year={2023}, + journal={arXiv preprint arXiv}, + archivePrefix={arXiv}, + eprint={2311.02303} +} +``` + diff --git a/content/en/docs/overview/b5.CodeFuseModelCache.md b/content/en/docs/overview/b5.CodeFuseModelCache.md new file mode 100644 index 0000000..9e3a93e --- /dev/null +++ b/content/en/docs/overview/b5.CodeFuseModelCache.md @@ -0,0 +1,42 @@ +--- +title: CodeFuse-ModelCache +slug: CodeFuse-ModelCache +description: 介绍主要功能 +aliases: +- "/docs/codefuse-modelcache" +--- + + +

+

+

+

+ 中文 | + English +

+

+
+ +## Contents +- [news](#news) +- [Introduction](#Introduction) +- [Modules](#Modules) +- [Acknowledgements](#Acknowledgements) +- [Contributing](#Contributing) + +## news +- 🔥🔥[2023.12.10] we integrate LLM embedding frameworks such as 'llmEmb', 'ONNX', 'PaddleNLP', 'FastText', alone with the image embedding framework 'timm', to bolster embedding functionality. +- 🔥🔥[2023.11.20] codefuse-ModelCache has integrated local storage, such as sqlite and faiss, providing users with the convenience of quickly initiating tests. +- [2023.08.26] codefuse-ModelCache... + +## Introduction +Codefuse-ModelCache is a semantic cache for large language models (LLMs). By caching pre-generated model results, it reduces response time for similar requests and improves user experience.
This project aims to optimize services by introducing a caching mechanism. It helps businesses and research institutions reduce the cost of inference deployment, improve model performance and efficiency, and provide scalable services for large models. Through open-source, we aim to share and exchange technologies related to large model semantic cache. + +## modules +![modelcache modules](/images/codefuse-modelcache/modelcache_modules_20231114.png) + +## Acknowledgements +This project has referenced the following open-source projects. We would like to express our gratitude to the projects and their developers for their contributions and research.
[GPTCache](https://github.com/zilliztech/GPTCache) + +## Contributing +ModelCache is a captivating and invaluable project, whether you are an experienced developer or a novice just starting out, your contributions to this project are warmly welcomed. Your involvement in this project, be it through raising issues, providing suggestions, writing code, or documenting and creating examples, will enhance the project's quality and make a significant contribution to the open-source community. \ No newline at end of file diff --git a/content/en/docs/overview/b6.FasterTransformer4CodeFuse.md b/content/en/docs/overview/b6.FasterTransformer4CodeFuse.md new file mode 100644 index 0000000..d26f074 --- /dev/null +++ b/content/en/docs/overview/b6.FasterTransformer4CodeFuse.md @@ -0,0 +1,10 @@ +--- +title: FasterTransformer4CodeFuse +slug: FasterTransformer4CodeFuse +description: 介绍主要功能 +aliases: +- "/docs/fastertransformer4codefuse" +--- + +## FasterTransformer4CodeFuse +FasterTransformer4CodeFuse \ No newline at end of file diff --git a/content/en/docs/overview/b7.TestAgent.md b/content/en/docs/overview/b7.TestAgent.md new file mode 100644 index 0000000..1221291 --- /dev/null +++ b/content/en/docs/overview/b7.TestAgent.md @@ -0,0 +1,63 @@ +--- +title: "Test-Agent: Your AI Test Assistant" +slug: "Test-Agent: Your AI Test Assistant" +description: 介绍主要功能 +aliases: +- "/docs/test-agent" +--- + +### Local Mac M1 Experience +![图片](https://github.com/codefuse-ai/Test-Agent/assets/103973989/8dba860f-c1bb-49d5-b9dd-a58e541562a6) + +### Moda Experience +Moda Model Access Link:[ModelScope TestGPT-7B](https://modelscope.cn/models/codefuse-ai/TestGPT-7B/summary) +![MS](https://github.com/codefuse-ai/Test-Agent/assets/103973989/0e50b258-44f9-4dc6-8e30-0a01cf62d02b) + + +## What is Test Agent? (Introduction) +**Test Agent** aims to build an "intelligent agent" in the testing domain, integrating large models with engineering technologies in the quality domain to promote the generational upgrade of quality technology. We look forward to collaborating with community members to create innovative solutions in the testing domain, establish a 24-hour online testing assistant service, and make testing as smooth as silk. +## Current Features (Features) +* **Model**: This release open-sources the TestGPT-7B model for the testing domain. The model is based on CodeLlama-7B and has been fine-tuned for related downstream tasks: + * **Multilingual Test Case Generation (Java/Python/Javascript)**: This has always been an area of great interest to both academia and industry, with new products and tools like EvoSuite, Randoop, SmartUnit, etc., constantly being incubated. However, traditional test case generation has pain points that are difficult to address. Test case generation based on large models is superior to traditional tools in terms of readability, completeness of test scenarios, and multilingual support. This update focuses on multilingual test case generation, initially including Java, Python, and Javascript, and will gradually introduce Go, C++, and other languages in future releases. + * **Test Case Assert Completion**: Analyzing the current state of test cases, we found that a certain proportion of existing test cases in the code repositories do not contain Asserts. Test cases without Asserts may pass during regression but fail to detect issues. Therefore, we expanded the scenario of automatic completion of test case Asserts. With this model capability, combined with the right engineering support, it's possible to perform batch automatic completion for the entire test case repository, intelligently raising the quality level of the project. +* **Engineering Framework**: Local model quick release and experience engineering framework + - ChatBot page + - Quick model launch + - Private deployment, localized GPT large model interactions with your data and environment, no risk of data leakage, 100% safe. + +**We will continue to iterate on the model and engineering capabilities:** +- Continuously adding more exciting test domain application scenarios, such as domain knowledge Q&A, test scenario analysis, etc. +- Supporting the open copilot engineering framework focused on testing scenarios, such as intelligent embedding of testing domain knowledge, a common tool API system, intelligent testing Agent, and more, so stay tuned! +- Expanding from a 7B base to 13B and 34B models gradually. Stay tuned! + +## The Most Powerful 7B Test Domain Large Model (Model) +Currently, within TestAgent, we default to using the TestGPT-7B model. Compared to existing open-source models, the TestGPT-7B model leads the industry in case execution pass rate (pass@1) and test scenario coverage (average number of test scenarios). +The core capability evaluation results of the TestGPT-7B model are as follows: + +Multilingual Test Case Generation For the three supported languages of the model: Java, Python, and Javascript, the Pass@1 evaluation results are as follows: + +| Model | Java pass@1 | Java Average number of test scenarios | Python pass@1 | Python Average number of test scenarios | Javascript pass@1 | Javascript Average number of test scenarios | +| --- | --- | --- | --- | --- | --- | --- | +| TestGPT-7B | 48.6% | 4.37 | 35.67% | 3.56 | 36% | 2.76 | +| CodeLlama-13B-Instruct | 40.54% | 1.08 | 30.57% | 1.65 | 31.7% | 3.13 | +| Qwen-14B-Chat | 10.81% | 2.78 | 15.9% | 1.32 | 9.15% | 4.22 | +| Baichuan2-13B-Chat | 13.5% | 2.24 | 12.7% | 2.12 | 6.1% | 3.31 | + + +- Test Case Assert Completion +Currently, the model supports Assert completion for Java cases, and the Pass@1 evaluation + +| Model | pass@1 | Percentage of strong validation | +| --- | --- | --- | +| Codefuse-TestGPT-7B | 71.1% | 100% | + + +## Engineering Architecture +![JG](https://github.com/codefuse-ai/Test-Agent/assets/103973989/1b61beff-df59-4ab3-843c-266413c8dbc4) + +The clarion call for large models has been sounded, and large models in the testing domain are continuously evolving. With the rich world knowledge accumulated during the pre-training process, they have demonstrated extraordinary reasoning and decision-making abilities in complex interactive environments. + +Despite significant achievements of the foundational models in the testing domain, there are still some limitations. Testing tasks in specific domains often require specialized tools or domain knowledge. For instance, foundational models can complete tasks such as single-instance test code generation and test text generation through pre-trained knowledge, but when dealing with complex integrated test case generation, domain-specific case creation, and interactions with test process pipelines, more specialized tools and domain knowledge are necessary. Therefore, integrating specialized tools with foundational models can fully harness their respective strengths. Specialized tools can address insufficiencies in model timeliness, enhance professional knowledge, and improve interpretability and robustness. On the other hand, foundational models possess human-like reasoning and planning abilities, capable of understanding complex data and scenarios, and interacting with the real world. + +Building upon the open model engineering deployment and ChatBot foundation in this release, we will continue to invest deeply in the open-source testing domain. Collaborating with community developers who share similar interests, we aim to create the most advanced engineering system for tools in the testing domain, an intelligent testing assistant, and open-source testing engineering! + diff --git a/content/en/docs/overview/b8.CodeFuseQuery.md b/content/en/docs/overview/b8.CodeFuseQuery.md new file mode 100644 index 0000000..875dc25 --- /dev/null +++ b/content/en/docs/overview/b8.CodeFuseQuery.md @@ -0,0 +1,25 @@ +--- +title: CodeFuse-Query +slug: CodeFuse-Query +description: 介绍主要功能 +aliases: +- "/docs/codefuse-query" +--- + +## CodeFuse-Query +With the increasing popularity of large-scale software development, the demand for scalable and adaptable static code analysis techniques is growing. Traditional static analysis tools such as Clang Static Analyzer (CSA) or PMD have shown good results in checking programming rules or style issues. However, these tools are often designed for specific objectives and are unable to meet the diverse and changing needs of modern software development environments. These needs may relate to Quality of Service (QoS), various programming languages, different algorithmic requirements, and various performance needs. For example, a security team might need sophisticated algorithms like context-sensitive taint analysis to review smaller codebases, while project managers might need a lighter algorithm, such as one that calculates cyclomatic complexity, to measure developer productivity on larger codebases. + +These diversified needs, coupled with the common computational resource constraints in large organizations, pose a significant challenge. Traditional tools, with their problem-specific computation methods, often fail to scale in such environments. This is why we introduced CodeQuery, a centralized data platform specifically designed for large-scale static analysis. +In implementing CodeQuery, we treat source code and analysis results as data, and the execution process as big data processing, a significant departure from traditional tool-centric approaches. We leverage common systems in large organizations, such as data warehouses, data computation facilities like MaxCompute and Hive, OSS object storage, and flexible computing resources like Kubernetes, allowing CodeQuery to integrate seamlessly into these systems. This approach makes CodeQuery highly maintainable and scalable, capable of supporting diverse needs and effectively addressing changing demands. Furthermore, CodeQuery's open architecture encourages interoperability between various internal systems, facilitating seamless interaction and data exchange. This level of integration and interaction not only increases the degree of automation within the organization but also improves efficiency and reduces the likelihood of manual errors. By breaking down information silos and fostering a more interconnected, automated environment, CodeQuery significantly enhances the overall productivity and efficiency of the software development process. +Moreover, CodeQuery's data-centric approach offers unique advantages when addressing domain-specific challenges in static source code analysis. For instance, source code is typically a highly structured and interconnected dataset, with strong informational and relational ties to other code and configuration files. By treating code as data, CodeQuery can adeptly handle these issues, making it especially suitable for use in large organizations where codebases evolve continuously but incrementally, with most code undergoing minor changes daily while remaining stable. CodeQuery also supports use cases like code-data based Business Intelligence (BI), generating reports and dashboards to aid in monitoring and decision-making processes. Additionally, CodeQuery plays an important role in analyzing training data for large language models (LLMs), providing deep insights to enhance the overall effectiveness of these models. + +In the current field of static analysis, CodeQuery introduces a new paradigm. It not only meets the needs of analyzing large, complex codebases but is also adaptable to the ever-changing and diversified scenarios of static analysis. CodeQuery's data-centric approach gives it a unique advantage in dealing with code analysis issues in big data environments. Designed to address static analysis problems in large-scale software development settings, it views both source code and analysis results as data, allowing it to integrate flexibly into various systems within large organizations. This approach not only enables efficient handling of large codebases but can also accommodate various complex analysis needs, thereby making static analysis work more effective and accurate. + +The characteristics and advantages of CodeQuery can be summarized as follows: + +- **Highly Scalable**: CodeQuery can handle large codebases and adapt to different analysis needs. This high level of scalability makes CodeQuery particularly valuable in large organizations. +- **Data-Centric**: By treating source code and analysis results as data, CodeQuery's data-centric approach gives it a distinct edge in addressing code analysis problems in big data environments. +- **Highly Integrated**: CodeQuery can integrate seamlessly into various systems within large organizations, including data warehouses, data computation facilities, object storage, and flexible computing resources. This high level of integration makes the use of CodeQuery in large organizations more convenient and efficient. +- **Supports Diverse Needs**: CodeQuery can process large codebases and accommodate various complex analysis needs, including QoS analysis, cross-language analysis, algorithmic needs, and performance requirements. + +CodeQuery is a powerful static code analysis platform, suitable for large-scale, complex codebase analysis scenarios. Its data-centric approach and high scalability give it a unique advantage in the modern software development environment. As static code analysis technology continues to evolve, CodeQuery is expected to play an increasingly important role in this field. \ No newline at end of file diff --git a/content/en/docs/overview/b9.mftvlm.md b/content/en/docs/overview/b9.mftvlm.md new file mode 100644 index 0000000..ef95829 --- /dev/null +++ b/content/en/docs/overview/b9.mftvlm.md @@ -0,0 +1,28 @@ +--- +title: CodeFuse-MFT-VLM +slug: CodeFuse-MFT-VLM +description: 介绍主要功能 +aliases: +- "/docs/codefuse-mft-vlm" +--- + +## CodeFuse-VLM +CodeFuse-VLM is a Multimodal LLM(MLLM) framework that provides users with multiple vision encoders, multimodal alignment adapters, and LLMs. Through CodeFuse-VLM framework, users are able to customize their own MLLM model to adapt their own tasks. +As more and more models are published on Huggingface community, there will be more open-source vision encoders and LLMs. Each of these models has their own specialties, e.g. Code-LLama is good at code-related tasks but has poor performance for Chinese tasks. Therefore, we built CodeFuse-VLM framework to support multiple vision encoders, multimodal alignment adapters, and LLMs to adapt different types of tasks. +![img.jpg](/images/mft-vlm/CodeFuse-VLM-arch.png) + +Under CodeFuse-VLM framework, we use cross attention multimodal adapter, Qwen-14B LLM, and Qwen-VL's vision encoder to train CodeFuse-VLM-14B model. On multiple benchmarks, our CodeFuse-VLM-14B shows superior performances over Qwen-VL and LLAVA-1.5. +![img.jpg](/images/mft-vlm/CodeFuse-VLM-14B-performance.png) + +Here is the table for different MLLM model's performance on benchmarks +Model | MMBench | MMBench-CN | VqaV2 | GQA | TextVQA | Vizwiz +| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | +LLAVA-1.5 | 67.7 | 63.6 | 80.0 | 63.3 | 61.3 | 53.6 +Qwen-VL | 60.6 | 56.7 | 78.2 | 57.5 | 63.8 | 38.9 +CodeFuse-VLM-14B | 75.7 | 69.8 | 79.3 | 59.4 | 63.9 | 45.3 + +Our model achieved high ranking on MMBenchmark: https://mmbench.opencompass.org.cn/leaderboard + +Here's our model's demo video + +https://private-user-images.githubusercontent.com/22836551/300386230-8e64f615-ac0e-447e-9695-c96b254d484f.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY1MjExODksIm5iZiI6MTcwNjUyMDg4OSwicGF0aCI6Ii8yMjgzNjU1MS8zMDAzODYyMzAtOGU2NGY2MTUtYWMwZS00NDdlLTk2OTUtYzk2YjI1NGQ0ODRmLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTI5VDA5MzQ0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ5NzNjM2U1ZWU4NDU0Yzc5NmE4ZTM1NzY2ZjU4YjRjY2ZhNjMzODk0ZDgzMDg4N2FjYjZhYTllM2E3NTAyMWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.pr-ad7rKYBgk26DTItj2q2q9I5dRWnBNHbV9M7GSVCo diff --git a/content/en/docs/testagent/1_quickstart.md b/content/en/docs/testagent/1_quickstart.md new file mode 100644 index 0000000..66f794b --- /dev/null +++ b/content/en/docs/testagent/1_quickstart.md @@ -0,0 +1,65 @@ +--- +title: "QuickStart" +slug: "QuickStart" +description: 介绍主要功能 +url: "/docs/test-agent-quickstart" +aliases: +- "/docs/test-agent-quickstart" +--- + + +## QuickStart +### Prerequisites + +#### Model Download +You can get detailed information about the model and download the model files from [modelscope](https://modelscope.cn/models/codefuse-ai/TestGPT-7B) or [huggingface](https://huggingface.co/codefuse-ai/TestGPT-7B). +Please note: +需要注意的是: +If you download the model through modelscope, refer to the download instructions: [Download Instructions]((https://www.modelscope.cn/docs/%E6%A8%A1%E5%9E%8B%E7%9A%84%E4%B8%8B%E8%BD%BD#%E4%BD%BF%E7%94%A8Git%E4%B8%8B%E8%BD%BD%E6%A8%A1%E5%9E%8B)); +If you download the model through huggingface, please make sure you have proper access to huggingface. + +#### Environment Installation +- python>=3.8 +- transformers==4.33.2 + +```plain +git clone https://github.com/codefuse-ai/Test-Agent +cd Test-Agent +pip install -r requirements.txt +``` + +Before starting to run the TestGPT-7B model, please ensure that your execution environment has about 14GB of VRAM. + + +### Starting the Service + +The project provides the ability to quickly set up a web UI for a more intuitive display of model interactions and effects. We can use a few simple commands to wake up the front-end page and call the model capabilities in real time. In the project directory, start the following services in order: + +1.**Start controller** +![controller](https://github.com/codefuse-ai/Test-Agent/assets/103973989/e68ce187-c9f1-4ce8-9d59-ff9d8348d0ac) +python3 -m chat.server.controller + +2.**Start model worker** +![work](https://github.com/codefuse-ai/Test-Agent/assets/103973989/073e4e79-4005-4c98-87f7-0eaa0b2b1e22) +python3 -m chat.server.model_worker --model-path models/TestGPT-7B --device mps + +(models/TestGPT-7B is the actual model file path) + +For the launch method, you can choose from several configuration options as needed: +- --device mps for enabling GPU acceleration on Mac computers (Apple Silicon or AMD GPUs); +- --device xpu for enabling acceleration on Intel XPU (Intel Data Center and Arc A-Series GPUs): + - Install [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/installation.html) + - Set the OneAPI environment variable: source /opt/intel/oneapi/setvars.sh +- --device npu for enabling acceleration on Huawei AI processors; + - Install [Ascend PyTorch Adapter](https://github.com/Ascend/pytorch) + - 设置CANN环境变量:source /usr/local/Ascend/ascend-toolkit/set_env.sh +- --device cpu for running using only CPU, no GPU needed; +- --num-gpus 2 to specify the option of running GPUs concurrently. + + +3. **Start the web service** +python3 -m chat.server.gradio_testgpt +![web](https://github.com/codefuse-ai/Test-Agent/assets/103973989/340dae35-573b-4046-a3e8-e87a91453601) +Once the service is ready, you can open the local web service address http://0.0.0.0:7860 and see the complete front-end page. At the bottom of the page, there are two examples: 【Single-test Generation】 and 【Assert Completion】. After clicking the button, a sample text will be automatically generated in the input box. Clicking the Send button will trigger the model to run. After waiting patiently for a while (running time depends on the performance of your machine), you can see the complete answer. +![demo](https://github.com/codefuse-ai/Test-Agent/assets/103973989/fd24274c-729b-4ce7-8763-a083b39300fb) + diff --git a/content/en/muagent/connector/connector_agent.md b/content/en/muagent/connector/connector_agent.md new file mode 100644 index 0000000..c0e511b --- /dev/null +++ b/content/en/muagent/connector/connector_agent.md @@ -0,0 +1,150 @@ +--- +title: Connector Agent +slug: Connector Agent +url: "muagent/connector-agent" +aliases: +- "/muagent/connector-agent" +--- + + +## Quickly Build an Agent +- First, add an OpenAI configuration, or a model with a similar interface to OpenAI (launched through fastchat) + + +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" + +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` + +- Then Set LLM Configuration and Vector Model Configuration +Configure related LLM and Embedding Model +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.connector.agents import BaseAgent, ReactAgent, ExecutorAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Role, Message, ChainConfig +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS + +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) + +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + +### Agent Configuration +- Define two react agents for actual task execution + +``` +# Here, predefined prompts are used, but you can also refer to the above prompts to complete the writing +from muagent.connector.configs.prompts import REACT_CODE_PROMPT, REACT_TOOL_PROMPT + +# A tool agent based on react is defined +tool_role = Role(role_type="assistant", role_name="tool_reacter", prompt=REACT_TOOL_PROMPT) +tool_react_agent = ReactAgent( + role=tool_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) + + +# A code agent based on react is defined +code_role = Role(role_type="assistant", role_name="code_reacter", prompt=REACT_CODE_PROMPT) +code_react_agent = ReactAgent( + role=code_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) +``` + +- Define a groupAgent for agent selection +``` +prompt = """#### Agent Profile +Your goal is to respond according to the information in the Context Data with the role that will best facilitate a solution, taking into account all relevant context (Context) provided. +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions, and tool list. +ATTENTION: respond carefully following the "Response Output Format". +#### Response Output Format +**Thoughts:** think step by step about why you selected one role +**Role:** Select the role from the agent names. +""" + +# A groupAgent is defined +role = Role(role_type="assistant", role_name="qaer", prompt=prompt) +base_agent = SelectorAgent( + role=role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, + group_agents=[tool_react_agent, code_react_agent] +) +``` + +### Start Actual Q&A +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) +question = "Confirm if employee_data.csv exists locally, and check its columns and data types; then draw a bar chart" + +query = Message( + user_name="test", role_type="user", role_name="user", input_query=question, + tools=tools, +) +# base_agent.pre_print(query) +output_message = base_agent.step(query) +print(output_message.input_query) +print(output_message.role_content) +``` + + +## Agent Configs +``` +# Configuration structure is in this directory +from muagent.connector.schema import Role +``` + +### Agent Config +|Config Key Name| Type| Description| +| ------------------ | ---------- | ---------- | +|role| Role |Role description| +|focus_agents |List[String] |Logic of MetaGPT, focusing on the messages generated by which agents, optional values are: role_name| +|focus_message_keys |List[String]| Additional logic, focusing on specific key information in the message, optional values are: agent's output_keys| +|chat_turn |int |Valid only for ReactAgent| +|llm_config |LLMConfig |Large language model configuration| +|embed_config |EmbedConfig |Vector model configuration| +|sandbox_server |Dict |Sandbox environment, i.e., notebook startup configuration| +|jupyter_work_path |str |Working directory of the sandbox environment| +|kb_root_path |str |Storage path for memory| +|log_verbose |str |Log printing level of agent prompt & predict| + +### Role + +| Config Key Name | Type | Description | +|------------------|------|--------------------| +| role_type | str | Role type, Enum: system, user, assistant, function, observation, summary | +| role_name | str | Role name | +| role_desc | str | Role description | +| agent_type | str | Agent type | +| role_prompt | str | Role instruction | +| prompt | str | Complete prompt structure | \ No newline at end of file diff --git a/content/en/muagent/connector/connector_chain.md b/content/en/muagent/connector/connector_chain.md new file mode 100644 index 0000000..be0ab8e --- /dev/null +++ b/content/en/muagent/connector/connector_chain.md @@ -0,0 +1,137 @@ +--- +title: Connector Chain +slug: Connector Chain +url: "muagent/connector-chain" +aliases: +- "/muagent/connector-chain" +--- + + +## Quickly Build an Agent +### First, add an OpenAI configuration, or a model with a similar interface to OpenAI (launched through fastchat) + + +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` + +### Then Set LLM Configuration and Vector Model Configuration +Configure related LLM and Embedding Model +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.connector.agents import BaseAgent, ReactAgent, ExecutorAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Role, Message, ChainConfig +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS + +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) + +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + +### Agent Configuration +- Define two react agents for actual task execution + +``` +# Here, predefined prompts are used, but you can also refer to the above prompts to complete the writing +from muagent.connector.configs.prompts import REACT_CODE_PROMPT, REACT_TOOL_PROMPT + +# A tool agent based on react is defined +tool_role = Role(role_type="assistant", role_name="tool_reacter", prompt=REACT_TOOL_PROMPT) +tool_react_agent = ReactAgent( + role=tool_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) + +# A code agent based on react is defined +code_role = Role(role_type="assistant", role_name="code_reacter", prompt=REACT_CODE_PROMPT) +code_react_agent = ReactAgent( + role=code_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) +``` + +- Define a groupAgent for agent selection +``` +prompt = """#### Agent Profile +Your goal is to respond according to the information in the Context Data with the role that will best facilitate a solution, taking into account all relevant context (Context) provided. +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions, and tool list. +ATTENTION: respond carefully following the "Response Output Format". +#### Response Output Format +**Thoughts:** think step by step about why you selected one role +**Role:** Select the role from the agent names. +""" +# A groupAgent is defined +role = Role(role_type="assistant", role_name="qaer", prompt=prompt) +base_agent = SelectorAgent( + role=role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, + group_agents=[tool_react_agent, code_react_agent] +) +``` + +### Chain Config +``` +chain_config = ChainConfig(chain_name="group_chain", agents=[base_agent.role.role_name], chat_turn=1) +base_chain = BaseChain( + chainConfig=chain_config, agents=[base_agent], + llm_config=llm_config, embed_config=embed_config, +) +``` + +### Start Actual Q&A +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) +question = "Confirm if employee_data.csv exists locally, and check its columns and data types; then draw a bar chart" +query = Message( + user_name="test", role_type="user", role_name="user", input_query=question, + tools=tools, +) + +# base_chain.pre_print(query) +output_message, output_memory = base_chain.step(query) +print(output_message.input_query) +print(output_message.role_content) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +## Chain Parameter Configuration +|Config Key Name| Type |Description| +| ------------------ | ---------- | ---------- | +|agents| List[BaseAgent] | +|llm_config |LLMConfig |Large Language Model Configuration| +|embed_config |EmbedConfig |Vector Model Configuration| +|sandbox_server |Dict |Sandbox environment or notebook startup configuration| +|jupyter_work_path |str |Working directory for the sandbox environment| +|kb_root_path |str |Storage path for memory| +|log_verbose |str |Log printing level for agent prompts & predictions| \ No newline at end of file diff --git a/content/en/muagent/connector/connector_memory.md b/content/en/muagent/connector/connector_memory.md new file mode 100644 index 0000000..d3d5ec4 --- /dev/null +++ b/content/en/muagent/connector/connector_memory.md @@ -0,0 +1,82 @@ +--- +title: Connector Memory +slug: Connector Memory +url: "muagent/connector-memory" +aliases: +- "/muagent/connector-memory" +--- + +## Memory Manager +Primarily used for managing chat history, not yet completed +- Read and write chat history in the database, including user input, llm output, doc retrieval, code retrieval, search retrieval. +- Summarize key information from the chat history into a summary context, serving as a prompt context. +- Provide a search function to retrieve information related to the question from chat history or summary context, aiding in Q&A. + +## Usage Example +### Create memory manager instance +``` +import os +import openai +from coagent.base_configs.env_config import KB_ROOT_PATH +from coagent.connector.memory_manager import BaseMemoryManager, LocalMemoryManager +from coagent.llm_models.llm_config import EmbedConfig, LLMConfig +from coagent.connector.schema import Message +os.environ["API_BASE_URL"] = OPENAI_API_BASE +os.environ["OPENAI_API_KEY"] = "sk-xx" +openai.api_key = "sk-xxx" +# os.environ["OPENAI_PROXY"] = "socks5h://127.0.0.1:13659" +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" + +# LLM and Embedding Model configurations +llm_config = LLMConfig( + model_name="gpt-3.5-turbo", model_device="cpu",api_key=os.environ["OPENAI_API_KEY"], + api_base_url=os.environ["API_BASE_URL"], temperature=0.3 + ) +embed_config = EmbedConfig( + embed_engine="model", embed_model="text2vec-base-chinese", + embed_model_path="D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/embedding_models/text2vec-base-chinese" + ) +# +phase_name = "test" +memory_manager = LocalMemoryManager( + unique_name=phase_name, + do_init=True, + kb_root_path=KB_ROOT_PATH, + embed_config=embed_config, + llm_config=llm_config + ) +``` + +### Support for Message management +``` +message1 = Message( + role_name="test1", role_type="user", input_query="hello", origin_query="hello", + parsed_output_list=[{"input": "hello"}] +) +text = "hi! how can I help you?" +message2 = Message( + role_name="test2", role_type="assistant", input_query=text, origin_query=text, + role_content=text, step_content=text, parsed_output_list=[{"answer": text}] +) +text = "they say hello and hi to each other" +message3 = Message( + role_name="test3", role_type="summary", + role_content=text, step_content=text, + parsed_output_list=[{"summary": text}] + ) +``` + +### Support for memory retrieval +``` +# embedding retrieval test +text = "say hi, i want some help" +print(memory_manager.router_retrieval(text=text, datetime="2024-01-08 20:22:00", n=4, top_k=5, retrieval_type= "datetime")) +print(memory_manager.router_retrieval(text=text, datetime="2024-01-08 20:22:00", n=4, top_k=5, retrieval_type= "embedding")) +print(memory_manager.router_retrieval(text=text, datetime="2024-01-08 20:22:00", n=4, top_k=5, retrieval_type= "text")) +``` + +### Support for memory summarization +``` +# recursive_summary test +print(memory_manager.recursive_summary(local_memory_manager.recall_memory.messages, split_n=1)) +``` \ No newline at end of file diff --git a/content/en/muagent/connector/connector_phase.md b/content/en/muagent/connector/connector_phase.md new file mode 100644 index 0000000..8530d9d --- /dev/null +++ b/content/en/muagent/connector/connector_phase.md @@ -0,0 +1,132 @@ +--- +title: Connector Phase +slug: Connector Phase +url: "muagent/connector-phase" +aliases: +- "/muagent/connector-phase" +--- + +## Quickly Build an Agent Phase +- First, add OpenAI configuration, which can be models with similar interfaces to OpenAI (triggered via fastchat). +``` +import os, sys +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` +### Then Set LLM Configuration and Vector Model Configuration +- Configure related LLM and Embedding Model. +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.connector.agents import BaseAgent, ReactAgent, ExecutorAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Role, Message, ChainConfig +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` +### Agent Configuration +- Define two react agents for actual task execution. +``` +# Predefined prompts are used here; you can also refer to the above-mentioned prompts to write your own. +from muagent.connector.configs.prompts import REACT_CODE_PROMPT, REACT_TOOL_PROMPT +# Defined a tool agent based on react +tool_role = Role(role_type="assistant", role_name="tool_reacter", prompt=REACT_TOOL_PROMPT) +tool_react_agent = ReactAgent( + role=tool_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) +# Defined a code agent based on react +code_role = Role(role_type="assistant", role_name="code_reacter", prompt=REACT_CODE_PROMPT) +code_react_agent = ReactAgent( + role=code_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) +``` +- Define a GroupAgent for agent selection. +``` +prompt = """#### Agent Profile +Your goal is to respond according to the information provided by the Context Data's with the role that will best facilitate a solution, taking into account all relevant context data (Context). +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions, and tool list. +ATTENTION: respond carefully, referenced to the "Response Output Format" standard. +#### Response Output Format +**Thoughts:** think the reason step by step about why you select one role +**Role:** Select the role from the agent names. +""" +# Defined a GroupAgent +role = Role(role_type="assistant", role_name="qaer", prompt=prompt) +base_agent = SelectorAgent( + role=role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, + group_agents=[tool_react_agent, code_react_agent] +) +``` +### Chain Configuration +``` +chain_config = ChainConfig(chain_name="group_chain", agents=[base_agent.role.role_name], chat_turn=1) +base_chain = BaseChain( + chainConfig=chain_config, agents=[base_agent], + llm_config=llm_config, embed_config=embed_config, +) +``` +### Phase Configuration +``` +base_phase = BasePhase( + phase_name="group_phase", chains=[base_chain], + embed_config=embed_config, llm_config=llm_config +) +``` +### Start Real Q&A +- Start execution. +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) +question = "Confirm whether employee_data.csv exists locally, and review its columns and data types; then plot a bar chart." +query = Message( + user_name="test", role_type="user", role_name="user", input_query=question, + tools=tools, +) + + +# base_phase.pre_print(query) +output_message, output_memory = base_phase.step(query) +print(output_message.input_query) +print(output_message.role_content) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + +## Phase Parameter Configuration +| Config Key Name | Type | Description | +| ------------------ | ---------- | ---------- | +| phase_name | String | Scenario name | +| chains | List[Chain] | List of chains to be executed in order | +| llm_config | LLMConfig | Large Language Model configuration | +| embed_config | EmbedConfig | Vector model configuration | +| sandbox_server | Dict | Sandbox environment, i.e., notebook startup configuration | +| jupyter_work_path | str | Working directory in the sandbox environment | +| kb_root_path | str | Storage path for memory | +| log_verbose | str | Log print level for agent prompts & predictions | diff --git a/content/en/muagent/connector/connector_prompt.md b/content/en/muagent/connector/connector_prompt.md new file mode 100644 index 0000000..6db0b58 --- /dev/null +++ b/content/en/muagent/connector/connector_prompt.md @@ -0,0 +1,222 @@ +--- +title: Connector Prompt +slug: Connector Prompt +url: "muagent/connector-prompt" +aliases: +- "/muagent/connector-prompt" +--- + +## Prompt Manager +Managing prompt creation in multi-agent linkages +- Quick Configuration: Utilizing preset processing functions, users can easily configure by simply defining the inputs and outputs of the agents, enabling fast assembly and configuration of multi-agent prompts. +- Customization Support: Allows users to customize the internal processing logic of each module within the prompt to achieve personalized implementation of the agent prompt. + +### Preset Template Structure for Prompts +- Agent Profile: This section involves the basic description of the agent, including but not limited to the type of agent, its functions, and command set. Users can set the basic attributes of the agent here to ensure its behavior aligns with expectations. +- Context: Contextual Information, provided as a reference for the agent, aiding in better decision-making. + - Tool Information: This part provides the agent with a list of available tools, from which the agent can choose appropriate ones to assist in task execution based on current scenario requirements. + - Reference Documents: This may include documents or code snippets for the agent to refer to when handling requests, to facilitate the use of relevant information. + - Session Records: In multi-round conversations, this section records previous dialogue content to ensure continuity within the context. +- Response Output Format: Here the user can set the output format of the agent to ensure that the generated responses meet specific formatting requirements, including structure, grammar, etc. + +## Standard Structure of Prompt +In the entire structure of a Prompt, we need to define three parts: +- Agent Profile +- Input Format +- Response Output Format + +``` +#### Agent Profile +Agent Description ... + +#### Input Format +**Origin Query:** the initial question or objective that the user wanted to achieve +**Context:** the current status and history of the tasks to determine if Origin Query has been achieved. + +#### Response Output Format +**Action Status:** finished or continued +If it's 'finished', the context can answer the origin query. +If it's 'continued', the context can't answer the origin query. +**REASON:** Justify the decision of choosing 'finished' or 'continued' by evaluating the progress step by step. +Consider all relevant information. If the tasks were aimed at an ongoing process, assess whether it has reached a satisfactory conclusion. +``` + +Here, we have integrated some of the common operations of the `Input Format`, with certain fields and operational procedures built in to form a standardized configurable operation. +In the future, we will also make parts of the Agent Profile and Response Output Format configurable to reduce the difficulty of writing Prompts. + + +### Customizing Agents +- Implement construction with custom fields according to actual needs +``` +class CodeGenDocer(BaseAgent): + def start_action_step(self, message: Message) -> Message: + '''do action before agent predict ''' + # Get code snippets and node information based on the question + action_json = CodeRetrievalSingle.run(message.code_engine_name, message.input_query, llm_config=self.llm_config, + embed_config=self.embed_config, local_graph_path=message.local_graph_path, use_nh=message.use_nh,search_type="tag") + current_vertex = action_json['vertex'] + message.customed_kargs["Code Snippet"] = action_json["code"] + message.customed_kargs['Current_Vertex'] = current_vertex + return message + +``` + + + +### pre_print Function +After building phases, chains, or agents, we can confirm agent linkages using the pre-print function of methods, allowing for debugging in advance to avoid discovering issues only after execution. +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.connector.agents import BaseAgent, ReactAgent, ExecutorAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Role, Message, ChainConfig +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS + + +import os, sys +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" + +llm_config = LLMConfig( + model_name="gpt-4", api_key=api_key, api_base_url=api_base_url, temperature=0.3 +) +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) + +phase_name = "baseGroupPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, +) +phase.pre_print(query) +``` + + +Here, pre-defined agents are used,,custom case can be seen [customed_example](/muagent/customed-examples) +
+ + + +## check the pre-print prompt +``` +########################## +<<<>>> +########################## + +### Agent Profile +Your goal is to response according the Context Data's information with the role that will best facilitate a solution, taking into account all relevant context (Context) provided. +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions and tool list. +ATTENTION: response carefully referenced "Response Output Format" in format. + +### Tool Information + +### Agent Infomation + Please ensure your selection is one of the listed roles. Available roles for selection: + "role name: tool_react +role description: Agent Profile,When interacting with users, your role is to respond in a helpful and accurate manner using the tools available. Follow the steps below to ensure efficient and effective use of the tools.,Please note that all the tools you can use are listed below. You can only choose from these tools for use. ,If there are no suitable tools, please do not invent any tools. Just let the user know that you do not have suitable tools to use.,ATTENTION: The Action Status field ensures that the tools or code mentioned in the Action can be parsed smoothly. Please make sure not to omit the Action Status field when replying.," +"role name: code_react +role description: Agent Profile,When users need help with coding, your role is to provide precise and effective guidance.,Write the code step by step, showing only the part necessary to solve the current problem. Each reply should contain only the code required for the current step.," + Please ensure select the Role from agent names, such as tool_react, code_react + +### Context Data + +#### Reference Documents + +#### Session Records + +#### Current Plan + +### Response Output Format +**Thoughts:** think the reason step by step about why you selecte one role +**Role:** Select the role from agent names. + +### Begin!!! + +################### +<<<>>> +################### + +**Thoughts:** +**Role:** + + +########################### +<<<>>> +########################### +### Agent Profile +When interacting with users, your role is to respond in a helpful and accurate manner using the tools available. Follow the steps below to ensure efficient and effective use of the tools. +Please note that all the tools you can use are listed below. You can only choose from these tools for use. +If there are no suitable tools, please do not invent any tools. Just let the user know that you do not have suitable tools to use. +ATTENTION: The Action Status field ensures that the tools or code mentioned in the Action can be parsed smoothly. Please make sure not to omit the Action Status field when replying. + +### Tool Information + +### Context Data + +#### Reference Documents + +#### Session Records + +#### Task Records + +### Response Output Format +**Thoughts:** According the previous observations, plan the approach for using the tool effectively. +... + +### Begin!!! + +################### +<<<>>> +################### +**Thoughts:** +**Action Status:** +**Action:** +**Observation:** +**Thoughts:** +**Action Status:** +**Action:** + +########################### +<<<>>> +########################### +### Agent Profile +When users need help with coding, your role is to provide precise and effective guidance. +Write the code step by step, showing only the part necessary to solve the current problem. Each reply should contain only the code required for the current step. + +### Context Data + +#### Reference Documents + +#### Session Records + +### Response Output Format + +**Thoughts:** According the previous context, solve the problem step by step, only displaying the thought process necessary for the current step of solving the problem, +outline the plan for executing this step. + +**Action Status:** Set to 'stopped' or 'code_executing'. +If it's 'stopped', the action is to provide the final answer to the session records and executed steps. +If it's 'code_executing', the action is to write the code. +... + +### Begin!!! + +################### +<<<>>> +################### + +**Thoughts:** +**Action Status:** +**Action:** +**Observation:** +**Thoughts:** +**Action Status:** +**Action:** + +``` diff --git a/content/en/muagent/connector/customed_examples.md b/content/en/muagent/connector/customed_examples.md new file mode 100644 index 0000000..e4ebe1e --- /dev/null +++ b/content/en/muagent/connector/customed_examples.md @@ -0,0 +1,290 @@ +--- +title: Customed Examples +slug: Customed Examples +url: "muagent/custom-examples" +aliases: +- "/muagent/custom-examples" +--- + + + +## How to Create Your Personalized Agent Phase Scenario +Below we will use a code repository to demonstrate the automatic generation of API documentation from code, detailing how to customize the construction of an agent phase. + +### Design Your Prompt Structure + +- codeGenDocGroup_PROMPT, create group Agent Prompt +``` +# update new agent configs +codeGenDocGroup_PROMPT = """#### Agent Profile + +Your goal is to response according the Context Data's information with the role that will best facilitate a solution, taking into account all relevant context (Context) provided. + +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions and tool list. + +#### Input Format + +#### Response Output Format + +**Code Path:** Extract the paths for the class/method/function that need to be addressed from the context + +**Role:** Select the role from agent names +""" +``` + +- classGenDoc_PROMPT, create class code to api doc Prompt +``` +classGenDoc_PROMPT = """#### Agent Profile +As an advanced code documentation generator, you are proficient in translating class definitions into comprehensive documentation with a focus on instantiation parameters. +Your specific task is to parse the given code snippet of a class, extract information regarding its instantiation parameters. + +#### Input Format + +**Current_Vertex:** Provide the code vertex of the function or method. + +**Code Snippet:** Provide the full class definition, including the constructor and any parameters it may require for instantiation. + +#### Response Output Format +**Class Base:** Specify the base class or interface from which the current class extends, if any. + +**Class Description:** Offer a brief description of the class's purpose and functionality. + +**Init Parameters:** List each parameter from construct. For each parameter, provide: + - `param`: The parameter name + - `param_description`: A concise explanation of the parameter's purpose. + - `param_type`: The data type of the parameter, if explicitly defined. + + ```json + [ + { + "param": "parameter_name", + "param_description": "A brief description of what this parameter is used for.", + "param_type": "The data type of the parameter" + }, + ... + ] + ``` + + + If no parameter for construct, return + ```json + [] + ``` +""" +``` + +- funcGenDoc_PROMPT,create function code to api doc Prompt +``` +funcGenDoc_PROMPT = """#### Agent Profile +You are a high-level code documentation assistant, skilled at extracting information from function/method code into detailed and well-structured documentation. + + +#### Input Format +**Code Path:** Provide the code path of the function or method you wish to document. +This name will be used to identify and extract the relevant details from the code snippet provided. + +**Current_Vertex:** Provide the code vertex of the function or method. + +**Code Snippet:** A segment of code that contains the function or method to be documented. + +#### Response Output Format + +**Class Description:** Offer a brief description of the method(function)'s purpose and functionality. + +**Parameters:** Extract parameter for the specific function/method Code from Code Snippet. For parameter, provide: + - `param`: The parameter name + - `param_description`: A concise explanation of the parameter's purpose. + - `param_type`: The data type of the parameter, if explicitly defined. + ```json + [ + { + "param": "parameter_name", + "param_description": "A brief description of what this parameter is used for.", + "param_type": "The data type of the parameter" + }, + ... + ] + ``` + + If no parameter for function/method, return + ```json + [] + ``` + +**Return Value Description:** Describe what the function/method returns upon completion. + +**Return Type:** Indicate the type of data the function/method returns (e.g., string, integer, object, void). +""" +``` + + +### Import Packages and Basic Configuration Parameters +- First, add openai configuration or similar interfaces to models such as openai (launched via fastchat) + +``` +import os, sys +from muagent.base_configs.env_config import CB_ROOT_PATH +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.connector.phase import BasePhase +from muagent.connector.agents import BaseAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Message, Role, ChainConfig +from muagent.codechat.codebase_handler.codebase_handler import CodeBaseHandler +from loguru import logger +from muagent.tools import CodeRetrievalSingle + + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` + + +### Defining a New Agent Class +For custom key-value information +``` +class CodeGenDocer(BaseAgent): + def start_action_step(self, message: Message) -> Message: + '''do action before agent predict ''' + # Retrieve code snippets and node information based on the question + action_json = CodeRetrievalSingle.run(message.code_engine_name, message.input_query, llm_config=self.llm_config, + embed_config=self.embed_config, local_graph_path=message.local_graph_path, use_nh=message.use_nh,search_type="tag") + current_vertex = action_json['vertex'] + message.customed_kargs["Code Snippet"] = action_json["code"] + message.customed_kargs['Current_Vertex'] = current_vertex + return message + +``` + + +### Preparing LLM & Embedding +``` +llm_config = LLMConfig( + model_name="gpt-4", api_key=api_key, api_base_url=api_base_url, temperature=0.3 +) +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + + +### Codebase Loading +``` +# initialize codebase +# delete codebase +codebase_name = 'client_nebula' +code_path = "D://chromeDownloads/devopschat-bot/client_v2/client" +use_nh = True +do_interpret = False +cbh = CodeBaseHandler(codebase_name, code_path, crawl_type='dir', use_nh=use_nh, local_graph_path=CB_ROOT_PATH, + llm_config=llm_config, embed_config=embed_config) +cbh.delete_codebase(codebase_name=codebase_name) +# load codebase +cbh = CodeBaseHandler(codebase_name, code_path, crawl_type='dir', use_nh=use_nh, local_graph_path=CB_ROOT_PATH, + llm_config=llm_config, embed_config=embed_config) +cbh.import_code(do_interpret=do_interpret) +``` + + +### Then Construct a Phase Instance and Begin Execution +``` +# log-level, print prompt, and llm predict +os.environ["log_verbose"] = "1" + +funcGenDoc_role = Role(role_type="assistant", role_name="funcGenDoc_role", prompt=funcGenDoc_PROMPT) +funcGenDoc = CodeGenDocer( + role=funcGenDoc_role, + chat_turn=1, + llm_config=llm_config, embed_config=embed_config, +) + +classGenDoc_role = Role(role_type="assistant", role_name="classGenDoc_role", prompt=classGenDoc_PROMPT) +classGenDoc = CodeGenDocer( + role=classGenDoc_role, + chat_turn=1, + llm_config=llm_config, embed_config=embed_config, +) + +codeGenDocGroup_role = Role(role_type="assistant", role_name="codeGenDocGroup_role", prompt=codeGenDocGroup_PROMPT) +codeGenDocGroup = SelectorAgent( + role=codeGenDocGroup_role, + chat_turn=1, + llm_config=llm_config, embed_config=embed_config, + group_agents=[funcGenDoc, classGenDoc] +) + +chain_config = ChainConfig( + chain_name="codeGenDocGroup_chain", agents=[codeGenDocGroup.role.role_name,], + chat_turn=1) +chain = BaseChain( + chainConfig=chain_config, agents=[codeGenDocGroup], + llm_config=llm_config, embed_config=embed_config, +) + +phase = BasePhase( + phase_name="codeGenDocGroup_phase", chains=[chain], + embed_config=embed_config, llm_config=llm_config +) +``` + +### start to generate api docs from code + +``` +# Initialize based on the previous loading process +cbh = CodeBaseHandler(codebase_name, code_path, crawl_type='dir', use_nh=use_nh, local_graph_path=CB_ROOT_PATH, + llm_config=llm_config, embed_config=embed_config) +cbh.search_vertices(vertex_type="method") +# Begin transforming code into API documentation structure +for vertex_type in ["class", "method"]: + vertices = cbh.search_vertices(vertex_type=vertex_type) + logger.info(f"vertices={vertices}") + # round-1 + docs = [] + for vertex in vertices: + vertex = vertex.split("-")[0] # '-' is the delimiter for method parameters + query_content = f"Generate documentation for {vertex_type} node {vertex}" + query = Message( + role_name="human", role_type="user", input_query=query_content, + code_engine_name=codebase_name, score_threshold=1.0, top_k=3, cb_search_type="tag", use_nh=use_nh, + local_graph_path=CB_ROOT_PATH, + ) + output_message, output_memory = phase.step(query, reinit_memory=True) + # print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + docs.append(output_memory.get_spec_parserd_output()) + os.makedirs(f"{CB_ROOT_PATH}/docs", exist_ok=True) + with open(f"{CB_ROOT_PATH}/docs/raw_{vertex_type}.json", "w") as f: + json.dump(docs, f) + + +# Convert the generated document information into markdown text +from muagent.utils.code2doc_util import * +import json +with open(f"/home/user/code_base/docs/raw_method.json", "r") as f: + method_raw_data = json.load(f) + + +with open(f"/home/user/code_base/docs/raw_class.json", "r") as f: + class_raw_data = json.load(f) + +method_data = method_info_decode(method_raw_data) +class_data = class_info_decode(class_raw_data) +method_mds = encode2md(method_data, method_text_md) +class_mds = encode2md(class_data, class_text_md) + +docs_dict = {} +for k,v in class_mds.items(): + method_textmds = method_mds.get(k, []) + for vv in v: + # Theoretically, there should only be one + text_md = vv + for method_textmd in method_textmds: + text_md += "\n
" + method_textmd + docs_dict.setdefault(k, []).append(text_md) + + with open(f"/home/user/code_base/docs/{k}.md", "w") as f: + f.write(text_md) +``` \ No newline at end of file diff --git a/content/en/muagent/llm_models/embedding_config.md b/content/en/muagent/llm_models/embedding_config.md new file mode 100644 index 0000000..0a99caf --- /dev/null +++ b/content/en/muagent/llm_models/embedding_config.md @@ -0,0 +1,65 @@ +--- +title: Embedding Config +url: "muagent/embedding-model-config" +aliases: +- "/muagent/embedding-model-config" +--- + + +## Prepare Relevant Parameters +First, add the OpenAI configuration; this could also be a model similar to the OpenAI interface (launched via fastchat). +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +``` + +## Build LLM Config +- Constructing with a local model file +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig + +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + + +- Constructing via OpenAI +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig + +embed_config = EmbedConfig( + embed_engine="openai", api_key=api_key, api_base_url=api_base_url, +) +``` + +- Customizing and inputting langchain embeddings +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig + +class CustomizedEmbeddings(Embeddings): + def embed_documents(self, texts: List[str]) -> List[List[float]]: + embeddings = [] + # add your embedding code + return embeddings + def embed_query(self, text: str) -> List[float]: + """Compute query embeddings using a HuggingFace transformer model. + Args: + text: The text to embed. + Returns: + Embeddings for the text. + """ + # add your embedding code + return embedding + + +embeddings = CustomizedEmbeddings() +embed_config = EmbedConfig( + embed_model="default", + langchain_embeddings=embeddings +) +``` \ No newline at end of file diff --git a/content/en/muagent/llm_models/llm_config.md b/content/en/muagent/llm_models/llm_config.md new file mode 100644 index 0000000..4365298 --- /dev/null +++ b/content/en/muagent/llm_models/llm_config.md @@ -0,0 +1,57 @@ +--- +title: LLM Config +url: "muagent/llm-model-config" +aliases: +- "/muagent/llm-model-config" +--- + +## Prepare Relevant Parameters +First, add the OpenAI configuration, or you can use another model similar to the OpenAI interface (launched through fastchat). +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +``` + + +## Build LLM Config +- By passing the class `openai` + +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig + +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) +``` + + +- Customizing and inputting langchain LLM +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from langchain.llms.base import BaseLLM, LLM + + +class CustomizedModel(LLM): + repetition_penalty = 1.1 + temperature = 0.2 + top_k = 40 + top_p = 0.9 + + def predict(self, prompt: str, stop: Optional[List[str]] = None) -> str: + return self._call(prompt, stop) + + def _call(self, prompt: str, + stop: Optional[List[str]] = None) -> str: + """_call""" + return "" + + +llm = CustomizedModel() +llm_config = LLMConfig( + llm=llm +) +``` \ No newline at end of file diff --git a/content/en/muagent/overview/agent-flow.md b/content/en/muagent/overview/agent-flow.md new file mode 100644 index 0000000..664fdc9 --- /dev/null +++ b/content/en/muagent/overview/agent-flow.md @@ -0,0 +1,46 @@ +--- +title: Agent Flow +slug: Agent Flow +url: "muagent/agent-flow" +aliases: +- "/muagent/agent-flow" +--- + + +## Introduction to Core Connectors +To facilitate everyone's understanding of the entire muagent link, we adopt the Flow format to introduce in detail how to build through configuration + +
+ 图片 +
+ + +
Below, we first introduce the related core components
+ +### Agent +On the design level of the Agent, we provide four basic types of Agents, with Role settings for these Agents that can meet the interactions and uses of various common scenarios: + +1. BaseAgent: Provides basic question answering, tool usage, and code execution functions, and realizes input => output according to the Prompt format. +2. ReactAgent: Provides standard React functionality, accomplishing current tasks based on questions. +3. ExecutorAgent: Performs sequential execution of task lists, completing related tasks according to plans arranged by the User or the previous Agent. +4. SelectorAgent: Provides the function of selecting an Agent, choosing the appropriate Agent to respond according to the question from the User or the previous Agent. After output, the message is pushed into the memory pool, which will later be managed by the Memory Manager. + +It selects the appropriate Agent to respond based on the question from the User or the previous Agent. After output, the message is pushed into the memory pool, which is subsequently managed by the Memory Manager. + +### Chain +Basic Chain: BaseChain, connects the interactions of agents, manages the related messages and memory. + +### Phase +Basic Phase: BasePhase, connects the interactions of chains, and manages the related messages and memory. + +### Prompt Manager +The prompt creation for each agent in the Mutli-Agent link: + +- By setting simple prompt_input_keys and prompt_output_keys, the preset Prompt Context creation logic can be followed to quickly configure the agent prompt. +- It is also possible to design a new key-context in the prompt manager module, achieving personalized Agent Prompt. + +### Memory Manager +Mainly used for the management of chat history: +- Manages the reading and writing of chat history in a database, including user input, llm output, doc retrieval, code retrieval, search retrieval. +- Summarizes the key information in chat history to create a summary context, which serves as a prompt context. +- Provides a retrieval function to search for information related to the question in the chat history or summary context, assisting with question answering. \ No newline at end of file diff --git a/content/en/muagent/overview/multi-agent.md b/content/en/muagent/overview/multi-agent.md new file mode 100644 index 0000000..9bfa32b --- /dev/null +++ b/content/en/muagent/overview/multi-agent.md @@ -0,0 +1,147 @@ +--- +title: MuAgent +slug: MuAgent +url: "muagent/muagent" +aliases: +- "/muagent" +- "/muagent/multi-agent" +- "/muagent/muagent" +- "/muagent/muagent-overview" +--- + + +# Introduction +To enhance the performance of large models in terms of inference accuracy, various innovative Large Language Model (LLM) playbooks have emerged in the industry. From the earliest Chain of Thought (CoT) and Thread of Thought (ToT) to Games on Tracks (GoT), these methods have continually expanded the capability boundaries of LLMs. When handling complex problems, we can select, invoke and execute tool feedback through the ReAct process, while realizing multi-round tool use and multi-step execution. + +However, for more complex scenarios, such as the development of complex code, a single-function LLM Agent is clearly not up to the task. Therefore, the community has begun to develop combinations of multiple Agents, such as projects focused on the development field like metaGPT, GPT-Engineer, and chatDev, as well as the AutoGen project that focuses on automating the construction of Agents and Agent dialogue. + +After an in-depth analysis of these frameworks, it has been found that most Agent frameworks are highly coupled, with poor usability and extensibility. They implement specific scenarios in preset settings, but expanding to new scenarios can be very challenging. + +Therefore, we hope to build an extensible, easy-to-use Multi-Agent framework to support ChatBots in retrieving knowledge base information while assisting with various general tasks such as daily office work, data analysis, development, and operations. + +This project's Multi-Agent framework incorporates excellent designs from multiple frameworks, such as the message pool from metaGPT and the agent selector from autogen. + +
+ 图片 +
+ + +# MuAgent Framework +In MuAgent, in addition to defining the Agent interaction link and AgentBase basic execution flow, we have also designed two basic components: Prompt Manager and Memory Manager, which are used for automated construction of Prompts and chat history management, respectively. We have built an extensible, easy-to-use Multi-Agent framework, including the following content: + +- **Agent Base:** Established four basic types of Agents – BaseAgent, ReactAgent, ExecutorAgent, SelectorAgent – to support basic activities in various scenarios. +- **Communication:** Completes the transfer of information between Agents through Message and Parse Message entities, and interacts with Memory Manager to manage memory in the Memory Pool. +- **Prompt Manager:** Automates the assembly of Customized Agent Prompts through Role Handler, Doc/Tool Handler, Session Handler, Customized Handler. +- **Memory Manager:** Supports storage management of chat history, information compression, memory retrieval, and finally storage in databases, local or vector databases through the Memory Pool. +- **Component:** Auxiliary ecosystem components for building Agents, including Retrieval, Tool, Action, Sandbox, etc. +- **Customized Model:** Supports the integration of private LLM and Embedding. + +## Agent Base +At the Agent level, we provide four basic types of Agents, with Role settings for these Agents that can meet the interactions and uses of various common scenarios. All Actions are executed by Agents. + +1. BaseAgent: Provides basic question answering, tool usage, and code execution functions, and realizes input => output according to the Prompt format. +
+ 图片 +
+ +2. ReactAgent: Provides standard React functionality, according to questions to execute current tasks. +
+ 图片 +
+ + +3. ExecutorAgent: Sequentially executes a list of tasks, completing related tasks according to plans arranged by the User or the previous Agent. The Agent receives a task list ([List[task]) and loops through the tasks (Feedback Agents can also be added in the middle for task re-optimization), until the task is complete. +
+ 图片 +
+ +4. SelectorAgent: Provides the function of selecting an Agent, choosing the appropriate Agent to respond based on the question from the User or the previous Agent. +
+ 图片 +
+ + +## Communication +To enable better interaction between Agents, as well as to provide each Agent with enough information to complete their specific tasks, we have divided the Message information body into several parts, such as System Content, Info Content, LLM Content, and LLM Parsed Content, etc. + +System Content: Used to store and manage the timing of the current LLM output, Role information, etc. +Info Content: LLM auxiliary information, such as knowledge base query information, code library retrieval information, tool information, Agent information, etc. + +LLM Content: Directly stores and conveys information generated by the LLM. +LLM Parsed Content: Parses the LLM's output into a more manageable key-value data structure, making it easier to filter through LLM content. +Customized Content: Manages key-value data content generated by custom actions, used for subsequent assembly and construction of custom Prompt templates. +By defining the above message formats, we can accomplish the transfer and management of general messages. Specific assembly methods can be seen in the Prompt Manager module. + + +## Context Manager +### Memory Manager +Mainly used for the management of chat history: + +- Storage Management: Implements the save and load management of chat history in the database or locally, including user input, LLM output, observation output. +- Information Compression: Summarizes key information from the chat history into a summary context, such as single text summaries, summaries from different angles, key information extraction, multi-text summaries, and serves as Prompt context. +- Memory Retrieval: Provides basic retrieval functions, retrieving information related to questions from chat history or Summary Context to assist in Q&A. +- LLM Automatic Trigger: Future definitions of policies or the use of LLM to trigger the compression summary and retrieval functions. + +### Prompt Manager +Asking LLMs has become common practice, but how to coordinate the planning and usage of tools, code writing abilities among multiple large models to guide their expected outputs has become a key issue. Essentially, this involves abstracting business problems into executable Prompts, so we're not just designing Agents but rather engaging in framework design after a deep understanding of the current demands. + +In actual business scenarios where LLMs are involved (excluding the SFT process), we can designate LLM to complete specific tasks and obtain expected outputs through the design of Agent Prompt content. In the process of MuAgent, the Prompt is divided into three parts: System Prompt, Context Prompt, Customized Prompt. + +- System Prompt includes Role Name, Role Description, Task, etc. +- Context Prompt includes Doc Context, Code Context, Tool Context, Agent Context, Session Context, etc. +- Customized Prompt involves custom inputs and outputs, such as... +We can also ask the model to output structured texts, such as the JSON string of a tool, code\ncode_content, etc., to complete particular workflows. + +**Automatic Prompt Assemble** + +After defining the structure as above, we can complete the automation assembly of Prompts in the following ways, without having to make extensive adjustments to the prompt each time: + +1. Upon defining an Agent, configure Role Name, Role Description, Task, etc., to determine what the Agent needs to do. +2. Pre-package some reusable Context Prompt general strategies, such as selectable Role's SessionContext, configurable Tool, Code Retrieval, Doc Retrieval, Search Retrieval, Agent to complete corresponding assemblies. +3. As the Agent's Prompt requires relatively personalized operations, it also supports the addition of new key-context designs within the Prompt Manager module to achieve personalized Agent Prompts. + +**Automatic Prompt Design** +Able to automatically design the best prompt based on role description, task, query, etc.; to be defined... + +**Multi Prompt Design** +Based on the previous definition of Prompt, we know that a Prompt consists of three parts: System Prompt, Context Prompt, Customized Prompt. Any changes in the three parts may cause changes in the final output of the LLM. + +For the same type of task, their System Prompt is the same. So, without considering the variations of Customiezd Prompt, it is possible to achieve the assembly differences of different contexts. For example, Prompt A obtains 10 rounds of chat history, while Prompt B uses 5 rounds of chat history, or alternatively, filters and compresses information in chat history. + +To be implemented... + + +## Component +### Retrieval +In all Prompts' Contexts, aside from Chat History session information, information based on external document libraries, code repositories, internet search results is also relied upon. This knowledge system beyond the model parameters can significantly enhance the Agent's ability to complete complex tasks. + +Thus, in MuAgent, we integrated three ways to retrieve information: Doc, Internet Search, Code Retrieval, and defined an abstract class IMRetrieval, supporting developers to customize their knowledge bases to complete the Agent's knowledge base registration. + +**Doc Retrieval** + +Document vector databases are currently the mainstream method for building knowledge bases, using Text Embedding models to vectorize documents and store them in vector databases. In the future, we will also support queries based on knowledge graphs and automatically extract entities and relations through large models to explore the complex relationships in data. + +**Code Retrieval** + +LLMs face the challenge of lagging training data for code generation, repair, and component understanding tasks, as well as not being able to perceive the context-dependent structure of code. During development, understanding, retrieving and querying metadata from the existing codebase and dependencies can take a considerable amount of time. Hence, we hope to provide an external knowledge system + + +**Search Retrieval** +In addition to the readily available document and code knowledge bases, in daily practice, browsing a large amount of web content to acquire more knowledge helps us understand emerging scenarios, businesses, technologies, and more. Hence, we've integrated duckduckgosearch, an open-source search tool, to provide LLMs with content beyond their knowledge reserves. + +### Tool +With OpenAI launching the Function Call feature, which generates parameters for specified tools through LLM and executes the call, machines can better understand and respond to human needs, thus solving practical problems and repetitive work. Nowadays, the ability to learn tools is increasingly becoming a standard feature of open-source models. Therefore, in MuAgent, it also supports agents to complete Tool registration. By using the Python registration template BaseToolModel class and writing related properties and methods such as Tool_name, Tool_description, ToolInputArgs, ToolOutputArgs, and run, tools can be quickly integrated. It also supports the direct use of langchain Tool interfaces. +For example, functions like the above XXRetrieval can also be registered as a Tool, ultimately called by LLM. + +### Action +In the definition of MuAgent, Action is viewed as a specific action or action flow that LLM needs to execute, including LLM information processing, knowledge retrieval, tool invocation, and code execution, etc., constituting a comprehensive and complex dynamic process. For instance, in the React process, we obtained a Tool parameter through LLM, and then "putting the tool parameter into the Tool and executing the call" is an Action, which practically invokes the Tool. Or, we defined an Agent, who orchestrates a fixed agent's Action steps, with the input parameters of this Agent specially designated by the Action. That is to say, whether the parameters are generated by LLM or set by engineering, as long as it involves a specific execution process, it is an Action. + +## Modules +- [connector](/muagent/connector-agent) Mainly introduces the work of this block of the Agent framework +- llm_models +- retrieval +- tools +- sandbox +- utils + + diff --git a/content/en/muagent/overview/prompt-manager.md b/content/en/muagent/overview/prompt-manager.md new file mode 100644 index 0000000..5ec59d3 --- /dev/null +++ b/content/en/muagent/overview/prompt-manager.md @@ -0,0 +1,70 @@ +--- +title: Prompt Manager +slug: Prompt Manager +url: "coagent/prompt-manager" +aliases: +- "/coagent/prompt-manager" +--- + + +### 提示管理器(Prompt Manager) +管理多智能体链路中的prompt创建 +- 快速配置:采用预设的处理函数,用户仅需通过定义智能体的输入输出即可轻松配置,实现多智能体的prompt快速组装和配置。 +- 自定义支持:允许用户自定义prompt内部各模块的处理逻辑,以达到个性化的智能体prompt实现。 + +### Prompt预设模板结构 + +- Agent Profile:此部分涉及到智能体的基础描述,包括但不限于代理的类型、功能和指令集。用户可以在这里设置智能体的基本属性,确保其行为与预期相符。 +- Context:上下文信息,给智能体做参考,帮助智能体更好的进行决策。 + - Tool Information:此部分为智能体提供了一套可用工具的清单,智能体可以根据当前的场景需求从中挑选合适的工具以辅助其执行任务。 + - Reference Documents:这里可以包含代理参考使用的文档或代码片段,以便于它在处理请求时能够参照相关资料。 + - Session Records:在进行多轮对话时,此部分会记录之前的交谈内容,确保智能体能够在上下文中保持连贯性。 +- Response Output Format:用户可以在此设置智能体的输出格式,以确保生成的响应满足特定的格式要求,包括结构、语法等。 +- Response:在与智能体的对话中,如果用户希望智能体继续某个话题或内容,可以在此模块中输入续写的上文。例如,在运用REACT模式时,可以在此区域内详细阐述智能体先前的行为和观察结果,以便于智能体构建连贯的后续响应。 + +### Prompt自定义配置 + +#### Prompt模块参数 +- field_name:唯一的字段名称标识,必须提供。 +- function:指定如何处理输入数据的函数,必须提供。 +- title:定义模块的标题。若未提供,将自动生成一个标题,该标题通过把字段名称中的下划线替换为空格并将每个单词的首字母大写来构建。 +- description:提供模块的简要描述,位于模块最上方(标题下方)。默认为空,可选填。 +- is_context:标识该字段是否属于上下文模块的一部分。默认为True,意味着除非显式指定为False,否则都被视为上下文的一部分。 +- omit_if_empty:设定当模块内容为空时,是否在prompt中省略该模块,即不显示相应的模板标题和内容。默认为False,意味着即使内容为空也会显示标题。如果希望内容为空时省略模块,需显式设置为True。 + +#### Prompt配置示例 + +Prompt配置由一系列定义prompt模块的字典组成,这些模块将根据指定的参数和功能来处理输入数据并组织成一个完整的prompt。 + +在配置中,每个字典代表一个模块,其中包含相关的参数如 field_name, function_name, is_context, title, description, 和 omit_if_empty,用以控制模块的行为和呈现方式。 + +context_placeholder 字段用于标识上下文模板的位置,允许在prompt中插入动态内容。 +``` +[ + {"field_name": 'agent_profile', "function_name": 'handle_agent_profile', "is_context": False}, + {"field_name": 'context_placeholder', "function_name": '', "is_context": True}, + {"field_name": 'tool_information',"function_name": 'handle_tool_data', "is_context": True}, + {"field_name": 'reference_documents', "function_name": 'handle_doc_info'}, + {"field_name": 'session_records', "function_name": 'handle_session_records'}, + {"field_name": 'task_records', "function_name": 'handle_task_records'}, + {"field_name": 'output_format', "function_name": 'handle_output_format', 'title': 'Response Output Format', "is_context": False}, + {"field_name": 'response', "function_name": 'handle_response', "title"="begin!!!", "is_context": False, "omit_if_empty": False} +] +``` + +### 未来规划 + +#### Prompt配置简化 + +未来的Prompt配置简化旨在降低用户面对复杂配置的难度。通过引入更直观的配置方法,我们计划使得Prompt配置不仅对高级用户友好,还能让初学者轻松上手。简化计划可能包括: + +- 预设配置短语:将复杂的配置字典转换为简洁的短语,每个短语都预定义了一个Prompt模块。用户将能够使用简单的字符串指令来快速配置Prompt,而无需深入了解所有参数。 +- 配置校验和建议:增加配置的即时校验,如果检测到配置错误或不一致性,自动提供修改建议,帮助用户优化Prompt结构。 + +#### 动作(Action)注册的改进计划 + +在现行系统中,智能体必须在其角色提示(role prompt)内定义所有的动作(actions)。这意味着智能体需要同时处理动作的意图识别和生成动作所需的输入数据,这一过程对语言模型的理解和推理能力提出了更高要求。 + +为了优化这一流程,我们打算在后续版本中对动作的输入生成和执行进行模块化。这将使智能体的工作重点转移至判断当前情境下应执行哪些动作,而不必负责具体的操作指令。在这种新的架构下,当需要执行某个动作时,将有专门的机制负责生成相应动作的具体输入指令。 + +这种分离将显著降低单个模块的复杂性,使得整个系统更加灵活、易于扩展,同时也提升了动作执行的效率和准确性。 diff --git a/content/en/muagent/overview/quick-start.md b/content/en/muagent/overview/quick-start.md new file mode 100644 index 0000000..d484080 --- /dev/null +++ b/content/en/muagent/overview/quick-start.md @@ -0,0 +1,350 @@ +--- +title: Quick Start +slug: Quick Start +url: "muagent/quick-start" +aliases: +- "/muagent/quick-start" +--- + + +## Quick Start +For a complete example, see [examples/muagent_examples](htpps://) +### First, prepare the relevant configuration information +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` + +### Then, set up LLM configuration and Embedding model configuration +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.connector.phase import BasePhase +from muagent.connector.schema import Message + + +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + +### Finally, choose an existing scenario to execute +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) + +# Choose a scenario +phase_name = "baseGroupPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + +# round-1 needs to be completed by code interpreter +query_content = "Confirm whether employee_data.csv exists locally and view its columns and data types; then draw a bar chart" +query = Message( + role_name="human", role_type="user", tools=[], input_query=query_content, +) +# phase.pre_print(query) # This function is used to pre-print the Prompt of the Agents' execution chain +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + + +# round-2 requires the execution of a tool +tools = toLangchainTools([TOOL_DICT[i] for i in TOOL_SETS if i in TOOL_DICT]) +query_content = "Please help me check if the server at 127.0.0.1 had any issues at 10 o'clock, help me to determine" +query = Message( + role_name="human", role_type="user", tools=tools, input_query=query_content, +) +# phase.pre_print(query) # This function is used to pre-print the Prompt of the Agents' execution chain +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +## Phase Customization +Refer to [How to Customize Phase](/muagent/customed-examples) + + +## Introduction and Usage of Scenes +Below are some specific scene introductions and usages. +We also welcome everyone to brainstorm and construct some interesting cases. +### baseTaskPhase +Scenarios involving task segmentation and multi-step execution of xAgents + +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) + +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "2" + +phase_name = "baseTaskPhase" +phase = BasePhase( +phase_name, embed_config=embed_config, llm_config=llm_config, +) + + +# round-1 +query_content = "Check if employee_data.csv exists locally and see what columns and data types it has; then draw a bar chart" +query = Message( + role_name="human", role_type="user", input_query=query_content, + ) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +### codeReactPhase +The code interpreter scenario based on React + +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/book_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) + +# then, create a data analyze phase +phase_name = "codeReactPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, + jupyter_work_path=JUPYTER_WORK_PATH, +) + +# round-1 +query_content = "Check if 'employee_data.csv' exists locally, view its columns and data types; then draw a bar chart" +query = Message( + role_name="human", role_type="user", + role_content=query_content, input_query=query_content, origin_query=query_content, + ) + +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + +### codeToolReactPhase +The tool invocation and code interpreter scenario based on the React template + + +``` +TOOL_SETS = [ + "StockName", "StockInfo", + ] +tools = toLangchainTools([TOOL_DICT[i] for i in TOOL_SETS if i in TOOL_DICT]) + +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "2" + +phase_name = "codeToolReactPhase" + +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, +) + +query_content = "Query the stock code of Kweichow Moutai and acquire the time series data of the last 10 days up to the current date (December 24th, 2023); then use code to draw a line chart and analyze it" + +query = Message(role_name="human", role_type="user", input_query=query_content, tools=tools) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +### docChatPhase +Knowledge Base Retrieval and Question-Answering Pipeline + +- example 1 +``` +# create your knowledge base +from muagent.service.kb_api import create_kb, upload_files2kb +from muagent.utils.server_utils import run_async +from muagent.orm import create_tables + + +# use to test, don't create some directory +create_tables() + +# create a knowledge base +kb_name = "example_test" +run_async(create_kb(knowledge_base_name=kb_name, vector_store_type="faiss", embed_config=embed_config, kb_root_path=KB_ROOT_PATH)) + +# add doc to knowledge base +file = os.path.join("D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/sources/docs/langchain_text_10.jsonl") +files = [file] +upload_files2kb(files, kb_name, embed_config, kb_root_path=KB_ROOT_PATH) + + +## start to chat with knowledge base +# log-level, print prompt, and llm predict +os.environ["log_verbose"] = "0" + +## example 1 +# set chat phase +phase_name = "docChatPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, kb_root_path=KB_ROOT_PATH, +) + +# round-1 +query_content = "What modules does langchain have?" +query = Message( + role_name="human", role_type="user", input_query=query_content, + doc_engine_name=kb_name, score_threshold=1.0, top_k=3 + ) + +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +# round-2 +query_content = "What is the use of prompts?" +query = Message( + role_name="human", role_type="user", input_query=query_content, + doc_engine_name=kb_name, score_threshold=1.0, top_k=3 + ) + +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + +- example 2 +``` +## Customized register demo +from muagent.tools import DocRetrieval +class BaseDocRetrieval(IMRertrieval): + + def __init__(self, knowledge_base_name: str, search_top=5, score_threshold=1.0, embed_config: EmbedConfig=EmbedConfig(), kb_root_path: str=KB_ROOT_PATH): + self.knowledge_base_name = knowledge_base_name + self.search_top = search_top + self.score_threshold = score_threshold + self.embed_config = embed_config + self.kb_root_path = kb_root_path + + def run(self, query: str, search_top=None, score_threshold=None, ): + docs = DocRetrieval.run( + query=query, knowledge_base_name=self.knowledge_base_name, + search_top=search_top or self.search_top, + score_threshold=score_threshold or self.score_threshold, + embed_config=self.embed_config, + kb_root_path=self.kb_root_path + ) + return docs + + +doc_retrieval = BaseDocRetrieval(knowledge_base_name=kb_name, score_threshold=1.0, search_top=3, embed_config=embed_config) + +# set chat phase +phase_name = "docChatPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, kb_root_path=KB_ROOT_PATH, + doc_retrieval=doc_retrieval +) + +# round-1 +query_content = "What modules does langchain have?" +query = Message( + role_name="human", role_type="user", input_query=query_content, +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + + +# round-2 +query_content = "What is the use of prompts?" +query = Message( + role_name="human", role_type="user", input_query=query_content, +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + +### metagpt_code_devlop +The code construction Phase in metagpt + +``` +# log level, print prompt, and llm predict +os.environ["log_verbose"] = "2" +phase_name = "metagpt_code_development" + +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + +query_content = "create a snake game" +query = Message(role_name="human", role_type="user", input_query=query_content) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +### searchChatPhase +Fixed scenario chain, search first then directly answer based on LLM + +``` +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "2" + +# This can be configured when the duckduckgo connection is not available +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5h://127.0.0.1:13659" +phase_name = "searchChatPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + + +# round-1 +query_content1 = "Who is the current President of the United States?" +query = Message( + role_name="human", role_type="user", input_query=query_content1, + search_engine_name="duckduckgo", score_threshold=1.0, top_k=3 +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +# round-2 +query_content2 = "Who was the previous president of the United States, and do these two people have any relationship?" +query = Message( + role_name="human", role_type="user", input_query=query_content2, + search_engine_name="duckduckgo", score_threshold=1.0, top_k=3 +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +### toolReactPhase +The tool invocation scene based on the React template + +``` +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "2" +phase_name = "toolReactPhase" + +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + +# round-1 +tools = toLangchainTools([TOOL_DICT[i] for i in TOOL_SETS if i in TOOL_DICT]) +query_content = "Please help me check if there were any issues with the server at 127.0.0.1 at 10 o'clock, I need your assistance in determining this." +query = Message( + role_name="human", role_type="user", tools=tools, input_query=query_content, +) + +# phase.pre_print(query) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` \ No newline at end of file diff --git a/content/en/muagent/retrieval/custom_retrieval.md b/content/en/muagent/retrieval/custom_retrieval.md new file mode 100644 index 0000000..afa18df --- /dev/null +++ b/content/en/muagent/retrieval/custom_retrieval.md @@ -0,0 +1,105 @@ +--- +title: Custom Retrieval +url: "muagent/custom-retrieval" +aliases: +- "/muagent/custom-retrieval" +--- + +## Basic Introduction +`Doc Retrieval` is the document vector database, which is the most mainstream method for knowledge base construction nowadays. It uses Text Embedding models to vectorize documents and stores them in a vector database. In the future, we will also support querying based on knowledge graph and automatically extracting entities and relationships through large models to explore various complex relationships in data. + +`Code Retrieval` LLM faces challenges in tasks such as code generation, repair, and component understanding, including lagging code training data and the inability to perceive the dependency structure of code context. During development, understanding existing codebases and dependencies, retrieving related code, querying metadata, etc., can take a significant amount of time. Therefore, we hope to support LLM with code outside of its knowledge system through code structure analysis and code retrieval. + +`Search Retrieval` In addition to existing document and code knowledge bases, in daily practice, we browse a large amount of web content to acquire more knowledge, helping us understand emerging scenarios, businesses, technologies, etc., hence we integrated duckduckgo search, an open-source search tool, to provide LLM with content beyond its knowledge reserve. + +## Retrieval Structure + +``` +class IMRertrieval: + def __init__(self,): + ''' + init your personal attributes + ''' + pass + + def run(self, ): + ''' + execute interface, and can use init' attributes + ''' + pass + + +class BaseDocRetrieval(IMRertrieval): + + def __init__(self, knowledge_base_name: str, search_top=5, score_threshold=1.0, embed_config: EmbedConfig=EmbedConfig(), kb_root_path: str=KB_ROOT_PATH): + self.knowledge_base_name = knowledge_base_name + self.search_top = search_top + self.score_threshold = score_threshold + self.embed_config = embed_config + self.kb_root_path = kb_root_path + + def run(self, query: str, search_top=None, score_threshold=None, ): + docs = DocRetrieval.run( + query=query, knowledge_base_name=self.knowledge_base_name, + search_top=search_top or self.search_top, + score_threshold=score_threshold or self.score_threshold, + embed_config=self.embed_config, + kb_root_path=self.kb_root_path + ) + return docs +``` + + +## Usage Example +``` +# retrieval your customized register demo +from muagent.tools import DocRetrieval + +class BaseDocRetrieval(IMRertrieval): + + def __init__(self, knowledge_base_name: str, search_top=5, score_threshold=1.0, embed_config: EmbedConfig=EmbedConfig(), kb_root_path: str=KB_ROOT_PATH): + self.knowledge_base_name = knowledge_base_name + self.search_top = search_top + self.score_threshold = score_threshold + self.embed_config = embed_config + self.kb_root_path = kb_root_path + + def run(self, query: str, search_top=None, score_threshold=None, ): + docs = DocRetrieval.run( + query=query, knowledge_base_name=self.knowledge_base_name, + search_top=search_top or self.search_top, + score_threshold=score_threshold or self.score_threshold, + embed_config=self.embed_config, + kb_root_path=self.kb_root_path + ) + + return docs + + +doc_retrieval = BaseDocRetrieval(knowledge_base_name=kb_name, score_threshold=1.0, search_top=3, embed_config=embed_config) + +# set chat phase +phase_name = "docChatPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, kb_root_path=KB_ROOT_PATH, + doc_retrieval=doc_retrieval +) + + +# round-1 +query_content = "What modules does langchain have?" +query = Message( + role_name="human", role_type="user", input_query=query_content, +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + + +# round-2 +query_content = "What is the use of prompts?" +query = Message( + role_name="human", role_type="user", input_query=query_content, +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` \ No newline at end of file diff --git a/content/en/muagent/tools/custom_tool.md b/content/en/muagent/tools/custom_tool.md new file mode 100644 index 0000000..3098181 --- /dev/null +++ b/content/en/muagent/tools/custom_tool.md @@ -0,0 +1,126 @@ +--- +title: Custom Tool +url: "muagent/custom-tool" +aliases: +- "/muagent/custom-tool" +--- + +## Introduction +In MuAgent, it also supports the registration of Tools by Agents. By registering the BaseToolModel class with Python and writing +- Tool_name +- Tool_description +- ToolInputArgs +- ToolOutputArgs +- run + +and other relevant properties and methods, the quick integration of tools can be achieved. It also supports the direct use of the langchain Tool interface. For example, functions like the aforementioned XXRetrieval can also be registered as a Tool, to be ultimately called by an LLM. + + +## BaseTool Structure + +``` +from langchain.agents import Tool +from pydantic import BaseModel, Field +from typing import List, Dict +import json + + +class BaseToolModel: + name = "BaseToolModel" + description = "Tool Description" + + class ToolInputArgs(BaseModel): + """ + Input for MoveFileTool. + Tips: + default control Required, e.g. key1 is not Required/key2 is Required + """ + + key1: str = Field(default=None, description="hello world!") + key2: str = Field(..., description="hello world!!") + + class ToolOutputArgs(BaseModel): + """ + Input for MoveFileTool. + Tips: + default control Required, e.g. key1 is not Required/key2 is Required + """ + + key1: str = Field(default=None, description="hello world!") + key2: str = Field(..., description="hello world!!") + + @classmethod + def run(cls, tool_input_args: ToolInputArgs) -> ToolOutputArgs: + """excute your tool!""" + pass +``` + + +## Register Example + +``` +from pydantic import BaseModel, Field +from typing import List, Dict +import requests +from loguru import logger + +from .base_tool import BaseToolModel + +class Multiplier(BaseToolModel): + """ + Tips: + default control Required, e.g. key1 is not Required/key2 is Required + """ + + name: str = "Multiplier" + description: str = """useful for when you need to multiply two numbers together. \ + The input to this tool should be a comma separated list of numbers of length two, representing the two numbers you want to multiply together. \ + For example, `1,2` would be the input if you wanted to multiply 1 by 2.""" + + class ToolInputArgs(BaseModel): + """Input for Multiplier.""" + + # key: str = Field(..., description="用户在高德地图官网申请web服务API类型KEY") + a: int = Field(..., description="num a") + b: int = Field(..., description="num b") + + class ToolOutputArgs(BaseModel): + """Output for Multiplier.""" + + res: int = Field(..., description="the result of two nums") + + @staticmethod + def run(a, b): + return a * b +``` + + +## Use Example +``` +from langchain.tools import StructuredTool +from muagent.tools import ( + WeatherInfo, Multiplier, toLangchainTools, + TOOL_DICT, TOOL_SETS +) + +# Function exec +tools = [ + StructuredTool( + name=Multiplier.name, + func=Multiplier.run, + description=Multiplier.description, + args_schema=Multiplier.ToolInputArgs, + ), + StructuredTool( + name=WeatherInfo.name, + func=WeatherInfo.run, + description=WeatherInfo.description, + args_schema=WeatherInfo.ToolInputArgs, + ) + ] + +tools = toLangchainTools([TOOL_DICT["Multiplier"]]) + +# tool run Test +print(tools[0].func(1,2)) +``` \ No newline at end of file diff --git a/content/zh/contribution/contribute/d1.pr.md b/content/zh/contribution/contribute/d1.pr.md index 4b749f1..8daba5f 100644 --- a/content/zh/contribution/contribute/d1.pr.md +++ b/content/zh/contribution/contribute/d1.pr.md @@ -80,6 +80,8 @@ commit message的标题:`[]() (#pr)` ### subject 内容 标题需尽量清晰表明本次提交的主要内容。 +例: +`[feature](coagent)<增加antflow兼容和增加coagent demo>` ## 示例 comming soon diff --git a/content/zh/docs/chatbot/c1.quickstart.md b/content/zh/docs/chatbot/c1.quickstart.md index 4af6b67..4263889 100644 --- a/content/zh/docs/chatbot/c1.quickstart.md +++ b/content/zh/docs/chatbot/c1.quickstart.md @@ -2,10 +2,10 @@ title: 快速开始 slug: 快速开始 description: 介绍主要功能 -url: "docs/快速开始" +url: "docs/codefuse-chatbot-quickstart-zh" aliases: - "/docs/快速开始" -- "/docs/quickstart-zh" +- "/docs/codefuse-chatbot-quickstart-zh" ---

diff --git a/content/zh/docs/chatbot/c2.start-detail.md b/content/zh/docs/chatbot/c2.start-detail.md index 674889b..43cd795 100644 --- a/content/zh/docs/chatbot/c2.start-detail.md +++ b/content/zh/docs/chatbot/c2.start-detail.md @@ -13,7 +13,7 @@ aliases:

-如需使用私有化模型部署,请自行安装 nvidia 驱动程序。 +如需使用私有化模型部署,请自行安装 nvidia 驱动程序。。 ### python 环境准备 diff --git a/content/zh/docs/codefuse-evalution/1_quickstart.md b/content/zh/docs/codefuse-evalution/1_quickstart.md new file mode 100644 index 0000000..0241859 --- /dev/null +++ b/content/zh/docs/codefuse-evalution/1_quickstart.md @@ -0,0 +1,248 @@ +--- +title: 快速使用 +description: 介绍主要功能 +url: docs/codefuse-evalution-quickstart-zh +aliases: +- "/docs/codefuse-evalution-quickstart-zh" +--- + +## 推理环境: +CodeFuse-13B: python 3.8及以上版本,pytorch 2.0及以上版本,transformers 4.24.0及以上版本,CUDA 11.4及以上; + +CodeFuse-CodeLlama-34B: python 3.8及以上版本,pytorch2.0及以上版本,transformers==4.32.0 ,Sentencepiece,CUDA 11.4及以上。 + +## 评测执行环境 + +评测生成的代码需要使用多种语言编译、运行。我们使用的各编程语言依赖及所用包的版本如下: + +| 依赖 | 版本 | +| ------- |----------| +| Python | 3.10.9 | +| JDK | 18.0.2.1 | +| Node.js | 16.14.0 | +| js-md5 | 0.7.3 | +| C++ | 11 | +| g++ | 7.5.0 | +| Boost | 1.75.0 | +| OpenSSL | 3.0.0 | +| go | 1.18.4 | +| cargo | 1.71.1 | + + +为了省去使用者配置这些语言环境的麻烦,我们构建了一个Docker镜像,并在其中配置了所需要的环境,你可以按照下面的指令拉取使用 +```bash +docker pull registry.cn-hangzhou.aliyuncs.com/codefuse/codefuseeval:latest +``` + +如果您熟悉Dockerfile,也可以从`codefuseEval/docker/Dockerfile`构建镜像,或者修改之以定制自己的配置: + +```bash +cd codefuseEval/docker +docker build [OPTIONS] . +``` + +获取镜像后,使用如下命令创建容器: + +```bash +docker run -it --gpus all --mount type=bind,source=,target= [OPTIONS] +``` + +## 检查推理结果指令 +我们提供脚本来检查所提供代码 LLM 的结果。请使用以下脚本检查相应的推理结果。 +``` +bash codefuseEval/script/check_reference.sh codefuseEval/result/CodeFuse-CodeLlama-34B/humaneval_result_python.jsonl humaneval_python +bash codefuseEval/script/check_reference.sh codefuseEval/result/CodeFuse-13B/humaneval_result_python.jsonl humaneval_python +``` + +## 如何使用CodeFuseEval +1. 下载模型并更新 ckpt config.json 中的当前模型信息。 主要更新对应型号和版本中的「path」参数。 +2. 运行以下生成命令以生成结果。 +``` +bash codefuseEval/script/generation.sh MODELNAME MODELVERSION EVALDATASET OUTFILE + +eg: +bash codefuseEval/script/generation.sh CodeFuse-13B v1 humaneval_python result/test.jsonl +``` +3. 运行以下评估命令来评估相应模型版本的生成结果。 +``` +bash codefuseEval/script/evaluation.sh +eg: +bash codefuseEval/script/evaluation.sh codefuseEval/result/test.jsonl pass@k humaneval_python +``` + +## 评测说明 + +我们推荐使用给定的[评测环境](#评测环境)进行评测。在评测前,将生成的代码以如下JSON列表形式存储: + +``` +{"task_id": "../..", "generation: "..."} +{"task_id": "../..", "generation: "..."} +... +``` + +### 评测数据集 +样本使用JSON列表格式存储在``codefuseEval/data``中,根据用户所需的下游任务情况,每条样本包含 + +* ``task_id``: 题目的目标语言与ID。语言为["Python", "Java", "JavaScript", "CPP", "Go"]中之一。 +* ``prompt``: 函数声明与描述,用于代码生成。 +* ``declaration``: 仅有函数声明,用于代码翻译。 +* ``canonical_solution``: 手写的示例解答。 +* ``test``: 隐藏测例,用于评测。 +* ``example_test``: 公共测试样本,用于评估生成代码。 +* ``prompt_text``: prompt文本情况。 +* ``prompt_explain``: prompt信息说明。 +* ``func_title``: 生成函数头信息。 +* ``prompt_text_chinese``: 中文prompt信息。 + +### 评测指标 +除了目前提供的[Codex](https://arxiv.org/abs/2107.03374) 中提出的无偏 pass@k 指标之外,我们还将huggingface开源的相关指标与[CodeBLEU](https://arxiv.org/abs/2009.10297)提出的相似性指标进行集成。 +目前建议用户主要使用的指标如下: +* ``codebleu``: codebleu相似性评测指标。 +* ``pass@k``: 无偏pass@k的评测指标。 +* ``bleu``: 文本相似性指标bleu +* ``bleurt``: 文本语义相似性指标bleurt +* ``total_time_cost``: 基于被评数据集、模型推理总耗时 +* ``Average time cost``: 基于被评数据集单个任务、模型推理平均耗时 + + +### 评测命令: +``` +bash codefuseEval/script/evaluation.sh +eg: +bash codefuseEval/script/evaluation.sh codefuseEval/result/test.jsonl pass@k humaneval_python +``` + +并在本仓库的根目录下使用如下指令(请谨慎执行,生成的代码可能有极低概率产生意外行为。在[execution.py](execution.py)中查看警告并取消执行代码的注释,风险自负): + +同时我们当前提供如下的标志位,可以直接将测试数据集中的示例解答作为生成答案带入进行测试。 +* ``TEST_GROUDTRUTH`` 取值为True或False + +当TEST_GROUDTRUTH为True时,开启self-test模式,将读取PROBLEM_FILE,将示例解答作为生成答案代入进行测试。 +TEST_GROUDTRUTH为False时,开启评测模式,读取RESULT_FILE和将读取PROBLEM_FILE,将生成答案代入进行测试 + +## 更多信息 + +### 使用自己的数据集评估自己的模型 +如果你想用自己的数据集评估自己的模型,可以参考以下步骤: +1. 注册自己的数据集 +* 下载评估数据集并存储在`codefuseEval/data`或其他目录中。 数据集必须是jsonl格式。 +* 针对于数据集路径、数据集任务模式task_mode和使用数据集后生成结果的代码语言情况,需要在`codefuseEval/util.py`中的`EVAL_DATASET`、`DATASET_SUPPORT`和`DATASET_LANGUAGE`变量中进行设置。 +2. 注册你的评测模型 +* 下载评估模型并存储在`codefuseEval/model`或其他目录中。 +* 在`codefuseEval/processor`包中编写评估模型处理器代码。 +#### 处理适配器 + +我们设计了一个名为Processor的基础结构,用户可以自己根据推理模型的情况创建自己需要的处理器, 主要目的是为了处理不同模型的区别情况进行处理,主要需要完成3个抽象函数: +``` +load_model_tokenizer: 由于模型加载参数的区别以及tokenizer的终止符的区别,模型需要使用不同的参数进行适配加载,当前函数主要是为了帮助用户加载适配不同的模型 +process_before:由于prompt根据用户不同的选择评测任务的类型或不同模型来适配不同的prompt样式,因此抽取出process_before函数主要用来帮助用户处理prompt +process_after:由于模型生成结果多样性,为了适配评测框架,方便生成结果数据可以拼接成合适的用例进行自动化运行,当前函数主要是根据任务类型和数据集情况,处理生成结果适配评测数据集和结果进行评测 +``` +您可以在`codefuseEval/processor/base.py`中查看`BaseProcessor`情况,创建自己模型的处理器,并实现上述函数功能 + +* 在`ckpt_config.json`中设置信息模型。 举例如下 +``` +{ + "CodeFuse-13B": { //模型名称 + "v1": { //模型版本 + "path": "/mnt/model/CodeFuse13B-evol-instruction-4K/", // 模型路径 + "processor_class": "codefuseEval.process.codefuse13b.Codefuse13BProcessor", // 模型处理器路径 + "tokenizer": { // 将prompt token化时tokenizer传入的参数 + "truncation": true, + "padding": true, + "max_length": 600 + }, + "generation_config": { //生成配置参数 + "greedy": { //如果是JsonObject,当前配置的是解码策略,可以通过设置下方「decode_mode」参数来加载生成配置参数中定义的不同的解码策略。 + "do_sample": false, + "num_beams": 1, + "max_new_tokens": 512 + }, + "beams": { + "do_sample": false, + "num_beams": 5, + "max_new_tokens": 600, + "num_return_sequences": 1 + }, + "dosample": { + "da_sample": true + }, + "temperature": 0.2, //如果不是 JsonObject,它是一个默认参数,我们将在 Generation_config 中设置默认值。 你可以通过读取解码策略中同名参数的方式覆盖当前参数的默认值。 + "max_new_tokens": 600, + "num_return_sequences": 1, + "top_p": 0.9, + "num_beams": 1, + "do_sample": true + }, + "batch_size": 1, // 单次生成的batch size大小 + "sample_num": 1, // 单条评测数据生成的样本数 + "decode_mode": "beams" // 选择在 Generation_config 中定义的解码模式 + } + } +``` + +### 检查数据集命令 +为了检查评估数据集提供的参考值是否正确,我们提供以下命令来检查数据集,针对于已经集成的数据集情况,检查数据集的命令如下所示 + +代码补全 +```bash +bash codefuseEval/script/check_dataset.sh humaneval_python + +bash codefuseEval/script/check_dataset.sh humaneval_java + +bash codefuseEval/script/check_dataset.sh humaneval_js + +bash codefuseEval/script/check_dataset.sh humaneval_rust + +bash codefuseEval/script/check_dataset.sh humaneval_go + +bash codefuseEval/script/check_dataset.sh humaneval_cpp +``` +自然语言生成代码 +```bash +bash codefuseEval/script/check_dataset.sh mbpp +``` +代码翻译 +``` +bash codefuseEval/script/check_dataset.sh codeTrans_python_to_java + +bash codefuseEval/script/check_dataset.sh codeTrans_python_to_cpp + +bash codefuseEval/script/check_dataset.sh codeTrans_cpp_to_java + +bash codefuseEval/script/check_dataset.sh codeTrans_cpp_to_python + +bash codefuseEval/script/check_dataset.sh codeTrans_java_to_python + +bash codefuseEval/script/check_dataset.sh codeTrans_java_to_cpp +``` +科学计算 +``` +bash codefuseEval/script/check_dataset.sh codeCompletion_matplotlib + +bash codefuseEval/script/check_dataset.sh codeCompletion_numpy + +bash codefuseEval/script/check_dataset.sh codeCompletion_pandas + +bash codefuseEval/script/check_dataset.sh codeCompletion_pytorch + +bash codefuseEval/script/check_dataset.sh codeCompletion_scipy + +bash codefuseEval/script/check_dataset.sh codeCompletion_sklearn + +bash codefuseEval/script/check_dataset.sh codeCompletion_tensorflow + +bash codefuseEval/script/check_dataset.sh codeInsertion_matplotlib + +bash codefuseEval/script/check_dataset.sh codeInsertion_numpy + +bash codefuseEval/script/check_dataset.sh codeInsertion_pandas + +bash codefuseEval/script/check_dataset.sh codeInsertion_pytorch + +bash codefuseEval/script/check_dataset.sh codeInsertion_scipy + +bash codefuseEval/script/check_dataset.sh codeInsertion_sklearn + +bash codefuseEval/script/check_dataset.sh codeInsertion_tensorflow +``` \ No newline at end of file diff --git a/content/zh/docs/codefuse-mft-vlm/1_quickstart.md b/content/zh/docs/codefuse-mft-vlm/1_quickstart.md new file mode 100644 index 0000000..5f9bbf7 --- /dev/null +++ b/content/zh/docs/codefuse-mft-vlm/1_quickstart.md @@ -0,0 +1,65 @@ +--- +title: 快速使用 +slug: 快速使用 +description: 快速使用 +aliases: +- "/docs/codefuse-mft-vlm-quickstart-zh" +--- + + +## Contents +- [Install](#Install) +- [Datasets](#Datasets) +- [Multimodal Alignment](#Multimodal-Alignment) +- [Visual Instruction Tuning](#Visual-Instruction-Tuning) +- [Evaluation](#Evaluation) + +## Install +请执行 sh init\_env.sh + +## Datasets +使用了以下数据集训练模型: + +数据集 | 任务种类 | 样本量 +| ------------- | ------------- | ------------- | +synthdog-en | OCR | 800,000 +synthdog-zh | OCR | 800,000 +cc3m(downsampled)| Image Caption | 600,000 +cc3m(downsampled)| Image Caption | 600,000 +SBU | Image Caption | 850,000 +Visual Genome VQA (Downsampled) | Visual Question Answer(VQA) | 500,000 +Visual Genome Region descriptions (Downsampled) | Reference Grouding | 500,000 +Visual Genome objects (Downsampled) | Grounded Caption | 500,000 +OCR VQA (Downsampled) | OCR and VQA | 500,000 + +请到各个数据集的官网上下载这些数据。 + +## Multimodal Alignment +请执行 sh scripts/pretrain.sh 或者 sh scripts/pretrain\_multinode.sh + + +## Visual Instruction Tuning +请执行 sh scripts/finetune.sh 或者 sh scripts/finetune\_multinode.sh + +## Evaluation +请执行 llava/eval/ 当中的python脚本. 可以通过下面的代码来加载我们预训练的CodeFuse-VLM-14B: + +``` +import os +from llava.model.builder import load_mixed_pretrained_model + +model_path = '/pretrained/model/path' +tokenizer, model, image_processor, context_len = load_mixed_pretrained_model(model_path, None, 'qwen-vl-14b', os.path.join(model_path, 'Qwen-VL-visual'), 'cross_attn', os.path.join(model_path, 'mm_projector/mm_projector.bin')) +``` + +您也可以先运行下面的脚本来合并各个模型组件:scripts/merge\_qwen\_vl\_weights.sh,然后通过下面的代码加载合并后的模型: +``` +from llava.model import LlavaQWenForCausalLM + +model = LlavaQWenForCausalLM.from_pretrained('/path/to/our/pretrained/model') +``` + +## CodeFuse-VLM 产品视频 +这是我们模型支持的产品的视频 + +https://private-user-images.githubusercontent.com/22836551/300398424-201f667d-6b6b-4548-b3e6-724afc4b3071.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY1MjE5MTIsIm5iZiI6MTcwNjUyMTYxMiwicGF0aCI6Ii8yMjgzNjU1MS8zMDAzOTg0MjQtMjAxZjY2N2QtNmI2Yi00NTQ4LWIzZTYtNzI0YWZjNGIzMDcxLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTI5VDA5NDY1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI0ZmJmZWNlNDZmNWM3NzA0OThlMmY1ODY4MDkxNWY5ZWNiNzRiYjJkYmE4NjEzM2EwYWRiNWY2ODc3N2ViYjEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.BIvWGNx0XV7RoauxB0c2noEdbfZfu8-16LPHtCaCJ9k \ No newline at end of file diff --git a/content/zh/docs/codefuse-modelcache/1_quickstart.md b/content/zh/docs/codefuse-modelcache/1_quickstart.md new file mode 100644 index 0000000..5961216 --- /dev/null +++ b/content/zh/docs/codefuse-modelcache/1_quickstart.md @@ -0,0 +1,50 @@ +--- +title: QuickStart +description: 介绍主要功能 +url: "/docs/codefuse-modelcache-quickstart-zh" +aliases: +- "/docs/codefuse-modelcache-quickstart-zh" +--- + + + +ModelCache易于使用,只需1步骤即可构建缓存测试Demo + +## 快速开始 +### 构建Cache +Cache的默认接口如下所示: +``` +class Cache: + # it should be called when start the cache system + def __init__(self): + self.has_init = False + self.cache_enable_func = None + self.embedding_func = None + self.post_process_messages_func = None + self.config = Config() +``` + +在创建ModelCache之前,请考虑以下问题: +- 你将如何为查询生成嵌入向量?(embedding_func) 该函数将文本嵌入到一个用于上下文相似性搜索的密集向量中。ModelCache可以支持多种嵌入上下文的方法:Huggingface、ONNX和SentenceTransformers。默认逻辑中,使用了在中文领域表现更好的huggingface中的text2vec模型。只需将你的嵌入函数初始化为:text2vec.to_embeddings + +``` +data_manager = get_data_manager(CacheBase("mysql", config=mysql_config), + VectorBase("milvus", dimension=data2vec.dimension, milvus_config=milvus_config)) + +cache.init( + embedding_func=data2vec.to_embeddings, + data_manager=data_manager, + similarity_evaluation=SearchDistanceEvaluation(), + query_pre_embedding_func=query_multi_splicing, + insert_pre_embedding_func=insert_multi_splicing, +) +``` + +- 你将在哪里缓存数据?(data_manager缓存存储) 缓存存储用于存储所有标量数据,例如原始问题、提示、答案和访问时间。ModelCache支持多种缓存存储选项,如SQLite、MySQL和OceanBase。未来还将添加更多的NoSQL数据库选项。 +- 你将在哪里存储和搜索向量嵌入?(data_manager向量存储) 向量存储组件用于存储和搜索所有嵌入向量,以便在语义上找到最相似的结果。ModelCache支持使用FAISS等向量搜索库或Milvus等向量数据库。未来还将添加更多的向量数据库和云服务选项。 + +以下是一些示例: +``` +data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=data2vec.dimension)) +data_manager = get_data_manager(CacheBase("oceanbase"), VectorBase("milvus", dimension=data2vec.dimension)) +``` diff --git a/content/zh/docs/codefuse-modelcache/2_feature.md b/content/zh/docs/codefuse-modelcache/2_feature.md new file mode 100644 index 0000000..440a448 --- /dev/null +++ b/content/zh/docs/codefuse-modelcache/2_feature.md @@ -0,0 +1,159 @@ +--- +title: 功能特性 +description: 介绍主要功能 +url: "/docs/codefuse-modelcache-feature-zh" +aliases: +- "/docs/codefuse-modelcache-feature-zh" +--- + + + + +功能方面,为了解决huggingface网络问题并提升推理速度,增加了embedding本地推理能力。鉴于SqlAlchemy框架存在一些限制,我们对关系数据库交互模块进行了重写,以更灵活地实现数据库操作。在实践中,大型模型产品需要与多个用户和多个模型对接,因此在ModelCache中增加了对多租户的支持,同时也初步兼容了系统指令和多轮会话。 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
模块功能
ModelCacheGPTCache
基础接口数据查询接口
数据写入接口
Embeddingembedding模型配置
大模型embedding层
bert模型长文本处理
Large model invocation是否与大模型解耦
embeddingg模型本地加载
数据隔离模型数据隔离
超参数隔离
数据库MySQL
Milvus
OceanBase
会话管理单轮回话
system指令
多轮回话
数据管理数据持久化
一键清空缓存
租户管理支持多租户(多模型)
milvus多表能力
其他长短对话区分能力
+ +## 核心功能 +在ModelCache中,沿用了GPTCache的主要思想,包含了一系列核心模块:adapter、embedding、similarity和data_manager。adapter模块主要功能是处理各种任务的业务逻辑,并且能够将embedding、similarity、data_manager等模块串联起来;embedding模块主要负责将文本转换为语义向量表示,它将用户的查询转换为向量形式,并用于后续的召回或存储操作;rank模块用于对召回的向量进行相似度排序和评估;data_manager模块主要用于管理数据库。同时,为了更好的在工业界落地,我们做了架构和功能上的升级,如下: + +- [x] 架构调整(轻量化集成):以类redis的缓存模式嵌入到大模型产品中,提供语义缓存能力,不会干扰LLM调用和安全审核等功能,适配所有大模型服务。 +- [x] 多种模型加载方案: + - 支持加载本地embedding模型,解决huggingface网络连通问题 + - 支持加载多种预训练模型embeding层 +- [x] 数据隔离能力 + - 环境隔离:可依据环境,拉取不同的数据库配置,实现环境隔离(开发、预发、生产) + - 多租户数据隔离:根据模型动态创建collection,进行数据隔离,用于大模型产品中多个模型/服务数据隔离问题 +- [x] 支持系统指令:采用拼接的方式,解决propmt范式中sys指令问题。 +- [x] 长短文本区分:长文本会给相似评估带来更多挑战,增加了长短文本的区分,可单独配置判断阈值。 +- [x] milvus性能优化:milvus consistency_level调整为"Session"级别,可以得到更好的性能。 +- [x] 数据管理能力: + - 一键清空缓存的能力,用于模型升级后的数据管理。 + - 召回hitquery,用于后续的数据分析和模型迭代参考。 + - 异步日志回写能力,用于数据分析和统计 + - 增加model字段和数据统计字段,用于功能拓展。 + +未来会持续建设的功能: + +- [ ] 基于超参数的数据隔离 +- [ ] system promt分区存储能力,以提高相似度匹配的准确度和效率 +- [ ] 更通用的embedding模型和相似度评估算法 \ No newline at end of file diff --git a/content/zh/docs/codefuse-modelcache/3_config.md b/content/zh/docs/codefuse-modelcache/3_config.md new file mode 100644 index 0000000..78d004a --- /dev/null +++ b/content/zh/docs/codefuse-modelcache/3_config.md @@ -0,0 +1,23 @@ +--- +title: 最佳配置 +description: 介绍主要功能 +url: "/docs/codefuse-modelcache-config-zh" +aliases: +- "/docs/codefuse-modelcache-config-zh" +--- + +## 环境依赖 +- python版本: 3.8及以上 +- 依赖包安装: + ```pip install requirements.txt ``` + +## 服务启动 +- 在启动服务前,应该进行如下环境配置: +- 安装关系数据库 mysql, 导入sql创建数据表,sql文件: reference_doc/create_table.sql +- 安装向量数据库milvus +- 在配置文件中添加数据库访问信息,配置文件为: + - modelcache/config/milvus_config.ini + - modelcache/config/mysql_config.ini +- 离线模型bin文件下载, 参考地址:https://huggingface.co/shibing624/text2vec-base-chinese/tree/main,并将下载的bin文件,放到 model/text2vec-base-chinese 文件夹中 +- 通过flask4modelcache.py脚本启动后端服务。 + diff --git a/content/zh/docs/codefuse-modelcache/4_release_note.md b/content/zh/docs/codefuse-modelcache/4_release_note.md new file mode 100644 index 0000000..02b7784 --- /dev/null +++ b/content/zh/docs/codefuse-modelcache/4_release_note.md @@ -0,0 +1,21 @@ +--- +title: 最佳配置 +description: 介绍主要功能 +url: "/docs/codefuse-modelcache-release-zh" +aliases: +- "/docs/codefuse-modelcache-release-zh" +--- + + + +| 时间 |功能 |版本号| +| ----- | ------ | ----- | +| 20230430| 完成GPTCache调研,开源流程在OpenAI接口上跑通,单节点形式 |无| +| 20230509| 1、完成技术选型及上下游交互方案
2、重新开发数据库模块,替换SQLalchemy框架
3、重构llm_handler模块,兼容codegpt,适配codegpt模型参数| V0.1.0| +| 20230519| 1、根据环境动态选择codegpt服务模式
2、模型本地加载能力,以及预加载能力
3、增加本地路径依据环境动态加载能力| V0.1.1| +| 20230522| 1、架构优化,调整为类redis结构,解藕大模型调用
2、关系数据库由sqlite切换至OceanBase
3、向量数据库由faiss切换至milvus
4、模型数据隔离能力
5、增加核心模块adapter_query、adapter_insert |V0.2.0| +| 20230531| 1、线上环境上线,动态感知能力
2、embedding模型评测及选型
3、增加预发环境及数据隔离能力
4、增加原始query字段透出能力| V0.2.1| +| 20230607| 1、优化关系数据库访问性能
2、优化环境和模型隔离能力| V0.2.2| +| 20230630| 1、在modelCache中增加大模型embedding层适配模块
2、增加采纳率统计能力 |V0.2.3| +| 20230730| 1、增加缓存统计功能
2、增加数据删除功能接口
3、缓存一键清空能力上线
4、多轮会话能力研发,支持system指令和多轮对话| v0.3.0| +| 20230830| 1、增加异步处理能力,性能提升超20%
2、架构变更,解藕embedding推理和业务处理逻辑
3、黑名单过滤功能 |V0.3.1| \ No newline at end of file diff --git a/content/zh/docs/codefuse-query/1_abstract.md b/content/zh/docs/codefuse-query/1_abstract.md new file mode 100644 index 0000000..8a82567 --- /dev/null +++ b/content/zh/docs/codefuse-query/1_abstract.md @@ -0,0 +1,17 @@ +# 引言 +随着大规模软件开发的普及,对可扩展且易于适应的静态代码分析技术的需求正在加大。传统的静态分析工具,如 Clang Static Analyzer (CSA) 或 PMD,在检查编程规则或样式问题方面已经展现出了良好的效果。然而,这些工具通常是为了满足特定的目标而设计的,往往无法满足现代软件开发环境中多变和多元化的需求。这些需求可以涉及服务质量 (QoS)、各种编程语言、不同的算法需求,以及各种性能需求。例如,安全团队可能需要复杂的算法,如上下文敏感的污点分析,来审查较小的代码库,而项目经理可能需要一种相对较轻的算法,例如计算圈复杂度的算法,以在较大的代码库上测量开发人员的生产力。 + +这些多元化的需求,加上大型组织中常见的计算资源限制,构成了一项重大的挑战。由于传统工具采用的是问题特定的计算方式,往往无法在这种环境中实现扩展。因此,我们推出了 CodeQuery,这是一个专为大规模静态分析设计的集中式数据平台。 +在 CodeQuery 的实现中,我们把源代码和分析结果看作数据,把执行过程看作大数据处理,这与传统的以工具为中心的方法有着显著的不同。我们利用大型组织中的常见系统,如数据仓库、MaxCompute 和 Hive 等数据计算设施、OSS 对象存储和 Kubernetes 等灵活计算资源,让 CodeQuery 能够无缝地融入这些系统中。这种方法使 CodeQuery 高度可维护和可扩展,能够支持多元化的需求,并有效应对不断变化的需求。此外,CodeQuery 的开放架构鼓励各种内部系统之间的互操作性,实现了无缝的交互和数据交换。这种集成和交互能力不仅提高了组织内部的自动化程度,也提高了效率,降低了手动错误的可能性。通过打破信息孤岛,推动更互联、更自动化的环境,CodeQuery 显著提高了软件开发过程的整体生产力和效率。 +此外,CodeQuery 的以数据为中心的方法在处理静态源代码分析的领域特定挑战时具有独特的优势。例如,源代码通常是一个高度结构化和互联的数据集,与其他代码和配置文件有强烈的信息和连接。将代码视为数据,CodeQuery 可以巧妙地处理这些问题,这使得它特别适合在大型组织中使用,其中代码库持续但逐步地进行演变,大部分代码在每天进行微小的改动同时保持稳定。 CodeQuery 还支持如基于代码数据的商业智能 (BI) 这类用例,能生成报告和仪表板,协助监控和决策过程。此外,CodeQuery 在分析大型语言模型 (LLM) 的训练数据方面发挥了重要作用,提供了增强这些模型整体效果的深入见解。 + +在当前的静态分析领域,CodeQuery 带来了一种新的范式。它不仅满足了大规模、复杂的代码库分析需求,还能适应不断变化和多元化的静态分析场景。CodeQuery 的以数据为中心的方法,使得其在处理大数据环境中的代码分析问题时具有独特优势。CodeQuery 的设计,旨在解决大规模软件开发环境中的静态分析问题。它能够将源代码和分析结果视作数据,使得其可以灵活地融入大型组织的各种系统中。这种方法不仅可以有效地处理大规模的代码库,还可以应对各种复杂的分析需求,从而使得静态分析工作变得更加高效和准确。 + +CodeQuery 的特点和优势可以概括为以下几点: + +- **高度可扩展**:CodeQuery 可以处理大规模的代码库,且能够适应不同的分析需求。这种高度的可扩展性使得 CodeQuery 可以在大型组织中发挥重要作用。 +- **以数据为中心**:CodeQuery 将源代码和分析结果视作数据,这种以数据为中心的方法使其在处理大数据环境中的代码分析问题时具有独特优势。 +- **高度集成**:CodeQuery 能够无缝地融入大型组织的各种系统中,包括数据仓库、数据计算设施、对象存储和灵活计算资源等。这种高度的集成性使得 CodeQuery 在大型组织中的使用变得更加方便和高效。 +- **支持多元化的需求**:CodeQuery 不仅可以处理大规模的代码库,还可以应对各种复杂的分析需求,包括服务质量分析需求、跨编程语言分析需求、算法需求和性能需求等。 + +CodeQuery 是一种强大的静态代码分析平台,适合大规模、复杂的代码库分析场景。它的以数据为中心的方法和高度的可扩展性使得它在现代软件开发环境中具有独特的优势。未来,随着静态代码分析技术的不断发展,CodeQuery 有望在这个领域中扮演更加重要的角色。 diff --git a/content/zh/docs/codefuse-query/2_introduction.md b/content/zh/docs/codefuse-query/2_introduction.md new file mode 100644 index 0000000..9362795 --- /dev/null +++ b/content/zh/docs/codefuse-query/2_introduction.md @@ -0,0 +1,120 @@ +--- +title: CodeFuse-Query 介绍 +slug: CodeFuse-Query +description: 介绍主要功能 +url: docs/codefuse-query-introduction-zh +aliases: +- "/docs/codefuse-query-introduction-zh" +--- + + +# 概述 +CodeFuse-Query 是一个支持对 **各种编程语言** 进行 **结构化分析** 的 **代码数据平台**。核心思想是利用各种语言解析器将所有代码转化为数据,并将其结构化存储到代码数据库中。通过使用自定义查询语言,按照业务需求进行数据分析。如下图所示: +![image.png](/images/codefuse-query/introduction01.png) + +## 2.1 CodeFuse-Query的架构 +从整体上来说,CodeFuse-Query代码数据平台分为三大部分:代码数据模型、代码查询DSL、平台产品化服务。主要工作流程如下图所示: +### ![image.png](/images/codefuse-query/introduction02.png) + +### 代码数据化和标准化:COREF +我们定义了一种代码数据化和标准化的模型:COREF,要求所有代码都要能通过各种语言抽取器转化到该模型。 +COREF主要包含以下几种信息: +**COREF** = AST (抽象语法树) + ASG(抽象语义图) + CFG(控制流图) + PDG(程序依赖图)+ Call Graph(函数调用图) + Class Hierarchy (类继承关系)+ Documentation(文档/注释信息) +注:由于每种信息的计算难度不一,所以并不是所有语言的COREF信息均包含以上全部信息,基础信息主要有AST、ASG、Call Graph、Class Hierarchy和Documentation,其他信息( CFG 和 PDG )仍在建设中,后续会逐步支持。 +### 代码查询DSL +基于生成的COREF代码数据,CodeFuse-Query 使用一种自定义的DSL语言 **Gödel** 来进行查询,从而完成代码分析需求。 +Gödel是一种逻辑推理语言,它的底层实现是基于逻辑推理语言Datalog,通过描述“事实”和“规则”, 程序可以不断地推导出新的事实。Gödel也是一个声明式语言,相较于命令式编程,声明式编程更加着重描述“要什么”,而把如何实现交给计算引擎。 +既然代码已经转化为关系型数据(COREF数据以关系型数据表的形式存储),相信大家会有疑问,为什么不直接用SQL,或者是直接使用SDK,而是又要专门去学习一个新的DSL语言呢?因为Datalog的计算具备单调性和终止性,简单理解就是,Datalog是在牺牲了表达能力的前提下获得了更高的性能,而Gödel继承了这个特点。 + +- 相比较SDK,Gödel的主要优点是易学易用,声明式的描述,用户不需要关注中间的运算过程,只需要像SQL一样简单描述清楚需求即可。 +- 相比较SQL,Gödel的优点主要是描述能力更强、计算速度更快,例如描述递归算法和多表联合查询,而这些对于SQL来说都是比较困难的。 +### 平台化、产品化 +CodeFuse-Query 包括**Sparrow CLI **和CodeFuse-Query**在线服务Query中心**。Sparrow CLI包含了所有组件和依赖,例如抽取器,数据模型,编译器等,用户完全可以通过使用Sparrow CLI在本地进行代码数据生成和查询(Sparrow CLI的使用方式请见 第3节 安装、配置、运行)。如果用户有在线查询的需求,可以使用Query中心进行实验。 +## 2.2 CodeFuse-Query支持的分析语言 +截至2023-10-31为止,CodeFuse-Query支持对11种编程语言进行数据分析。其中对5种编程语言( Java、JavaScript、TypeScript、XML、Go )的支持度非常成熟,对剩余6种编程语言(Object-C、C++、Python3、Swift、SQL、Properties )的支持度处于beta阶段,还有进一步提升和完善的空间,具体的支持情况见下表: + +| 语言 | 状态 | COREF模型节点数 | +| --- | --- | --- | +| Java | 成熟 | 162 | +| XML | 成熟 | 12 | +| TS/JS | 成熟 | 392 | +| Go | 成熟 | 40 | +| OC/C++ | beta | 53/397 | +| Python3 | beta | 93 | +| Swift | beta | 248 | +| SQL | beta | 750 | +| Properties | beta | 9 | + +注:以上语言状态的成熟程度判断标准是根据COREF包含的信息种类和实际落地情况来进行判定,除了OC/C++外,所有语言均支持了完整的AST信息和Documentation信息,以Java为例,COREF for Java还支持了ASG、Call Graph、Class Hierarchy、以及部分CFG信息。 +## 2.3 CodeFuse-Query的使用场景 +### 查询代码特征 +小开发同学想知道 Repo A 里面使用了哪些 String 型的变量,所以他写了一个 Godel 如下,交给 CodeFuse-Query 系统给他返回了结果。 +```rust +// script +use coref::java::* + +fn out(var: string) -> bool { + for(v in Variable(JavaDB::load("coref_java_src.db"))) { + if (v.getType().getName() = "String" && var = v.getName()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` +类似需求:查询:类,函数,变量,返回值,调用图,类继承等等。 + +### 输出静态分析能力 +小安全是 XX 团队的安全同学,他做了**一套系统**交叉验证日志数据和代码数据是否一致。为了完成某个分析任务,他计划通过写 Godel 查询出来静态数据 D1,合并动态数据 D2,联合分析得出结论 C。小安全通过在 CodeFuse-Query 上面编写 Godel Query 测试技术上可行之后,使用 CodeFuse-Query 提供的标准 API 将系统对接了起来。 +类似需求:通过静态分析进行系统的卡点,提高测试的效率,通过分析出来的数据合并成说明文档。 +### 代码规则检查器 +小 TL 同学发现团队总是写出很多类似的 Bug A,**他想针对 Bug A 制定一个代码规则和其检查器**,并在 CodeReview 阶段做个卡点。小 TL 通过在 CodeFuse-Query 平台上面编写了一段分析 Query,在平台上面测试符合要求,把这段分析 Query 固化下来作为一个代码规则,并上线到了 CodeReview/CI 阶段。从此这个 Bug 再也没发生过了。 +类似需求:编写静态缺陷扫描规则进行代码风险拦截。 +### 分析代码特性 +研发部同学小框架想知道目前代码仓库中Spring工程和Spring Boot工程比例。 好量化新框架的推广情况。小架构通过编写 Godel Query 描述不同项目分析特征,**然后一次性 Query 了 11 万个代码仓库**,过了几十分钟后就拿到了所有代码的数据,开开心心做 KPI 去了。 +类似需求:应用画像,代码画像,架构分析。 +### 获取统计数据 +小研究发现传统的代码复杂度指标很难准确地衡量代码的复杂情况,通过学习国际先进经验加上自我灵光一闪,设计了一套复杂度指标和算法。通过 Godel 实现出来以后,**发现不怎么优化就已经性能非常高了**,很快就应用到了 10 几种语言,11+万个仓库当中去了。马上就对代码仓库整体的复杂度有了深入的了解。相比较以前需要自己解析代码,分析语法树,对接系统,**不知道方便了多少。** +类似需求:代码统计,代码度量,算法设计,学术研究。 +### 架构分析 +小架构同学最近推行了一种新的基于 txt 文件的消息中间件,目前已有的分析平台都不能支持分析此类系统的上下游依赖。小架构通过 Godel**快速建模了该消息格式**,并马上获取到了目前系统中不同组件的依赖关系。 +类似需求:系统 Overview,架构治理,血缘分析。 +### 模型验证 +小促销设计的系统里面要求用户一定是先玩游戏再领券。他通过 Godel 描述了**该模型的验证逻辑**,然后通过 CodeFuse-Query 系统**保障当前以及未来系统的代码实现**,都是完全符合该模型的。从此再不担心游戏出资损~ +类似需求:系统验证,网络验证,权限验证 +## 2.4 CodeFuse-Query的应用领域 +目前,CodeFuse-Query在蚂蚁集团已经支持 **CodeFuse大语言模型数据清洗**、**代码度量评估**、**研发风险控制**、**隐私安全分析**、**代码智能**、**终端包大小治理 **等多个场景的落地应用,服务月均调用量超过百万。 +![image.png](/images/codefuse-query/introduction03.png) + +### 高质量代码数据清洗 - CodeFuse代码大模型 +CodeFuse代码大模型是蚂蚁集团对外开源的处理代码相关问题的模型,对于CodeFuse大语言模型而言,训练的数据质量直接影响模型的推理结果。低质量的代码数据会直接污染语言模型的输出,例如:模型可能会学习到错误的代码模式,从而生成错误的代码;数据中只包含某种编程语言的代码,模型可能无法很好地适应其他编程语言的代码。 +为了把控进入模型的代码数据质量,进而提升模型的推理能力。我们基于蚂蚁程序分析团队多年的实践积累结合业界共识,梳理了高质量代码的定义方式,并利用已有程序分析技术实现了自动化、大规模的代码数据清洗。 +CodeFuse-Query为CodeFuse代码大模型提供了以下数据清洗能力: + +- 高质量代码数据清洗:对代码数据进行清洗,包括对 Python,Java,JavaScript,TypeScript,Go,C,C++ 7 种语言进行漏洞扫描,对语言种类 / star 数进行筛选,过滤有效代码行数为 0 的数据等。目前已沉淀清洗后的 GitHub 和蚂蚁内部代码数据总共约 **2TB**。 +- 代码画像:实现对大规模代码进行高性能多维度的自动标注,支持 Java, Scala, Kotlin, JavaScript, JSX, TypeScript, TSX, Vue, Python, Go 等 **10** 种语言,**77** 种通用标签,**40** 种蚂蚁特有标签,共 **117** 种标签。目前自动标注性能能够达到 **40MB/s**。 +- 其他原子能力 + - 高级代码特征提取,包括提取 AST(抽象语法树),DFG(数据流图)数据等。目前 AST 信息已用于 SFT 训练,准确率 97% 左右。 + - 代码片段识别,用于针对文本数据中的代码进行提取,方便进行代码格式化或加上 Markdown 格式: + - 文本提取代码:从文本中提取代码块信息,支持主流语言的解析,函数及类定义,仅验证二分类问题,就是说仅验证文本是否含有代码块准确率 83% 左右。 + - 识别代码片段的编程语言种类:识别任意代码片段的编程语言种类,支持 30+ 种语言,准确率80%左右。 + - 代码注释对提取:支持提取方法级别的注释-代码对信息,覆盖 **15 种** GitHub 最流行的语言,用于 Text To Code/Code To Text 的 SFT 训练。 +### 代码数据指标 - 广目 +广目是蚂蚁内部一款面向不同职能的研发同学和团队管理者,对代码力进行评估、展示客观数据和分析结果的数据产品。 +广目提供了个人代码力评估报告、日常代码力指标数据分析、团队代码力管理、代码评优荣誉展示等功能,旨在帮助蚂蚁研发工程师不断提升代码品质、减少代码负债,更长远的提升研发效能。 +CodeFuse-Query为广目提供的能力分为两部分: + +- 代码评估指标:代码复杂度、代码注释率、标准开发量等 +- 代码评优指标:代码复用度 +### 变更分析-优酷服务端研发效能 +优酷质量保障团队从2023年开始针对服务端精准测试的探索,经过半年的技术沉淀和体系搭建,形成了具备**变更内容识别、变更影响分析、测试能力推荐、测试覆盖评估**的精准测试体系。 +在此过程中,CodeFuse-Query能提供的能力主要有: + +- 根据代码变更内容(文件+行号),分析出影响的对象:方法、入口(http入口、hsf入口)、调用链路(从入口到变更方法的所有调用链路)、数据库操作(表、操作类型) +- 结合线上动态调用链路(方法链路)、CodeFuse-Query静态分析调用链路的影响面精准分析能力,提升变更分析影响面的有效性、准备率 + +到目前为止,优酷已通过CodeFuse-Query接入所有核心应用,并基于静态分析采集数据,构建了服务端完整的代码知识库和流量知识库。 + diff --git a/content/zh/docs/codefuse-query/3_install_and_run.md b/content/zh/docs/codefuse-query/3_install_and_run.md new file mode 100644 index 0000000..962b5c4 --- /dev/null +++ b/content/zh/docs/codefuse-query/3_install_and_run.md @@ -0,0 +1,175 @@ +--- +title: 快速开始 +slug: 快速开始 +description: CodeFuse介绍主要功能 +url: /docs/codefuse-query-quickstart-zh +aliases: +- "/docs/codefuse-query-quickstart-zh" +--- + +# 安装、配置、运行 + +## 硬件和软件要求 + +- 硬件:4C8G + +- 环境要求:java 1.8 和 python3.8 以上执行环境, 请保证 java python 可执行环境 + +## Sparrow 安装步骤和指导 + +- CodeFuse-Query 下载包是一个 zip 存档,其中包含工具、脚本和各种特定于 CodeFuse-Query 的文件。如果您没有 CodeFuse-Query 许可证,那么下载此存档即表示您同意 [CodeFuse-Query 条款和条件](../LICENSE)。 +- 目前仅支持 mac,linux 系统下使用 CodeFuse-Query,下载地址为:(目前仅给出示例,开源后给出正式下载地址) + - mac: [CodeFuse-Query 2.0.0](https://github.com/codefuse-ai/CodeFuse-Query/releases/tag/2.0.0) + - linux: [CodeFuse-Query 2.0.0](https://github.com/codefuse-ai/CodeFuse-Query/releases/tag/2.0.0) +- 您应该始终使用 CodeFuse-Query 捆绑包,确保版本兼容性 + +### Tips: + +- mac系统下直接下载软件包会提示需要验证开发者 + +![image.png](/images/codefuse-query/macos_cannot_open_godel.png) + +- 可在安全性设置中进行修改验证 + +![image.png](/images/codefuse-query/security_allow_godel_run.png) + +- 点击仍然允许 + +- 详细步骤可参照:[Mac 官方文档: 如何在 Mac 上安全地打开 App](https://support.apple.com/zh-cn/HT202491) + +- 或使用`xattr -d com.apple.quarantine`命令,删除 CodeFuse-Query 被 macOS 赋予的外部属性 + +- `xattr -d com.apple.quarantine`是一个命令行指令,用于删除文件的 `com.apple.quarantine` 扩展属性。该扩展属性是 macOS 系统用来标记从外部来源下载的文件或应用程序的属性,以确保安全性。 + +```java +xattr -d com.apple.quarantine path/to/file +``` + +## 配置和初始化 CodeFuse-Query 开发环境 + +- 解压缩:命令行解压或者直接点一下解压缩即可 + +- 需要具备 java8 和 python3.8 以上执行环境 + +- CodeFuse-Query 解压后,您可以通过以下几种方式运行可执行文件来运行 sparrow 进程: + +- 通过执行 `/sparrow-cli/sparrow`,其中 `` 是提取CodeFuse-Query包的文件夹。 + +- 通过添加 `/sparrow-cli` 到您的 PATH,以便您可以直接运行可执行文件 sparrow。 + +此时,您可以执行 sparrow 命令。 + +## 运行 + +### 执行步骤 + +- 确认需要执行查询的源代码目录 + +- 抽取源代码的代码数据 + +- 基于代码数据编写 godel 脚本,获取自己想要的代码数据 + +- godel 脚本如何编写参照 [GödelScript 查询语言](./4_godelscript_language.md) + +### 执行样例 + +#### 数据抽取 +```java +/sparrow-cli/sparrow database create -s -lang -o +``` + +- `` 代码库抽取出的代码数据的输出目录,后文数据库位置:`` + +- `` 需要进行代码抽取的语言,分析 java 则填写 java + +- `` 需要扫描的源代码目录 + +- 在数据抽取步骤,获得脚本执行需要的数据库 `` + +#### 编写godel脚本 + +- 假设具备如下 godel 脚本, 获取指定仓库的所有 java 方法名 + +- godel 脚本具体编写可参照 [GödelScript 查询语言](./4_godelscript_language.md) + +```java +// script +use coref::java::* + +// 定义全局java数据库 +fn default_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +// 遍历所有方法,获取方法名,输出限制 +fn getFunctionName(name: string) -> bool { + let (db = default_db()) { + for (method in Method(db)) { + if (name = method.getName()) { + return true + } + } + } +} + + +fn main() { + output(getFunctionName()) +} +``` + +#### 脚本执行 +```java +/sparrow-cli/sparrow query run -d -gdl -o +``` + +- `` 需要扫描的代码库抽取出的代码数据,与上文的 `` 一致 + +- `` godel 脚本所在路径,可填写所在目录,会依次执行所在目录下所有以`.gdl`结尾的文件 + +- `` 输出路径目录,xxx.gdl 的执行结果会以 json 格式存入 `/xxx.json` 中 + +- 可通过查看数据文件确认脚本执行是否正确 + +#### 例子 + +若存在以下java代码 + +```java +public class HelloWorld { + public static void main(String[] args) { + HelloWorld tmp = new HelloWorld(); + String hello = tmp.getHello(); + String world = tmp.getWorld(); + System.out.println(hello + " " + world); + } + + public String getHello() { + return "Hello"; + } + + public String getWorld() { + return "World"; + } +} + +``` + +```java +sparrow database create -s -lang java -o ./db/ +sparrow query run -d ./db/ -gdl example.gdl -o ./ +``` + +- `` 为上述给出的 java 文件存储目录 + +- example.gdl 为上述给出的 gdl 示例,存储到当前目录 + +- 执行完毕后可在当前目录下找到 example.json 文件 + +对应的脚本输出 json 文件内容如下 +```java +[{"name": "getHello"}, +{"name": "getWorld"}, +{"name": "main"}] + +``` diff --git a/content/zh/docs/codefuse-query/4_godelscript_language.md b/content/zh/docs/codefuse-query/4_godelscript_language.md new file mode 100644 index 0000000..06522ff --- /dev/null +++ b/content/zh/docs/codefuse-query/4_godelscript_language.md @@ -0,0 +1,2295 @@ +--- +title: 查询语言介绍 +slug: 查询语言介绍 +description: CodeFuse介绍主要功能 +url: /docs/codefuse-query-godellanguage-zh +aliases: +- "/docs/codefuse-query-godellanguage-zh" +--- + + +# GödelScript 查询语言 + +## 目录 + +- [GödelScript 基本概念和语法](#gödelscript-基本概念和语法) + - [简介](#简介) + - [基本程序构成](#基本程序构成) + - [基础类型和编译器内建函数](#基础类型和编译器内建函数) + - [函数](#函数) + - [语句](#语句) + - [Schema](#schema) + - [数据库](#数据库) + - [Trait](#trait) + - [Import](#import) + - [Query](#query) + - [Ungrounded Error: 未赋值/未绑定错误](#ungrounded-error-未赋值未绑定错误) +- [查询示例](#查询示例) + - [Java](#java) + - [Python](#python) + - [JavaScript](#javascript) + - [XML](#xml) + - [Go](#go) +- [查询调试和优化技巧](#查询调试和优化技巧) + - [Schema 传参导致笛卡尔积过大](#schema-传参导致笛卡尔积过大) + - [多层 for 导致笛卡尔积过大](#多层-for-导致笛卡尔积过大) + - [不要滥用`@inline`](#不要滥用inline必须用inline的优化策略) +- [在本机使用查询脚本流程](#在本机使用查询脚本流程) + +## GödelScript 基本概念和语法 + +### 简介 + +```rust +// script +fn hello(greeting: string) -> bool { + return greeting = "hello world!" +} + +fn main() { + output(hello()) +} +``` + +GödelScript 即 Gödel 查询语言。GödelScript 是 CodeQuery 用于查询和数据处理的领域专用语言 (DSL)。GödelScript 使用了类 Rust 的语法,提供了严格的类型检查、方便快捷的类型推导、智能友好的错误提示信息,使用户能够快速上手。 + +GödelScript 编译器主要应用场景为: + +1. 面向用户编写简单或复杂查询,提供更便捷的写法,提高编写查询的效率; +2. 提供严格类型检查与类型推导,给予更智能的代码修改提示; +3. 提供严格的 [ungrounded(未赋值/未绑定)](#ungrounded-error-未赋值未绑定错误) 检测,避免触发 Soufflé Ungrounded Error; +4. Language Server 以及 IDE Extension 支持。 + +### 基本程序构成 + +#### 程序结构 + +GödelScript 程序可能包含: + +- [模块和符号引用](#import) +- [Schema 类型声明](#schema) +- [数据库类型声明](#数据库) +- [Trait 声明](#trait) +- [Schema 方法实现](#方法实现) +- [函数声明和实现](#函数) +- [Query 声明](#query) + +包含以上所有组成内容的样例: + +```rust +// script +// 包引入/符号引入 +use coref::java::* // 引入所有符号 +use coref::java::{JavaDB, Class} // 选择性引入符号 + +// 函数声明 +fn default_db() -> JavaDB { + return JavaDB::load("example.db") +} + +// schema 声明 +schema File { + @primary id: int +} + +// database 声明 +database NewDB { + file: *File +} + +// trait 声明 +trait FileTrait { + fn getId(self) -> int; +} + +// impl trait for +impl FileTrait for File { + fn getId(self) -> int { + return self.id + } +} + +// impl +impl File { + @data_constraint + fn all() -> *File { + yield File {id: 1} + yield File {id: 2} + } +} + +// query +query get_all_anno from + Annotation anno in Annotation(default_db()) +select + anno.id as id +``` + +#### 注释 + +GödelScript 采用类 C 语言的注释方式。 + +```rust +// 单行注释 + +/* +* 1. 多行注释 +* 2. 多行注释 +*/ +``` + +#### `main` 函数 + +GödelScript 查询脚本可以包含`main`函数,该函数无返回值。在不实现`main`函数,且没有写 query 声明的情况下,程序不会输出。 + +更多详细内容请看 [main 函数](#gödelscript-main-函数)。 + +```rust +fn main() { + output(query_1()) + output(query_2()) +} +``` + +### 基础类型和编译器内建函数 + +GödelScript 包含基础类型`int` `string`,`bool`属于基础类型,但是不能作为值存储。 + +#### `int`类型 native 函数 + +| 函数 | 类型 | 解释 | +| --- | --- | --- | +| pow | (int, int) -> int | 乘方。参数只能非负数。 | +| rem | (int, int) -> int | 取余。 | +| bitand | (int, int) -> int | 按位与。 | +| bitor | (int, int) -> int | 按位或。 | +| bitxor | (int, int) -> int | 按位异或。 | +| bitnot | (int) -> int | 按位非。 | +| neg | (int) -> int | 算术取反。 | +| to_string | (int) -> string | 转换为字符串。 | +| add | (int, int) -> int | + | +| sub | (int, int) -> int | - | +| mul | (int, int) -> int | * | +| div | (int, int) -> int | / | +| eq | (int, int) -> bool | = | +| ne | (int, int) -> bool | != | +| gt | (int, int) -> bool | > | +| ge | (int, int) -> bool | >= | +| lt | (int, int) -> bool | < | +| le | (int, int) -> bool | <= | +| to_set | (int) -> *int | 转为集合类型。 | + +#### `string`类型 native 函数 + +| 函数 | 类型 | 解释 | +| --- | --- | --- | +| len | (string) -> int | 获取字符串长度。 | +| substr | (string, int, int) -> string | 通过初始index和length来截取字符串。 | +| contains | (string, string) -> bool | 判断一个字符串是否被包含在当前字符串中。 | +| matches | (string, string) -> bool | 判断正则字符串是否完全匹配当前字符串。 | +| get_regex_match_result | (string, string, int) -> string | 获取被正则字符串完全匹配当前字符串时的某一个捕获结果,该结果由第二个参数(int)确定。如 "abcdef".get_regex_match_result("a(.*)f", 1) 的结果是 "bcde"。 | +| to_int | (string) -> int | 转换为整数。 | +| add | (string, string) -> string | 字符串拼接。 | +| eq | (string, string) -> bool | 判断字符串相等。 | +| ne | (string, string) -> bool | 判断字符串不相等。 | +| to_set | (string) -> *string | 转为集合类型。 | + +#### `bool`类型 native 函数 + +`bool`虽然作为基础类型存在,但是该类型不能作为数据参与中间计算,只能作为条件结果。 + +| 函数 | 类型 | 解释 | +| --- | --- | --- | +| not | (bool) -> bool | 条件取反。 | +| and | (bool, bool) -> bool | 条件与。 | +| or | (bool, bool) -> bool | 条件或。 | +| eq | (bool, bool) -> bool | 相等。 | +| ne | (bool, bool) -> bool | 不相等。 | + +#### 作用于集合的 native 函数 + +| 函数 | 类型 | 解释 | +| --- | --- | --- | +| len | (*T) -> int | 获取数据集合的数量。 | +| max | (*int) -> int | 查找最大值。 | +| min | (*int) -> int | 查找最小值。 | +| sum | (*int) -> int | 求和。 | +| find | (*T0) -> T1 | 从一个集合中,通过主键查找数据。 | + +#### 全局 native 函数 + +| 函数 | 类型 | 解释 | +| --- | --- | --- | +| output | ((...) -> bool) -> | 输出 query 内容。 | + +#### database 的 native 函数 + +| 函数 | 类型 | 解释 | +| --- | --- | --- | +| load | (string) -> T | 加载 database 。 | + +#### schema 的 native 函数 + +| 函数 | 类型 | 解释 | +| --- | --- | --- | +| to | (self) -> T | 转换到其他类型的 schema,采用 duck type 检测。 | +| is | (self) -> bool | 判断是否可以是其他类型的 schema,采用 duck type 检测。如果自身 schema 有主键,则底层只会通过主键判断是否可以是其他类型。 | +| key_eq | (self, T) -> bool | 检查两个 schema 实例的主键是否相等。 | +| key_neq | (self, T) -> bool | 检查两个 schema 实例的主键是否不等。 | + +schema native 函数实例: + +```rust +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +fn example() -> bool { + for(stmt in StatementParent(default_java_db())) { + if (stmt.is()) { + return true + } + } +} + +fn convert() -> *ElementParent { + for(stmt in StatementParent(default_java_db())) { + yield stmt.to() + } +} +``` + +### 函数 + +#### GödelScript `main` 函数 + +`main`函数是 GödelScript 中唯一不声明返回值的函数。`main`函数只允许使用`output`,其他语句会导致编译错误;多次使用`output(...)`可以输出多个查询结果,查询结果会分表显示,表名即为`output`中调用的查询函数的函数名。 + +#### 查询函数 + +查询函数的返回值类型推荐为`bool`,需要输出查询结果时,需要使用`output()`函数。 + +在`output()`中调用的查询函数不再是常规思路中的用传参调用函数。参数列表在此时会变化为输出表的表结构,下面是两个查询函数的应用实例: + +1. 单表`output` + + 单表`output`特指在`main`函数中,只使用一次`output`来输出。 + + ```rust + fn example(a: int, b: string) -> bool {...} + + fn main() { + output(example()) // 此时参数列表变为输出表结构,不需要传参 + } + ``` + + 对应的输出表结构为: + + ```json + [ + {"a": 0, "b": "xxx"}, + {"a": 1, "b": "xxx"} + ] + ``` + +2. 多表`output` + + 多表`output`是指在`main`函数中,使用多次`output`来输出。在这种情况下,输出数据会附带对应的表名。 + + ```rust + fn example0(a: int, b: string) -> bool {...} + fn example1(a: string, b: int) -> bool {...} + + fn main() { + output(example0()) + output(example1()) + } + ``` + + 对应的输出表结构为: + + ```json + { + "example0":[ + {"a": 0, "b": "xxx"}, + {"a": 1, "b": "xxx"} + ], + "example1":[ + {"a": "xxx", "b": 0}, + {"a": "xxx", "b": 1} + ] + } + ``` + +下面是一个比较详细的例子,在这个例子中,我们直接构造了两组数据并输出。在下列代码中,需要注意的是: + +1. GödelScript 中,布尔值可以使用`true`和`false`关键字。 + +2. `=`符号在 GödelScript 中是比较特殊的符号,不能用常规的编程语言的思路来理解。GödelScript 是一种 Datalog 语言。在这里,`=`符号同时具备两种语义,一个是 __赋值__ 一个是 __判等__。详情可看[`=`运算符](#赋值和判等运算符)。 + +3. 在这个例子的条件语句中,`a`和`b`均使用了`=`的赋值语义,因为`int`和`string`类型参数在函数体中被认为是`ungrounded(未赋值/未绑定)`,必须要被赋值才能使用。 + +4. `=`赋值语句的返回值是`true`。 + +```rust +fn example(a: int, b: string) -> bool { + // = 符号同时具有赋值和比较的功能,取决于此时的左值是否已经“被赋值” + // 这里的 a 和 b 所用的 = 符号均是赋值语义 + if (a = 1 && b = "1") { + // GödelScript 使用关键字 true 和 false 来表示布尔值 + return true + } + if (a = 2 && b = "2") { + return true + } +} + +fn main() { + output(example()) +} +``` + +预期的输出结果应该为: + +```json +[ + {"a": 1, "b": "1"}, + {"a": 2, "b": "2"} +] +``` + +#### 普通函数 + +普通函数用于封装一些复杂过程,这些函数必须要有明确的返回类型。 +其中返回类型可以存在两种情况: + +1. 单个返回值,箭头后面声明返回值类型。 + +```rust +fn getFile(c: Class) -> File { + return c.getRelativePath() +} +``` + +2. 返回集合,箭头后面的返回值类型前需要加上`*`以表明其返回的是一个集合。 + +```rust +fn getAllFiles(db: JavaDB) -> *File { + for (f: File in File(db)) { + yield f + } +} +``` + +一般情况下要求单返回值使用`return`语句,而多返回值/返回集合使用`yield`语句。 +实际使用中,由于 GödelScript 底层使用 Datalog 引擎,故任何的运算都是基于集合的,单返回值实际上仅仅意味着返回的集合可能只包含一个数据,但是也可能包含多个数据。 + +### 语句 + +#### for 语句:从集合中声明变量 + +GödelScript 使用`for`关键字和类似循环语句的语法来从集合中声明变量: + +```rust +for(f: File in getAllFiles()) { + ... +} +``` + +其中`f: File`,冒号后面跟着的是`f`的类型,可省略。 +`for`语句中允许直接定义多个变量,后面定义的变量在初始化时可使用同一语句中在它前面定义的所有变量: + +```rust +for(a in XmlAttribute(db), b in XmlAttribute(db), c in XmlElement(db)) { + ... +} + +for(a in getAllFiles(), b in a.getAllPaths()) { + ... +} +``` + +#### let 语句:声明单一变量 + +GödelScript 使用 `let`关键字来声明一个单一/中间变量: + +```rust +let(f: File = c.getRelativePath()) { + ... +} +``` + +其中`f: File`,冒号后面的类型可省略。 +`let`语句中允许直接定义多个变量,后面定义的变量在初始化时可使用同一语句中在它前面定义的所有变量: + +```rust +let(a = 1, b = a + 1, c = b + 1) { + ... +} +``` + +#### if 语句 + +GödelScript 的条件语句与许多过程式程序语言类似: + +```rust +if (f.getName().contains("util") || f.getName().contains("com")) { + ... +} +``` + +条件可以使用这些逻辑运算符进行连接:`!`取反,`||`或,`&&`与。 + +条件中的比较运算符:`>`大于,`<`小于,`>=`大于等于,`<=`小于等于,`=`等于或者赋值,`!=`不等于。 + +常规算术运算可以使用如下运算符:`+`加法,`-`减法/取负,`*`乘法,`/`除法。 + +##### 赋值和判等运算符`=` + +`=`符号在 GödelScript 中具有两种不同的语义:赋值和判等,具体的语义需要分情况进行讨论: + +1. 赋值 + + 赋值一般出现在`int` `string`这类基础类型的变量参数上,这类变量作为函数的参数出现时,一般被认为是未赋值的。而具有这类变量的函数被调用时,传入的参数,实际上是作为筛选条件存在。 + + ```rust + fn example(a: int) -> bool { + // 这里比较反直觉,在过程式语言中,这里通常会被认为是判断 a == 1 + // 但是在 datalog 方言中,datalog 的每个函数实际上都是在算一个中间表 (view) + // 所以这个函数本质上是生成了一个 view,数据为 [{"a": 1}] + return a = 1 // assign a = 1 + } + + fn test() -> bool { + // 这里看似是在通过传参让 a = 2,实际上并不是 + // example() 自己会返回 view: [{"a": 1}] + // 然后通过 a = 2 来约束结果,可以看到,我们这里没有拿到任何结果 + // 所以返回了 false + return example(2) // false + } + ``` + +2. 判等 + + 对于 schema 类型来说,任何一个 schema 背后都有一个全集,所以参数列表中的 schema 类型一般被认为是已经被赋值的。对于已经赋值的变量来说,`=`就是判等操作。 + + ```rust + // 声明 schema + schema A {...} + + // 实现 schema 的成员函数 + impl A { + // 这里定义了 schema A 的全集 + @data_constraint + pub fn __all__() -> *A {...} + } + + fn example(a: A) -> bool { + for(temp in A::__all__()) { + if (a = temp) { + return true + } + } + } + ``` + + 同样,对于中间声明的有初始值的`int`或者`string`,`=`也是判等操作。 + + ```rust + fn example() -> bool { + let (a = 1) { // assign a = 1 + if (a = 1) { // compare a = 1 + return true + } + } + } + ``` + +#### match 语句 + +GödelScript 允许对`int`和`string`类型编写 match 语句,match 语句是类似 switch 的多条件分支语句,match 的条件必须为字面量: + +```rust +match(a) { + 1 => return 0, + 2 => return 1, + 3 => if (a + 1 < 10) { + return 10 + } +} +``` + +#### 返回语句 + +GödelScript 使用`return`和`yield`。`return`用于单个返回值的函数,`yield`用于集合的返回。 + +```rust +fn a() -> int { + return 0 +} + +fn b() -> *int { + yield 1 + yield 2 + yield 3 +} +``` + +### Schema + +Schema 是 GödelScript 中的复杂数据表的结构。 + +#### 结构声明 + +GödelScript 使用`schema`关键字来声明一个表结构: + +```rust +schema File { + id: int, + name: string +} +``` + +如果某个字段在数据库中是作为主键存在的,可以使用`@primary`注解来表明其为主键: + +```rust +schema File { + @primary id: int, + name: string +} +``` + +**有主键的表结构会使得查询速度得到显著提升,所以尽量绑定一个主键,主键应尽量为**`**int**`**类型。** + +#### 方法实现 + +GödelScript 使用如下方式来声明和实现`schema`的相关方法: + +```rust +impl File { + // 静态方法 + fn f1() -> ... {...} + // 成员方法,第一个参数必须为 self + fn f2(self) -> ... {...} + ... +} +``` +##### 静态方法 + +静态方法不需要`self`作为第一个参数,使用方式很简单,`类名::方法名(...)`。 + +```rust +impl File { + fn getSchemaName() -> string { + return "File" + } +} + +fn out(t: string) -> bool { + if (t = File::getSchemaName()) { + return true + } +} +``` + +##### 成员方法 + +成员方法的第一个参数必须为`self`,该参数无需写明类型。这类函数的调用方式是`实例名.函数名(...)`。 + +```rust +impl File { + fn getName(self) -> string { + return self.name + } +} + +fn out(path: string) -> bool { + let (db = JavaDB::load("coref_java_src.db")) { + for (f in File::__all__(db)) { + if (path = f.getName()) { + return true + } + } + } +} +``` + +##### 数据加载方法 `fn __all__(db)` + +`schema`可以包含一个特别的**静态方法**,用于加载它在数据库中的数据集。 + +```rust +impl File { + @data_constraint + fn __all__(db: JavaDB) -> *File { + ... + } +} +``` + +这种方法必须包含特殊注解`@data_constraint`,表明该方法专用于加载,如果不写该注解,则该方法的返回为**空集合**。该方法返回类型必须为其本身的集合。 + +包含了该方法的`schema`可以使用一个语法糖来获取其全集: + +```rust +fn out() -> bool { + for(f in File(JavaDB::load("..."))) { + ... + } + ... +} +// 等价于 +fn out() -> bool { + for(f in File::__all__(JavaDB::load("..."))) { + ... + } + ... +} +``` + +##### 自定义全集方法 + +`schema`允许使用不同于`__all__`名称的**静态方法**来表明一些集合也存在于该类型的全集中。该方法也必须包含特殊注解`@data_constraint`。该方法一般用于手动添加一些数据到该类型的全集中。 + +```rust +impl File { + @data_constraint + fn extend_example() -> *File { + yield File {id: 1234567} + } +} +``` + +#### 构造匿名实例 + +GödelScript 允许用一个特定语法生成匿名实例。生成匿名实例的前提是该实例存在于该`schema`的全集中,除非该用法出现在`@data_constraint`方法中,否则结果为空。 + +```rust +schema A { + @primary id: int, + name: string +} +``` + +对应的应该使用如下语法来进行匿名实例的生成: + +```rust +A {id: 1, name: "first"} +``` + +#### Schema 继承 + +GödelScript 中,`schema`继承非常便捷,使用样例如下: + +```rust +schema MyFile extends File {} +``` + +##### 父类 Field 继承 + +子类会默认将父类的所有 field 继承下来。所以无需手动重写。 + +```rust +schema File { + @primary id: int, + name: string +} + +schema MyFile extends File {} +``` + +##### 父类 Method 继承 + +子类会默认继承父类的所有 method,除了标注`@data_constraint`的方法。所以无需手动重写。但是需要注意的是,`__all__`方法较为特殊,不会被继承,所以需要重新编写`__all__`方法确定继承后的 schema 的全集。 + +```rust +schema File { + @primary id: int, + name: string +} + +impl File { + @data_constraint + fn __all__() -> *File {...} + fn getId(self) -> int {...} + fn staticMethod() -> string {return "File"} +} + +schema MyFile extends File {} +``` + +##### Method Override + +如果子类的实现中存在与父类同名的方法,则父类的方法会被子类方法**覆盖**。 + +```rust +schema File { + @primary id: int, + name: string +} + +impl File { + fn staticMethod() -> string {return "File"} +} + +schema MyFile extends File {} + +impl MyFile { + fn staticMethod() -> string {return "MyFile"} +} +``` + +此时`File::staticMethod`被`MyFile::staticMethod`覆盖,所以调用子类的该方法时,获取的结果为`"MyFile"`。 + +### 数据库 + +#### 数据库声明 + +数据库的声明格式如下: + +```rust +database DatabaseName { + // table_name 对应的是 db 中真实的表名 + // GodelSchemaType 对应的是将表数据读入 godel 后,存储的对应的 schema + table_name : *GodelSchemaType +} +``` + +冒号前是加载的数据库中的**真实表名**,冒号后是其对应的**数据表格式**,必须为`schema`类型。 +例如 db 中存在一张表,名字为`annotation`,对应的`schema`是`Annotation`,写法为: + +```rust +database JavaDB { + // 从 db 的 annotation 表中读取数据,存入 Annotation 中 + annotation : *Annotation +} +``` + +另外需要保证`Annotation`结构必须和表结构一致,例如: + +```rust +schema Annotation { + @primary id: int, // primary注解表示该字段为主键,一个表也可以没有主键 + content: string +} +``` + +就必须要求`annotation`表中必须有`id`和`content`字段,并且存储类型必须对应。 + +#### 数据库加载 + +数据库类型拥有静态方法`(database)::load(filename: string)` + +```rust +fn loadDatabaseExample() -> bool { + // load 中传入的 string 为 db 的文件名,而不需要路径 + // 因为 db 的路径会在执行 godel 时,通过命令行参数传入 + let (db: JavaDB = JavaDB::load("...")) { + ... + } +} +``` + +#### 数据表获取 + +上文中的例子中,要拿到`annotation`表,可以这样做: + +```rust +fn getAnnotation() -> Annotation { + // load 中传入的 string 为 db 的文件名,而不需要路径 + // 因为 db 的路径会在执行 godel 时,通过命令行参数传入 + let (db: JavaDB = JavaDB::load("...")) { + // 直接使用 db.field 就可以拿到表数据了 + for (anno: Annotation in db.annotation) { + ... + } + } +} +``` + +### Trait + +#### Trait 声明 + +`trait`声明语法如下: + +```rust +trait Example { + fn getId(self) -> int; + fn getName(self) -> string; + fn getValueByName(self, name: string) -> string; +} +``` + +#### Impl Trait + +写法与`impl`类似,但是必须要将`trait`中声明的所有函数都实现出来,否则无法通过编译。 + +```rust +impl Example for XmlElement { + fn getId(self) -> int {return self.id} + fn getName(self) -> int {return self.name} + fn getValueByName(self, name: string) -> int { + for(attr in XmlAttribute(XmlDB::load("...")) { + if (attr.getName() = name && attr.id = self.getAttribute().id) { + return attr.getValue() + } + } + } +} +``` + +### Import + +GödelScript 使用`use`关键字来引入其他文件的符号: + +```rust +use coref::java::* // 引用全部符号 +use coref::xml::Location // 引用单个符号 +use coref::xml::{XmlDB, XmlElement} // 引用多个符号 +``` + +#### 模块引用规则 + +GödelScript 包管理器会在传入参数中含有`-p {package dir path}`时启用。 + +包管理器会对文件夹结构进行解析,遍历其中所有的`.gdl`后缀文件。在拿到文件的相对路径后,会将路径映射到对应的包路径。如果文件的相对路径中存在`-`,或者路径中存在一个文件夹名或者文件名或者`.`后跟随的第一个字符是数字, 则该路径不会被包管理器接受,但是包管理器不会对其进行报错,只进行忽略处理。 + +如果想知道忽略了哪些路径,可以使用`-v`参数,包管理器在有该参数的情况下会将忽略的路径作为`warning`报出。如果最终映射的路径中,存在路径冲突的情况,那么包管理器会将其作为`error`报出并退出编译进程。 + +```rust +packages: + coref::cfamily -> /.../Library/coref.cfamily.gdl + coref::go -> /.../Library/coref.go.gdl + coref::java -> /.../Library/coref.java.gdl + coref::javascript -> /.../Library/coref.javascript.gdl + coref::properties -> /.../Library/coref.properties.gdl + coref::python -> /.../Library/coref.python.gdl + coref::sql -> /.../Library/coref.sql.gdl + coref::xml -> /.../Library/coref.xml.gdl +modules + +--coref -> coref + |--xml -> coref::xml + |--properties -> coref::properties + |--cfamily -> coref::cfamily + |--java -> coref::java + |--javascript -> coref::javascript + |--go -> coref::go + |--sql -> coref::sql + +--python -> coref::python +``` + +#### 路径映射样例 + +```rust +Library +|-- coref.java.gdl +|-- coref.xml.gdl ++-- coref + |-- go.gdl + +-- a + +-- b.gdl +=> +coref::java +coref::xml +coref::go +coref::a::b +``` + +该样例中,路径出现冲突 + +```rust +Library +|-- coref +| |-- java.gdl +| +-- python.gdl ++-- coref.python.gdl +=> +coref::java +coref::python -- \ + > 出现冲突 +coref::python -- / +``` + +该样例中,路径存在不合法字符 + +```rust +Library +|-- 0123.gdl +|-- my-godel-lib +| +-- js.gdl ++-- lib-file.123.gdl +=> +0123 +^ 第一个字符为数字 +my-godel-lib::js + ^ ^ 使用了 `-` 字符 +lib-file::123 + ^ ^ 使用了一个字符为数字,并且路径中包含 `-` +``` + +#### 符号冲突 + +在使用过程中,难免会遇到如下的情况。此时直接使用`File`会被告知符号冲突,需要指定其中一个符号。 + +```rust +use coref::java::Location +use coref::xml::Location +schema MyLoc extends Location {} + ^^^^^^^^ +Error: "Location" is ambiguous, with multiple symbols + "coref::java::Location, coref::xml::Location". +``` + +与其他语言类似,GödelScript允许通过完整路径的方式直接指定一个符号,但是该符号必须被引入。 + +```rust +use coref::java::Location +use coref::xml::Location +schema MyLoc extends coref::xml::Location {} +``` + +完整路径符号可以被用于以下情况: + +- schema 继承 + +```rust +schema JavaLocation extends coref::java::Location {} +``` + +- 函数参数和返回值 + +```rust +fn return_java_file(f: coref::java::File) -> coref::java::File { + ... +} +``` + +- database 声明 + +```rust +database MyDB { + java_file: coref::java::File, + xml_file: coref::xml::File, + java_loc: coref::java::Location, + xml_loc: coref::xml::Location +} +``` + +- query 列表类型声明 + +```rust +query example from + coref::java::Location loc in coref::java::Location(coref::java::JavaDB::load("...")) +where + ... +select + ... +``` + +- schema 静态方法调用 + +```rust +for(loc in coref::java::Location(coref::java::JavaDB::load("..."))) { + ... +} + +stmt.to() +stmt.is() +``` + +### Query + +Query 用于进行一些简单的查询,编写的 query 一定会被输出,即使没有声明`main`函数。Query 的语法格式如下: + +```rust +query 名字 from + 变量名 in 初始值, + 变量名 in 初始值, + 变量名 in 初始值 +where 条件 +select 值 as 输出的列名 + 值 as 输出的列名, + 值 as 输出的列名, + 值 as 输出的列名 +``` + +from 列表中的变量声明无需加上类型标注,编译器会进行自动推导,另外此处初始化不会使用`=`号,而是`in`关键字。此外,select 列表中,输出的列名不能和参与计算的变量名冲突,但是列名可以被省略。被省略的列名会在输出结果时采取随机名字,所以尽量不要省略。 + +下面是用 query 语法编写的`hello world`: + +```rust +query hello_world from + info in "hello world" +select info as greeting +``` + +上面的代码等价于如下代码: + +```rust +fn hello_world(greeting: string) -> bool { + let (info = "hello world") { + if (greeting = info) { + return true + } + } +} +fn main() { + output(hello_world()) +} +``` + +#### 样例和组成结构 + +Query 包含了查询名称,`from`列表,`where`筛选条件,`select`列表。 + +```rust +// script +use coref::java::{Callable, Class, Interface, JavaDB} + +fn db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +query class_method from + Callable m in Callable(db()), + Class c in Class(db()) +where + c.id = m.getBelongedClass().id +select + c.getQualifiedName() as className, + m.getName() as methodName, + m.getSignature() as methodSignature +``` + +#### 等价代码 + +上面的例子等价于如下代码: + +```rust +// script +use coref::java::{Callable, Class, Interface, JavaDB} + +fn db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +fn main() { + output(class_method()) +} + +fn class_method(className: string, methodName: string, methodSignature: string) -> bool { + for (m in Callable(db()), c in Class(db())) { + if (c.id = m.getBelongedClass().id) { + if (className = c.getQualifiedName() && + methodName = m.getName() && + methodSignature = m.getSignature()) { + return true + } + } + } +} +``` + +### Ungrounded Error: 未赋值/未绑定错误 + +GödelScript 会将未与数据绑定的符号判定为`ungrounded(未赋值/未绑定)`。基本判定规则为: + +- 未初始化的/未被使用的/未与集合绑定的符号 + - 未被绑定的`int` `string`参数 + - 未被使用的 database 类型的参数 + - 函数体有语句,但是没有任何返回语句 +- 在取非运算块中进行绑定的符号 + - 例如 `!(__tmp = 1)`,`__tmp`会被认为是未绑定的 + - 在取非运算块中调用 inline 函数或数据构造函数 + +#### 1. 未使用的 database/基础类型参数 + +函数代码块中,如果有一个语句分支没有使用参数中的`database`或者基础类型参数,则一定会导致`ungrounded`: + +```rust +fn test(db: JavaDB, a: int, b: string) -> bool {} + ^^ ^ ^ ^^ +Error: ungrounded parameter "db, a, b" in this branch. +``` + +编译器会提示在哪一条执行分支中存在 unused paramemter,根据提示检查对应的执行路径,补全对 parameter 的约束即可。 + +存在某些函数,在调用的时候,参数虽然是基础类型,但是传入的都是字面量,那这时如果错误地报出了`ungrounded`,可以给该函数添加`@inline`注解,来避免错误的约束检测。 + +```rust +impl XXX { + @inline + fn getValueByAttributeNameByDefaultValue(self, attributeName: string) -> string { + if (self.hasAttribute(attributeName)) { + return self.getValueByAttributeName(attributeName) + } + if (!self.hasAttribute(attributeName)) { + return "null" + } + } +} + +fn xxx() -> xx { + .. + attr.getValueByAttributeNameByDefaultValue("pattern") + ^^^^^^^^^ 使用了字面量, 添加@inline来通过检测 +} +``` + +#### 2. 函数体有语句的情况下无返回语句 + +GödelScript 允许一个函数体不包含任何语句,即空函数体。但是如果函数体中有其他语句,则 GödelScript 会要求必须有至少一个返回语句,否则就会出现 ungrounded error。 + +```rust +fn test() -> int {} + ^^ 没有语句,可以通过编译 + +fn test() -> int { + let (a = 1) {} + ^^^^^^^^^^^^^^ 有语句的情况下,没有返回语句,ungrounded +} +``` + +#### 3. 取非运算块中使用 inline 函数或数据构造函数 + +上文提到了可以通过`@inline`注解来规避 ungrounded error。但是如果在取非运算中使用了含有该注解的函数,则必然会导致 ungrounded error。 + +同样,数据构造函数实际的作用就是对一个临时中间变量进行绑定,但是这会直接导致 ungrounded error。 +所以综上所述,在取非运算块中使用 inline 函数或者数据构造函数,必然会导致 ungrounded error,编译器会对所有类似的情况直接报错。 + +```rust +if (!check(method.to())) { + ^^^^^^^^^^^^^^^^^^^^^^^^^^ ungrounded +} +if (!check(ElementParent {id: 0})) { + ^^^^^^^^^^^^^^ ungrounded +} + +@inline +fn for_test() -> ElementParent { + ... +} +if (!check(for_test())) { + ^^^^^^^^^^ 取非运算中存在 inline 函数,ungrounded +} +``` + +#### 4. 对链式调用的取非运算 + +GödelScript 未对该情况执行`ungrounded`检测,但是该写法会导致在 Soufflé 中报`ungrounded`错误: + +```rust +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +fn get_field() -> *Field { + for (field in Field(default_java_db())) { + if (!field.getLocation().getFile().getRelativePath().contains("/test/")) { + yield field + } + } +} +``` + +其中: + +```rust +!field.getLocation().getFile().getRelativePath().contains("/test/") +``` + +实际会被翻译为类似如下的 Soufflé 代码片段: + +```rust +!(__tmp = field, Field_getLocation(__tmp, __tmp_1), ..., contains("/test/", __tmp_4)) + ^^^^^ ^^^^^^^ +``` + +其中用于中间存储的变量在`!(...)`中被绑定,但是由于取非操作符,这个绑定被认为是假设的,但是`__tmp`,`__tmp_1`却被认为是被声明出来整个语句范围内可见的变量,从而导致`ungrounded`。 + +可以采取声明中间变量接住中间结果的方式来避免取非运算中的绑定操作: + +```rust +fn get_field() -> *Field { + for (field in Field(default_java_db())) { + let (path = field.getLocation().getFile().getRelativePath()) { + if (!path.contains("/test/")) { + yield field + } + } + } +} +``` + +## 查询示例 + +### Java + +#### 未使用方法 + +```rust +// script +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +// find unused methods +fn unused_method(unused: string) -> bool { + for(c in Callable(default_java_db()), method in Callable(default_java_db()), caller in method.getCaller()) { + if (c != caller && unused = method.getSignature()) { + return true + } + } +} + +fn main() { + output(unused_method()) +} +``` + +#### 类继承关系 + +```rust +// script +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +/** + * Find all class and the inheritances + * including parent class inheritance and ancestor class inheritance + */ +fn class_hierarchy(className : string, superClassName : string) -> bool { + for (c in Class(default_java_db()), ancestor in c.getAnAncestorClass()) { + if (className = c.getQualifiedName() && + superClassName = ancestor.getQualifiedName()) { + return true + } + } +} + +fn main() { + output(class_hierarchy()) +} +``` + +#### 类的所有方法信息 + +```rust +// script +use coref::java::* + +fn default_java_db() -> JavaDB { + return JavaDB::load("coref_java_src.db") +} + +// Find all methods of the class +fn methods(className : string, methodName : string) -> bool { + for (c in Class(default_java_db()), m in c.getAllMethods()) { + if (className = c.getQualifiedName() && + methodName = m.getName()){ + return true + } + } +} + +fn main() { + output(methods()) +} +``` + +### Python + +#### 获取函数圈复杂度 + +```rust +// script +use coref::python::* + +fn default_db() -> PythonDB { + return PythonDB::load("coref_python_src.db") +} + +/** + * Get cyclomatic complexity of functions + * + * @param name function name + * @param value cyclomatic complexity of function + * @param path path of file including this function + * @param sline function start line + * @param eline function end line + */ +fn getCyclomaticComplexity( + name: string, + value: int, + path: string, + sline: int, + eline: int) -> bool { + // get metric function + for (c in MetricFunction(default_db())) { + if (path = c.getLocation().getFile().getRelativePath() && + name = c.getQualifiedName() && + value = c.getCyclomaticComplexity() && + sline = c.getLocation().getStartLineNumber() && + eline = c.getLocation().getEndLineNumber()) { + return true + } + } +} + +fn main() { + output(getCyclomaticComplexity()) +} +``` + +#### 注释率统计 + +```rust +// script +use coref::python::* + +schema PublicVisitedElement extends CombineElement {} + +impl PublicVisitedElement { + @data_constraint + pub fn __all__(db: PythonDB) -> *PublicVisitedElement { + for (tmp in Class(db)) { + yield PublicVisitedElement {id: tmp.element_oid} + } + for (tmp in Function(db)) { + yield PublicVisitedElement {id: tmp.element_oid} + } + } +} + +fn default_db() -> PythonDB { + return PythonDB::load("coref_python_src.db") +} + + +// count number of total public element +fn countTotalPublicElement() -> int { + return PublicVisitedElement(default_db()).len() +} + +// get public elements with Docstring comment +fn withDocstringCommentElement() -> *PublicVisitedElement { + let (db = default_db()) { + for (e in PublicVisitedElement(db), j in DocstringComment(db)) { + if (e.key_eq(j.getDocumentableElement())) { + yield e + } + } + } +} + +// count number of public elements with Docstring comment +fn countTotalPublicDocumentedElement() -> int { + return withDocstringCommentElement().len() +} + +fn withPublicDocumentedBelowElement() -> *PublicVisitedElement { + let (db = default_db()) { + for (e in PublicVisitedElement(db), j in Comment(db)) { + if (e.key_eq(j.getDocumentedClassOrFunctionElement())) { + yield e + } + } + } +} + +// count number of public element with single line comment +fn countTotalPublicDocumentedBelowElement() -> int { + return withPublicDocumentedBelowElement().len() +} + + +// calculate documented percentage +fn getDocumentedPercentage(documentedPercentage: int) -> bool { + let (i = countTotalPublicElement(), + j = countTotalPublicDocumentedElement(), + k = countTotalPublicDocumentedBelowElement()) { + if (i = 0) { + if (documentedPercentage = -1) { + return true + } + } + if (i != 0) { + if (documentedPercentage = (j + k) * 1000 / i) { + return true + } + } + } +} + +fn main() { + output(getDocumentedPercentage()) +} +``` + +#### 函数注释情况 + +```rust +// script +use coref::python::* + +schema PublicVisitedElement extends CombineElement {} + +impl PublicVisitedElement { + @data_constraint + pub fn __all__(db: PythonDB) -> *PublicVisitedElement { + for (tmp in Class(db)) { + yield PublicVisitedElement {id: tmp.element_oid} + } + for (tmp in Function(db)) { + yield PublicVisitedElement {id: tmp.element_oid} + } + } + + pub fn getName(self) -> string { + let (tmp = Class(__all_data__).find(self)) { + return tmp.getQualifiedName() + } + let (tmp = Function(__all_data__).find(self)) { + return tmp.getQualifiedName() + } + } +} + +fn default_db() -> PythonDB { + return PythonDB::load("coref_python_src.db") +} + +fn hasComment(e: PublicVisitedElement) -> bool { + let (db = default_db()) { + for (j in DocstringComment(db)) { + if (e.key_eq(j.getDocumentableElement())) { + return true + } + } + for (j in Comment(db)) { + if (e.key_eq(j.getDocumentedClassOrFunctionElement())) { + return true + } + } + } +} + +/** + * Get comment of each public element + * + * @param type public visited element type + * @param name public visited element name + * @param filePath file path + * @param sline element start line + * @param eline element end line + * @param isCommented if is commented + */ +fn output_result( + type: string, + name: string, + filePath: string, + sline: int, + eline: int, + isCommented: int) -> bool { + for (e in PublicVisitedElement(default_db())) { + if (type = e.getType() && + name = e.getName() && + filePath = e.getLocation().getFile().getRelativePath() && + sline = e.getLocation().getStartLineNumber() && + eline = e.getLocation().getEndLineNumber()) { + if (hasComment(e)) { + if (isCommented = 1) { + return true + } + } + if (!hasComment(e)) { + if (isCommented = 0) { + return true + } + } + } + } +} + +fn main() { + output(output_result()) +} +``` + +### JavaScript + +#### AST Print + +```rust +// script +use coref::javascript::* + +/** + * print AST + * + * @param filePath file path + * @param parentId parent node ID + * @param parentKind parent node kind + * @param parentStartLine parent node start line + * @param parentEndLine parent node end line + * @param childId child node ID + * @param childKind child node kind + * @param childStartLine child node start line + * @param childEndLine child node end line + * @param index child node index + */ +fn out( + filePath: string, + parentId: int, + parentKind: string, + parentStartLine: int, + parentEndLine: int, + childId: int, + childKind: string, + childStartLine: int, + childEndLine: int, + index: int +) -> bool { + let (db = JavascriptDB::load("coref_javascript_src.db")) { + for (parent in Node(db), + child in Node(db), + parentSyntaxKind in SyntaxKind(), + childSyntaxKind in SyntaxKind(), + parentLocation in Location(db), + childLocation in Location(db), + file in File(db)) { + if (parent.key_eq(child.getParent()) && + parentId = parent.id && + childId = child.id && + parentSyntaxKind.id = parent.getKind() && + childSyntaxKind.id = child.getKind() && + parentKind = parentSyntaxKind.getName() && + childKind = childSyntaxKind.getName() && + index = child.getIndex() && + parentLocation = parent.getLocation() && + childLocation = parent.getLocation() && + file = parentLocation.getFile() && + filePath = file.getRelativePath() && + parentStartLine = parentLocation.getStartLineNumber() && + parentEndLine = parentLocation.getEndLineNumber() && + childStartLine = childLocation.getStartLineNumber() && + childEndLine = childLocation.getEndLineNumber()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +#### 圈复杂度 + +```rust +// script +use coref::javascript::* + +fn default_db() -> JavascriptDB { + return JavascriptDB::load("coref_javascript_src.db") +} + +/** + * Output the cyclomatic complexity of each function + * + * @param filePath file path + * @param functionName function name + * @param complexity cyclomatic complexity + * @param startLine function start line + * @param endLine function end line + */ +fn out(filePath: string, functionName: string, complexity: int, startLine: int, endLine: int) -> bool { + let (db = default_db()) { + for (func in FunctionLikeDeclaration(db), file in File(db)) { + if (complexity = func.getCyclomaticComplexity() && + functionName = func.getName() && + file = func.getLocation().getFile() && + filePath = file.getRelativePath() && + startLine = func.getLocation().getStartLineNumber() && + endLine = func.getLocation().getEndLineNumber()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +#### Change Effect + +```rust +// script +use coref::javascript::* + +fn default_db() -> JavascriptDB { + return JavascriptDB::load("coref_javascript_src.db") +} + +fn getACallerFunction(function: FunctionLikeDeclaration, callerFunction: FunctionLikeDeclaration) -> bool { + for (mayInvokeExpression in MayInvokeExpression(default_db())) { + if (mayInvokeExpression = function.getACallSite() && + callerFunction = mayInvokeExpression.getEnclosingFunction()) { + return true + } + } +} + +fn getAnEffectedFunction(function: FunctionLikeDeclaration, effectedFunction: FunctionLikeDeclaration) -> bool { + if (getACallerFunction(function, effectedFunction)) { + return true + } + for (callerFunction in FunctionLikeDeclaration(default_db())) { + if (getACallerFunction(function, callerFunction) && + getAnEffectedFunction(callerFunction, effectedFunction)) { + return true + } + } +} + +/** + * Query the effected functions according to the changed lines. + * + * @param function the changed function id + * @param signature the changed function signature + * @param functionPath the changed function file path + * @param startLine the changed function start line + * @param endLine the changed function end line + * @param effectedFunction the effected function id + * @param effectedSignature the effected function signature + * @param effectedFunctionPath the effected function file path + * @param effectedStartLine the effected function start line + * @param effectedEndLine the effected function end line + */ +fn out( + function: FunctionLikeDeclaration, + signature: string, + functionPath: string, + startLine: int, + endLine: int, + effectedFunction: FunctionLikeDeclaration, + effectedSignature: string, + effectedFunctionPath: string, + effectedStartLine: int, + effectedEndLine: int +) -> bool { + if (getAnEffectedFunction(function, effectedFunction)) { + let (symbol = function.getSymbol(), + effectedSymbol = effectedFunction.getSymbol(), + location = function.getLocation(), + effectedLocation = effectedFunction.getLocation()) { + if (signature = symbol.getDescription() && + effectedSignature = effectedSymbol.getDescription() && + functionPath = location.getRelativePath() && + startLine = location.getStartLineNumber() && + endLine = location.getEndLineNumber() && + effectedFunctionPath = effectedLocation.getRelativePath() && + effectedStartLine = effectedLocation.getStartLineNumber() && + effectedEndLine = effectedLocation.getEndLineNumber()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +### XML + +#### 获取 bean + +```rust +// script +use coref::xml::* + +schema BeanXmlElement extends XmlElement {} + +impl BeanXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *BeanXmlElement { + for (e in XmlElement(db)) { + let (path = e.getLocation().getFile().getRelativePath()) { + if (!path.contains("target") && e.getName() = "bean") { + yield BeanXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } + } +} + +schema EntryXmlElement extends XmlElement {} + +impl EntryXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *EntryXmlElement { + for (e in XmlElement(db)) { + if (e.getName() = "entry") { + yield EntryXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema PropertyXmlElement extends XmlElement {} + +impl PropertyXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *PropertyXmlElement { + for (e in XmlElement(db)) { + if (e.getName() = "property") { + yield PropertyXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +fn default_db() -> XmlDB { + return XmlDB::load("coref_xml_src.db") +} + +// get class name +fn getClassName(bean: BeanXmlElement) -> string { + for (attr in bean.getAttribute()) { + if (attr.getName() = "class") { + return attr.getValue() + } + } +} + +// get key +fn getKey(e: EntryXmlElement) -> string { + for (attr in e.getAttribute()) { + if (attr.getName() = "key") { + return attr.getValue() + } + } +} + +// output value and class info of the bean +fn output1(className: string, pName: string, kName: string) -> bool { + let (db = default_db()) { + for (bean in BeanXmlElement(db), p in PropertyXmlElement(db), e in EntryXmlElement(db)) { + if (className = getClassName(bean) && + bean.key_eq(p.getParent()) && + p.key_eq(e.getParent().getParent()) && + pName = p.getName() && + kName = getKey(e)) { + return true + } + } + } +} + +fn main() { + output(output1()) +} +``` + +#### POM + +```rust +// script +use coref::xml::* + +schema DependencyElement extends XmlElement {} + +impl DependencyElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *DependencyElement { + for(e in XmlElement(db)) { + if (e.getElementName() = "dependency") { + yield DependencyElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema GroupElement extends XmlElement {} + +impl GroupElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *GroupElement { + for(e in XmlElement(db)) { + if (e.getElementName() = "groupId") { + yield GroupElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema VersionElement extends XmlElement {} + +impl VersionElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *VersionElement { + for(e in XmlElement(db)) { + if (e.getElementName() = "version") { + yield VersionElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema ArtifactElement extends XmlElement {} + +impl ArtifactElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *ArtifactElement { + for(e in XmlElement(db)) { + if (e.getElementName() = "artifactId") { + yield ArtifactElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +schema PomFile extends XmlFile {} + +impl PomFile { + @data_constraint + pub fn __all__(db: XmlDB) -> *PomFile { + for(f in XmlFile(db)) { + if (f.getFileName() = "pom.xml") { + yield PomFile { + id: f.id, + file_name: f.file_name, + relative_path: f.relative_path + } + } + } + } +} + +// output relative path of the file, referenced jar name and version +fn out(fileName: string, m1: string, m2: string, m3: string) -> bool { + let (db = XmlDB::load("coref_xml_src.db")) { + for (f in PomFile(db), + e1 in GroupElement(db), + e2 in VersionElement(db), + e3 in ArtifactElement(db), + c1 in XmlCharacter(db), + c2 in XmlCharacter(db), + c3 in XmlCharacter(db), + p in DependencyElement(db)) { + if (f.key_eq(p.getLocation().getFile()) && + fileName = f.getRelativePath() && + p.key_eq(e1.getParent()) && + e1.key_eq(c1.getBelongedElement()) && + m1 = c1.getText() && + p.key_eq(e2.getParent()) && + e2.key_eq(c2.getBelongedElement()) && + m2 = c2.getText() && + p.key_eq(e3.getParent()) && + e3.key_eq(c3.getBelongedElement()) && + m3 = c3.getText()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +#### RPC + +```rust +// script +use coref::xml::* + +// select XmlElement containing "mobileService" +schema MobileServiceXmlElement extends XmlElement{} + +impl MobileServiceXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *MobileServiceXmlElement { + for (e in XmlElement(db)) { + if (e.getElementName() = "mobileService") { + yield MobileServiceXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } + + pub fn getServiceBeanValue(self) -> string { + for (a in self.getAttribute()) { + if (a.getName() = "serviceBean") { + return a.getValue() + } + } + } +} + +// select XmlElement containing "sofa:extension" +schema SofaExtensionXmlElement extends XmlElement{} +impl SofaExtensionXmlElement { + @data_constraint + pub fn __all__(db: XmlDB) -> *SofaExtensionXmlElement { + for (e in XmlElement(db)) { + if (e.getName() = "sofa:extension") { + yield SofaExtensionXmlElement { + id: e.id, + location_id: e.location_id, + parent_id: e.parent_id, + index_order: e.index_order + } + } + } + } +} + +fn out(value: string) -> bool { + let (db = XmlDB::load("coref_xml_src.db")) { + for (m in MobileServiceXmlElement(db), s in SofaExtensionXmlElement(db), ancestor in m.getAnAncestor()) { + if (s.key_eq(ancestor) && value = m.getServiceBeanValue()) { + return true + } + } + } +} + +fn main() { + output(out()) +} +``` + +### Go + +#### 获取所有文件的基本信息 + +```rust +// script +use coref::go::* + +fn default_db() -> GoDB { + return GoDB::load("coref_go_src.db") +} +/** + * @param name file name + * @param funcCount function/method quantity + * @param totallines total lines of file + * @param codelines code line of file + * @param commentlines comment line of fine + * @param md5 md5 of this file + * @param sha256 sha256 of this file + */ +fn out( + name: string, + funcCount: int, + totallines: int, + codelines: int, + commentlines: int, + md5: string, + sha256: string) -> bool { + for(f in File(default_db())) { + if (name = f.getName() && + funcCount = f.getFunctionCount() && + md5 = f.getMd5Sum() && + sha256 = f.getSha256Sum() && + totallines = f.getLineInfo().getNumberOfTotalLines() && + codelines = f.getLineInfo().getNumberOfCodeLines() && + commentlines = f.getLineInfo().getNumberOfCommentLines()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` + +#### 获取函数及其关联的注释 + +```rust +// script +use coref::go::* + +fn default_db() -> GoDB { + return GoDB::load("coref_go_src.db") +} + +// Define a predicate called 'out' with parameters fileName, funcName, funcComment, and signature +fn out(fileName: string, funcName: string, funcComment: string, signature: string) -> bool { + // Check if there exists a Function object 'func' + for(func in Function(default_db())) { + if ( + // Get the name of the file the function belongs to and assign it to the variable 'fileName' + fileName = func.getBelongsFile().getName() && + // Get the name of the function and assign it to the variable 'funcName' + funcName = func.getName() && + // Get the associated comment string for the function and assign it to the variable 'funcComment' + funcComment = func.getAssociatedCommentString() && + // Get the function type signature and assign it to the variable 'signature' + signature = func.getFunctionTypeSignature()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` + +#### 获取函数圈复杂度 + +```rust +// script +use coref::go::* + +fn default_db() -> GoDB { + return GoDB::load("coref_go_src.db") +} + +/** + * @param name: file name + * @param func: function name + * @param cmplx: function cyclomatic complexity + * @param sl,el,sc,ec: function location info + */ +fn out(name: string, func: string, cmplx: int, sl: int, el: int) -> bool { + for(f in GoFile(default_db()), function in Function(default_db())) { + if ((!f.isAutoGenereatedFile()) && + f.key_eq(function.getBelongsFile()) && + name = f.getName() && + func = function.getName() && + cmplx = function.getCyclomaticComplexity() && + sl = function.getLocation().getStartLineNumber() && + el = function.getLocation().getEndLineNumber()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` + +## 查询调试和优化技巧 + +运行 GödelScript 脚本的时候,经常会出现运行时间超长的问题,这里提供一些基本判别方法和解决方案。 + +### Schema 传参导致笛卡尔积过大 + +函数传参在没有`@inline`注解的情况下,默认是作为“限定”条件,而不是一个传入值存在。 + +例如下面的这个例子中,`get`获取到一个`Class`类型的传入参数,但是实际上最终的编译结果会类似下面的代码: + +```rust +fn check(class: Class) -> bool { + if (class.getName().contains("io")) { + return true + } +} + +// 实际的编译结果 +fn check(class: Class) -> bool { + // 实际上是要先拿 Class 全集 + for(__temp_class in Class::__all__(__all_data__)) { + if (class = __temp_class ) { + if (class.getName().contains("io")) { + return true + } + } + } +} +``` + +所以在传参中 schema 类型很多时,会出现多个 schema 全集做笛卡尔积的情况,空间和时间开销急剧增加。 +解决方案也很简单,加一个`@inline`注解就可以: + +```rust +@inline +fn check(class: Class) -> bool { + if (class.getName().contains("io")) { + return true + } +} + +fn example() -> bool { + for(class in Class(default_java_db())) { + if (check(class)) { + return true + } + } +} + +// inline 注解会强行在代码生成阶段将函数内联到语句中,避免多次加载表 +// 实际的编译结果类似于 +fn example() -> bool { + for(class in Class(default_java_db())) { + if (class.getName().contains("io")) { + return true + } + } +} +``` + +### 多层 for 导致笛卡尔积过大 + +在一些情况下不可避免的会使用非常多层数的 for 来加载多表进行联查,导致笛卡尔积严重膨胀。可以通过提前减少 (过滤) 集合大小的方式来缩减笛卡尔积结果数量,例如: + +```rust +fn getByIndex(self) -> Expression { + let (db = default_java_db()) { + for(e in Expression(db), p in Parameter(db)) { + let (i = p.getIndex()) { + if (e.key_eq(self.getValueByIndex(i))) { + return e + } + } + } + } +} +``` + +这个例子中,e, p 做笛卡尔积,导致中间过程占用时间太长。 +i 实际上是从 p 的一个方法中得到的集合,并且在实际使用中,这个集合非常小,远比 Parameter 全集小,所以可以把 i 集合的获取抽出来变成单独的函数,生成小集合,避免大集合之间笛卡尔积运算的同时,还保证了结果的等价: + +```rust +fn getAllParameterIndex() -> *int { + let (db = default_java_db()) { + for (p in Parameter(db)) { + yield p.getIndex() + } + } +} + +fn getByIndex(self) -> Expression { + let (db = default_java_db()) { + for(e in Expression(db), i in getAllParameterIndex()) { + if (e.key_eq(self.getValueByIndex(i))) { + return e + } + } + } +} +``` + +e, p 的笛卡尔积就变成了 e, i 的笛卡尔积,从运算的层面来看,笛卡尔积开销变小,`getIndex`操作也被提前了,而不是在做笛卡尔积之后进行,所以性能大幅度提升。 + +### 不要滥用`@inline`/必须用`@inline`的优化策略 + +inline 函数的底层机制是在**调用处展开**,如果该函数不存在大量的 schema 传参,并且在很多位置都被调用,inline 可能会导致**编译结果膨胀且重复计算次数指数级增加**,有时反而不利于减少运行时间。 +如果存在必须要使用 inline 的情况 (比如规避`ungrounded`),但是使用之后反而出现运行速度变慢的情况,可以采取将内嵌语句拆分为 predicate 的方式来避免展开导致的编译结果膨胀。 + +下面的例子中,`getValueByAttributeNameByDefaultValue`为了避免`attributeName`被识别为`ungrounded`所以标注`inline`,后续在 if 分支中添加了一个条件语句,但是导致了执行时间从 3 秒变成 35 秒: + +```rust +impl XmlElementBase { + @inline + fn getValueByAttributeNameByDefaultValue(self, attributeName: string) -> string { + if (self.hasAttribute(attributeName)) { + // return self.getValueByAttributeName(attributeName) + // 更改为了如下语句: + let(value = self.getValueByAttributeName(attributeName)) { + if (value = "n/a") { + return "" + } + if (value != "n/a") { + return value + } + } + } + if (!self.hasAttribute(attributeName)) { + return "null" + } + } +} +``` + +可以看到的是,增加了一层赋值和一层条件语句,在下文中,这个函数被调用了接近 20 次,导致了代码接近 20 次被重复展开,同时也造成了性能出现了一个数量级的差距。此时可以将更改的语句提取出来,由于提取出来的函数并没有使用复杂类型作为传参,所以不需要 inline 性能也没有损失,提取之后结果如下: + +```rust +impl XmlElementBase { + fn getTransValueByAttributeName(self, attributeName: string) -> string { + let (value = self.getValueByAttributeName(attributeName)) { + if (value = "n/a") { + return "" + } + if (value != "n/a") { + return value + } + } + } + @inline + fn getValueByAttributeNameByDefaultValue(self, attributeName: string) -> string { + if (self.hasAttribute(attributeName)) { + return self.getTransValueByAttributeName(attributeName) + } + if (!self.hasAttribute(attributeName)) { + return "null" + } + } +} +``` + +这样执行时间从 35 秒回到 3 秒,符合预期。 + +## 在本机使用查询脚本流程 + +参见[安装、配置、运行](./3_install_and_run.md) diff --git a/content/zh/docs/codefuse-query/5_toolchain.md b/content/zh/docs/codefuse-query/5_toolchain.md new file mode 100644 index 0000000..b041d4e --- /dev/null +++ b/content/zh/docs/codefuse-query/5_toolchain.md @@ -0,0 +1,97 @@ +--- +title: VSCode插件 +slug: VSCode插件 +description: CodeFuse介绍主要功能 +url: /docs/codefuse-query-toolchain-zh +aliases: +- "/docs/codefuse-query-toolchain-zh" +--- + + +# 开发插件(VSCode) +## 安装 +### 从VSCode官方插件市场安装(推荐) +[插件地址](https://marketplace.visualstudio.com/items?itemName=CodeFuse-Query.codefuse-query-extension) +### 使用VSIX安装包安装 +1. 下载插件 +2. 手动从 vsix 安装: +![image.png](/images/codefuse-query/toolchain01.png) +3. 或者使用指令直接从终端安装: +```bash +code --install-extension [扩展vsix文件路径] +``` +## 环境准备 + +- Sparrow CLI ,参照 3 安装、配置、运行 +## 扩展特性 +本扩展提供了以下功能模块: + +- COREF AST Viewer +- Gödel Language Server +- Gödel Language Runner +### COREF AST Viewer +以下功能需要在扩展设置中设置相关项后启用。目前仅支持于Java语言 +#### Java 文件转成树状的 COREF Node +![](/images/codefuse-query/toolchain02.gif) +#### Node 与代码位置的相互定位 +![](/images/codefuse-query/toolchain03.gif) +#### 在Lib API Viewer 查看 Node 的API,Node 复制 +![](/images/codefuse-query/toolchain04.gif) +#### Lib API Viewer:查询与复制使用 +![](/images/codefuse-query/toolchain05.gif) +### Gödel Language Server Features +以下功能均需要在设置扩展后启用。不设置相关项的情况下,语法高亮仍然可用。 +#### 错误信息提示 +错误信息会随着代码的更新而自动更新。 +![](/images/codefuse-query/toolchain06.gif) +#### 符号信息提示和补全 +包含local变量和全局符号信息的补全提示,关键字等信息会提供对应的使用样例,全局符号信息会提供更详细的内部信息,如包含的成员变量、成员方法、静态方法。 + +![](/images/codefuse-query/toolchain07.gif) + +- 关键字补全和使用样例提示 +- local 变量类型信息和符号补全 +- `.` 跟随的符号信息和补全 +- `::` 跟随的符号信息和补全 +- 注解使用样例提示 +- 全局符号类型信息 (内部结构,成员方法,静态方法) +#### 跳转到定义 +可以通过右键跳转定义或者`ctrl`/`command`+`left click`直接跳转到准确的符号定义位置。 + +![](/images/codefuse-query/toolchain08.gif) +#### 代码片段 (Snippets) +扩展提供了一些代码片段补齐以供快速编写 Gödel 1.0/script 代码。 + +![](/images/codefuse-query/toolchain09.gif) +### GödelScript Runner +需要在扩展中设置 sparrow cli 路径后使用。运行脚本之前需要先加载数据库。关于如何生成数据库 参考 3.4.章节 运行 中的数据抽取部分。 +#### 运行脚本 +![panel.gif](/images/codefuse-query/toolchain10.gif) +提供了四种不同的脚本运行按钮: +1. 在要运行的脚本处右键执行。 +2. 在 extension `GodelScript Runner` 面板上选择 `Run GödelScript`。 +3. 在 extension `GodelScript Runner Setting` 面板上选择 `Run`。 +4. 在 extension `GodelScript Runner Setting` 面板右上角点击运行按钮。 +#### 数据库文件夹加载 +1. 在要运行的脚本处右键选择包含数据库的文件夹进行加载。 +2. 在 extension `GodelScript Runner` 面板上选择 `Load Database Directory`。 +3. 在 extension `GodelScript Runner Setting` 面板上选择 `Database`。 +4. 在 extension `GodelScript Runner Setting` 面板右上角点击数据库加载按钮。 +## 扩展设置 +### COREF AST Viewer 设置 + +- `corefASTViewer.sparrowCliRoot` + - 指定 Sparrow CLI 的根目录,参照第3章节的安装部分 +### Gödel Language Server 设置 +扩展启动时,以下两项中存在任意一项未被设置,则会弹出提示。点击`configure`按钮会跳转至相应配置页面。 + +- `godelScript.executablePath` + - 用于指定 GödelScript 的可执行文件路径,默认为空。需要时请替换为实际的 GödelScript 可执行文件的绝对路径。 + - 如果已经下载 Sparrow CLI ,则 GödelScript 可执行文件为 `[sparrow cli root]/godel-script/usr/bin/godel`。 +- `godelScript.libraryDirectoryPath` + - 用于指定 GödelScript 的库文件夹路径,默认为空。需要时请替换为 GödelScript 库文件夹绝对路径。 + - 如果已经下载 Sparrow CLI ,则库文件夹路径为 `[sparrow cli root]/lib-1.0`。 + +# 智能助手 + +待开放,尽情期待! diff --git a/content/zh/docs/codefuse-query/user_case.md b/content/zh/docs/codefuse-query/user_case.md new file mode 100644 index 0000000..48722e9 --- /dev/null +++ b/content/zh/docs/codefuse-query/user_case.md @@ -0,0 +1,59 @@ +--- +title: 用户案例 +slug: 用户案例 +description: CodeFuse介绍主要功能 +url: /docs/codefuse-query-usercase-zh +aliases: +- "/docs/codefuse-query-usercase-zh" +--- + +# 使用场景 +## 查询代码特征 +小开发同学想知道 Repo A 里面使用了哪些 String 型的变量,所以他写了一个 Gödel 如下,交给 CodeFuse-Query 系统给他返回了结果。 +```rust +// script +use coref::java::* + +fn out(var: string) -> bool { + for(v in Variable(JavaDB::load("coref_java_src.db"))) { + if (v.getType().getName() = "String" && var = v.getName()) { + return true + } + } +} + +fn main() { + output(out()) +} +``` +类似需求:查询:类,函数,变量,返回值,调用图,类继承等等。 +## 代码规则检查器 +小 TL 同学发现团队总是写出很多类似的 Bug A,**他想针对 Bug A 制定一个代码规则和其检查器**,并在 CodeReview 阶段做个卡点。小 TL 通过在 CodeFuse-Query 平台上面编写了一段分析 Query,在平台上面测试符合要求,把这段分析 Query 固化下来作为一个代码规则,并上线到了 CodeReview/CI 阶段。从此这个 Bug 再也没发生过了。 +类似需求:编写静态缺陷扫描规则进行代码风险拦截。 +## 获取统计数据 +小研究发现传统的代码复杂度指标很难准确地衡量代码的复杂情况,通过学习国际先进经验加上自我灵光一闪,设计了一套复杂度指标和算法。通过 Gödel 实现出来以后,**发现不怎么优化就已经性能非常高了**,很快就应用到了 10 几种语言,11+万个仓库当中去了。马上就对代码仓库整体的复杂度有了深入的了解。相比较以前需要自己解析代码,分析语法树,对接系统,**不知道方便了多少。** +类似需求:代码统计,代码度量,算法设计,学术研究。 + +# 应用领域 +目前,CodeFuse-Query在蚂蚁集团已经支持 **CodeFuse大语言模型数据清洗**、**代码度量评估**、**研发风险控制**、**隐私安全分析**、**代码智能**、**终端包大小治理 **等多个场景的落地应用,服务月均调用量超过百万。 +## 高质量代码数据清洗 - CodeFuse代码大模型 +CodeFuse代码大模型是蚂蚁集团对外开源的处理代码相关问题的模型,对于CodeFuse大语言模型而言,训练的数据质量直接影响模型的推理结果。低质量的代码数据会直接污染语言模型的输出,例如:模型可能会学习到错误的代码模式,从而生成错误的代码;数据中只包含某种编程语言的代码,模型可能无法很好地适应其他编程语言的代码。 +为了把控进入模型的代码数据质量,进而提升模型的推理能力。我们基于蚂蚁程序分析团队多年的实践积累结合业界共识,梳理了高质量代码的定义方式,并利用已有程序分析技术实现了自动化、大规模的代码数据清洗。 +CodeFuse-Query为CodeFuse代码大模型提供了以下数据清洗能力: + +- 高质量代码数据清洗:对代码数据进行清洗,包括对 Python,Java,JavaScript,TypeScript,Go,C,C++ 7 种语言进行漏洞扫描,对语言种类 / star 数进行筛选,过滤有效代码行数为 0 的数据等。目前已沉淀清洗后的 GitHub 和蚂蚁内部代码数据总共约 **2TB**。 +- 代码画像:实现对大规模代码进行高性能多维度的自动标注,支持 Java, Scala, Kotlin, JavaScript, JSX, TypeScript, TSX, Vue, Python, Go 等 **10** 种语言,**77** 种通用标签,**40** 种蚂蚁特有标签,共 **117** 种标签。目前自动标注性能能够达到 **40MB/s**。 +- 其他原子能力 + - 高级代码特征提取,包括提取 AST(抽象语法树),DFG(数据流图)数据等。目前 AST 信息已用于 SFT 训练,准确率 97% 左右。 + - 代码片段识别,用于针对文本数据中的代码进行提取,方便进行代码格式化或加上 Markdown 格式: + - 文本提取代码:从文本中提取代码块信息,支持主流语言的解析,函数及类定义,仅验证二分类问题,就是说仅验证文本是否含有代码块准确率 83% 左右。 + - 识别代码片段的编程语言种类:识别任意代码片段的编程语言种类,支持 30+ 种语言,准确率80%左右。 + - 代码注释对提取:支持提取方法级别的注释-代码对信息,覆盖 **15 种** GitHub 最流行的语言,用于 Text To Code/Code To Text 的 SFT 训练。 +## 变更分析-优酷服务端研发效能 +优酷质量保障团队从2023年开始针对服务端精准测试的探索,经过半年的技术沉淀和体系搭建,形成了具备**变更内容识别、变更影响分析、测试能力推荐、测试覆盖评估**的精准测试体系。 +在此过程中,CodeFuse-Query能提供的能力主要有: + +- 根据代码变更内容(文件+行号),分析出影响的对象:方法、入口(http入口、hsf入口)、调用链路(从入口到变更方法的所有调用链路)、数据库操作(表、操作类型) +- 结合线上动态调用链路(方法链路)、CodeFuse-Query静态分析调用链路的影响面精准分析能力,提升变更分析影响面的有效性、准备率 + +到目前为止,优酷已通过CodeFuse-Query接入所有核心应用,并基于静态分析采集数据,构建了服务端完整的代码知识库和流量知识库。 \ No newline at end of file diff --git a/content/zh/docs/devops-model/1_traindetail.md b/content/zh/docs/devops-model/1_traindetail.md new file mode 100644 index 0000000..ec60625 --- /dev/null +++ b/content/zh/docs/devops-model/1_traindetail.md @@ -0,0 +1,43 @@ +--- +title: 训练解析 +slug: 训练解析 +description: 介绍主要功能 +url: "/docs/codefuse-devops-model-train-zh" +aliases: +- "/docs/codefuse-devops-model-train-zh" +--- + + +## 训练流程 +根据查阅文献可知,大部分领域模型都是在对话模型的基础上,通过SFT微调来进行知识注入。而SFT微调所需要QA预料基本都来自于ChatGPT生成。然而,该方案可能存在QA语料无法完全覆盖领域知识的情况。 +因此,DevOps-Model采用的是预训练加训 + SFT微调的方案,如图2.1所示。我们认为针对领域大模型,预训练的加训是必要的,因为其可以将领域内的一些知识在预训练阶段注入到大模型,如果这些知识在通用大模型预训练时没有出现过,那会让大模型学习到新的知识;如果出现过,就可以让大模型进一步加深印象。第二步则是大模型对齐,目的是让大模型可以根据问题来回答最合适的内容。 + +![](/images/devops-model/devops_train_framework.png) +![](/images/devops_model/devops_train_framework.png) + + +## 训练数据 +### 数据收集 +模型的定位是中文 DevOps 领域大模型,因此收集与中文DevOps相关的预训练数据和QA数据。 +- 预训练数据主要来自互联网技术博客、技术文档、技术书籍等,最终收集到了 50G+ 的预训练语料数据; +- 针对 QA 数据,我们的目的是想让模型不但对齐到通用的问答能力,而且针对 DevOps 领域也可以学会如何更好的回答问题,因此不但收集了通用领域的单轮和多轮对话数据,还针对 DevOps 领域,通过爬取和 ChatGPT 生成的方式产出了属于 DevOps 领域的问答数据。最终我们精心筛选了约 200K 的 QA 数据进行 SFT微调训练,具体数据量如下表所示。 + +|数据类型 |数据量级| +| -- | - | +|通用单轮 QA| 50K| +|通用多轮 QA| 20K| +|DevOps 领域 QA| 130K| + +### 数据筛选 +![](/images/devops-model/devops_data_filter.png) +![](/images/devops_model/devops_data_filter.png) + + + +由于预训练数据大部分是从互联网上收集的数据,质量会参差不齐,而大模型训练中数据是最重要的一环,我们建立了如上图所示的清洗 Pipeline,来针对收集到的数据进行质量的全面过滤。 +1. 首先,由专家经验和人工筛选,总结出来了一批文档级别的 Heuristic 过滤规则,这一步主要用来过滤掉那些质量非常差的文档; +2. 然后,即便是一篇质量稍差的文章中,也有可能还是含有一些有价值的领域知识,我们也需要尽可能的进行收集。此处,我们对文章进行段落拆分,将文章拆分成一个个段落; +3. 然后,我们将拆分后的段落会再次通过步骤1进行过滤,便得到了一批经过规则过滤后的段落; +4. 然后,我们摘取了其中 1000 个段落,由经验丰富的专业开发人员来进行打标,获得高质量的打标数据; +5. 最后,我们根据打标后的结果来训练了一个打分模型来针对段落进行质量的打分,段落的向量模型选用了预训练好的中文版本的 Sentence-Bert,打分算法选用了逻辑回归,为了避免打分模型的误差,会再通过帕累托分布来根据段落的质量打分进行采样来决定要不要过滤这个段落。 +经过这个 Pipeline 后,我们最终沉淀下 15G 左右的数据来进行大模型的预训练加训。 \ No newline at end of file diff --git a/content/zh/docs/devops-model/2_quickstart.md b/content/zh/docs/devops-model/2_quickstart.md new file mode 100644 index 0000000..3db7ff9 --- /dev/null +++ b/content/zh/docs/devops-model/2_quickstart.md @@ -0,0 +1,66 @@ +--- +title: 快速使用 +slug: 快速使用 +description: 介绍主要功能 +url: "/docs/codefuse-devops-model-quickstart-zh" +aliases: +- "/docs/codefuse-devops-model-quickstart-zh" +--- + + +## 依赖安装 +需要先 PIP 安装一下 Github 地址下的 requirement.txt 中的包,可以参考一下代码 +pip install -r requirements.txt + + +## 模型下载 +模型下载相关信息如下: + 🤗 Huggingface 地址 + +| - | 基座模型 |对齐模型| +| -- | ---------- | ------- | +|7B| DevOps-Model-7B-Base| DevOps-Model-7B-Chat| +|14B| DevOps-Model-14B-Base| DevOps-Model-14B-Chat| + +🤖 ModelScope 地址 +| - | 基座模型 |对齐模型| +| -- | ---------- | ------- | +|7B | DevOps-Model-7B-Base |DevOps-Model-7B-Chat| +|14B| DevOps-Model-14B-Base| DevOps-Model-14B-Chat| + +找到自己想要下载的 Chat 模型版本,当前提供了 7B 和 14B 的模型 + + +## 模型使用 +根据以下代码来和 Chat 模型进行交互 +``` +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + +tokenizer = AutoTokenizer.from_pretrained("path_to_DevOps-Model-Chat", trust_remote_code=True) + +model = AutoModelForCausalLM.from_pretrained("path_to_DevOps-Model-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() + +# 指定 generation_config +model.generation_config = GenerationConfig.from_pretrained("path_to_DevOps-Model-Chat", trust_remote_code=True) + +# 第一轮对话 +resp, hist = model.chat(query='你是谁', tokenizer=tokenizer, history=None) +print(resp) +# 我是 DevOps-Model,一个由蚂蚁集团平台技术事业群风险智能团队和北京大学联合研发的人工智能机器人,可以与用户进行自然语言交互,并协助解答 DevOps 全生命周期中的各种问题。如果您有任何需要协助的问题或者想要进行闲聊,都可以和我交流哦。 + +# 第二轮对话 +resp2, hist2 = model.chat(query='Java 中 HashMap 和 Hashtable 有什么区别', tokenizer=tokenizer, history=hist) +print(resp2) +# HashMap 和 Hashtable 都是 Java 中常用的哈希表实现,它们的主要区别在于: +# 1. Hashtable 是线程安全的,而 HashMap 不是线程安全的,因此在多线程环境下,Hashtable 的性能更稳定。 +# 2. Hashtable 中的方法都是同步的,而 HashMap 的方法不是同步的,因此在多线程环境下,Hashtable 的性能更好。 +# 3. Hashtable 中的 key 和 value 都必须实现 Serializable 接口,而 HashMap 中的 key 和 value 可以是任何对象,包括基本数据类型。 +# 4. Hashtable 的初始容量是 11,而 HashMap 的初始容量是 16。 +# 总之,如果需要在多线程环境下使用哈希表,并且需要保证线程安全,那么应该使用 Hashtable;如果不需要考虑线程安全,或者需要快速地进行哈希表操作,那么应该使用 HashMap。 + +# 第三轮对话 +resp3, hist3 = model.chat(query='线程安全代表什么', tokenizer=tokenizer, history=hist2) +print(resp3) +# 线程安全是指在多线程环境下,程序能够正确地处理并发访问,并且不会出现数据竞争、死锁、饥饿等异常情况。线程安全的程序可以保证在不同的线程之间共享同一个数据结构时,数据的正确性和一致性。线程安全的实现通常需要使用同步机制,如锁、原子操作等,来保证对共享数据的访问是线程安全的。在 Java 中,可以通过 synchronized 关键字、Lock 接口等机制来实现线程安全。 +``` \ No newline at end of file diff --git a/content/zh/docs/devops_eval/c1.data.md b/content/zh/docs/devops_eval/c1.data.md new file mode 100644 index 0000000..f9e1a14 --- /dev/null +++ b/content/zh/docs/devops_eval/c1.data.md @@ -0,0 +1,115 @@ +--- +title: 数据 +slug: 数据 +description: 介绍主要功能 +url: "docs/数据介绍" +aliases: +- "/docs/数据介绍" +--- + +## ⏬ 数据 +#### 下载 +* 方法一:下载zip压缩文件(你也可以直接用浏览器打开下面的链接): + ``` + wget https://huggingface.co/datasets/codefuse-admin/devopseval-exam/resolve/main/devopseval-exam.zip + ``` + 然后可以使用 pandas加载数据: + + ``` + import os + import pandas as pd + + File_Dir="devopseval-exam" + test_df=pd.read_csv(os.path.join(File_Dir,"test","UnitTesting.csv")) + ``` +* 方法二:使用[Hugging Face datasets](https://huggingface.co/datasets/codefuse-admin/devopseval-exam)直接加载数据集。示例如下: + ```python + from datasets import load_dataset + dataset=load_dataset(r"DevOps-Eval/devopseval-exam",name="UnitTesting") + + print(dataset['val'][0]) + # {"id": 1, "question": "单元测试应该覆盖以下哪些方面?", "A": "正常路径", "B": "异常路径", "C": "边界值条件","D": 所有以上,"answer": "D", "explanation": ""} ``` + +* 方法三:使用modelscope下载相关所有数据。示例如下: + ```python + from modelscope.msdatasets import MsDataset + MsDataset.clone_meta(dataset_work_dir='./xxx', dataset_id='codefuse-ai/devopseval-exam')``` + +#### 👀 说明 +为了方便使用,我们已经整理出了 55 个细分类别以及它们的中英文名称。具体细节请查看 [category_mapping.json](resources/categroy_mapping.json) 。格式如下: + +``` +{ + "UnitTesting.csv": [ + "unit testing", + "单元测试", + {"dev": 5, "test": 32} + "TEST" + ], + ... + "file_name":[ + "英文名称", + "中文名称", + "样本数量", + "类别(PLAN,CODE,BUILD,TEST,RELEASE,DEPOLY,OPERATE,MONITOR八选一)" + ] +} +``` +每个细分类别由两个部分组成:dev 和 test。每个细分类别的 dev 集包含五个示范实例以及为 few-shot 评估提供的解释。而 test 集则用于模型评估,并且test数据已包含准确标签。 + +下面是 dev 数据的示例,来自"版本控制"细分类别: +``` +id: 4 +question: 如何找到Git特定提交中已更改的文件列表? +A: 使用命令 `git diff --name-only SHA` +B: 使用命令 `git log --name-only SHA` +C: 使用命令 `git commit --name-only SHA` +D: 使用命令 `git clone --name-only SHA` +answer: A +explanation: +分析原因: +git diff --name-only SHA命令会显示与SHA参数对应的提交中已修改的文件列表。参数--name-only让命令只输出文件名,而忽略其他信息。其它选项中的命令并不能实现此功能。 +``` +#### 🔥 AIOps样本示例 +👀 👀 此处以日志解析和时序异常检测为例,对AIOps样本做一些简要的展示: + +日志解析 +``` +id: 0 +question: +下面是一些运行日志 + 0 04:21:15,429 WARN Cannot open channel to 2 at election address /10.10.34.12:3888 + 1 19:18:56,377 WARN ******* GOODBYE /10.10.34.11:52703 ******** + 2 19:13:46,128 WARN ******* GOODBYE /10.10.34.11:52308 ******** + 3 19:16:26,268 WARN ******* GOODBYE /10.10.34.11:52502 ******** + 4 09:11:16,012 WARN Cannot open channel to 3 at election address /10.10.34.13:3888 + 5 16:37:13,837 WARN Cannot open channel to 2 at election address /10.10.34.12:3888 + 6 09:09:16,008 WARN Cannot open channel to 3 at election address /10.10.34.13:3888 + 7 15:27:03,681 WARN Cannot open channel to 3 at election address /10.10.34.13:3888 +日志最前面三部分别为序号、时间戳和日志Level,在不考虑这三部分内容的情况下,此处我们设定日志的变量用'<*>'代替,token与token之间用空格分隔,那么请问上述日志的日志模版具体是什么? +A: Notification time out: <*> 和 Connection broken for id <*>, my id = <*>, error = +B: Send worker leaving thread 和 Connection broken for id <*>, my id = <*>, error = +C: Received connection request /<*>:<*> 和 Interrupting SendWorker +D: Cannot open channel to <*> at election address /<*>:<*> 和 ******* GOODBYE /<*>:<*> ******** +answer: D +explanation: 根据日志中的内容,选项D是最符合日志模板的。日志中包含了"Cannot open channel to <*> at election address /<*>:<*>"和"******* GOODBYE /<*>:<*> ********"这两个固定的模板片段,它们都在选项D中出现了。同时,其他选项中的模板片段与日志中的内容不匹配。因此,选项D是最符合日志模板的。 +``` +时序异常检测 +``` +id: 0 +question: +分析如下时间序列 +[50,62,74,84,92,97,99,98,94,87,77,65,265,40,28,17,8,3,0,0,4,10,20,31,43,56,68,79,89,95,99,99,96,91,82,71,59,46,34,22,12,5,1,0,2,7,15,25,37,49] +请找出其中明显异常点的下标。所谓的异常点一般指的是明显与数据整体趋势不符的点。 +A: 46 +B: 0 +C: 37 +D: 12 +answer: D +explanation: 根据分析,题目中的时间序列在12点出的值265要明显大于周围数据,存在着突增现象,因此选择D是正确的。 +``` +#### 🔧 ToolLearning样本示例 +工具学习样本的数据格式与OpenAI的函数调用格式兼容。 +详情请参阅[tool_learning_info_zh.md](/docs/devops_eval/tool_learning_info_zh.md)。 +工具学习评测过程,详情请参阅见 [tool_learning_evalution.md](/docs/devops_eval/tool_learning_evalution.md)。 +
diff --git a/content/zh/docs/devops_eval/c2.evaluate.md b/content/zh/docs/devops_eval/c2.evaluate.md new file mode 100644 index 0000000..22f89ea --- /dev/null +++ b/content/zh/docs/devops_eval/c2.evaluate.md @@ -0,0 +1,73 @@ +--- +title: 评测 +slug: 评测 +description: 介绍主要功能 +url: "docs/codefuse-devops-eval-quickstart-zh" +aliases: +- "/docs/codefuse-devops-eval-quickstart-zh" +--- + +## 🚀 如何进行测试 +如果需要在自己的 HuggingFace 格式的模型上进行测试的话,总的步骤分为如下几步: +1. 编写 Model 的 loader 函数 +2. 编写 Model 的 context_builder 函数 +3. 注册模型到配置文件中 +4. 执行测试脚本 +如果模型在加载进来后不需要特殊的处理,而且输入也不需要转换为特定的格式(e.g. chatml 格式或者其他的 human-bot 格式),请直接跳转到第四步直接发起测试。 + +#### 1. 编写 loader 函数 +模型加载时还需要做一些额外的处理(e.g. tokenizer 调整),需要继承 `ModelAndTokenizerLoader` 类来覆写对应的 `load_model` 和 `load_tokenizer` 函数, 如下所示: +```python +class QwenModelAndTokenizerLoader(ModelAndTokenizerLoader): + def __init__(self): + super().__init__() + pass + + @override + def load_model(self, model_path: str): + # Implementation of the method + pass + + @override + def load_tokenizer(self, model_path: str): + # Implementation of the method + pass +``` +#### 2. 编写 Model 的 context_builder 函数 +如果输入需要转换为特定的格式(e.g. chatml 格式或者其他的 human-bot 格式),则需要继承 ContextBuilder 类来覆写 make_context 函数,如下所示: +```python +class QwenChatContextBuilder(ContextBuilder): + def __init__(self): + super().__init__() + + @override + def make_context(self, model, tokenizer, query: str, system: str = "hello!"): + # Implementation of the method + pass +``` +#### 3. 注册模型到配置文件中 +去 conf 中的 `model_conf.json`,注册对应的模型名和这个模型将要使用的 loader 和 context_builder,示例如下: +```json +{ + "Qwen-Chat": { + "loader": "QwenModelAndTokenizerLoader", + "context_builder": "QwenChatContextBuilder" + } +} +``` + +#### 4. 执行测试脚本 +直接运行以下代码发起测试 +```Bash +python src/run_eval.py \ +--model_path path_to_model \ +--model_name model_name_in_conf \ +--model_conf_path path_to_model_conf \ +--eval_dataset_list all \ +--eval_dataset_fp_conf_path path_to_dataset_conf \ +--eval_dataset_type test \ +--data_path path_to_downloaded_devops_eval_data \ +--k_shot 0 +``` +👀 👀 具体评测流程见📖 [**数据集评测教程**](/docs/devops_eval/tutorial_zh.md) +
diff --git a/content/zh/docs/devops_eval/tool_learning_evalution.md b/content/zh/docs/devops_eval/tool_learning_evalution.md new file mode 100644 index 0000000..6583de3 --- /dev/null +++ b/content/zh/docs/devops_eval/tool_learning_evalution.md @@ -0,0 +1,224 @@ +## tool learning 数据集评测教程 + +### chatml接入方式 +如果需要在自己的 huggingface 格式的模型上进行测试的话,总的步骤分为如下几步: +1. 编写 ~/evals/FuncCallEvalution 的 create_prompts 函数 +2. 编写 ~/models/base_model 的 相关函数 +3. 注册模型和评估函数 +4. 执行测试脚本 +如果模型在加载进来后不需要特殊的处理,而且输入也不需要转换为特定的格式(e.g. chatml 格式或者其他的 human-bot 格式),请直接跳转到第四步直接发起测试。 + +#### 1. 编写 loader 函数 +如果模型在加载进来还需要做一些额外的处理(e.g. tokenizer 调整),需要去 `src.context_builder.context_builder_family.py` 中继承 `ModelAndTokenizerLoader` 类来覆写对应的 `load_model` 和 `load_tokenizer` 函数,具体可以参照以下示例: +```python +class FuncCallEvalution(ToolEvalution): + + def create_prompts(self, func_call_datas): + ''' + datas: [ + { + "instruction": history[his_idx], + "input": "", + "output": output, + "history": [(human_content, ai_content), (), ()], + "functions": tools + } + ] + ''' + system_content = '''CodeFuse是一个面向研发领域的智能助手,旨在中立的、无害的帮助用户解决开发相关的问题,所有的回答均使用Markdown格式返回。 + 你能利用许多工具和功能来完成给定的任务,在每一步中,你需要分析当前状态,并通过执行函数调用来确定下一步的行动方向。你可以进行多次尝试。如果你计划连续尝试不同的条件,请每次尝试一种条件。若给定了Finish函数,则以Finish调用结束,若没提供Finish函数,则以不带function_call的对话结束。''' + function_format = '''You are ToolGPT, you have access to the following APIs:\n{tools}''' + + func_call_train_datas = [] + history_error_cnt = 0 + funccall_error_cnt = 0 + + for data in func_call_datas: + tools = data["functions"] + chatrounds = data["chatrounds"] + + function_content = "" + if len(tools) > 0: + function_content = function_format.format(tools=json.dumps(tools, ensure_ascii=False, sort_keys=True)) + + history = [] + for i in chatrounds: + if i["role"]=="system": + continue + + if i["role"]=="user": + history.append(("user", i["content"])) + + if i["role"] == "assistant": + if "function_call" in i: + if not isinstance(i["function_call"], dict): + funccall_error_cnt+=1 + continue + content = "#function" + json.dumps({**{"content": i["content"]}, **i["function_call"]}, ensure_ascii=False) + else: + content = i["content"] + history.append(("assistant", content)) + + + if i["role"] == "function": + content = json.dumps({**{"content": i["content"]}, **{"name": i["name"]}}, ensure_ascii=False) + history.append(("user", content)) + + + history = [i[1] for i in history] + history[0] = "\n".join([system_content,function_content, history[0]]) + + for his_idx in range(0, len(history), 2): + output = history[his_idx+1] + + if "#function" in output: + output = output.split("#function")[-1] + + try: + output = json.loads(output) + except: + output = {"content": output} + + + func_call_train_datas.append( + { + "instruction": history[his_idx], + "input": "", + "output": output, + "history": [history[:his_idx+2][i:i+2] for i in range(0, len(history[:his_idx]), 2)], + "functions": tools + }, + ) + return func_call_train_datas +``` + +#### 2. 编写 Model 的 context_builder 函数 +如果输入需要转换为特定的格式(e.g. chatml 格式或者其他的 human-bot 格式),则需要去 `src.context_builder.context_builder_family` 中继承 ContextBuilder 类来覆写 make_context 函数,这个函数是用来将输入转换格式为对应需要的输出的,一个示例如下: +```python +class ToolModel: + def __init__(self, model_path: str, template: str, trust_remote_code=True, tensor_parallel_size=1, gpu_memory_utilization=0.25): + self.model_path = model_path + self.trust_remote_code = trust_remote_code + self.tensor_parallel_size = tensor_parallel_size + self.gpu_memory_utilization = gpu_memory_utilization + self.load_model(self.model_path, self.trust_remote_code, self.tensor_parallel_size, self.gpu_memory_utilization) + + def generate(self, prompts: str, template: str = None, generate_configs: GenerateConfigs = None) -> list: + '''产出对应结果''' + pass + + def generate_params( + self, generate_configs: GenerateConfigs, + ): + '''generate param''' + kargs = generate_configs.dict() + return kargs + + def load_model(self, model_path, trust_remote_code=True, tensor_parallel_size=1, gpu_memory_utilization=0.25): + '''加载模型''' + self.tokenizer = AutoTokenizer.from_pretrained(self.model_path, trust_remote_code=trust_remote_code) + self.model = AutoModelForCausalLM.from_pretrained(self.model_path, device_map="auto", trust_remote_code=trust_remote_code).eval() + + # self.model = LLM(model=model_path, trust_remote_code=trust_remote_code, tensor_parallel_size=tensor_parallel_size, gpu_memory_utilization=gpu_memory_utilization) +``` + +#### 3. 注册模型和eval函数即可 +在 ~/models/__init__.py 中注册即可 +```python +from .base_model import ToolModel + +__all__ = [ + "ToolModel", +] +``` +在 ~/evasl/__init__.py 中注册即可 +```python +from .base_evalution import ToolEvalution +from .toolfill_evalution import ToolFillEvalution +from .toolparser_evalution import ToolParserEvalution +from .toolsummary_evalution import ToolSummaryEvalution +from .func_call_evalution import FuncCallEvalution + + +__all__ = [ + "ToolEvalution", "ToolFillEvalution", "ToolParserEvalution", "ToolSummaryEvalution", "FuncCallEvalution" +] +``` + + +#### 4. 执行测试脚本 +修改 ~/src/qwen_eval_main.py# datainfos和model_infos +```python +model_infos = [ + {"model_name": "", "template": "chatml", "model_path": "", + "peft_path": "", "model_class": QwenModel}] + +datainfos = [ + {"dataset_path": "~/fcdata_luban_zh_test.jsonl", "dataset_name": "fcdata_luban_zh", "tool_task": "func_call"}, + {"dataset_path": "~/test_datas/fcdata_zh_test_v1.jsonl", "dataset_name": "fcdata_zh", "tool_task": "func_call"}, +] +``` + +运行下述命令即可 +```Bash +python qwen_eval_main.py +``` + +
+ +### 非chatml接入 +如果需要在自己的 huggingface 格式的模型上进行测试的话,总的步骤分为如下几步: +1. 编写 ~/getAssistantAns.py 相关代码 +2. 执行测试脚本 + + +#### 1、编写 getAssistantAns 示例 +``` +class GetAssistantAns(): + # 按照自己推理需求自己修改代码 + + def __init__(self, gpu_num=1): + model = AutoModelForCausalLM.from_pretrained(model_name) + device_list = [] + for gpu_idx in range(gpu_num): + device_list.append(torch.device("cuda:0")) + + # 将模型移动到指定的GPU设备 + model.to(device) + + + def gen_answer(self, chat_dict, gpu_index): + # 这里实际根据自己推理逻辑 然后转为标准格式返回 + # 以下仅仅是样例 + import time + print(os.environ["CUDA_VISIBLE_DEVICES"]) + time.sleep(1) + rtn_dict1 = { + "role": "assistant", + "content": None, + "function_call": + { + "name": "get_fudan_university_scoreline", + "arguments": "{\n \"year\": \"2020\"\n}" + } + } + + rtn_dict2 = { + "role": "assistant", + "content": "2020年复旦大学的分数线如下:\n\n- 文科一批:630分\n- 文科二批:610分\n- 理科一批:650分\n- 理科二批:630分" + } + + return random.choice([rtn_dict1, rtn_dict2]) +``` +#### 2、执行测试脚本 +修改 ~/src/opensource_functioncall_evalution.py # test_ans_file_list +```python +test_ans_file_list = [ + "fcdata_zh_test.jsonl" + ] +``` + +运行下述命令即可 +```Bash +python opensource_functioncall_evalution.py +``` diff --git a/content/zh/docs/devops_eval/tool_learning_info_zh.md b/content/zh/docs/devops_eval/tool_learning_info_zh.md new file mode 100644 index 0000000..d3db092 --- /dev/null +++ b/content/zh/docs/devops_eval/tool_learning_info_zh.md @@ -0,0 +1,87 @@ +### 数据样例 +在数据上我们完全兼容了 OpenAI Function Calling,具体格式如下: + +**Function Call的数据格式** + +| Input Key | Input Type | Input Description | +| --- | --- | --- | +| functions | List[Swagger] | 工具集合 | +| chatrounds | List[chatround] | 多轮对话数据 | + +**chatrounds的数据格式** + +| Input Key | Input Type | Input Description | +| --- | --- | --- | +| role | string | 角色名称,包含三种类别,user、assistant、function | +| name | string | 若role为function,则存在name字段,为function的名称 | +| content | string | role的返回内容 | +| function_call | dict | 工具调用 | + +``` +{ + "functions": + [ + { + "name": "get_fudan_university_scoreline", + "description": "查询复旦大学往年分数线,例如:查询2020年复旦大学的分数线", + "parameters": + { + "type": "object", + "properties": + { + "year": + { + "type": "string", + "description": "年份,例如:2020,2019,2018" + } + }, + "required": + [ + "year" + ] + } + } + ], + "chatrounds": + [ + { + "role": "system", + "content": "CodeFuse是一个面向研发领域的智能助手,旨在中立的、无害的帮助用户解决开发相关的问题,所有的回答均使用Markdown格式返回。\n你能利用许多工具和功能来完成给定的任务,在每一步中,你需要分析当前状态,并通过执行函数调用来确定下一步的行动方向。你可以进行多次尝试。如果你计划连续尝试不同的条件,请每次尝试一种条件。若给定了Finish函数,则以Finish调用结束,若没提供Finish函数,则以不带function_call的对话结束。" + }, + { + "role": "user", + "content": "查询2020年复旦大学的分数线" + }, + { + "role": "assistant", + "content": null, + "function_call": + { + "name": "get_fudan_university_scoreline", + "arguments": "{\n \"year\": \"2020\"\n}" + } + }, + { + "role": "function", + "name": "get_fudan_university_scoreline", + "content": "{\n \"scoreline\":{\n \"文科一批\": 630, \n \"文科二批\": 610, \n \"理科一批\": 650, \n \"理科二批\": 630 \n }\n}" + }, + { + "role": "assistant", + "content": "2020年复旦大学的分数线如下:\n\n- 文科一批:630分\n- 文科二批:610分\n- 理科一批:650分\n- 理科二批:630分" + } + ] +} +``` + +上述Function Call的数据样例为给定特定工具集后,用于回答用户查询某高校录取分数线的问题。 + + +### 评测指标 +由于一般通用模型无法具备工具调用的能力,因此在进行Tool Learn-Eval评测之前需要对通用模型进行微调,先让模型学会工具使用的基本范式 + +下面,我们定义了几种评估工具使用的指标: + + + +②③④⑤的和为1,代表工具调用失败的总数,⑤工具幻觉是工具名识别失败的一种特殊情况 \ No newline at end of file diff --git a/content/zh/docs/devops_eval/tutorial_zh.md b/content/zh/docs/devops_eval/tutorial_zh.md new file mode 100644 index 0000000..ca303e3 --- /dev/null +++ b/content/zh/docs/devops_eval/tutorial_zh.md @@ -0,0 +1,135 @@ +## 数据集评测教程 + +## 🚀 如何进行测试 +如果需要在自己的 huggingface 格式的模型上进行测试的话,总的步骤分为如下几步: +1. 编写 Model 的 loader 函数 +2. 编写 Model 的 context_builder 函数 +3. 注册模型到配置文件中 +4. 执行测试脚本 +如果模型在加载进来后不需要特殊的处理,而且输入也不需要转换为特定的格式(e.g. chatml 格式或者其他的 human-bot 格式),请直接跳转到第四步直接发起测试。 + +#### 1. 编写 loader 函数 +如果模型在加载进来还需要做一些额外的处理(e.g. tokenizer 调整),需要去 `src.context_builder.context_builder_family.py` 中继承 `ModelAndTokenizerLoader` 类来覆写对应的 `load_model` 和 `load_tokenizer` 函数,具体可以参照以下示例: +```python +class QwenModelAndTokenizerLoader(ModelAndTokenizerLoader): + def __init__(self): + super().__init__() + pass + + def load_model(self, model_path: str): + model = super().load_model(model_path) + model.generation_config = GenerationConfig.from_pretrained(model_path) + return model + + def load_tokenizer(self, model_path: str): + tokenizer = super().load_tokenizer(model_path) + + # read generation config + with open(model_path + '/generation_config.json', 'r') as f: + generation_config = json.load(f) + tokenizer.pad_token_id = generation_config['pad_token_id'] + tokenizer.eos_token_id = generation_config['eos_token_id'] + return tokenizer +``` + +#### 2. 编写 Model 的 context_builder 函数 +如果输入需要转换为特定的格式(e.g. chatml 格式或者其他的 human-bot 格式),则需要去 `src.context_builder.context_builder_family` 中继承 ContextBuilder 类来覆写 make_context 函数,这个函数是用来将输入转换格式为对应需要的输出的,一个示例如下: +```python +class QwenChatContextBuilder(ContextBuilder): + def __init__(self): + super().__init__() + + def make_context( + self, + model, + tokenizer, + query: str, + system: str = "you are a helpful assistant" + ): + ''' + model: PretrainedModel + tokenizer: PretrainedTokenzier + query: Input string + system: System prompt if needed + ''' + im_start, im_end = "<|im_start|>", "<|im_end|>" + im_start_tokens = [tokenizer.im_start_id] + im_end_tokens = [tokenizer.im_end_id] + nl_tokens = tokenizer.encode("\n") + + def _tokenize_str(role, content): + return f"{role}\n{content}", tokenizer.encode( + role, allowed_special=set() + ) + nl_tokens + tokenizer.encode(content, allowed_special=set()) + + system_text, system_tokens_part = _tokenize_str("system", system) + system_tokens = im_start_tokens + system_tokens_part + im_end_tokens + + raw_text = "" + context_tokens = [] + + context_tokens = system_tokens + context_tokens + raw_text = f"{im_start}{system_text}{im_end}" + raw_text + context_tokens += ( + nl_tokens + + im_start_tokens + + _tokenize_str("user", query)[1] + + im_end_tokens + + nl_tokens + + im_start_tokens + + tokenizer.encode("assistant") + + nl_tokens + ) + raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n" + return raw_text, context_tokens +``` + +#### 3. 注册模型到配置文件中 +去 conf 中的 `model_conf.json`,注册对应的模型名和这个模型将要使用的 loader 和 context_builder,其中 loader 和 context_builder 写第一步和第二步中自定义的类名就可以,示例如下: +```json +{ + "Qwen-Chat": { + "loader": "QwenModelAndTokenizerLoader", + "context_builder": "QwenChatContextBuilder" + } +} +``` + + +#### 4. 执行测试脚本 +直接运行以下代码发起测试 +```Bash +# model_path: 要测试的模型路径 +# model_name: 模型配置文件对应的模型命名,默认为 Default ,代表走默认的 loader 和 context_builder +# model_conf_path: 模型配置文件的地址,一般就为 conf 路径下的 devopseval_dataset_fp.json +# eval_dataset_list: 要测试的数据集名称,默认 all,全部测试,如果需要测试单个或者多个,用 # 符号链接,示例:dataset1#dataset2 +# eval_dataset_fp_conf_path: 数据集配置地址 +# eval_dataset_type: 测试哪种类型,只支持默认 test 类型的测试集 +# data_path: 评测数据集地址,填写下载数据集后的地址就可以 +# k_shot: 支持 0-5,代表 few-shot 会给模型前缀加的示例数量 + + +python src/run_eval.py \ +--model_path path_to_model \ +--model_name model_name_in_conf \ +--model_conf_path path_to_model_conf \ +--eval_dataset_list all \ +--eval_dataset_fp_conf_path path_to_dataset_conf \ +--eval_dataset_type test \ +--data_path path_to_downloaded_devops_eval_data \ +--k_shot 0 +``` + +举个🌰:比如评测数据集下载到了 `folder1`,代码放在了 `folder2`,模型在 `folder3`,模型不需要自定义 loader 和 context_builder,需要测试所有的数据集的 zero-shot 得分,那可以按照以下脚本发起测试: +```Bash +python folder2/src/run_eval.py \ +--model_path folder3 \ +--model_name Default \ +--model_conf_path folder1/conf/model_conf.json \ +--eval_dataset_list all \ +--eval_dataset_fp_conf_path folder1/conf/devopseval_dataset_fp.json \ +--eval_dataset_type test \ +--data_path folder2 \ +--k_shot 0 +``` +
\ No newline at end of file diff --git a/content/zh/docs/mftcoder/1_introduction.md b/content/zh/docs/mftcoder/1_introduction.md new file mode 100644 index 0000000..625a21b --- /dev/null +++ b/content/zh/docs/mftcoder/1_introduction.md @@ -0,0 +1,34 @@ +--- +title: MFTCoder 介绍 +slug: MFTCoder 介绍 +description: MFTCoder 介绍 +url: /docs/mftcoder-introduction-zh +aliases: +- "/docs/mftcoder-introduction-zh" +--- + +## 项目简介 +**国际首个高精度、高效率、多任务、多模型支持、多训练算法,大模型代码能力微调框架;** + +**Codefuse-MFTCoder** 是一个开源的多任务代码大语言模型项目,包含代码大模型的模型、数据、训练等。我们希望通过开源,分享交流大语言模型在代码领域的进步。 + +### 项目框架 +![img_1.jpg](/images/mftcoder/img_1.jpg) + +### 项目优势 +:white_check_mark: **多任务**:一个模型同时支持多个任务,会保证多个任务之间的平衡,甚至可以泛化到新的没有见过的任务上去; + +:white_check_mark: **多模型**:支持最新的多个开源模型,包括gpt-neox,llama,llama-2,baichuan,Qwen,chatglm2等; + +:white_check_mark: **多框架**:既支持主流开源的Accelerate+DeepSpeed/FSDP,也支持新开源的[ATorch 框架](https://github.com/intelligent-machine-learning/dlrover); + +:white_check_mark: **高效微调**:支持LoRA和QLoRA,可以用很少的资源去微调很大的模型,且训练速度能满足几乎所有微调场景; + + +本项目主要内容如下: +- 同时支持单任务SFT(Supervised FineTuning)和MFT(Multi-task FineTuning), 当前开源支持数据均衡,未来将持续开源难易均衡, 收敛均衡等 +- 支持QLoRA低成本高效指令微调、LoRA高效指令微调、全量参数高精度微调。 +- 支持绝大部分主流的开源大模型,重点关注代码能力优秀的开源大模型,如DeepSeek-coder, Mistral, Mistral(MoE), Chatglm3, Qwen, GPT-Neox, Starcoder, Codegeex2, Code-LLaMA等。 +- 支持lora与base model进行权重合并,推理更便捷。 +- 整理并开源2个指令微调数据集:[Evol-instruction-66k](https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k)和[CodeExercise-Python-27k](https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k)。 +- 开源多个[Codefuse系列指令微调模型权重],具体参见我们的huggingface组织和modelscope组织下的模型:[codefuse-ai huggingface](https://huggingface.co/codefuse-ai) or [codefuse-ai 魔搭](https://modelscope.cn/organization/codefuse-ai)。 \ No newline at end of file diff --git a/content/zh/docs/mftcoder/2_quickstart.md b/content/zh/docs/mftcoder/2_quickstart.md new file mode 100644 index 0000000..32cc86c --- /dev/null +++ b/content/zh/docs/mftcoder/2_quickstart.md @@ -0,0 +1,53 @@ +--- +title: QuickStart +slug: QuickStart +description: QuickStart Document +url: /docs/mftcoder-quickstart-zh +aliases: +- "/docs/mftcoder-quickstart-zh" +--- + + +## 环境 +首先, 你需要将CUDA(>=11.4, 推荐11.7)及其相关驱动安装成功,并确保其工作正常, 并且安装基本的torch(>=2.0.0) +在requirements.txt下固定了几个主要的python包的版本,执行如下脚本即可: +```bash +sh init_env.sh +``` +我们强烈建议您安装flash attention(>=2.1.0, 推荐2.3.6), 安装请参考 https://github.com/Dao-AILab/flash-attention + +## 训练 +如果你熟悉大模型训练的各种主流开源资源,例如 ```transformers```, ```DeepSpeed```, ```FSDP```等, 为了用开源项目快速上手高性能微调,我们建议您尝试: + +🚀🚀 [MFTCoder-accelerate: Accelerate + DeepSpeed/FSDP Codebase for MFT(Multi-task Finetuning)](/docs/mftcoder-accelerate-zh) + + +如果你想探索一些新兴的训练框架,可以尝试: + +🚀 [MFTCoder-atorch: Atorch Codebase for MFT(Multi-task Finetuning)](/docs/mftcoder-atorch-zh) + + +## 模型 + +使用本项目的训练代码,以及上述训练数据,我们训练并在huggingface, modelscope开源了以下模型。 + +| 模型 | HuggingFace链接 | 魔搭 链接 | 基座模型 | 训练数据 | Batch Size | Seq Length | +|--------------------------------------|---------------------------------------------------------------------------|---------------------------------------------------------------------------------|----------------------|------|------------|------------| +| 🔥🔥🔥 CodeFuse-DeepSeek-33B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B) | DeepSeek-coder-33B | 60万 | 80 | 4096 | +| 🔥🔥🔥 CodeFuse-Mixtral-8x7B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-Mixtral-8x7B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-Mixtral-8x7B) | Mixtral-8x7B | 60万 | 80 | 4096 | +| 🔥🔥🔥 CodeFuse-CodeLlama-34B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B) | CodeLlama-34b-Python | 60万 | 80 | 4096 | +| 🔥🔥🔥 CodeFuse-CodeLlama-34B-4bits | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits) | CodeLlama-34b-Python | | | 4096 | +| 🔥🔥🔥 CodeFuse-StarCoder-15B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-StarCoder-15B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-StarCoder-15B) | StarCoder-15B | 60万 | 80 | 4096 | +| 🔥🔥🔥 CodeFuse-QWen-14B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-QWen-14B) | Qwen-14b | 110万 | 256 | 4096 | +| 🔥🔥🔥 CodeFuse-CodeGeex2-6B | [h-link](https://huggingface.co/codefuse-ai/CodeFuse-CodeGeex2-6B) | [m-link](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeGeex2-6B) | CodeGeex2-6B | 110万 | 256 | 4096 | + + + + +## 数据集 +目前本项目主要整理了如下指令数据集,并将其整理成统一的数据格式,这两个指令微调数据集是我们多任务训练中数十个任务中的2个,未来我们会陆续开源更多的代码任务指令微调数据集: + +| 数据集 | 介绍 | +|---------------------------------------------------------------|--------------------------------------------------------------------| +| [⭐ Evol-instruction-66k](https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k) | 基于开源open-evol-instruction-80k过滤低质量,重复和human eval相似的数据后得到的高质量代码类微调数据 | +| [⭐ CodeExercise-Python-27k](https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k) | 高质量python练习题数据 \ No newline at end of file diff --git a/content/zh/docs/mftcoder/3_accelerate.md b/content/zh/docs/mftcoder/3_accelerate.md new file mode 100644 index 0000000..e65a2b4 --- /dev/null +++ b/content/zh/docs/mftcoder/3_accelerate.md @@ -0,0 +1,294 @@ +--- +title: "MFTCoder: Accelerate + DeepSpeed/FSDP 框架篇" +description: 介绍主要功能 +url: /docs/mftcoder-accelerate-zh +aliases: +- "/docs/mftcoder-accelerate-zh" +--- + + +[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/codefuse-ai) + + GitHub + + +[**中文**] [[English]](/docs/mftcoder-accelerate) + +## 1. 更新 +🔥 MFTCoder-accelerate 新增支持accelerate + FSDP框架, 支持全量微调和LoRA; + +🔥 MFTCoder-accelerate 支持最新更多主流开源模型: mistral, mixtral-8x7b(Mixture of Experts), deepseek, chatglm3; + +🔥 MFTCoder-accelerate 新增self-paced Loss, 用于收敛均衡; + +🔥 MFTCoder-accelerate 支持使用accelerate + DeepSpeed框架下支持 全量参数/QLoRA/LoRA微调; + +🔥 MFTCoder-accelerate 在训练中支持了多任务微调MFT, 可以同时平衡多个任务的训练,训练的模型支持多任务推理; + +🔥 MFTCoder-accelerate 在训练中支持多种模型基座: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen等 + +## 2. 数据格式 +### 2.1 训练数据格式 +训练数据为jsonl格式,每一行的数据格式如下,其中chat_rounds字段是必需的,可以根据实际需求添加或删除其他字段。 +可以参考项目中的xxx.jsonl文件。 +```json +{ + "id":0, + "data_name":"code-helper", + "chat_rounds":[ + { + "role": "system", + "content": "你是一个智能代码助手,可以回复用户与代码相关的问题" + }, + { + "role": "human", + "content": "写一个快速排序" + }, + { + "role": "bot", + "content": "以下是一个快速排序算法xxxxxx" + }, + { + "role": "human", + "content": "解释一下这段代码" + }, + { + "role": "bot", + "content": "好的,这段代码xxx" + } + ] +} +``` + +### 2.2 推理数据格式 +推理数据格式为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式: +``` +""" +system +这是System指令 +human +这是第1轮用户输入的问题 +bot +这是第1轮模型生成的内容{EOS_TOKEN} +human +这是第2轮用户输入的问题 +bot +这是第2轮模型生成的内容{EOS_TOKEN} +... +... +... +human +这是第n轮用户输入的问题 +bot +{模型现在要生成的内容}{EOS_TOKEN} +""" +``` + + +## 3. 模型训练 +目前支持全量参数(Full-parameters)指令微调、QLoRA指令微调,LoRA指令微调。 +一些优秀的代码预训练模型权重,理论上,HuggingFace上开源的模型,均可使用本项目进行训练: + +🤗 [最新代码预训练SOTA,CodeLlama](https://huggingface.co/codellama/CodeLlama-34b-Python-hf) :code-llama-34b, code-llama-34b-python, 新的SOTA基座。 + +🤗 [10B级别最佳代码预训练模型Starcoder](https://huggingface.co/bigcode/starcoder) wizardCoder-15B, PanGu-coder2等前SOTA的基座模型。 + +🤗 [多语言能手Qwen-7b](https://huggingface.co/Qwen/Qwen-7B) :适用于多语言任务,也适用中文任务。进行指令微调时。 + +**mftcoder_accelerate文件结构** +``` +mftcoder_accelerate + | + src + configs + | + data + | + model + | + *pefts* + | + tokenizer + | + utils + | + evals +``` +我们将训练中使用的各种组件抽取出来,以便后续的扩展和优化, 详见```src```目录下的实现。 + +训练入口文件是```mftcoder_accelerate/src/pefts/mft_accelerate.py``` + +参数配置存储在```mftcoder_accelerate/src/configs```目录下,方便统一管理和更改。 + +**_所以,在你开启训练之前,请进入src目录_** +``` +cd mftcoder_accelerate/src +``` + + + +### 3.1 数据tokenization +训练时,我们将多轮对话拼接成如下格式(也是上文中的推理数据格式),然后进行tokenize。 +其中,默认情况下: + +```human\n```作为human/user的起始符,```bot\n```作为bot/assistant的起始符,```{EOS_TOKEN}``` 表示eos_token。 +其中eos_token可以根据不同模型修改替换。不同角色的起始符可以配置,用来实现不同的对话/问答模版。 +``` +"human\n{input1}bot\n{target1}{EOS_TOKEN}human\n{input2}bot\n{target2}{EOS_TOKEN}\n" +``` +在计算loss时,我们通过loss mask的方式,input部分的loss不参与参数更新,只有“target{EOS_TOKEN}”部分的loss参与参数更新。 +这种方式充分利用了模型并行计算的优势,训练更加高效,同时也充分利用了decoder-only模型从左到右attention的特性,一次性将多轮对话中的每个target部分都参与了训练,训练更充分高效。 + +### 3.2 LoRA/QLoRA微调 + +#### LoRA/QLoRA微调简介 +关于LoRA的详细介绍可参考论文:[LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf) + +关于QLoRA的详细介绍可参考论文:[QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314.pdf) + +QLoRA通过4-bit的nf4量化,且加入更多adapter,在大幅减少显存消耗的同时,尽可能逼近全量参数微调的效果。 +QLoRA论文指出,该方法可以在一张V100上对33B的模型进行微调,并且性能逼近全量参数微调。 + +执行如下命令即可进行 Lora/QLora/全量 微调: +#### Launch via Deepspeed +DeepSpeed配置在accelerate_ds_config.yaml中。 +```bash +accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "DeepSpeed" +``` +或者 + +DeepSpeed配置在脚本中通过命令行输入。 +```bash +sh ds_single_launch.sh +``` + +#### Launch via FSDP +FSDP配置在accelerate_fsdp_config.yaml中。 +```bash +accelerate launch --config_file accelerate_fsdp_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "FSDP" +``` +或者 + +FSDP配置在脚本中通过命令行输入。 +```bash +sh fsdp_single_launch.sh +``` + +#### 训练参数 +_**训练需要的参数配置在```configs/*_train_config```中,主要参数说明如下:**_ + +- **load_raw_dataset**: 需要保持true,后续会支持其它模式数据,当前仅支持jsonl输入 +- **data_paths**: "[path1,path2,path3]" 输入数据地址,字符串,开头结尾用[],中间用```,```间隔不同path,每个path是一个目录,目录的最后一级名字作为任务名称,下面包含1到多个jsonl数据 +- **output_dir**:训练输出目录,存储checkpoint(全量训练时)、lora_adaptor(Lora或者Qlora时)等 +- **tb_dir**: 存储tensorboard等 +- **model_type**: "mixtral|mistral|deepseek|llama|starcoder|chatglm2|qwen|gpt_neox" +- **attn_implementation**: "flash_attention_2" 或者 "eager" +- **peft_type**: lora或者qlora或者null(全量微调) +- **lora_rank**: lora rank +- **lora_alpha**: lora alpha +- **lora_dropout**: lora dropout +- **target_modules**: List[str], lora目标模块,如果null,会使用默认,参考model_mapping.py +- **quantization**: 是否量化,"4bit", "8bit" 或者null, qlora推荐4bit量化 +- **pretrained_model_path**:预训练模型的本地目录,或者在huggingface上的模型名称。 +- **weighted_loss_mode**: 多任务loss加权模式, "case3"是当前推荐。 +- **padding_mode**: 数据的样本组织方式, "padding"是将每个原始样本填充到seq_length, "pack"是将尽量多的样本打包到每个seq_length的序列中。 +- **num_train_epochs**:训练的轮次。如果数据量足够大,一般建议只训1-2个epoch。 +- **per_device_train_batch_size**:每张显卡train的batch size。 +- **per_device_eval_batch_size**:每张显卡eval的batch size。 +- **gradient_accumulation_steps**:梯度累计步数。global batch=num_gpus * per_device_train_batch_size * gradient_accumulation_steps。 +- **learning_rate**:学习率。全量参数微调的时候,建议小一些,1e-5或5e-6。qlora中的学习率设置更大一些,一般为1e-4、2e-4。 +- **min_lr**: 最低学习率, 一般是learning_rate的十分之一 +- **seq_length**:训练时的最大长度。按照自己的设备进行设置,越长需要占用越多显存。 +- **log_interval**:每隔多少步统计一次train loss。 +- **checkpointing_steps**:每隔多少步保存一个模型。 +- **evaluation_steps**:每隔多少步在验证集上evaluate一次。 +- **early_stopping** : 是否执行early_stop +- **early_stopping_stall_num**: 多少个eval point不继续收敛,则停止训练 +- **lr_scheduler_type**:学习率变化策略。常用"cosine" +- **warmup_steps**:warm up步数。学习率经过多少步,增长到指定的数值。 +- **seed**:随机种子,用于复现实验结果。 +- **saving_limit**:整数,ckpt存储数量上限, 全量训练必须设置。默认null即不限制数量。 +- **role_markers**: null,即使用{"system": "\system\n", "user": "\human\n", "assistant": "\bot\n"}。 你可以自定义 "system", "user" and "assistant"的模板, 用于定制自己的问答或者对话模板,比如 {"system": "### System:\n", "user": "### Instruction:\n", "assistant": "### Response:\n"} + +## 4. 模型使用 + +### 4.1 权重合并 +如果使用LoRA或者QLoRA进行训练,本项目仅保存adapter的权重和配置文件,需要将adapter权重与base model进行合并。 +可以使用如下merge_base_and_lora_to_hf.py脚本。 +``` +python pefts/merge_base_and_lora_to_hf.py \ + --base_model_or_path model_path \ + --adaptor_path lora_adapter_path \ + --model_type model_type \ + --merged_output_path output_path +``` + +### 4.2 模型推理 +我们提供了单轮对话和多轮对话的如下脚本,该脚本可同时兼容大部分huggingface格式的模型。 +```python +from transformers import ( + AutoTokenizer, + AutoModelForCausalLM, +) +model_name_or_path = "codefuse-ai/CodeFuse-Deepseek-33B" +tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, padding_side="left") +tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids("<|end▁of▁sentence|>") +tokenizer.pad_token_id = tokenizer.eos_token_id +model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True) + +HUMAN_ROLE_START_TAG = "human\n" +BOT_ROLE_START_TAG = "bot\n" +texts = ["write a python function of quick sort."] +texts = [f"{HUMAN_ROLE_START_TAG}{text}{BOT_ROLE_START_TAG}" for text in texts] + +inputs = tokenizer(texts, return_tensors='pt', padding=True, add_special_tokens=False).to("cuda") +outputs = model.generate( + inputs=inputs["input_ids"], + attention_mask=inputs["attention_mask"], + max_new_tokens=512, + top_p=0.95, + temperature=0.1, + do_sample=True, + eos_token_id=tokenizer.eos_token_id, + pad_token_id=tokenizer.pad_token_id + ) +gen_text = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True) +print(gen_text) +``` + + +生成脚本中的top_p、temperature、repetition_penalty、do_sample等参数对模型的生成效果影响较大,可按照自己的使用场景进行调试修改。 +实践中,在代码生成场景中,如果采样模式,do_sample=True, top_p=0.95, temperature=0.1是pass@1指标的不错选择; +如果非采样模式, do_sample=False, beam_num=1或者3是不错的选择,其中beam_num=1即为greedy decoding。 + +## 5. FAQ +#### 问题1:OOM如何解决? +如果发生OOM,可以缩小per_device_train_batch_size、seq_length等参数来缓解。由于面对的模型普遍较大(6b, 13b, 34b, 70b等)我们已经默认使用gradient_checkpointing技术,可以大幅降低显存占用,但训练速度会稍慢一些。 + +#### 问题2:安装包错误 +参考init_env.sh和requirements.txt + +#### 问题3:如何指定使用某些卡训练? +通过如下方式,即可指定使用0和1号卡进行训练: +```bash +CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file pefts/accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "deepspeed" +``` + +#### 问题4:关于Flash Attention, 该如何配置训练? +首先,我们强烈建议您安装Flash Attention 2(FA2),(>=2.1.0, 2.3.6功能更齐全)。 + +训练参数中"attn_implementation" 设置成 "eager" 可以用naive attention,也就是未经加速的attention。 + +训练参数中"attn_implementation" 设置成 "flash_attention_2" 可以用FA2,速度快,省显存。 + +如果你可以自行安装环境并使用torch>=2.1.1,可以尝试设置参数"attn_implementation"为 "sdpa"。这样会尝试使用transformers兼容的torch.nn.functional.scaled_dot_product_attention。支持的模型还不全面。 + +#### 问题5:推荐的分布式框架是怎样的? +对于LoRA/QLoRA, 我们推荐使用DeepSpeed作为底层分布式框架,它具有易用性和兼容性好的特点,并且速度很快。 +FSDP 不支持QLoRA, 因为bitsandbytes暂不支持FSDP。 + +对于全量微调,我们推荐使用FSDP, 因为它在全量训练时可以发挥fully sharding的优势,达到更快的训练速度。 + +#### 问题6:当前支持的模型中,有什么区别 +国产大模型比如chatglm2, chatglm3, baichuan2, qwen, aquila2等,使用的是和模型共同发布的modeling_xxx.py. +其它被transformers官方支持的大模型,由于已经升级支持flash attention等,所以全面切换到官方的modeling支持训练,之前的自定义modeling会被deprecated diff --git a/content/zh/docs/mftcoder/4_atorch.md b/content/zh/docs/mftcoder/4_atorch.md new file mode 100644 index 0000000..f7e3816 --- /dev/null +++ b/content/zh/docs/mftcoder/4_atorch.md @@ -0,0 +1,225 @@ +--- +title: "MFTCoder训练: Atorch框架篇" +description: 介绍主要功能 +url: /docs/mftcoder-atorch-zh +aliases: +- "/docs/mftcoder-atorch-zh" +--- + +[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/codefuse-ai) + + GitHub + + +[**中文**] [[English]](/docs/mftcoder-atorch) + +## 1. 更新 + +🔥 MFTCoder在Atorch框架下支持GPTNeoX模型的微调; + +🔥 MFTCoder支持全量的有监督微调; + +🔥 MFTCoder支持LoRA微调; + +## 2. 数据格式 + +### 2.1 训练数据格式 +训练数据为jsonl格式,每一行的数据格式如下,其中chat_rounds字段是必需的,可以根据实际需求添加或删除其他字段。 +可以参考项目中的xxx.jsonl文件。 +```json +{ + "id":0, + "data_name":"code-helper", + "chat_rounds":[ + { + "role": "system", + "content": "你是一个智能代码助手,可以回复用户与代码相关的问题", + "chat_round_id": 0 + }, + { + "role": "human", + "content": "写一个快速排序", + "chat_round_id": 1 + }, + { + "role": "bot", + "content": "以下是一个快速排序算法xxxxxx", + "chat_round_id": 1 + }, + { + "role": "human", + "content": "解释一下这段代码", + "chat_round_id": 2 + }, + { + "role": "bot", + "content": "好的,这段代码xxx", + "chat_round_id": 2 + } + ] +} +``` + +### 2.2 推理数据格式 +推理数据格式为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式: +```python +""" +<|role_start|>system<|role_end|>这是System指令 +<|role_start|>human<|role_end|>这是第1轮用户输入的问题 +<|role_start|>bot<|role_end|>这是第1轮模型生成的内容 +<|role_start|>human<|role_end|>这是第2轮用户输入的问题 +<|role_start|>bot<|role_end|>这是第2轮模型生成的内容 +... +... +... +<|role_start|>human<|role_end|>这是第n轮用户输入的问题 +<|role_start|>bot<|role_end|>{模型现在要生成的内容} +""" +``` + + +## 3. 模型训练 +目前 "MFTCoder/mft_atorch" 代码库支持全量参数指令微调和LoRA指令微调。 +目前仅支持GPTNeoX模型的训练,理论上,HuggingFace上开源的GPTNeoX模型权重,均可使用本项目进行训练。 + +我们将训练中使用的各种组件抽取出来,以便后续的扩展和优化,详见主目录下的实现。微调训练的入口目录是```train/```, 训练入口文件是```train/run_train.py```, 参数配置存储在启动脚本```train/run_gpt_*.sh```等文件中,方便统一管理和更改。 + +### 3.1 数据格式 +训练时,我们将多轮对话拼接成如下格式,然后进行tokenize。其中<|role_start|>human<|role_end|>表示human输入提示符,<|role_start|>bot<|role_end|>表示bot输出提示符,`````````` 表示eos_token。 +``` +"<|role_start|>human<|role_end|>input1target1input2target2... +``` +在计算loss时,我们通过mask的方式,input部分的loss不参与参数更新,只有“target”部分的loss参与参数更新。 +这种方式充分利用了模型并行计算的优势,训练更加高效,且多轮对话中的每个target部分都参与了训练,训练更充分。 +否则,就需要把一个n轮对话,拆分成n条数据,且只计算最后一个target的loss,大大降低了训练效率。 + +### 3.2 全量SFT + +执行如下命令即可进行全量SFT: +```bash +sh run_gpt_mft.sh 10 1 8 5 +``` + +需注意,启动脚本后的四个参数,分别是: +- 第一个参数是总的per gpu batch size +- 第二个参数是tensor parallel数(暂时只支持1) +- 第三个参数是data parallel数,与所用GPU数保持一致 +- 第四个参数是训练epoch数 + +后面其他的训练方式启动脚本,也同样需要配置这四个参数 + +### 3.3 LoRA微调 + +执行如下命令即可进行Lora微调: +```bash +sh run_gpt_mft_peft.sh 10 1 8 5 +``` + +### 3.4 启动脚本中主要参数说明 +```train/run_gpt_*.sh```中的主要参数说明如下,以下参数可以根据需求进行修改,其他参数建议不做修改: +- tokenize_mode: 目前仅支持"sft"。 + +- train_mode: 目前仅支持"sft"。 + +- load_raw_dataset: 需要保持"True",后续会支持其它模式数据,当前仅支持jsonl输入 + +- data_paths: "[path1,path2,path3]" 输入数据地址,字符串,开头结尾用[],中间用```,```间隔不同path,每个path是一个目录,目录的最后一级名字作为任务名称,下面包含1到多个jsonl数据。 + +- output_dir: 训练输出目录,存储checkpoint、lora_adaptor checkpoint等。 + +- tensorboard_dir: 可以暂时忽略,实际tensorboard存储在output_dir的runs目录下。 + +- model_type: 目前仅支持 gpt_neox。 + +- peft_type: 目前仅支持 lora。 + +- pretrained_model_path: 预训练模型的本地目录。 + +- total_train_batch_size: 所有显卡train的batch size的总和,会根据启动脚本时输入的per gpu batch size自动计算。 + +- per_device_valid_batch_size: 每张显卡eval的batch size,会根据启动脚本时输入的per gpu batch size自动计算。 + +- gradient_accumulation_steps: 梯度累计步数。global batch=num_gpus * per_device_train_batch_size * gradient_accumulation_steps。 + +- checkpoint_activations: 如果显存捉襟见肘,可以开启。以时间换空间,模型不缓存激活状态,会进行两次forward计算,以节省显存。 + +- learning_rate: 学习率。全量参数微调的时候,建议小一些,1e-5或5e-6。qlora中的学习率设置更大一些,一般为1e-4、2e-4。 + +- min_lr: 最低学习率, 一般是learning_rate的十分之一。 + +- seq_length: 训练时的最大长度。按照自己的设备进行设置,越长需要占用越多显存。 + +- log_interval: 每隔多少步统计一次train loss。 + +- checkpointing_steps: 每隔多少步保存一个模型。 + +- evalation_steps: 每隔多少步在验证集上evaluate一次。 + +- early_stopping_patience: 多少个eval point不继续收敛,则停止训练。 + +- lr_scheduler_type: 学习率变化策略。 + +- num_warmup_steps: warm up步数,学习率经过多少步,增长到指定的数值。 + +- seed: 随机种子,用于复现实验结果。 + +- train_iters: 可以暂时设为比较小的数,如10,实际上不会影响训练步数,留作后面拓展读取其他形式数据集的功能。 + +- valid_iters: 可以暂时设为比较小的数,如10,实际上不会影响训练步数,留作后面拓展读取其他形式数据集的功能。 + +- evaluation_strategy: 训练期间evaluate的策略,"steps"表示每隔"valid_interval"步做一次evaluate,"epoch"表示每隔一个epoch做一次evaluate,支持同时开启。 + +- save_strategy: 训练期间保存模型权重的策略,"steps"表示每隔"checkpointing_steps"步保存一次。 + +- extra_save_by_epoch: 每过一个epoch是否要保存一个epoch级别的checkpoint。 + +- save_total_limit: 最多保留的模型checkpoint个数,一般设置为2,会保留valid loss最低,以及最新的checkpoint,注意epoch级别的checkpoint会一直保留,且不受限制。 + +- weighted_loss_mode: 多任务训练的loss加权方式。 + + +## 4. 模型使用 + +### 4.1 权重合并 +如果使用LoRA进行训练,本项目仅保存adapter的权重和配置文件,需要将adapter权重与base model进行合并。脚本见```utils/merge_base_and_lora_to_hf.py``` + +### 4.2 模型推理 +我们提供了单轮对话和多轮对话的如下脚本,该脚本可同时兼容大部分huggingface格式的模型。 +```python +from transformers import ( + AutoTokenizer, + AutoModelForCausalLM, +) +tokenizer = AutoTokenizer.from_pretrained(mode_name_or_path, trust_remote_code=True, use_fast=False, legacy=False) +tokenizer.padding_side = "left" +tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids("") +tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids("") +model = AutoModelForCausalLM.from_pretrained(mode_name_or_path, trust_remote_code=True) + +HUMAN_ROLE_START_TAG = "<|role_start|>human<|role_end|>" +BOT_ROLE_START_TAG = "<|role_start|>bot<|role_end|>" +texts = ["write a python function of quick sort."] +texts = [f"{HUMAN_ROLE_START_TAG}{text}{BOT_ROLE_START_TAG}" for text in texts] + +inputs = tokenizer(texts, return_tensors='pt', padding=True, add_special_tokens=False).to("cuda") +outputs = model.generate( + inputs=inputs["input_ids"], + attention_mask=inputs["attention_mask"], + max_new_tokens=512, + top_p=0.95, + temperature=0.1, + do_sample=True, + eos_token_id=tokenizer.eos_token_id, + pad_token_id=tokenizer.pad_token_id + ) +gen_text = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True) +print(gen_text) +``` + +生成脚本中的top_p、temperature、repetition_penalty、do_sample等参数对模型的生成效果影响较大,可按照自己的使用场景进行调试修改。 +实践中,在代码生成场景中,如果采样模式,do_sample=True, top_p=0.95, temperature=0.1是pass@1指标的不错选择; +如果非采样模式, do_sample=False, beam_num=1或者3是不错的选择,其中beam_num=1即为greedy decoding。 + +## 5. FAQ +#### 问题1:OOM如何解决? +如果发生OOM,可以缩小per GPU batch size (启动训练脚本时的第一个参数)、seq_length等参数来缓解。也可以设gradient_checkpointing=true,可以大幅降低显存占用,但训练速度会变慢一些。 \ No newline at end of file diff --git a/content/zh/docs/overview/b1.codefusechatbot.md b/content/zh/docs/overview/b1.codefusechatbot.md new file mode 100644 index 0000000..9997cf0 --- /dev/null +++ b/content/zh/docs/overview/b1.codefusechatbot.md @@ -0,0 +1,66 @@ +--- +title: CodeFuse-ChatBot Development by Private Knowledge Augmentation +slug: CodeFuse-ChatBot-zh +description: 介绍主要功能 +aliases: +- "/docs/codefuse-chatbot-zh" +--- + +

+ 中文  |  English  +

+ +DevOps-ChatBot是由蚂蚁CodeFuse团队开发的开源AI智能助手,致力于简化和优化软件开发生命周期中的各个环节。该项目结合了Multi-Agent的协同调度机制,并集成了丰富的工具库、代码库、知识库和沙盒环境,使得LLM模型能够在DevOps领域内有效执行和处理复杂任务。 + + +## 📜 目录 +- [🤝 介绍](#-介绍) +- [🎥 演示视频](#-演示视频) +- [🧭 技术路线](#-技术路线) + +## 🤝 介绍 + +💡 本项目旨在通过检索增强生成(Retrieval Augmented Generation,RAG)、工具学习(Tool Learning)和沙盒环境来构建软件开发全生命周期的AI智能助手,涵盖设计、编码、测试、部署和运维等阶段。 逐渐从各处资料查询、独立分散平台操作的传统开发运维模式转变到大模型问答的智能化开发运维模式,改变人们的开发运维习惯。 + +本项目核心差异技术、功能点: +- **🧠 智能调度核心:** 构建了体系链路完善的调度核心,支持多模式一键配置,简化操作流程。 [使用说明](/docs/multi-agent-zh) +- **💻 代码整库分析:** 实现了仓库级的代码深入理解,以及项目文件级的代码编写与生成,提升了开发效率。 +- **📄 文档分析增强:** 融合了文档知识库与知识图谱,通过检索和推理增强,为文档分析提供了更深层次的支持。 +- **🔧 垂类专属知识:** 为DevOps领域定制的专属知识库,支持垂类知识库的自助一键构建,便捷实用。 +- **🤖 垂类模型兼容:** 针对DevOps领域的小型模型,保证了与DevOps相关平台的兼容性,促进了技术生态的整合。 + +🌍 依托于开源的 LLM 与 Embedding 模型,本项目可实现基于开源模型的离线私有部署。此外,本项目也支持 OpenAI API 的调用。[接入Demo](/docs/fastchat-zh) + +👥 核心研发团队长期专注于 AIOps + NLP 领域的研究。我们发起了 Codefuse-ai 项目,希望大家广泛贡献高质量的开发和运维文档,共同完善这套解决方案,以实现“让天下没有难做的开发”的目标。 + +
+ 图片 +
+ + +## 🎥 演示视频 + +为了帮助您更直观地了解 Codefuse-ChatBot 的功能和使用方法,我们录制了一系列演示视频。您可以通过观看这些视频,快速了解本项目的主要特性和操作流程。 + + +- 知识库导入和问答:[演示视频](https://www.youtube.com/watch?v=UGJdTGaVnNY&t=2s&ab_channel=HaotianZhu) +- 本地代码库导入和问答:[演示视频](https://www.youtube.com/watch?v=ex5sbwGs3Kg) + + +## 🧭 技术路线 +
+ Image +
+ +- 🧠 **Multi-Agent Schedule Core:** 多智能体调度核心,简易配置即可打造交互式智能体。 +- 🕷️ **Multi Source Web Crawl:** 多源网络爬虫,提供对指定 URL 的爬取功能,以搜集所需信息。 +- 🗂️ **Data Processor:** 数据处理器,轻松完成文档载入、数据清洗,及文本切分,整合不同来源的数据。 +- 🔤 **Text Embedding & Index:**:文本嵌入索引,用户可以轻松上传文件进行文档检索,优化文档分析过程。 +- 🗄️ **Vector Database & Graph Database:** 向量与图数据库,提供灵活强大的数据管理解决方案。 +- 📝 **Prompt Control & Management:**:Prompt 控制与管理,精确定义智能体的上下文环境。 +- 🚧 **SandBox:**:沙盒环境,安全地执行代码编译和动作。 +- 💬 **LLM:**:智能体大脑,支持多种开源模型和 LLM 接口。 +- 🛠️ **API Management::** API 管理工具,实现对开源组件和运维平台的快速集成。 + +具体实现明细见:[技术路线明细](/docs/chatbot-roadmap) + diff --git a/content/zh/docs/overview/b10.codefuse-evalution.md b/content/zh/docs/overview/b10.codefuse-evalution.md new file mode 100644 index 0000000..d28e775 --- /dev/null +++ b/content/zh/docs/overview/b10.codefuse-evalution.md @@ -0,0 +1,19 @@ +--- +title: "CodeFuseEval: 代码大语言模型的多任务评估基准" +description: 介绍主要功能 +aliases: +- "/docs/codefuse-evalution-zh" +--- + + + + +CodeFuseEval在HumanEval-x、MBPP的基准上,结合CodeFuse大模型多任务场景,开发的编程领域多任务的评测基准, 可用于评估模型在代码补全,自然语言生成代码,测试用例生成、跨语言代码翻译,中文指令生成代码等多类任务的性能。持续开放中,敬请期待! + +![img](/images/codefuse-evalution/中文介绍.png) diff --git a/content/zh/docs/overview/b2.codefuseDevopsEval.md b/content/zh/docs/overview/b2.codefuseDevopsEval.md new file mode 100644 index 0000000..1d3afc1 --- /dev/null +++ b/content/zh/docs/overview/b2.codefuseDevopsEval.md @@ -0,0 +1,132 @@ +--- +title: CodeFuse-DevOps-Eval +slug: CodeFuse-DevOps-Eval-zh +description: 介绍主要功能 +aliases: +- "/docs/codefuse-devops-eval-zh" +--- + +

+ + + +DevOps-Eval是一个专门为DevOps领域大模型设计的综合评估数据集。我们希望DevOps-Eval能够帮助开发者,尤其是DevOps领域的开发者,追踪进展并分析他们拥有的DevOps大模型的优势和不足之处。 + +📚 该仓库包含与DevOps和AIOps相关的问题和练习, 还添加了关于ToolLearning相关的样本。 + +💥 目前有 **7486** 个多项选择题,根据DevOps的通用流程将其归纳未8个模块,如[下图](/images/devops_eval/data_info.png)所示。 + +🔥 AIOps样本总计 **2840** 个,覆盖的场景包括**日志解析**、**时序异常检测**、**时序分类**、**时序预测**和**根因分析**。 + +🔧 ToolLearning样本 **1509** 个,涵盖59个领域,总计 239 种工具类别。 + +

+ +## 🏆 排行榜 +以下是我们获得的初版评测结果,包括多个开源模型的zero-shot和five-shot准确率。我们注意到,对于大多数指令模型来说,five-shot的准确率要优于zero-shot。 + +### 👀 DevOps +#### Zero Shot + +| **模型** | plan | code | build | test | release | deploy | operate | monitor | **平均分** | +|:------------------------:|:-----:|:-----:|:-----:|:------:|:--------:|:------:|:-------:|:--------:|:---------:| +| DevOpsPal-14B-Chat | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 69.89 | 79.17 | 78.23 | +| DevOpsPal-14B-Base | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 71.18 | 82.41 | 78.23 | +| Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 69.57 | 80.56 | 77.18 | +| Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 70.05 | 80.09 | 76.19 | +| Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 70.37 | 83.8 | 73.73 | +| Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 67.63 | 84.72 | 72.9 | +| DevOpsPal-7B-Chat | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 64.73 | 77.78 | 71.92 | +| DevOpsPal-7B-Base | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 65.54 | 78.7 | 71.69 | +| Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 65.06 | 80.09 | 71.09 | +| Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 62.64 | 79.17 | 69.75 | +| Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 59.42 | 79.63 | 66.97 | +| Internlm-7B-Chat | 60.61 | 62.15 | 77.06 | 76.32 | 66.98 | 74.51 | 60.39 | 78.24 | 66.27 | +| Baichuan2-7B-Base | 56.06 | 62.45 | 75.69 | 70.61 | 74.06 | 69.8 | 61.67 | 75.93 | 66.21 | +| Internlm-7B-Base | 54.55 | 58.29 | 79.36 | 78.95 | 77.83 | 70.59 | 65.86 | 75.93 | 65.99 | + + +#### Five Shot + +| **模型** | plan | code | build | test | release | deploy | operate | monitor | **平均分** | +|:------------------------:|:-----:|:-----:|:-----:|:------:|:--------:|:------:|:-------:|:--------:|:---------:| +| DevOpsPal-14B-Chat | 63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 72.95 | 81.48 | 79.69 | +| DevOpsPal-14B-Base | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 71.98 | 80.09 | 79.63 | +| Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 70.85 | 81.48 | 77.81 | +| Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 72.46 | 80.56 | 77.56 | +| Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 73.75 | 85.19 | 75.8 | +| Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 70.37 | 81.94 | 75.36 | +| Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 71.98 | 80.56 | 74.12 | +| DevOpsPal-7B-Chat | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 68.6 | 76.85 | 73.61 | +| DevOpsPal-7B-Base | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 67.15 | 79.17 | 73.35 | +| Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 65.38 | 81.02 | 71.69 | +| Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 67.31 | 79.63 | 70.8 | +| Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 60.06 | 77.31 | 69.21 | +| Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 64.9 | 79.17 | 69.05 | +| Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 63.45 | 75.46 | 67.17 | + + +### 🔥 AIOps + +
+ +#### Zero Shot +| **模型** | 日志解析 | 根因分析 | 时序异常检测 | 时序分类 | 时序预测 | **平均分** | +|:-------------------:|:-----:|:----:|:------:|:----:|:-----:|:-------:| +| Qwen-14B-Base | 66.29 | 58.8 | 25.33 | 43.5 | 62.5 | 52.25 | +| DevOpsPal-14B—Base | 63.14 | 53.6 | 23.33 | 43.5 | 64.06 | 50.49 | +| Qwen-14B-Chat | 64.57 | 51.6 | 22.67 | 36 | 62.5 | 48.94 | +| DevOpsPal-14B—Chat | 60 | 56 | 24 | 43 | 57.81 | 48.8 | +| Qwen-7B-Base | 50 | 39.2 | 22.67 | 54 | 43.75 | 41.48 | +| DevOpsPal-7B—Chat | 56.57 | 30.4 | 25.33 | 45 | 44.06 | 40.92 | +| Baichuan2-13B-Chat | 64 | 18 | 21.33 | 37.5 | 46.88 | 39.3 | +| Qwen-7B-Chat | 57.43 | 38.8 | 22.33 | 39.5 | 25.31 | 36.97 | +| Internlm-7B—Chat | 58.86 | 8.8 | 22.33 | 28.5 | 51.25 | 36.34 | +| Baichuan2-7B-Chat | 60.86 | 10 | 28 | 34.5 | 39.06 | 36.34 | +| Baichuan2-7B-Base | 53.43 | 12.8 | 27.67 | 36.5 | 40.31 | 35.49 | +| Baichuan2-13B-Base | 54 | 12.4 | 23 | 34.5 | 42.81 | 34.86 | +| DevOpsPal-7B—Base | 46.57 | 20.8 | 25 | 34 | 38.75 | 33.94 | +| Internlm-7B—Base | 48.57 | 18.8 | 23.33 | 37.5 | 33.75 | 33.1 | + +#### One Shot +| **模型** | 日志解析 | 根因分析 | 时序异常检测 | 时序分类 | 时序预测 | **平均分** | +|:-------------------:|:-----:|:----:|:------:|:----:|:-----:|:-------:| +| DevOpsPal-14B—Chat | 66.29 | 80.8 | 23.33 | 44.5 | 56.25 | 54.44 | +| DevOpsPal-14B—Base | 60 | 74 | 25.33 | 43.5 | 52.5 | 51.13 | +| Qwen-14B-Base | 64.29 | 74.4 | 28 | 48.5 | 40.31 | 50.77 | +| Qwen-7B-Base | 56 | 60.8 | 27.67 | 44 | 57.19 | 49.44 | +| Qwen-14B-Chat | 49.71 | 65.6 | 28.67 | 48 | 42.19 | 46.13 | +| Baichuan2-13B-Base | 56 | 43.2 | 24.33 | 41 | 46.88 | 42.89 | +| Baichuan2-7B-Chat | 58.57 | 31.6 | 27 | 31.5 | 51.88 | 41.83 | +| DevOpsPal-7B—Base | 52.86 | 44.4 | 28 | 44.5 | 36.25 | 41.2 | +| Baichuan2-7B-Base | 48.29 | 40.4 | 27 | 42 | 40.94 | 39.86 | +| Qwen-7B-Chat | 54.57 | 52 | 29.67 | 26.5 | 27.19 | 38.73 | +| Baichuan2-13B-Chat | 57.43 | 44.4 | 25 | 25.5 | 30.63 | 37.75 | +| DevOpsPal-7B—Chat | 56.57 | 27.2 | 25.33 | 41.5 | 33.44 | 37.46 | +| Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 50.31 | 36.69 | +| Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 31.56 | 35.85 | + +
+ +### 🔧 ToolLearning +
+ +| **FuncCall-Filler** | dataset_name | fccr | 1-fcffr | 1-fcfnr | 1-fcfpr | 1-fcfnir | aar | +|:-------------------:| :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| Qwen-14b-chat | luban | 61 | 100 | 97.68 | 63.32 | 100 | 69.46 | +| Qwen-7b-chat | luban | 50.58 | 100 | 98.07 | 52.51 | 100 | 63.59 | +| Baichuan-7b-chat | luban | 60.23 | 100 | 97.3 | 62.93 | 99.61 | 61.12 | +| Internlm-chat-7b | luban | 47.88 | 100 | 96.14 | 51.74 | 99.61 | 61.85 | +| Qwen-14b-chat | fc_data | 98.37 | 99.73 | 99.86 | 98.78 | 100 | 81.58 | +| Qwen-7b-chat | fc_data | 99.46 | 99.86 | 100 | 99.59 | 100 | 79.25 | +| Baichuan-7b-chat | fc_data | 97.96 | 99.32 | 100 | 98.64 | 100 | 89.53 | +| Internlm-chat-7b | fc_data | 94.29 | 95.78 | 100 | 98.5 | 100 | 88.19 | +| CodeLLaMa-7b | fc_data | 98.78 | 99.73 | 100 | 99.05 | 100 | 94.7 | +| CodeLLaMa-7b-16 | fc_data | 98.1 | 99.87 | 99.73 | 98.5 | 100 | 93.14 | +| CodeFuse-7b-4k | fc_data | 98.91 | 99.87 | 99.87 | 99.18 | 100 | 89.5 | + +
\ No newline at end of file diff --git a/content/zh/docs/overview/b3.codefuseDevopsModel.md b/content/zh/docs/overview/b3.codefuseDevopsModel.md new file mode 100644 index 0000000..812d6f3 --- /dev/null +++ b/content/zh/docs/overview/b3.codefuseDevopsModel.md @@ -0,0 +1,61 @@ +--- +title: CodeFuse-DevOps-Model +slug: CodeFuse-DevOps-Model-zh +description: 介绍主要功能 +aliases: +- "/docs/codefuse-devops-model-zh" +--- + +## codeFuse-devops-model +DevOps-Model 是蚂蚁集团联合北京大学发布面向中文 DevOps 领域的大语言模型,通过收集 DevOps 领域相关的专业数据,再针对模型进行语言模型的加训和对齐训练,产出可以帮助工程师在整个开发运维生命周期提效的大模型。弥补当前大模型在 DevOps 领域的缺失,旨在做到有问题,问 DevOps-Model ! + +当前我们已经开源了 7B 和 14B 两种规格的经过加训得 Base 模型和经过对齐后的 Chat 模型,同时还开源了对应的训练代码,欢迎大家一起合作建设! + + +## 项目地址 +Github 地址:https://github.com/codefuse-ai/CodeFuse-DevOps-Model/tree/main + +ModelScope 地址: +- DevOps-Model-7B-Base:https://modelscope.cn/models/codefuse-ai/CodeFuse-DevOps-Model-7B-Base/summary +- DevOps-Model-7B-Chat:https://modelscope.cn/models/codefuse-ai/CodeFuse-DevOps-Model-7B-Chat/summary +- DevOps-Model-14B-Base:https://modelscope.cn/models/codefuse-ai/CodeFuse-DevOps-Model-14B-Base/summary +- DevOps-Model-14B-Chat:https://modelscope.cn/models/codefuse-ai/CodeFuse-DevOps-Model-14B-Chat/summary + +## 评测考题 +针对模型评测,最初并没有这样的一个 benchmark 用来 DevOps 领域进行测试,所以我们首先选用了一些通用开源测试中和 DevOps 领域相关的选择题进行测试,具体测试数据如下: +|数据集 |考试科目 |题目总数| +| ---- | --------- | ----- | +|CMMLU |Computer science 204| +|Computer |security |171| +|Machine |learning |122| +|CEval |college programming| 37| +|CEval |computer_architecture| 21| +|CEval |computer_network |19| +|总计 |总计题目数 |574| + + + +## 评测方式 +由于都是单选题,我们采用的是选取模型产出的第一个 Token 中四个选项 Token 中得分最高的作为模型对于问题的回答。同时我们还测试了 Zero-shot 和 Five-shot 的结果。 + + +## 评测结果 +![](/images/devops_model/devops_eval.webp) + +具体的得分如下表所示: +|参数量级| 模型 |模型大小 |Zero-shot 得分 |Five-shot 得分| +| - | ---- | --- | ---- | ---- | +|10+ B| DevOps-Model-14B-Base |14B |70.73 |73.00| +|10+ B|Qwen-14B-Base |14B |69.16| 71.25| +|10+ B|Baichuan2-13B-Base |13B |55.75| 61.15| +|10+ B|DevOps-Model-14B-Chat| 14B |74.04 |75.96| +|10+ B|Qwen-14B-Chat |14B |69.16| 70.03| +|10+ B|Baichuan2-13B-Chat |13B |52.79 |55.23| +|7B| DevOps-Model-7B-Base| 7B |62.72| 62.02| +|7B|Qwen-7B-Base| 7B| 55.75| 56.0| +|7B|Baichuan2-7B-Base| 7B |49.30| 55.4| +|7B|Internlm-7B-Base |7B |47.56 |52.6| +|7B|DevOps-Model-7B-Chat| 7B |62.20| 64.11| +|7B|Qwen-7B-Chat| 7B |46.00 |52.44| +|7B|Baichuan2-7B-Chat| 7B| 52.26| 54.46| +|7B|Internlm-7B-Chat |7B |52.61 |55.75| \ No newline at end of file diff --git a/content/zh/docs/overview/b4.MFTCoder.md b/content/zh/docs/overview/b4.MFTCoder.md new file mode 100644 index 0000000..ac3e7a5 --- /dev/null +++ b/content/zh/docs/overview/b4.MFTCoder.md @@ -0,0 +1,116 @@ +--- +title: "MFTCoder: 高效准确的多任务大模型微调框架" +slug: MFTCoder-zh +description: 介绍主要功能 +aliases: +- "/docs/mftcoder-zh" +--- + + +
+ +

+ 🤗 HuggingFace + • 🤖 魔搭 +

+ +[**中文**] [[English]](/docs/mftcoder) + +
+ + + +## 目录 +- [新闻](#新闻) +- [文章](#文章) +- [项目简介](#项目简介) +- [环境](#环境) +- [训练](#训练) +- [模型](#模型) +- [数据集](#数据集) + + +## 新闻 +🔥🔥🔥 [2024/01/17] **MFTCoder-v0.3.0**发布。新增对Mixtral(MoE), DeepSeek等模型的支持;新增支持FSDP(Fully Sharded Data Parallel);新增Self-paced Loss, 支持多任务收敛均衡。 感兴趣详见微信公众号CodeFuse的文章[MFTCoder 重磅升级v0.3.0发布](https://mp.weixin.qq.com/s/xI3f0iUKq9TIIKZ_kMtcQg) + +🔥🔥🔥 [2024/01/17] 开源了[CodeFuse-DeepSeek-33B](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B)模型,在HumanEval pass@1(greedy decoding)上可以达到78.7%。该模型在Big Code榜单的结果近期发布,请关注公众号获取最新信息。 + +🔥🔥🔥 [2024/01/17] 开源了[CodeFuse-Mixtral-8x7B](https://huggingface.co/codefuse-ai/CodeFuse-Mixtral-8x7B)模型,在HumanEval pass@1(greedy decoding)上可以达到56.1%。感兴趣详见微信公众号CodeFuse的文章[MFTCoder提升Mixtral-8x7B混合专家模型的代码能力实践](https://mp.weixin.qq.com/s/xI3f0iUKq9TIIKZ_kMtcQg) + +🔥🔥 [2023/11/07] [MFTCoder论文](https://arxiv.org/abs/2311.02303)在Arxiv公布,介绍了多任务微调的技术细节。 + +🔥🔥 [2023/10/20] 开源了[CodeFuse-QWen-14B](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B)模型,在HumanEval pass@1(greedy decoding)上可以达到48.8%。相比较与基座模型Qwen-14b提升16%。感兴趣详见微信公众号CodeFuse[文章](https://mp.weixin.qq.com/s/PCQPkvbvfxSPzsqjOILCDw) + +🔥🔥 [2023/09/27] 开源了[CodeFuse-StarCoder-15B](https://huggingface.co/codefuse-ai/CodeFuse-StarCoder-15B)模型,在HumanEval pass@1(greedy decoding)上可以达到54.9%。 + +🔥🔥 [2023/09/26] [CodeFuse-CodeLlama-34B-4bits](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits)量化版本发布,量化后模型在HumanEval pass@1指标为73.8% (贪婪解码)。 + +🔥🔥 [2023/09/07]MFTCoder微调的模型**CodeFuse-CodeLlama-34B**在[HumanEval Benchmarks](https://github.com/openai/human-eval)的Python **Pass@1** 取得了**74.4%**(greedy decoding)的开源SOTA成绩。 + +🔥🔥 [2023/08/26]MFTCoder-v0.1.0 支持使用LoRA/QLoRA对Code Llama、Llama、Llama2、StarCoder、ChatGLM2、CodeGeeX2、Qwen和GPT-NeoX模型进行微调。 + +### HumanEval表现 +| 模型 | HumanEval(Pass@1) | 日期 | +|:---------------------------------|:-----------------:|:-------:| +| **CodeFuse-DeepSeek-33B** | **78.7%** | 2024/01 | +| **CodeFuse-CodeLlama-34B** | **74.4%** | 2023/09 | +| **CodeFuse-CodeLlama-34B-4bits** | **73.8%** | 2023/09 | +| WizardCoder-Python-34B-V1.0 | 73.2% | 2023/08 | +| GPT-4(zero-shot) | 67.0% | 2023/03 | +| PanGu-Coder2 15B | 61.6% | 2023/08 | +| **CodeFuse-Mixtral-8x7B** | **56.1%** | 2024/01 | +| **CodeFuse-StarCoder-15B** | **54.9%** | 2023/08 | +| CodeLlama-34b-Python | 53.7% | 2023/08 | +| **CodeFuse-QWen-14B** | **48.8%** | 2023/10 | +| CodeLlama-34b | 48.8% | 2023/08 | +| GPT-3.5(zero-shot) | 48.1% | 2022/11 | +| OctoCoder | 46.2% | 2023/08 | +| StarCoder-15B | 33.6% | 2023/05 | +| QWen-14B | 32.3% | 2023/10 | + + +## 文章 +🔥 [CodeFuse-MFTCoder提升CodeGeeX2-6B代码能力](https://mp.weixin.qq.com/s/kWMtHIoe3ytN8pRVi_CHZg) + +🔥 [CodeFuse-MFTCoder提升Qwen-14B代码能力](https://mp.weixin.qq.com/s/PCQPkvbvfxSPzsqjOILCDw) + + +## 项目简介 +**国际首个高精度、高效率、多任务、多模型支持、多训练算法,大模型代码能力微调框架;** + +**Codefuse-MFTCoder** 是一个开源的多任务代码大语言模型项目,包含代码大模型的模型、数据、训练等。我们希望通过开源,分享交流大语言模型在代码领域的进步。 + +### 项目框架 +![img_1.jpg](/images/mftcoder/img_1.jpg) + +### 项目优势 +:white_check_mark: **多任务**:一个模型同时支持多个任务,会保证多个任务之间的平衡,甚至可以泛化到新的没有见过的任务上去; + +:white_check_mark: **多模型**:支持最新的多个开源模型,包括gpt-neox,llama,llama-2,baichuan,Qwen,chatglm2等; + +:white_check_mark: **多框架**:既支持主流开源的Accelerate+DeepSpeed/FSDP,也支持新开源的[ATorch 框架](https://github.com/intelligent-machine-learning/dlrover); + +:white_check_mark: **高效微调**:支持LoRA和QLoRA,可以用很少的资源去微调很大的模型,且训练速度能满足几乎所有微调场景; + + +本项目主要内容如下: +- 同时支持单任务SFT(Supervised FineTuning)和MFT(Multi-task FineTuning), 当前开源支持数据均衡,未来将持续开源难易均衡, 收敛均衡等 +- 支持QLoRA低成本高效指令微调、LoRA高效指令微调、全量参数高精度微调。 +- 支持绝大部分主流的开源大模型,重点关注代码能力优秀的开源大模型,如DeepSeek-coder, Mistral, Mistral(MoE), Chatglm3, Qwen, GPT-Neox, Starcoder, Codegeex2, Code-LLaMA等。 +- 支持lora与base model进行权重合并,推理更便捷。 +- 整理并开源2个指令微调数据集:[Evol-instruction-66k](https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k)和[CodeExercise-Python-27k](https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k)。 +- 开源多个[Codefuse系列指令微调模型权重],具体参见我们的huggingface组织和modelscope组织下的模型:[codefuse-ai huggingface](https://huggingface.co/codefuse-ai) or [codefuse-ai 魔搭](https://modelscope.cn/organization/codefuse-ai)。 + | + +## 引用 +如果你觉得我们的工作对你有帮助,请引用我们的论文 +``` +@article{mftcoder2023, + title={MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning}, + author={Bingchang Liu and Chaoyu Chen and Cong Liao and Zi Gong and Huan Wang and Zhichao Lei and Ming Liang and Dajun Chen and Min Shen and Hailian Zhou and Hang Yu and Jianguo Li}, + year={2023}, + journal={arXiv preprint arXiv}, + archivePrefix={arXiv}, + eprint={2311.02303} +} +``` \ No newline at end of file diff --git a/content/zh/docs/overview/b5.CodeFuseModelCache.md b/content/zh/docs/overview/b5.CodeFuseModelCache.md new file mode 100644 index 0000000..e287de6 --- /dev/null +++ b/content/zh/docs/overview/b5.CodeFuseModelCache.md @@ -0,0 +1,42 @@ +--- +title: CodeFuse-ModelCache +slug: CodeFuse-ModelCache-zh +description: 介绍主要功能 +aliases: +- "/docs/codefuse-modelcache-zh" +--- + + +

+

+

+

+ 中文 | + English +

+

+
+ +## Contents +- [新闻](#新闻) +- [项目简介](#项目简介) +- [架构大图](#架构大图) +- [致谢](#致谢) +- [Contributing](#Contributing) + +## 新闻 +- 🔥🔥[2023.12.10] 增加llmEmb、onnx、paddlenlp、fasttext等LLM embedding框架,并增加timm 图片embedding框架,用于提供更丰富的embedding能力。 +- 🔥🔥[2023.11.20] codefuse-ModelCache增加本地存储能力, 适配了嵌入式数据库sqlite、faiss,方便用户快速启动测试。 +- [2023.10.31] codefuse-ModelCache... + +## 项目简介 +Codefuse-ModelCache 是一个开源的大模型语义缓存系统,通过缓存已生成的模型结果,降低类似请求的响应时间,提升用户体验。该项目从服务优化角度出发,引入缓存机制,在资源有限和对实时性要求较高的场景下,帮助企业和研究机构降低推理部署成本、提升模型性能和效率、提供规模化大模型服务。我们希望通过开源,分享交流大模型语义Cache的相关技术。 + +## 架构大图 +![modelcache modules](/images/codefuse-modelcache/modelcache_modules_20231114.png) + +## 致谢 +本项目参考了以下开源项目,在此对相关项目和研究开发人员表示感谢。
[GPTCache](https://github.com/zilliztech/GPTCache) + +## Contributing +ModelCache是一个非常有趣且有用的项目,我们相信这个项目有很大的潜力,无论你是经验丰富的开发者,还是刚刚入门的新手,都欢迎你为这个项目做出一些贡献,包括但不限于:提交问题和建议,参与代码编写,完善文档和示例。你的参与将会使这个项目变得更好,同时也会为开源社区做出贡献。 \ No newline at end of file diff --git a/content/zh/docs/overview/b6.FasterTransformer4CodeFuse.md b/content/zh/docs/overview/b6.FasterTransformer4CodeFuse.md new file mode 100644 index 0000000..e108a66 --- /dev/null +++ b/content/zh/docs/overview/b6.FasterTransformer4CodeFuse.md @@ -0,0 +1,10 @@ +--- +title: FasterTransformer4CodeFuse +slug: FasterTransformer4CodeFuse-zh +description: 介绍主要功能 +aliases: +- "/docs/fastertransformer4codefuse-zh" +--- + +## FasterTransformer4CodeFuse +FasterTransformer4CodeFuse \ No newline at end of file diff --git a/content/zh/docs/overview/b7.TestAgent.md b/content/zh/docs/overview/b7.TestAgent.md new file mode 100644 index 0000000..74a95af --- /dev/null +++ b/content/zh/docs/overview/b7.TestAgent.md @@ -0,0 +1,66 @@ +--- +title: "Test-Agent: 您的智能测试助理" +slug: Test-Agent-zh +description: 介绍主要功能 +aliases: +- "/docs/test-agent-zh" +--- + +### 本地Mac M1体验效果 +![图片](https://github.com/codefuse-ai/Test-Agent/assets/103973989/8dba860f-c1bb-49d5-b9dd-a58e541562a6) + +### 魔搭体验效果 +魔搭模型访问链接:[ModelScope TestGPT-7B](https://modelscope.cn/models/codefuse-ai/TestGPT-7B/summary) +![MS](https://github.com/codefuse-ai/Test-Agent/assets/103973989/0e50b258-44f9-4dc6-8e30-0a01cf62d02b) + + +## 什么是Test Agent?(Introduction) + +**Test Agent** 旨在构建测试领域的“智能体”,融合大模型和质量领域工程化技术,促进质量技术代系升级。我们期望和社区成员一起合作,打造创新的测试领域解决方案,构建24小时在线的测试助理服务,让测试如丝般顺滑。 +## 本期特性(Features) + +* **模型** 本期我们开源了测试领域模型TestGPT-7B。模型以CodeLlama-7B为基座,进行了相关下游任务的微调: + * **多语言测试用例生成(Java/Python/Javascript)** 一直以来都是学术界和工业界非常关注的领域,近年来不断有新产品或工具孵化出来,如EvoSuite、Randoop、SmartUnit等。然而传统的用例生成存在其难以解决的痛点问题,基于大模型的测试用例生成在测试用例可读性、测试场景完整度、多语言支持方面都优于传统用例生成工具。本次重点支持了多语言测试用例生成,在我们本次开源的版本中首先包含了Java、Python、Javascript的测试用例生成能力,下一版本中逐步开放Go、C++等语言。 + * **测试用例Assert补全** 对当前测试用例现状的分析与探查时,我们发现代码仓库中存在一定比例的存量测试用例中未包含Assert。没有Assert的测试用例虽然能够在回归过程中执行通过,却无法发现问题。因此我们拓展了测试用例Assert自动补全这一场景。通过该模型能力,结合一定的工程化配套,可以实现对全库测试用例的批量自动补全,智能提升项目质量水位。 + +* **工程框架** 本地模型快速发布和体验工程化框架 + - ChatBot页面 + - 模型快速启动 + - 私有化部署,本地化的GPT大模型与您的数据和环境进行交互,无数据泄露风险,100%安全 + +**后续我们会持续迭代模型和工程化能力:** +- 不断加入更多令人激动的测试域应用场景,如领域知识问答、测试场景分析等 +- 支撑面向测试场景的copilot 工程框架开放,如测试领域知识智能embedding、测试通用工具API体系、智能测试Agent等,敬请期待! +- 以7B为基础,逐步扩展至13B、34B模型。欢迎关注! + +## 性能最强的7B测试领域大模型(Model) +目前在TestAgent中,我们默认使用了TestGPT-7B模型。与当前已有开源模型相比,**TestGPT-7B模型在用例执行通过率(pass@1)、用例场景覆盖(平均测试场景数)上都处于业界领先水平。** +TestGPT-7B模型核心能力的评测结果如下: +- 多语言测试用例生成 +针对模型支持的三种语言:Java、Python、Javascript,Pass@1评测结果如下: + +| Model | Java pass@1 | Java Average number of test scenarios | Python pass@1 | Python Average number of test scenarios | Javascript pass@1 | Javascript Average number of test scenarios | +| --- | --- | --- | --- | --- | --- | --- | +| TestGPT-7B | 48.6% | 4.37 | 35.67% | 3.56 | 36% | 2.76 | +| CodeLlama-13B-Instruct | 40.54% | 1.08 | 30.57% | 1.65 | 31.7% | 3.13 | +| Qwen-14B-Chat | 10.81% | 2.78 | 15.9% | 1.32 | 9.15% | 4.22 | +| Baichuan2-13B-Chat | 13.5% | 2.24 | 12.7% | 2.12 | 6.1% | 3.31 | + + +- 测试用例Assert补全 +目前模型支持Java用例的Assert补全,Pass@1评测结果如下: + +| Model | pass@1 | Percentage of strong validation | +| --- | --- | --- | +| Codefuse-TestGPT-7B | 71.1% | 100% | + + +## 工程架构(Engineering Architecture) +![JG](https://github.com/codefuse-ai/Test-Agent/assets/103973989/1b61beff-df59-4ab3-843c-266413c8dbc4) + +大模型的号角已经吹响,测试领域大模型也在不断进化中,通过预训练过程中积累的丰富世界知识,在复杂交互环境中展现出了非凡的推理与决策能力。 + +尽管在测试领域中基础模型取得了显著的成果,但仍然存在一些局限性,特定领域的测试任务通常需要专业化的工具或领域知识来解决。例如,基础模型可以通过预训练知识完成单次测试代码生成和测试文本生成等任务,但处理复杂的集成用例生成、特定领域用例生成和测试流程pipeline交互等问题时,需要更专业的工具和领域知识。因此将专用工具与基础模型整合在一起,可以充分发挥它们各自的优势。专用工具可以解决模型时效性不足、增强专业知识、提高可解释性和鲁棒性的问题。而基础模型则具备类人的推理规划能力,可以理解复杂的数据和场景,并与现实世界进行交互。 + +在本期开放模型工程化部署和ChatBot基础上,我们将继续在测试开源领域深耕投入。协同社区志趣相投开发者们,一起打造测试领域最领先的Tools工程体系、智能测试助理和测试开源工程! + diff --git a/content/zh/docs/overview/b8.CodeFuseQuery.md b/content/zh/docs/overview/b8.CodeFuseQuery.md new file mode 100644 index 0000000..548cda2 --- /dev/null +++ b/content/zh/docs/overview/b8.CodeFuseQuery.md @@ -0,0 +1,25 @@ +--- +title: CodeFuse-Query +slug: CodeFuse-Query-zh +description: 介绍主要功能 +aliases: +- "/docs/codefuse-query-zh" +--- + +## CodeFuse-Query +随着大规模软件开发的普及,对可扩展且易于适应的静态代码分析技术的需求正在加大。传统的静态分析工具,如 Clang Static Analyzer (CSA) 或 PMD,在检查编程规则或样式问题方面已经展现出了良好的效果。然而,这些工具通常是为了满足特定的目标而设计的,往往无法满足现代软件开发环境中多变和多元化的需求。这些需求可以涉及服务质量 (QoS)、各种编程语言、不同的算法需求,以及各种性能需求。例如,安全团队可能需要复杂的算法,如上下文敏感的污点分析,来审查较小的代码库,而项目经理可能需要一种相对较轻的算法,例如计算圈复杂度的算法,以在较大的代码库上测量开发人员的生产力。 + +这些多元化的需求,加上大型组织中常见的计算资源限制,构成了一项重大的挑战。由于传统工具采用的是问题特定的计算方式,往往无法在这种环境中实现扩展。因此,我们推出了 CodeQuery,这是一个专为大规模静态分析设计的集中式数据平台。 +在 CodeQuery 的实现中,我们把源代码和分析结果看作数据,把执行过程看作大数据处理,这与传统的以工具为中心的方法有着显著的不同。我们利用大型组织中的常见系统,如数据仓库、MaxCompute 和 Hive 等数据计算设施、OSS 对象存储和 Kubernetes 等灵活计算资源,让 CodeQuery 能够无缝地融入这些系统中。这种方法使 CodeQuery 高度可维护和可扩展,能够支持多元化的需求,并有效应对不断变化的需求。此外,CodeQuery 的开放架构鼓励各种内部系统之间的互操作性,实现了无缝的交互和数据交换。这种集成和交互能力不仅提高了组织内部的自动化程度,也提高了效率,降低了手动错误的可能性。通过打破信息孤岛,推动更互联、更自动化的环境,CodeQuery 显著提高了软件开发过程的整体生产力和效率。 +此外,CodeQuery 的以数据为中心的方法在处理静态源代码分析的领域特定挑战时具有独特的优势。例如,源代码通常是一个高度结构化和互联的数据集,与其他代码和配置文件有强烈的信息和连接。将代码视为数据,CodeQuery 可以巧妙地处理这些问题,这使得它特别适合在大型组织中使用,其中代码库持续但逐步地进行演变,大部分代码在每天进行微小的改动同时保持稳定。 CodeQuery 还支持如基于代码数据的商业智能 (BI) 这类用例,能生成报告和仪表板,协助监控和决策过程。此外,CodeQuery 在分析大型语言模型 (LLM) 的训练数据方面发挥了重要作用,提供了增强这些模型整体效果的深入见解。 + +在当前的静态分析领域,CodeQuery 带来了一种新的范式。它不仅满足了大规模、复杂的代码库分析需求,还能适应不断变化和多元化的静态分析场景。CodeQuery 的以数据为中心的方法,使得其在处理大数据环境中的代码分析问题时具有独特优势。CodeQuery 的设计,旨在解决大规模软件开发环境中的静态分析问题。它能够将源代码和分析结果视作数据,使得其可以灵活地融入大型组织的各种系统中。这种方法不仅可以有效地处理大规模的代码库,还可以应对各种复杂的分析需求,从而使得静态分析工作变得更加高效和准确。 + +CodeQuery 的特点和优势可以概括为以下几点: + +- **高度可扩展**:CodeQuery 可以处理大规模的代码库,且能够适应不同的分析需求。这种高度的可扩展性使得 CodeQuery 可以在大型组织中发挥重要作用。 +- **以数据为中心**:CodeQuery 将源代码和分析结果视作数据,这种以数据为中心的方法使其在处理大数据环境中的代码分析问题时具有独特优势。 +- **高度集成**:CodeQuery 能够无缝地融入大型组织的各种系统中,包括数据仓库、数据计算设施、对象存储和灵活计算资源等。这种高度的集成性使得 CodeQuery 在大型组织中的使用变得更加方便和高效。 +- **支持多元化的需求**:CodeQuery 不仅可以处理大规模的代码库,还可以应对各种复杂的分析需求,包括服务质量分析需求、跨编程语言分析需求、算法需求和性能需求等。 + +CodeQuery 是一种强大的静态代码分析平台,适合大规模、复杂的代码库分析场景。它的以数据为中心的方法和高度的可扩展性使得它在现代软件开发环境中具有独特的优势。未来,随着静态代码分析技术的不断发展,CodeQuery 有望在这个领域中扮演更加重要的角色。 diff --git a/content/zh/docs/overview/b9.mftvlm.md b/content/zh/docs/overview/b9.mftvlm.md new file mode 100644 index 0000000..b59ea59 --- /dev/null +++ b/content/zh/docs/overview/b9.mftvlm.md @@ -0,0 +1,33 @@ +--- +title: CodeFuse-MFT-VLM +slug: CodeFuse-MFT-VLM +description: 介绍主要功能 +aliases: +- "/docs/codefuse-mft-vlm-zh" +--- + +## CodeFuse-VLM +CodeFuse-VLM 是一个多模态大语言模型框架,该框架为用户提供多种视觉编码器,模态对齐模块和大语言模型的选择,以适配用户对不同任务的需求。 + +随着huggingface开源社区的不断更新,会有更多的vision encoder 和 LLM 底座发布,这些vision encoder 和 LLM底座都有各自的强项,例如 code-llama 适合生成代码类任务,但是不适合生成中文类的任务;因此我们搭建了CodeFuse-VLM 框架,支持多种视觉模型和语言大模型,使得CodeFuse-VLM可以适应不同种类的任务。 + +![img.jpg](/images/mft-vlm/CodeFuse-VLM-arch.png) + +我们在CodeFuse-VLM 框架下, 使用Qwen-VL的视觉编码器, cross attention模态对齐模块, 和 Qwen-14B 模型训练了 CodeFuse-VLM-14B + +CodeFuse-VLM-14B 在多个benchmarks 上的性能超过了Qwen-VL和LLAVA-1.5 +![img.jpg](/images/mft-vlm/CodeFuse-VLM-14B-performance.png) + +各个模型得分如下表所示: +模型 | MMBench | MMBench-CN | VqaV2 | GQA | TextVQA | Vizwiz +| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | +LLAVA-1.5 | 67.7 | 63.6 | 80.0 | 63.3 | 61.3 | 53.6 +Qwen-VL | 60.6 | 56.7 | 78.2 | 57.5 | 63.8 | 38.9 +CodeFuse-VLM-14B | 75.7 | 69.8 | 79.3 | 59.4 | 63.9 | 45.3 + +我们的模型在MMBenchmark 多模态大模型榜单上取得了很高的排名: https://mmbench.opencompass.org.cn/leaderboard + +这是我们模型的展示视频 + +https://private-user-images.githubusercontent.com/22836551/300386230-8e64f615-ac0e-447e-9695-c96b254d484f.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY1MjExODksIm5iZiI6MTcwNjUyMDg4OSwicGF0aCI6Ii8yMjgzNjU1MS8zMDAzODYyMzAtOGU2NGY2MTUtYWMwZS00NDdlLTk2OTUtYzk2YjI1NGQ0ODRmLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTI5VDA5MzQ0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ5NzNjM2U1ZWU4NDU0Yzc5NmE4ZTM1NzY2ZjU4YjRjY2ZhNjMzODk0ZDgzMDg4N2FjYjZhYTllM2E3NTAyMWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.pr-ad7rKYBgk26DTItj2q2q9I5dRWnBNHbV9M7GSVCo + diff --git a/content/zh/docs/testagent/1_quickstart.md b/content/zh/docs/testagent/1_quickstart.md new file mode 100644 index 0000000..04efa4e --- /dev/null +++ b/content/zh/docs/testagent/1_quickstart.md @@ -0,0 +1,62 @@ +--- +title: "快速使用" +slug: "快速使用" +description: 介绍主要功能 +url: "/docs/test-agent-quickstart-zh" +aliases: +- "/docs/test-agent-quickstart-zh" +--- + +## 快速使用(QuickStart) +### 前置准备 + +#### 模型下载 + +您可在[modelscope](https://modelscope.cn/models/codefuse-ai/TestGPT-7B)或[huggingface](https://huggingface.co/codefuse-ai/TestGPT-7B)上获取到模型的详细信息并下载模型文件。 +需要注意的是: +1)如果您通过modelscope下载模型,下载方式可参考:[下载说明](https://www.modelscope.cn/docs/%E6%A8%A1%E5%9E%8B%E7%9A%84%E4%B8%8B%E8%BD%BD#%E4%BD%BF%E7%94%A8Git%E4%B8%8B%E8%BD%BD%E6%A8%A1%E5%9E%8B); +2)如果您通过huggingface下载模型,请确保您可以正常访问huggingface。 + +#### 环境安装 + +- python>=3.8 +- transformers==4.33.2 + +```plain +git clone https://github.com/codefuse-ai/Test-Agent +cd Test-Agent +pip install -r requirements.txt +``` + +在开始运行TestGPT-7B模型之前,请确保你的执行环境拥有大约14GB的显存。 +### 启动服务 + +项目提供了网页端快速搭建UI的能力能够更直观的展示模型交互和效果,我们可以使用简单的几个命令把前端页面唤醒并实时调用模型能力。在项目目录下,依次启动以下服务: + +1.**启动controller** +![controller](https://github.com/codefuse-ai/Test-Agent/assets/103973989/e68ce187-c9f1-4ce8-9d59-ff9d8348d0ac) +python3 -m chat.server.controller + +2.**启动模型worker** +![work](https://github.com/codefuse-ai/Test-Agent/assets/103973989/073e4e79-4005-4c98-87f7-0eaa0b2b1e22) +python3 -m chat.server.model_worker --model-path models/TestGPT-7B --device mps + +(models/TestGPT-7B 为实际模型文件路径) + +对于启动方式,可以按需选择以下几种配置选项: +- --device mps 用于在Mac电脑上开启GPU加速的选项(Apple Silicon或AMD GPUs); +- --device xpu 用于在Intel XPU上开启加速的选项(Intel Data Center and Arc A-Series GPUs); + - 需安装[Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/installation.html) + - 设置OneAPI环境变量:source /opt/intel/oneapi/setvars.sh +- --device npu 用于在华为AI处理器上开启加速的选项; + - 需安装[Ascend PyTorch Adapter](https://github.com/Ascend/pytorch) + - 设置CANN环境变量:source /usr/local/Ascend/ascend-toolkit/set_env.sh +- --device cpu 单独使用CPU运行的选项,不需要GPU; +- --num-gpus 2 指定并发gpu运行的选项。 + +3. **启动web服务** +python3 -m chat.server.gradio_testgpt +![web](https://github.com/codefuse-ai/Test-Agent/assets/103973989/340dae35-573b-4046-a3e8-e87a91453601) +待服务准备就绪后,我们可以打开本地启动的web服务地址 http://0.0.0.0:7860 ,就能看到完整的前端页面了。在页面下方包含了【单测生成】和【Assert补全】的两个例子,点击按钮后会自动生成一段样例文本到输入框中,点击Send按钮就会触发模型运行,之后耐心等待一段时间后(运行时间视本机性能而定)即可看到完整的回答了。 +![demo](https://github.com/codefuse-ai/Test-Agent/assets/103973989/fd24274c-729b-4ce7-8763-a083b39300fb) + diff --git a/content/zh/muagent/connector/connector_agent.md b/content/zh/muagent/connector/connector_agent.md new file mode 100644 index 0000000..5e93234 --- /dev/null +++ b/content/zh/muagent/connector/connector_agent.md @@ -0,0 +1,153 @@ +--- +title: Connector Agent +slug: Connector Agent ZH +url: "muagent/connector-agent-zh" +aliases: +- "/muagent/connector-agent-zh" +--- + + +## 快速构建一个Agent +### 首先增加openai配置,也可以是其它类似于openai接口的模型(通过fastchat启动) +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` + +### 然后设置LLM配置和向量模型配置 + +- 配置相关 LLM 和 Embedding Model +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.connector.agents import BaseAgent, ReactAgent, ExecutorAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Role, Message, ChainConfig +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS + + +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) + +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + +### Agent 配置 +- 定义两个react agent,进行实际任务执行 +``` +# 这里采用了预定义的prompt,也可以参考上述prompt完成编写 +from muagent.connector.configs.prompts import REACT_CODE_PROMPT, REACT_TOOL_PROMPT +# 定义了基于react的tool agent +tool_role = Role(role_type="assistant", role_name="tool_reacter", prompt=REACT_TOOL_PROMPT) +tool_react_agent = ReactAgent( + role=tool_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) + +# 定义了基于react的code agent +code_role = Role(role_type="assistant", role_name="code_reacter", prompt=REACT_CODE_PROMPT) +code_react_agent = ReactAgent( + role=code_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) + +``` + +- 定义groupAgent,用于agent选择 +``` +prompt = """#### Agent Profile + +Your goal is to response according the Context Data's information with the role that will best facilitate a solution, taking into account all relevant context (Context) provided. + +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions and tool list. + +ATTENTION: response carefully referenced "Response Output Format" in format. + +#### Response Output Format + +**Thoughts:** think the reason step by step about why you selecte one role + +**Role:** Select the role from agent names. +""" + +# 定义了一个groupAgent +role = Role(role_type="assistant", role_name="qaer", prompt=prompt) +base_agent = SelectorAgent( + role=role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, + group_agents=[tool_react_agent, code_react_agent] +) +``` + +### 开始实际问答 +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) + +question = "确认本地是否存在employee_data.csv,并查看它有哪些列和数据类型;然后画柱状图" +query = Message( + user_name="test", role_type="user", role_name="user", input_query=question, + tools=tools, +) +# base_agent.pre_print(query) +output_message = base_agent.step(query) +print(output_message.input_query) +print(output_message.role_content) +``` + +## Agent 参数配置 +``` +# 配置结构在这个目录 +from muagent.connector.schema import Role +``` + + +### Agent Config +|Config Key Name| Type| Description| +| ------------------ | ---------- | ---------- | +|role| Role |角色描述| +|focus_agents |List[String] |metagpt的逻辑,关注哪些agent生成的message,可选值范围为:role_name +|focus_message_keys |List[String]| 额外增加的逻辑,关注message里面具体的 key 信息可选值范围为:agent 的 output_keys| +|chat_turn |int |只针对ReactAgent有效| +|llm_config |LLMConfig |大语言模型配置| +|embed_config |EmbedConfig |向量模型配置| +|sandbox_server |Dict |沙盒环境即notebook启动配置| +|jupyter_work_path |str |沙盒环境的工作目录| +|kb_root_path |str |memory的存储路径| +|log_verbose |str |agent prompt&predict的日志打印级别| + +### Role + +| Config Key Name | Type | Description | +|------------------|------|--------------------| +| role_type | str | 角色类型, Enum: system、user、assistant、function、observation、summary | +| role_name | str | 角色名称 | +| role_desc | str | 角色描述 | +| agent_type | str | 代理类型 | +| role_prompt | str | 角色instruction | +| prompt | str | 完整prompt结构 | diff --git a/content/zh/muagent/connector/connector_chain.md b/content/zh/muagent/connector/connector_chain.md new file mode 100644 index 0000000..cca5ee0 --- /dev/null +++ b/content/zh/muagent/connector/connector_chain.md @@ -0,0 +1,145 @@ +--- +title: Connector Chain +slug: Connector Chain ZH +url: "muagent/connector-chain-zh" +aliases: +- "/muagent/connector-chain-zh" +--- + + +## 快速构建一个Agent Chain +- 首先增加openai配置,也可以是其它类似于openai接口的模型(通过fastchat启动) +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` + +### 然后设置LLM配置和向量模型配置 +- 配置相关 LLM 和 Embedding Model +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.connector.agents import BaseAgent, ReactAgent, ExecutorAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Role, Message, ChainConfig +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS + + +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) + +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + +### Agent 配置 +- 定义两个react agent,进行实际任务执行 +``` +# 这里采用了预定义的prompt,也可以参考上述prompt完成编写 +from muagent.connector.configs.prompts import REACT_CODE_PROMPT, REACT_TOOL_PROMPT +# 定义了基于react的tool agent +tool_role = Role(role_type="assistant", role_name="tool_reacter", prompt=REACT_TOOL_PROMPT) +tool_react_agent = ReactAgent( + role=tool_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) + +# 定义了基于react的code agent +code_role = Role(role_type="assistant", role_name="code_reacter", prompt=REACT_CODE_PROMPT) +code_react_agent = ReactAgent( + role=code_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) + +``` + +- 定义groupAgent,用于agent选择 +``` +prompt = """#### Agent Profile + +Your goal is to response according the Context Data's information with the role that will best facilitate a solution, taking into account all relevant context (Context) provided. + +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions and tool list. + +ATTENTION: response carefully referenced "Response Output Format" in format. + +#### Response Output Format + +**Thoughts:** think the reason step by step about why you selecte one role + +**Role:** Select the role from agent names. +""" + +# 定义了一个groupAgent +role = Role(role_type="assistant", role_name="qaer", prompt=prompt) +base_agent = SelectorAgent( + role=role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, + group_agents=[tool_react_agent, code_react_agent] +) +``` +### Chain 配置 +``` +chain_config = ChainConfig(chain_name="group_chain", agents=[base_agent.role.role_name], chat_turn=1) +base_chain = BaseChain( + chainConfig=chain_config, agents=[base_agent], + llm_config=llm_config, embed_config=embed_config, +) + +``` + + +### 开始实际问答 +- 开始执行 +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) + +question = "确认本地是否存在employee_data.csv,并查看它有哪些列和数据类型;然后画柱状图" +query = Message( + user_name="test", role_type="user", role_name="user", input_query=question, + tools=tools, +) + +# base_chain.pre_print(query) +output_message, output_memory = base_chain.step(query) +print(output_message.input_query) +print(output_message.role_content) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +## Chain 参数配置 +|Config Key Name| Type |Description| +| ------------------ | ---------- | ---------- | +|agents| List[BaseAgent] | +|llm_config |LLMConfig |大语言模型配置| +|embed_config |EmbedConfig |向量模型配置| +|sandbox_server |Dict |沙盒环境即notebook启动配置| +|jupyter_work_path |str |沙盒环境的工作目录| +|kb_root_path |str |memory的存储路径| +|log_verbose |str |agent prompt&predict的日志打印级别| diff --git a/content/zh/muagent/connector/connector_memory.md b/content/zh/muagent/connector/connector_memory.md new file mode 100644 index 0000000..29b27fd --- /dev/null +++ b/content/zh/muagent/connector/connector_memory.md @@ -0,0 +1,116 @@ +--- +title: Connector Memory +slug: Connector Memory ZH +url: "muagent/connector-memory-zh" +aliases: +- "/muagent/connector-memory-zh" +--- + + +## Memory Manager +- 将chat history在数据库进行读写管理,包括user input、 llm output、doc retrieval、code retrieval、search retrieval +- 对 chat history 进行关键信息总结 summary context,作为 prompt context +- 提供检索功能,检索 chat history 或者 summary context 中与问题相关信息,辅助问答 + + + +## 使用示例 +完整示例见 ~/tests/connector/memory_manager_test.py +### 创建 memory manager 实例 +``` +import os +import openai + +from muagent.base_configs.env_config import KB_ROOT_PATH +from muagent.connector.memory_manager import BaseMemoryManager, LocalMemoryManager +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.connector.schema import Message + +# +OPENAI_API_BASE = "https://api.openai.com/v1" +os.environ["API_BASE_URL"] = OPENAI_API_BASE +os.environ["OPENAI_API_KEY"] = "sk-xxx" +openai.api_key = "sk-xxx" +os.environ["model_name"] = "gpt-3.5-turbo" + +# +os.environ["embed_model"] = "{{embed_model_name}}" +os.environ["embed_model_path"] = "{{embed_model_path}}" + +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" + + +# LLM 和 Embedding Model 配置 +llm_config = LLMConfig( + model_name=os.environ["model_name"], api_key=os.environ["OPENAI_API_KEY"], + api_base_url=os.environ["API_BASE_URL"], temperature=0.3 +) + +embed_config = EmbedConfig( + embed_engine="model", embed_model=os.environ["embed_model"], + embed_model_path=os.environ["embed_model_path"] +) +# +phase_name = "test" +memory_manager = LocalMemoryManager( + unique_name=phase_name, + do_init=True, + kb_root_path = KB_ROOT_PATH, + embed_config=embed_config, + llm_config=llm_config + ) +``` + +### 支持Message管理 + +``` +message1 = Message( + role_name="test1", role_type="user", role_content="hello", + parsed_output_list=[{"input": "hello"}], user_name="default" +) + +text = "hi! how can I help you?" +message2 = Message( + role_name="test2", role_type="assistant", role_content=text, parsed_output_list=[{"answer": text}], + user_name="shuimo" +) + +text = "they say hello and hi to each other" +message3 = Message( + role_name="test3", role_type="summary", role_content=text, + parsed_output_list=[{"summary": text}], + user_name="shanshi" + ) + +local_memory_manager.append(message=message1) +local_memory_manager.append(message=message2) +local_memory_manager.append(message=message3) +``` + +### 重新加载 +``` +local_memory_manager = LocalMemoryManager(user_name="shanshi", embed_config=embed_config, llm_config=llm_config, do_init=False) +local_memory_manager.load() +print(local_memory_manager.get_memory_pool("default").messages) +print(local_memory_manager.get_memory_pool("shanshi").messages) +print(local_memory_manager.get_memory_pool("shuimo").messages) +``` + +### 支持 memory 检索 +``` +# embedding retrieval test +text = "say hi to each other, i want some help" +# retrieval_type=datetime => retrieval from datetime and jieba +print(local_memory_manager.router_retrieval(user_name="shanshi", text=text, datetime="2024-03-12 17:48:00", n=4, top_k=5, retrieval_type= "datetime")) +# retrieval_type=eembedding => retrieval from embedding +print(local_memory_manager.router_retrieval(user_name="shanshi", text=text, top_k=5, retrieval_type= "embedding")) +# retrieval_type=text => retrieval from jieba +print(local_memory_manager.router_retrieval(user_name="shanshi", text=text, top_k=5, retrieval_type= "text")) + +``` +### 支持 memory 总结 +``` +# recursive_summary test +print(local_memory_manager.recursive_summary(local_memory_manager.get_memory_pool("shanshi").messages, split_n=1)) +``` \ No newline at end of file diff --git a/content/zh/muagent/connector/connector_phase.md b/content/zh/muagent/connector/connector_phase.md new file mode 100644 index 0000000..636e07b --- /dev/null +++ b/content/zh/muagent/connector/connector_phase.md @@ -0,0 +1,155 @@ +--- +title: Connector Phase +slug: Connector Phase ZH +url: "muagent/connector-phase-zh" +aliases: +- "/muagent/connector-phase-zh" +--- + + + +## 快速构建一个Agent Phase +- 首先增加openai配置,也可以是其它类似于openai接口的模型(通过fastchat启动) +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` + +### 然后设置LLM配置和向量模型配置 + +- 配置相关 LLM 和 Embedding Model +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.connector.agents import BaseAgent, ReactAgent, ExecutorAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Role, Message, ChainConfig +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS + + +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) + +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + +### Agent 配置 +- 定义两个react agent,进行实际任务执行 +``` +# 这里采用了预定义的prompt,也可以参考上述prompt完成编写 +from muagent.connector.configs.prompts import REACT_CODE_PROMPT, REACT_TOOL_PROMPT +# 定义了基于react的tool agent +tool_role = Role(role_type="assistant", role_name="tool_reacter", prompt=REACT_TOOL_PROMPT) +tool_react_agent = ReactAgent( + role=tool_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) + +# 定义了基于react的code agent +code_role = Role(role_type="assistant", role_name="code_reacter", prompt=REACT_CODE_PROMPT) +code_react_agent = ReactAgent( + role=code_role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, +) + +``` + +- 定义groupAgent,用于agent选择 +``` +prompt = """#### Agent Profile + +Your goal is to response according the Context Data's information with the role that will best facilitate a solution, taking into account all relevant context (Context) provided. + +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions and tool list. + +ATTENTION: response carefully referenced "Response Output Format" in format. + +#### Response Output Format + +**Thoughts:** think the reason step by step about why you selecte one role + +**Role:** Select the role from agent names. +""" + +# 定义了一个groupAgent +role = Role(role_type="assistant", role_name="qaer", prompt=prompt) +base_agent = SelectorAgent( + role=role, + task="", + chat_turn=3, + focus_agents=[], + focus_message_keys=[], + llm_config=llm_config, embed_config=embed_config, + group_agents=[tool_react_agent, code_react_agent] +) +``` +### Chain 配置 +``` +chain_config = ChainConfig(chain_name="group_chain", agents=[base_agent.role.role_name], chat_turn=1) +base_chain = BaseChain( + chainConfig=chain_config, agents=[base_agent], + llm_config=llm_config, embed_config=embed_config, +) + +``` +### Phase 配置 +``` +base_phase = BasePhase( + phase_name="group_phase", chains=[base_chain], + embed_config=embed_config, llm_config=llm_config +) +``` + + +### 开始实际问答 +- 开始执行 +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) + +question = "确认本地是否存在employee_data.csv,并查看它有哪些列和数据类型;然后画柱状图" +query = Message( + user_name="test", role_type="user", role_name="user", input_query=question, + tools=tools, +) + +# base_phase.pre_print(query) +output_message, output_memory = base_phase.step(query) +print(output_message.input_query) +print(output_message.role_content) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +## Phase 参数配置 +|Config Key Name |Type |Description| +| ------------------ | ---------- | ---------- | +|phase_name| String| 场景名称| +|chains| List[Chain] | chain列表,按顺序执行 | +|llm_config |LLMConfig |大语言模型配置| +|embed_config |EmbedConfig |向量模型配置| +|sandbox_server |Dict |沙盒环境即notebook启动配置| +|jupyter_work_path |str |沙盒环境的工作目录| +|kb_root_path |str |memory的存储路径| +|log_verbose |str |agent prompt&predict的日志打印级别| \ No newline at end of file diff --git a/content/zh/muagent/connector/connector_prompt.md b/content/zh/muagent/connector/connector_prompt.md new file mode 100644 index 0000000..22569e6 --- /dev/null +++ b/content/zh/muagent/connector/connector_prompt.md @@ -0,0 +1,233 @@ +--- +title: Connector Prompt +slug: Connector Prompt ZH +url: "muagent/connector-prompt-zh" +aliases: +- "/muagent/connector-prompt-zh" +--- + + +## 提示管理器(Prompt Manager) +管理多智能体链路中的prompt创建 +- 快速配置:采用预设的处理函数,用户仅需通过定义智能体的输入输出即可轻松配置,实现多智能体的prompt快速组装和配置。 +- 自定义支持:允许用户自定义prompt内部各模块的处理逻辑,以达到个性化的智能体prompt实现。 + +### Prompt预设模板结构 + +- Agent Profile:此部分涉及到智能体的基础描述,包括但不限于代理的类型、功能和指令集。用户可以在这里设置智能体的基本属性,确保其行为与预期相符。 +- Context:上下文信息,给智能体做参考,帮助智能体更好的进行决策。 + - Tool Information:此部分为智能体提供了一套可用工具的清单,智能体可以根据当前的场景需求从中挑选合适的工具以辅助其执行任务。 + - Reference Documents:这里可以包含代理参考使用的文档或代码片段,以便于它在处理请求时能够参照相关资料。 + - Session Records:在进行多轮对话时,此部分会记录之前的交谈内容,确保智能体能够在上下文中保持连贯性。 +- Response Output Format:用户可以在此设置智能体的输出格式,以确保生成的响应满足特定的格式要求,包括结构、语法等。 + + +## Prompt 的标准结构 +在整个Prompt的整个结构中,我们需要去定义三个部分 +- Agent Profil +- Input Format +- Response Output Format + +``` +#### Agent Profile + +Agent Description ... + +#### Input Format + +**Origin Query:** the initial question or objective that the user wanted to achieve + +**Context:** the current status and history of the tasks to determine if Origin Query has been achieved. + +#### Response Output Format +**Action Status:** finished or continued +If it's 'finished', the context can answer the origin query. +If it's 'continued', the context cant answer the origin query. + +**REASON:** Justify the decision of choosing 'finished' and 'continued' by evaluating the progress step by step. +Consider all relevant information. If the tasks were aimed at an ongoing process, assess whether it has reached a satisfactory conclusion. +``` + + +其中,我们整合了部分 `Input Format` 的通用操作,内置了一部分字段和操作流程,形成通用的配置化操作。 + +未来我们会也会进一步将 Agent Profile和Response Output Format的部分,实现可配置化操作,降低Prompt编写难度 + +### 自定义 Agent + +- 有自定义字段需求,根据实际需求完成构造 +``` +class CodeGenDocer(BaseAgent): + + def start_action_step(self, message: Message) -> Message: + '''do action before agent predict ''' + # 根据问题获取代码片段和节点信息 + action_json = CodeRetrievalSingle.run(message.code_engine_name, message.input_query, llm_config=self.llm_config, + embed_config=self.embed_config, local_graph_path=message.local_graph_path, use_nh=message.use_nh,search_type="tag") + current_vertex = action_json['vertex'] + message.customed_kargs["Code Snippet"] = action_json["code"] + message.customed_kargs['Current_Vertex'] = current_vertex + return message + +``` + +### pre_print 功能 +在我们构建phase、chain或者agent之后,可以通过函数的预打印功能,实现agents链路确认,避免在执行后才发现问题,可提前进行debug +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.connector.agents import BaseAgent, ReactAgent, ExecutorAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Role, Message, ChainConfig +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS + + +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" + +llm_config = LLMConfig( + model_name="gpt-4", api_key=api_key, api_base_url=api_base_url, temperature=0.3 +) +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) + +phase_name = "baseGroupPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, +) + +phase.pre_print(query) +``` + + +这里采用预定义好的链路,自定义case可见[customed_example](/muagent/customed-examples-zh) +
+ + + +``` +>>> 完整信息确认 muagent.connector.configs中进行确认 + +########################## +<<<>>> +########################## + +### Agent Profile +Your goal is to response according the Context Data's information with the role that will best facilitate a solution, taking into account all relevant context (Context) provided. +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions and tool list. +ATTENTION: response carefully referenced "Response Output Format" in format. + +### Tool Information + +### Agent Infomation + Please ensure your selection is one of the listed roles. Available roles for selection: + "role name: tool_react +role description: Agent Profile,When interacting with users, your role is to respond in a helpful and accurate manner using the tools available. Follow the steps below to ensure efficient and effective use of the tools.,Please note that all the tools you can use are listed below. You can only choose from these tools for use. ,If there are no suitable tools, please do not invent any tools. Just let the user know that you do not have suitable tools to use.,ATTENTION: The Action Status field ensures that the tools or code mentioned in the Action can be parsed smoothly. Please make sure not to omit the Action Status field when replying.," +"role name: code_react +role description: Agent Profile,When users need help with coding, your role is to provide precise and effective guidance.,Write the code step by step, showing only the part necessary to solve the current problem. Each reply should contain only the code required for the current step.," + Please ensure select the Role from agent names, such as tool_react, code_react + +### Context Data + +#### Reference Documents + +#### Session Records + +#### Current Plan + +### Response Output Format +**Thoughts:** think the reason step by step about why you selecte one role +**Role:** Select the role from agent names. + +### Begin!!! + +################### +<<<>>> +################### + +**Thoughts:** +**Role:** + + +########################### +<<<>>> +########################### +### Agent Profile +When interacting with users, your role is to respond in a helpful and accurate manner using the tools available. Follow the steps below to ensure efficient and effective use of the tools. +Please note that all the tools you can use are listed below. You can only choose from these tools for use. +If there are no suitable tools, please do not invent any tools. Just let the user know that you do not have suitable tools to use. +ATTENTION: The Action Status field ensures that the tools or code mentioned in the Action can be parsed smoothly. Please make sure not to omit the Action Status field when replying. + +### Tool Information + +### Context Data + +#### Reference Documents + +#### Session Records + +#### Task Records + +### Response Output Format +**Thoughts:** According the previous observations, plan the approach for using the tool effectively. +... + +### Begin!!! + +################### +<<<>>> +################### +**Thoughts:** +**Action Status:** +**Action:** +**Observation:** +**Thoughts:** +**Action Status:** +**Action:** + +########################### +<<<>>> +########################### +### Agent Profile +When users need help with coding, your role is to provide precise and effective guidance. +Write the code step by step, showing only the part necessary to solve the current problem. Each reply should contain only the code required for the current step. + +### Context Data + +#### Reference Documents + +#### Session Records + +### Response Output Format + +**Thoughts:** According the previous context, solve the problem step by step, only displaying the thought process necessary for the current step of solving the problem, +outline the plan for executing this step. + +**Action Status:** Set to 'stopped' or 'code_executing'. +If it's 'stopped', the action is to provide the final answer to the session records and executed steps. +If it's 'code_executing', the action is to write the code. +... + +### Begin!!! + +################### +<<<>>> +################### + +**Thoughts:** +**Action Status:** +**Action:** +**Observation:** +**Thoughts:** +**Action Status:** +**Action:** + +``` diff --git a/content/zh/muagent/connector/customed_examples.md b/content/zh/muagent/connector/customed_examples.md new file mode 100644 index 0000000..de451e5 --- /dev/null +++ b/content/zh/muagent/connector/customed_examples.md @@ -0,0 +1,302 @@ +--- +title: Customed Examples +slug: Customed Examples ZH +url: "muagent/custom-examples-zh" +aliases: +- "/muagent/custom-examples-zh" +--- + + +## 如何创建你个性化的 agent phase 场景 + +下面通过 代码库来实现代码转API文档的自动生成, 来详细演示如何自定义一个 agent phase 的构建 + +### 设计你的prompt结构 + +- codeGenDocGroup_PROMPT, 构建 group Agent Prompt +``` +# update new agent configs +codeGenDocGroup_PROMPT = """#### Agent Profile + +Your goal is to response according the Context Data's information with the role that will best facilitate a solution, taking into account all relevant context (Context) provided. + +When you need to select the appropriate role for handling a user's query, carefully read the provided role names, role descriptions and tool list. + +#### Input Format + +#### Response Output Format + +**Code Path:** Extract the paths for the class/method/function that need to be addressed from the context + +**Role:** Select the role from agent names +""" +``` + +- classGenDoc_PROMPT, 构建 class code to api doc Prompt +``` +classGenDoc_PROMPT = """#### Agent Profile +As an advanced code documentation generator, you are proficient in translating class definitions into comprehensive documentation with a focus on instantiation parameters. +Your specific task is to parse the given code snippet of a class, extract information regarding its instantiation parameters. + +#### Input Format + +**Current_Vertex:** Provide the code vertex of the function or method. + +**Code Snippet:** Provide the full class definition, including the constructor and any parameters it may require for instantiation. + +#### Response Output Format +**Class Base:** Specify the base class or interface from which the current class extends, if any. + +**Class Description:** Offer a brief description of the class's purpose and functionality. + +**Init Parameters:** List each parameter from construct. For each parameter, provide: + - `param`: The parameter name + - `param_description`: A concise explanation of the parameter's purpose. + - `param_type`: The data type of the parameter, if explicitly defined. + + ```json + [ + { + "param": "parameter_name", + "param_description": "A brief description of what this parameter is used for.", + "param_type": "The data type of the parameter" + }, + ... + ] + ``` + + + If no parameter for construct, return + ```json + [] + ``` +""" +``` + +- funcGenDoc_PROMPT,构建 function code to api doc Prompt +``` +funcGenDoc_PROMPT = """#### Agent Profile +You are a high-level code documentation assistant, skilled at extracting information from function/method code into detailed and well-structured documentation. + + +#### Input Format +**Code Path:** Provide the code path of the function or method you wish to document. +This name will be used to identify and extract the relevant details from the code snippet provided. + +**Current_Vertex:** Provide the code vertex of the function or method. + +**Code Snippet:** A segment of code that contains the function or method to be documented. + +#### Response Output Format + +**Class Description:** Offer a brief description of the method(function)'s purpose and functionality. + +**Parameters:** Extract parameter for the specific function/method Code from Code Snippet. For parameter, provide: + - `param`: The parameter name + - `param_description`: A concise explanation of the parameter's purpose. + - `param_type`: The data type of the parameter, if explicitly defined. + ```json + [ + { + "param": "parameter_name", + "param_description": "A brief description of what this parameter is used for.", + "param_type": "The data type of the parameter" + }, + ... + ] + ``` + + If no parameter for function/method, return + ```json + [] + ``` + +**Return Value Description:** Describe what the function/method returns upon completion. + +**Return Type:** Indicate the type of data the function/method returns (e.g., string, integer, object, void). +""" +``` + +### 导包以及基础参数配置 +- 首先增加openai配置,也可以是其它类似于openai接口的模型(通过fastchat启动) +``` +import os, sys +from muagent.base_configs.env_config import CB_ROOT_PATH +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.connector.phase import BasePhase +from muagent.connector.agents import BaseAgent, SelectorAgent +from muagent.connector.chains import BaseChain +from muagent.connector.schema import Message, Role, ChainConfig +from muagent.codechat.codebase_handler.codebase_handler import CodeBaseHandler + +from loguru import logger +from muagent.tools import CodeRetrievalSingle + + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` + + + +### 定义新的agent类 +用于自定义key-value信息 +``` +class CodeGenDocer(BaseAgent): + + def start_action_step(self, message: Message) -> Message: + '''do action before agent predict ''' + # 根据问题获取代码片段和节点信息 + action_json = CodeRetrievalSingle.run(message.code_engine_name, message.input_query, llm_config=self.llm_config, + embed_config=self.embed_config, local_graph_path=message.local_graph_path, use_nh=message.use_nh,search_type="tag") + current_vertex = action_json['vertex'] + message.customed_kargs["Code Snippet"] = action_json["code"] + message.customed_kargs['Current_Vertex'] = current_vertex + return message + +``` + +### 准备LLM & Embedding +``` +llm_config = LLMConfig( + model_name="gpt-4", api_key=api_key, api_base_url=api_base_url, temperature=0.3 +) +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + +### 代码库加载 + +``` + +# initialize codebase +# delete codebase +codebase_name = 'client_nebula' +code_path = "D://chromeDownloads/devopschat-bot/client_v2/client" +use_nh = True +do_interpret = False +cbh = CodeBaseHandler(codebase_name, code_path, crawl_type='dir', use_nh=use_nh, local_graph_path=CB_ROOT_PATH, + llm_config=llm_config, embed_config=embed_config) +cbh.delete_codebase(codebase_name=codebase_name) + +# load codebase +cbh = CodeBaseHandler(codebase_name, code_path, crawl_type='dir', use_nh=use_nh, local_graph_path=CB_ROOT_PATH, + llm_config=llm_config, embed_config=embed_config) +cbh.import_code(do_interpret=do_interpret) + +``` + +### 接下来就构建 phase 实例,开始执行 +``` + +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "1" + +funcGenDoc_role = Role(role_type="assistant", role_name="funcGenDoc_role", prompt=funcGenDoc_PROMPT) +funcGenDoc = CodeGenDocer( + role=funcGenDoc_role, + chat_turn=1, + llm_config=llm_config, embed_config=embed_config, +) + + +classGenDoc_role = Role(role_type="assistant", role_name="classGenDoc_role", prompt=classGenDoc_PROMPT) +classGenDoc = CodeGenDocer( + role=classGenDoc_role, + chat_turn=1, + llm_config=llm_config, embed_config=embed_config, +) + +codeGenDocGroup_role = Role(role_type="assistant", role_name="codeGenDocGroup_role", prompt=codeGenDocGroup_PROMPT) +codeGenDocGroup = SelectorAgent( + role=codeGenDocGroup_role, + chat_turn=1, + llm_config=llm_config, embed_config=embed_config, + group_agents=[funcGenDoc, classGenDoc] +) + +chain_config = ChainConfig( + chain_name="codeGenDocGroup_chain", agents=[codeGenDocGroup.role.role_name,], + chat_turn=1) + +chain = BaseChain( + chainConfig=chain_config, agents=[codeGenDocGroup], + llm_config=llm_config, embed_config=embed_config, +) + +phase = BasePhase( + phase_name="codeGenDocGroup_phase", chains=[chain], + embed_config=embed_config, llm_config=llm_config +) +``` + + +### 开始代码转api文档 +``` +# 根据前面的load过程进行初始化 +cbh = CodeBaseHandler(codebase_name, code_path, crawl_type='dir', use_nh=use_nh, local_graph_path=CB_ROOT_PATH, + llm_config=llm_config, embed_config=embed_config) + +cbh.search_vertices(vertex_type="method") + +# 开始代码转换API文档结构 +for vertex_type in ["class", "method"]: + vertexes = cbh.search_vertices(vertex_type=vertex_type) + logger.info(f"vertexes={vertexes}") + + # round-1 + docs = [] + for vertex in vertexes: + vertex = vertex.split("-")[0] # -为method的参数 + query_content = f"为{vertex_type}节点 {vertex}生成文档" + query = Message( + role_name="human", role_type="user", input_query=query_content, + code_engine_name=codebase_name, score_threshold=1.0, top_k=3, cb_search_type="tag", use_nh=use_nh, + local_graph_path=CB_ROOT_PATH, + ) + output_message, output_memory = phase.step(query, reinit_memory=True) + # print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + docs.append(output_memory.get_spec_parserd_output()) + + os.makedirs(f"{CB_ROOT_PATH}/docs", exist_ok=True) + with open(f"{CB_ROOT_PATH}/docs/raw_{vertex_type}.json", "w") as f: + json.dump(docs, f) + + +# 下面把生成的文档信息转换成markdown文本 +from muagent.utils.code2doc_util import * + +import json +with open(f"/home/user/code_base/docs/raw_method.json", "r") as f: + method_raw_data = json.load(f) + +with open(f"/home/user/code_base/docs/raw_class.json", "r") as f: + class_raw_data = json.load(f) + + +method_data = method_info_decode(method_raw_data) +class_data = class_info_decode(class_raw_data) +method_mds = encode2md(method_data, method_text_md) +class_mds = encode2md(class_data, class_text_md) + +docs_dict = {} +for k,v in class_mds.items(): + method_textmds = method_mds.get(k, []) + for vv in v: + # 理论上只有一个 + text_md = vv + + for method_textmd in method_textmds: + text_md += "\n
" + method_textmd + + docs_dict.setdefault(k, []).append(text_md) + + with open(f"/home/user/code_base/docs/{k}.md", "w") as f: + f.write(text_md) +``` \ No newline at end of file diff --git a/content/zh/muagent/llm_models/embedding_config.md b/content/zh/muagent/llm_models/embedding_config.md new file mode 100644 index 0000000..1d9b5cd --- /dev/null +++ b/content/zh/muagent/llm_models/embedding_config.md @@ -0,0 +1,71 @@ +--- +title: Embedding 配置 +url: "muagent/embedding-model-config-zh" +aliases: +- "/muagent/embedding-model-config-zh" +--- + + +## 准备相关参数 +首先增加openai配置,也可以是其它类似于openai接口的模型(通过fastchat启动) + +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +``` + + +## 构建LLM Config +- 通过本地模型文件构建 +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig + +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + + +- 通过openai构建 +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig + +embed_config = EmbedConfig( + embed_engine="openai", api_key=api_key, api_base_url=api_base_url, +) +``` + +- 自定义langchain embeddings传入 +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig + + +class CustomizedEmbeddings(Embeddings): + + def embed_documents(self, texts: List[str]) -> List[List[float]]: + embeddings = [] + # add your embedding code + return embeddings + + def embed_query(self, text: str) -> List[float]: + """Compute query embeddings using a HuggingFace transformer model. + + Args: + text: The text to embed. + + Returns: + Embeddings for the text. + """ + # add your embedding code + return embedding + +embeddings = CustomizedEmbeddings() +embed_config = EmbedConfig( + embed_model="default", + langchain_embeddings=embeddings +) +``` \ No newline at end of file diff --git a/content/zh/muagent/llm_models/llm_config.md b/content/zh/muagent/llm_models/llm_config.md new file mode 100644 index 0000000..3ec20b9 --- /dev/null +++ b/content/zh/muagent/llm_models/llm_config.md @@ -0,0 +1,55 @@ +--- +title: LLM 配置 +url: "muagent/llm-model-config-zh" +aliases: +- "/muagent/llm-model-config-zh" +--- + + +## 准备相关参数 +首先增加openai配置,也可以是其它类似于openai接口的模型(通过fastchat启动) + +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +``` + + +## 构建LLM Config +- 通过调用 类openai 传入 +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) +``` + +- 自定义 langchain LLM 传入 +``` +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from langchain.llms.base import BaseLLM, LLM + +class CustomizedModel(LLM): + repetition_penalty = 1.1 + temperature = 0.2 + top_k = 40 + top_p = 0.9 + + def predict(self, prompt: str, stop: Optional[List[str]] = None) -> str: + return self._call(prompt, stop) + + def _call(self, prompt: str, + stop: Optional[List[str]] = None) -> str: + """_call + """ + return "" + +llm = CustomizedModel() +llm_config = LLMConfig( + llm=llm +) +``` \ No newline at end of file diff --git a/content/zh/muagent/overview/agent-flow.md b/content/zh/muagent/overview/agent-flow.md new file mode 100644 index 0000000..46d1795 --- /dev/null +++ b/content/zh/muagent/overview/agent-flow.md @@ -0,0 +1,50 @@ +--- +title: Agent 编排 +slug: Agent 编排 +url: "muagent/agent-编排" +aliases: +- "/muagent/agent-编排" +- "/muagent/agent-flow-zh" +--- + + + +## 核心Connector介绍 +为了便于大家理解整个 muagent 的链路,我们采取 Flow 的形式来详细介绍如何通过配置构建 + +
+ 图片 +
+ + +
下面,我们先介绍相关的核心组件
+ +### Agent +在Agent设计层面,我们提供了四种基本的Agent类型,对这些Agent进行Role的基础设定,可满足多种通用场景的交互和使用 +1. BaseAgent:提供基础问答、工具使用、代码执行的功能,根据Prompt格式实现 输入 => 输出 + +2. ReactAgent:提供标准React的功能,根据问题实现当前任务 + +3. ExecutorAgent:对任务清单进行顺序执行,根据 User 或 上一个Agent编排的计划,完成相关任务 + +4. SelectorAgent:提供选择Agent的功能,根据User 或 上一个 Agent的问题选择合适的Agent来进行回答. + + +输出后将 message push 到 memory pool 之中,后续通过Memory Manager进行管理 + +### Chain +基础链路:BaseChain,串联agent的交互,完成相关message和memory的管理 + +### Phase +基础场景:BasePhase,串联chain的交互,完成相关message和memory的管理 + +### Prompt Manager +Mutli-Agent链路中每一个agent的prompt创建 +- 通过对promtp_input_keys和promtp_output_keys对的简单设定,可以沿用预设 Prompt Context 创建逻辑,从而实现agent prompt快速配置 +- 也可以对prompt manager模块进行新的 key-context 设计,实现个性化的 Agent Prompt + +### Memory Manager +主要用于 chat history 的管理 +- 将chat history在数据库进行读写管理,包括user input、 llm output、doc retrieval、code retrieval、search retrieval +- 对 chat history 进行关键信息总结 summary context,作为 prompt context +- 提供检索功能,检索 chat history 或者 summary context 中与问题相关信息,辅助问答 diff --git a/content/zh/muagent/overview/multi-agent.md b/content/zh/muagent/overview/multi-agent.md new file mode 100644 index 0000000..da907d0 --- /dev/null +++ b/content/zh/muagent/overview/multi-agent.md @@ -0,0 +1,136 @@ +--- +title: MuAgent 概览 +slug: MuAgent 概览 +url: "muagent/muagent-概览" +aliases: +- "/muagent/muagent-概览" +- "/muagent/multi-agent-zh" +- "/muagent/muagent-zh" +--- + + +# 简介 + +为了提高大型模型在推理准确性方面的表现,业界出现了多种创新的大型语言模型(LLM)玩法。从最早的CoT、ToT到GoT,这些方法不断拓展了LLM的能力边界。在处理复杂问题时,我们可以通过ReAct过程来选择、调用和执行工具反馈,同时实现多轮工具使用和多步骤执行。 + +但对于更复杂的场景,例如复杂代码的开发,单一功能的LLM Agent显然难以胜任。因此,社区开始发展出多Agent的组合玩法,比如专注于metaGPT、GPT-Engineer、chatDev等开发领域的项目,以及专注于自动化构建Agent和Agent对话的AutoGen项目。 + +经过对这些框架的深入分析,发现大多数的Agent框架整体耦合度较高,其易用性和可扩展性较差。在预设场景中实现特定场景,但想要进行场景扩展却困难重重。 + +因此,我们希望构建一个可扩展、易于使用的Multi-Agent框架,以支持ChatBot在获取知识库信息的同时,能够辅助完成日常办公、数据分析、开发运维等各种通用任务。 + +本项目的Mutli-Agent框架汲取兼容了多个框架的优秀设计,比如metaGPT中的消息池(message pool)、autogen中的代理选择器(agent selector)等。 + +
+ 图片 +
+ + +# MuAgent框架 +在MuAgent中,我们除了定义Agent交互链路和AgentBase基础执行流以外,还额外设计了 Prompt Manager 和 Memory Manager 两个基础组件,分别用于自动化构建Prompt和chat history管理。最终构建出一个可扩展、易于使用的Multi-Agent框架,包括以下内容 +- Agent Base:构建了四种基本的Agent类型BaseAgent、ReactAgent、ExecutorAgent、SelectorAgent,支撑各种场景的基础活动 +- Communication:通过Message和Parse Message 实体完成Agent间的信息传递,并与Memory Manager交互再Memory Pool完成记忆管理 +- Prompt Manager:通过Role Handler、Doc/Tool Handler、Session Handler、Customized Handler,来自动化组装Customized 的Agent Prompt +- Memory Manager: 用于支撑 chat history 的存储管理、信息压缩、记忆检索等管理,最后通过Memory Pool在数据库、本地、向量数据库中完成存储 +- Component:用于构建Agent的辅助生态组件,包括Retrieval、Tool、Action、Sandbox等 +- Customized Model:支持私有化的LLM和Embedding的接入 + + + +## Agent Base +在Agent层面,提供四种基本的Agent类型,对这些Agent进行Role的基础设定,可满足多种通用场景的交互和使用。所有的Action都由Agent执行。 + +1. BaseAgent:提供基础问答、工具使用、代码执行的功能,根据Prompt格式实现 输入 => 输出 + +
+ 图片 +
+ +2. ReactAgent:提供标准React的功能,根据问题实现当前任务 +
+ 图片 +
+ +3. ExecutorAgent:对任务清单进行顺序执行,根据 User 或 上一个Agent编排的计划,完成相关任务 +Agent接受到任务清单([List[task]),对这个任务清单Task进行循环执行(中间也可添加 Feedback Agent来进行任务重新优化),直到任务完成 +
+ 图片 +
+ +4. SelectorAgent:提供选择Agent的功能,根据User 或 上一个 Agent的问题选择合适的Agent来进行回答. +
+ 图片 +
+ + +## Communication +为了让Agent之间进行更好的交互,以及能够让每一个Agent接受到足够的信息完成它们特定任务,我们将Message信息体分成了多个部分,System Content、Info Content、LLM Content和LLM Parsed Content等 +- System Content:用于存储管理当前LLM输出的时间,Role信息等 +- Info Content:LLM辅助信息,比如像知识库查询信息、代码库检索信息、工具信息、Agent信息等 +- LLM Content:直接存储和传递LLM 产生的信息 +- LLM Parsed Content:对LLM进行解析转成更易操作的key-value数据结构,方便对LLM内容进行过滤 +- Customized Content:用于管理自定义action产生的key-value数据内容,用于后续自定义Prompt模板的组装构建 + +通过对以上消息格式的定义,我们便可以完成通用消息的传递和管理。具体组装见Prompt Manager模块 + +## Context Manager +### Memory Manager +主要用于 chat history 的管理 +- 存储管理:在数据库或本地实现对chat history进行save和load管理,包括user input、 llm output、observation ouput +- 信息压缩:对 chat history 进行关键信息压缩总结 summary context,比如说单文本概况、侧重不同角度进行文本概况、关键信息提取、多文本概况,作为 Prompt context +- 记忆检索:提供基础检索功能,检索 chat history 或者 Summary Context 中与问题相关信息,辅助问答 +- LLM自动触发:后续定义策略或通过LLM来 触发 压缩总结和检索的功能 + +### Prompt Manager +提问LLM已经成为一种常见的实践,但如何让多个大模型分工并协调好LLM间的规划、调用工具、代码编写能力,来引导它们产生期望的输出,成为了一个关键的问题,其本质就是将业务问题抽象并拆解到可执行的Prompt,那与其说我们是在设计Agents,不如说是对当前需求的深入理解后进行框架设计。 +在LLM介入到实际业务场景(不涉及SFT过程),我们能通过设计Agent Prompt的内容来指定LLM完成相应任务得到相应输出。在MuAgent这个过程中,将这个Prompt分成了三个部分,System Prompt、Context Prompt、Customized Prompt + +- System Prompt 包括 Role Name、Role Description、Task等 +- Context Prompt 包括 Doc Context、Code Context、Tool Context、Agent Context、Session Context等 +- Customized Prompt 则是 自定义的一些 Input 和 Ouput,比如说 ... +我们还可以要求模型输出结构化的文本,比如说tool的json串、*code\ncode_content*等来完成特定工作流。 + +**Automatic Prompt Assemble** +在按照上述结构定义后,我们便可以通过以下方式来完成Prompt的自动化组装,不需要每次去做大量的prompt调整工作 +1. 定义Agent时直接配置 Role Name、Role Description、Task等来决定Agent需要做的事情 +2. 预封装一些可复用的Context Prompt 通用策略,比如说可筛选 Role 的 SessionContext、可配置的Tool、Code Retrieval、Doc Retrieval、Search Retrieval、Agent来完成对应的组装 +3. 由于Agent的Prompt是相对个性化的操作,所以也支持在Prompt Manager 模块内新增新的 key-context 设计,实现个性化的 Agent Prompt。 + + +**Automatic Prompt Design** +能根据role description、task、query等来自动化设计出最优的prompt;待定义... + +**Multi Prompt Design** +根据前面Prompt的定义,我们可以了解到Prompt 由 System Prompt、Context Prompt、Customized Prompt 三个部分组成,三个部分的任一变化都有可能会引起LLM最终输出结果的变化。 +对于同种任务而言,即它们的System Prompt是相同的。那么在不考虑Customiezd Prompt 变化时,就可实现不同上下文的组装差异,比如说Prompt A获取10轮的chat history,而Pormpt B采用5轮的chat history,又或者是对chat history进行信息过滤、信息压缩等。 +待实现... + +## Component +### Retrieval +在所有Prompt的Context中,除了Chat History的会话信息外,还需要依赖于从外界文档知识库、代码库、互联网搜索得来的相关信息,这些模型参数知识外的知识体系能够极大提升Agent完成复杂任务的能力。 +于是在MuAgent中我们集成了Doc、Internet Search、Code Retrieval三种检索信息的方式,并定义了一个抽象IMRetrieval类,可支持开发者自定义个性化的知识库,来完成Agent的知识库注册。 + +**Doc Retrieval** +文档向量数据库是当前最主流的知识库构建方法,使用Text Embedding 模型对文档进行向量化并在向量数据库中存储。未来我们也会去支持基于知识图谱查询以及通过大模型自动抽取实体和关系的方式,来挖掘数据中多种复杂关系。 + +**Code Retrieval** +LLM在代码生成、修复以及组件理解的任务上,会面临代码训练数据滞后、无法感知代码上下文依赖结构。以及在开发的过程中,对现有代码库和依赖包的理解、检索相关代码、查询元信息等会占用较长的时间。于是我们希望通过代码结构分析和代码检索生成来,以及为LLM提供知识体系外的代码。 + +**Search Retrieval** +除了现成的文档和代码知识库以及之外,在日常中实践中会去浏览大量网页内容获取更多的知识,帮助我们理解新兴的场景、业务、技术等,于是我们接入了duckduckgosearch这款开源的搜索工具,能够为LLM提供知识储备以外的内容。 + +### Tool +随着OpenAI推出了Function Call功能,通过LLM生成指定工具的参数并执行调用,使机器能更好地理解和回应人类的需求,从而解决实际问题和重复性的工作。现如今工具学习能力越来越作为开源模型的标配。那在MuAgent中也支持Agent完成Tool的注册,通过Python注册模板`BaseToolModel`类,编写Tool_name、Tool_description、ToolInputArgs、ToolOutputArgs、run等相关属性和方法即可实现工具的快速接入,同时支持langchain Tool接口的直接使用。 +例如像上述 XXRetrieval 的功能也可以注册为Tool,最终由LLM执行调用。 + +### Action +在MuAgent的定义里,Action是作为LLM具体要执行的动作或动作流,会包括LLM信息处理、知识检索、工具调用以及代码执行等一个综合性的复杂过程,是一个动态过程。比如在React过程中,我们通过LLM获取到了一个Tool参数,接下来"将工具参数放入到Tool并执行调用"这个过程就是Action,它去实践性的调用了Tool。又或者说我们定义了一个Agent,它编排在一个固定Agent的Action步骤之中,这个Agent的输入参数由Action特殊指定。也就是说无论是由LLM产生参数还是工程设定参数,只有涉及具体的执行过程,就是一个Action。 + + +## 模块分类 +- [connector](/muagent/connector-agent-zh) 主要介绍这块Agent框架的工作 +- llm_models +- retrieval +- tools +- sandbox +- utils diff --git a/content/zh/muagent/overview/quick-start.md b/content/zh/muagent/overview/quick-start.md new file mode 100644 index 0000000..4a40899 --- /dev/null +++ b/content/zh/muagent/overview/quick-start.md @@ -0,0 +1,353 @@ +--- +title: 快速开始 +slug: 快速开始 +url: "muagent/快速开始" +aliases: +- "/muagent/快速开始" +- "/muagent/quick-start-zh" +--- + + + +## Quick Start +完整示例见,[examples/muagent_examples](htpps://) +### 首先,准备相关配置信息 +``` +import os, sys + +api_key = "sk-xxx" +api_base_url= "https://api.openai.com/v1" +model_name = "gpt-3.5-turbo" +embed_model = "{{embed_model_name}}" +embed_model_path = "{{embed_model_path}}" +# +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5://127.0.0.1:13659" +``` + +### 然后,设置LLM配置和Embedding模型配置 +``` +from muagent.base_configs.env_config import JUPYTER_WORK_PATH +from muagent.tools import toLangchainTools, TOOL_DICT, TOOL_SETS +from muagent.llm_models.llm_config import EmbedConfig, LLMConfig +from muagent.connector.phase import BasePhase +from muagent.connector.schema import Message + + +llm_config = LLMConfig( + model_name=model_name, api_key=api_key, api_base_url=api_base_url, temperature=0.3, + stop="**Observation:**" +) + +embed_config = EmbedConfig( + embed_engine="model", embed_model=embed_model, embed_model_path=embed_model_path +) +``` + +### 最后选择一个已有场景进行执行 +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) + +# 选择一个场景 +phase_name = "baseGroupPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + +# round-1 需要通过代码解释器来完成 +query_content = "确认本地是否存在employee_data.csv,并查看它有哪些列和数据类型;然后画柱状图" +query = Message( + role_name="human", role_type="user", tools=[], input_query=query_content, +) + +# phase.pre_print(query) # 该功能用于预打印 Agents 执行链路的Prompt +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +# round-2 需要执行工具 +tools = toLangchainTools([TOOL_DICT[i] for i in TOOL_SETS if i in TOOL_DICT]) + +query_content = "帮我确认下127.0.0.1这个服务器的在10点是否存在异常,请帮我判断一下" +query = Message( + role_name="human", role_type="user", tools=tools, input_query=query_content, +) + +# phase.pre_print(query) # 该功能用于预打印 Agents 执行链路的Prompt +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +``` +## 场景自定义 +见[如何自定义场景](/muagent/customed-examples-zh) + +## 场景介绍和使用 + +下面是一些具体的场景介绍和使用。 + +也欢迎大家开脑洞构造一些有趣的case。 + +### baseTaskPhase +xAgents的任务拆分及多步骤执行场景 + +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/employee_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) + +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "2" + +# +phase_name = "baseTaskPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, +) +# round-1 +query_content = "确认本地是否存在employee_data.csv,并查看它有哪些列和数据类型;然后画柱状图" +query = Message( + role_name="human", role_type="user", input_query=query_content, +) + +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +### codeReactPhase +基于 React 的代码解释器场景 + +``` +# if you want to analyze a data.csv, please put the csv file into a jupyter_work_path (or your defined path) +import shutil +source_file = 'D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/jupyter_work/book_data.csv' +shutil.copy(source_file, JUPYTER_WORK_PATH) + +# then, create a data analyze phase +phase_name = "codeReactPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + +# round-1 +query_content = "确认本地是否存在employee_data.csv,并查看它有哪些列和数据类型;然后画柱状图" +query = Message( + role_name="human", role_type="user", input_query=query_content, + ) + +output_message, output_memory = phase.step(query) + +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + +### codeToolReactPhase +基于 React 模板的工具调用和代码解释器场景 + +``` +TOOL_SETS = [ + "StockName", "StockInfo", + ] +tools = toLangchainTools([TOOL_DICT[i] for i in TOOL_SETS if i in TOOL_DICT]) + +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "2" + +phase_name = "codeToolReactPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + +query_content = "查询贵州茅台的股票代码,并查询截止到当前日期(2023年12月24日)的最近10天的每日时序数据,然后用代码画出折线图并分析" + +query = Message(role_name="human", role_type="user", input_query=query_content, tools=tools) + +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +### docChatPhase +知识库检索问答链路 +- example 1 +``` +# create your knowledge base +from muagent.service.kb_api import create_kb, upload_files2kb +from muagent.utils.server_utils import run_async +from muagent.orm import create_tables + + +# use to test, don't create some directory +create_tables() +# create a knowledge base +kb_name = "example_test" +run_async(create_kb(knowledge_base_name=kb_name, vector_store_type="faiss", embed_config=embed_config, kb_root_path=KB_ROOT_PATH)) +# add doc to knowledge base +file = os.path.join("D://project/gitlab/llm/external/ant_code/Codefuse-chatbot/sources/docs/langchain_text_10.jsonl") +files = [file] +upload_files2kb(files, kb_name, embed_config, kb_root_path=KB_ROOT_PATH) + + + +## start to chat with knowledge base +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "0" + +## exmaple 1 +# set chat phase +phase_name = "docChatPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, kb_root_path=KB_ROOT_PATH, +) + +# round-1 +query_content = "langchain有哪些模块" +query = Message( + role_name="human", role_type="user", input_query=query_content, + doc_engine_name=kb_name, score_threshold=1.0, top_k=3 + ) + +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +# round-2 +query_content = "提示(prompts)有什么用?" +query = Message( + role_name="human", role_type="user", input_query=query_content, + doc_engine_name=kb_name, score_threshold=1.0, top_k=3 + ) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +``` + +- exmaple 2 +``` +ustomized register demo +from muagent.tools import DocRetrieval +class BaseDocRetrieval(IMRertrieval): + + def __init__(self, knowledge_base_name: str, search_top=5, score_threshold=1.0, embed_config: EmbedConfig=EmbedConfig(), kb_root_path: str=KB_ROOT_PATH): + self.knowledge_base_name = knowledge_base_name + self.search_top = search_top + self.score_threshold = score_threshold + self.embed_config = embed_config + self.kb_root_path = kb_root_path + + def run(self, query: str, search_top=None, score_threshold=None, ): + docs = DocRetrieval.run( + query=query, knowledge_base_name=self.knowledge_base_name, + search_top=search_top or self.search_top, + score_threshold=score_threshold or self.score_threshold, + embed_config=self.embed_config, + kb_root_path=self.kb_root_path + ) + return docs + + +doc_retrieval = BaseDocRetrieval(knowledge_base_name=kb_name, score_threshold=1.0, search_top=3, embed_config=embed_config) + +# set chat phase +phase_name = "docChatPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, kb_root_path=KB_ROOT_PATH, + doc_retrieval=doc_retrieval +) + +# round-1 +query_content = "langchain有哪些模块" +query = Message( + role_name="human", role_type="user", input_query=query_content, +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +# round-2 +query_content = "提示(prompts)有什么用?" +query = Message( + role_name="human", role_type="user", input_query=query_content, +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + +### metagpt_code_devlop +metagpt的代码构造链路 + +``` +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "2" + +phase_name = "metagpt_code_devlop" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + +query_content = "create a snake game" +query = Message(role_name="human", role_type="user", input_query=query_content) + +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +### searchChatPhase +固定场景链路,先搜索后基于LLM直接回答 + +``` +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "2" + +# 当duckduckgo连接不通的时候可以配置这个 +os.environ["DUCKDUCKGO_PROXY"] = os.environ.get("DUCKDUCKGO_PROXY") or "socks5h://127.0.0.1:13659" + +phase_name = "searchChatPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + +# round-1 +query_content1 = "美国当前总统是谁?" +query = Message( + role_name="human", role_type="user", input_query=query_content1, + search_engine_name="duckduckgo", score_threshold=1.0, top_k=3 + ) + +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +# round-2 +query_content2 = "美国上一任总统是谁,两个人有什么关系没?" +query = Message( + role_name="human", role_type="user", input_query=query_content2, + search_engine_name="duckduckgo", score_threshold=1.0, top_k=3 + ) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` + + +### toolReactPhase +基于 React 模板的工具调用场景 + +``` +# log-level,print prompt和llm predict +os.environ["log_verbose"] = "2" + +phase_name = "toolReactPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config +) + +# round-1 +tools = toLangchainTools([TOOL_DICT[i] for i in TOOL_SETS if i in TOOL_DICT]) +query_content = "帮我确认下127.0.0.1这个服务器的在10点是否存在异常,请帮我判断一下" +query = Message( + role_name="human", role_type="user", tools=tools, input_query=query_content, + ) + +# phase.pre_print(query) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) +``` \ No newline at end of file diff --git a/content/zh/muagent/retrieval/custom_retrieval.md b/content/zh/muagent/retrieval/custom_retrieval.md new file mode 100644 index 0000000..678adf2 --- /dev/null +++ b/content/zh/muagent/retrieval/custom_retrieval.md @@ -0,0 +1,103 @@ +--- +title: 自定义 Retrieval 接入 +url: "muagent/custom-retrieval-zh" +aliases: +- "/muagent/custom-retrieval-zh" +--- + +## 基本介绍 +`Doc Retrieval` 文档向量数据库是当前最主流的知识库构建方法,使用Text Embedding 模型对文档进行向量化并在向量数据库中存储。未来我们也会去支持基于知识图谱查询以及通过大模型自动抽取实体和关系的方式,来挖掘数据中多种复杂关系。 + +`Code Retrieval` LLM在代码生成、修复以及组件理解的任务上,会面临代码训练数据滞后、无法感知代码上下文依赖结构。以及在开发的过程中,对现有代码库和依赖包的理解、检索相关代码、查询元信息等会占用较长的时间。于是我们希望通过代码结构分析和代码检索生成来,以及为LLM提供知识体系外的代码。 + +`Search Retrieval` 除了现成的文档和代码知识库以及之外,在日常中实践中会去浏览大量网页内容获取更多的知识,帮助我们理解新兴的场景、业务、技术等,于是我们接入了duckduckgosearch这款开源的搜索工具,能够为LLM提供知识储备以外的内容。 + +## Rertrieval 结构 + + +``` +class IMRertrieval: + + def __init__(self,): + ''' + init your personal attributes + ''' + pass + + def run(self, ): + ''' + execute interface, and can use init' attributes + ''' + pass + +class BaseDocRetrieval(IMRertrieval): + + def __init__(self, knowledge_base_name: str, search_top=5, score_threshold=1.0, embed_config: EmbedConfig=EmbedConfig(), kb_root_path: str=KB_ROOT_PATH): + self.knowledge_base_name = knowledge_base_name + self.search_top = search_top + self.score_threshold = score_threshold + self.embed_config = embed_config + self.kb_root_path = kb_root_path + + def run(self, query: str, search_top=None, score_threshold=None, ): + docs = DocRetrieval.run( + query=query, knowledge_base_name=self.knowledge_base_name, + search_top=search_top or self.search_top, + score_threshold=score_threshold or self.score_threshold, + embed_config=self.embed_config, + kb_root_path=self.kb_root_path + ) + return docs +``` + + +## 使用示例 +``` +# retrieval your customized register demo +from muagent.tools import DocRetrieval +class BaseDocRetrieval(IMRertrieval): + + def __init__(self, knowledge_base_name: str, search_top=5, score_threshold=1.0, embed_config: EmbedConfig=EmbedConfig(), kb_root_path: str=KB_ROOT_PATH): + self.knowledge_base_name = knowledge_base_name + self.search_top = search_top + self.score_threshold = score_threshold + self.embed_config = embed_config + self.kb_root_path = kb_root_path + + def run(self, query: str, search_top=None, score_threshold=None, ): + docs = DocRetrieval.run( + query=query, knowledge_base_name=self.knowledge_base_name, + search_top=search_top or self.search_top, + score_threshold=score_threshold or self.score_threshold, + embed_config=self.embed_config, + kb_root_path=self.kb_root_path + ) + return docs + + +doc_retrieval = BaseDocRetrieval(knowledge_base_name=kb_name, score_threshold=1.0, search_top=3, embed_config=embed_config) + +# set chat phase +phase_name = "docChatPhase" +phase = BasePhase( + phase_name, embed_config=embed_config, llm_config=llm_config, kb_root_path=KB_ROOT_PATH, + doc_retrieval=doc_retrieval +) + +# round-1 +query_content = "langchain有哪些模块" +query = Message( + role_name="human", role_type="user", input_query=query_content, +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +# round-2 +query_content = "提示(prompts)有什么用?" +query = Message( + role_name="human", role_type="user", input_query=query_content, +) +output_message, output_memory = phase.step(query) +print(output_memory.to_str_messages(return_all=True, content_key="parsed_output_list")) + +``` \ No newline at end of file diff --git a/content/zh/muagent/tools/custom_tool.md b/content/zh/muagent/tools/custom_tool.md new file mode 100644 index 0000000..7b9be52 --- /dev/null +++ b/content/zh/muagent/tools/custom_tool.md @@ -0,0 +1,125 @@ +--- +title: 自定义 Tool 接入 +url: "muagent/custom-tool-zh" +aliases: +- "/muagent/custom-tool-zh" +--- + +## 基本介绍 +在MuAgent中也支持Agent完成Tool的注册,通过Python注册模板BaseToolModel类,编写 +- Tool_nam +- Tool_descriptio +- ToolInputArgs +- ToolOutputArgs +- run + +等相关属性和方法即可实现工具的快速接入,同时支持langchain Tool接口的直接使用。 例如像上述 XXRetrieval 的功能也可以注册为Tool,最终由LLM执行调用。 + +## BaseTool 结构 + +``` +from langchain.agents import Tool +from pydantic import BaseModel, Field +from typing import List, Dict +import json + + +class BaseToolModel: + name = "BaseToolModel" + description = "Tool Description" + + class ToolInputArgs(BaseModel): + """ + Input for MoveFileTool. + Tips: + default control Required, e.g. key1 is not Required/key2 is Required + """ + + key1: str = Field(default=None, description="hello world!") + key2: str = Field(..., description="hello world!!") + + class ToolOutputArgs(BaseModel): + """ + Input for MoveFileTool. + Tips: + default control Required, e.g. key1 is not Required/key2 is Required + """ + + key1: str = Field(default=None, description="hello world!") + key2: str = Field(..., description="hello world!!") + + @classmethod + def run(cls, tool_input_args: ToolInputArgs) -> ToolOutputArgs: + """excute your tool!""" + pass +``` + + +## 注册示例 + +``` +from pydantic import BaseModel, Field +from typing import List, Dict +import requests +from loguru import logger + +from .base_tool import BaseToolModel + +class Multiplier(BaseToolModel): + """ + Tips: + default control Required, e.g. key1 is not Required/key2 is Required + """ + + name: str = "Multiplier" + description: str = """useful for when you need to multiply two numbers together. \ + The input to this tool should be a comma separated list of numbers of length two, representing the two numbers you want to multiply together. \ + For example, `1,2` would be the input if you wanted to multiply 1 by 2.""" + + class ToolInputArgs(BaseModel): + """Input for Multiplier.""" + + # key: str = Field(..., description="用户在高德地图官网申请web服务API类型KEY") + a: int = Field(..., description="num a") + b: int = Field(..., description="num b") + + class ToolOutputArgs(BaseModel): + """Output for Multiplier.""" + + res: int = Field(..., description="the result of two nums") + + @staticmethod + def run(a, b): + return a * b +``` + + +## 使用示例 +``` +from langchain.tools import StructuredTool +from muagent.tools import ( + WeatherInfo, Multiplier, toLangchainTools, + TOOL_DICT, TOOL_SETS +) + +# 函数执行 +tools = [ + StructuredTool( + name=Multiplier.name, + func=Multiplier.run, + description=Multiplier.description, + args_schema=Multiplier.ToolInputArgs, + ), + StructuredTool( + name=WeatherInfo.name, + func=WeatherInfo.run, + description=WeatherInfo.description, + args_schema=WeatherInfo.ToolInputArgs, + ) + ] + +tools = toLangchainTools([TOOL_DICT["Multiplier"]]) + +# tool run 测试 +print(tools[0].func(1,2)) +``` \ No newline at end of file diff --git a/data/en/docs/sidebar.yml b/data/en/docs/sidebar.yml index d0b164a..d91c288 100644 --- a/data/en/docs/sidebar.yml +++ b/data/en/docs/sidebar.yml @@ -4,19 +4,84 @@ - title: 📖 CodeFuse-AI Module pages: + - title: CodeFuse-Query + - title: MFTCoder + - title: CodeFuse-MFT-VLM + - title: Test-Agent + - title: CodeFuse-ModelCache - title: CodeFuse-ChatBot + # rename: quickstart # 用于title重命名,title作为跳转地址 docs/title - title: CodeFuse-DevOps-Eval - title: CodeFuse-DevOps-Model - - title: MFTCoder - - title: CodeFuse-ModelCache - # - title: FasterTransformer4CodeFuse - - title: Test-Agent - - title: CodeFuse-Query + - title: CodeFuse-Evalution + +- title: CodeFuse-Query + pages: + - title: codefuse-query-introduction + rename: Introduction + - title: codefuse-query-quickstart + rename: QuickStart + - title: codefuse-query-GodelLanguage + rename: GodelLanguage + - title: codefuse-query-toolchain + rename: Toolchain + - title: codefuse-query-usercase + rename: UserCase + +- title: MFTCoder + pages: + - title: mftcoder-introduction + rename: Introduction + - title: mftcoder-quickstart + rename: QuickStart + - title: mftcoder-accelerate + rename: Accelerate + DeepSpeed/FSDP Framework + - title: mftcoder-atorch + rename: Atorch Framework + +- title: CodeFuse-MFT-VLM + pages: + - title: codefuse-mft-vlm-quickstart + rename: QuickStart + +- title: 🌱 Test Agent + pages: + - title: test-agent-quickstart + rename: QuickStart + +- title: 🌱 CodeFuse-ModelCache + pages: + - title: CodeFuse-ModelCache-quickstart + rename: QuickStart + - title: CodeFuse-ModelCache-feature + rename: Feature + - title: CodeFuse-ModelCache-config + rename: Config + - title: CodeFuse-ModelCache-release + rename: Release Note - title: 🌱 CodeFuse-ChatBot pages: - - title: QuickStart + - title: CodeFuse-ChatBot-QuickStart + rename: QuickStart - title: Start-Detail - title: LLM-Configuration - title: ChatBot-RoadMap - \ No newline at end of file + +- title: 🌱 CodeFuse-DevOps-Model + pages: + - title: CodeFuse-DevOps-Model-Train + rename: TrainDetail + - title: CodeFuse-DevOps-Model-QuickStart + rename: QuickStart + +- title: 🌱 CodeFuse-DevOps-Eval + pages: + - title: Data + - title: CodeFuse-DevOps-Eval-QuickStart + rename: QuickStart + +- title: 🌱 CodeFuse-evalution + pages: + - title: CodeFuse-evalution-quickstart + rename: QuickStart diff --git a/data/en/muagent/sidebar.yml b/data/en/muagent/sidebar.yml new file mode 100644 index 0000000..4d9819c --- /dev/null +++ b/data/en/muagent/sidebar.yml @@ -0,0 +1,33 @@ +- title: ❤️ MuAgent + pages: + - title: MuAgent Overview + - title: Agent Flow + - title: Quick Start + + +- title: Modules + pages: + - title: Connector Agent + rename: Agent Builder + - title: Connector Chain + rename: Chain Builder + - title: Connector Phase + rename: Phase Builder + - title: Connector Prompt + rename: Create Prompt + - title: Connector Memory + rename: Memory Builder + - title: Custom Examples + +- title: llm_models + pages: + - title: LLM Model Config + - title: Embedding Model Config + +- title: Tools + pages: + - title: Custom tool + +- title: Retrieval + pages: + - title: Custom retrieval \ No newline at end of file diff --git a/data/zh/docs/sidebar.yml b/data/zh/docs/sidebar.yml index de69ec0..6882b52 100644 --- a/data/zh/docs/sidebar.yml +++ b/data/zh/docs/sidebar.yml @@ -4,18 +4,92 @@ - title: 📖 CodeFuse-AI 模块 pages: + - title: CodeFuse-Query-zh + rename: CodeFuse-Query + - title: MFTCoder-zh + rename: MFTCoder + - title: CodeFuse-MFT-VLM-zh + rename: CodeFuse-MFT-VLM + - title: Test-Agent-zh + rename: Test-Agent + - title: CodeFuse-ModelCache-zh + rename: CodeFuse-ModelCache - title: CodeFuse-ChatBot-zh + rename: CodeFuse-ChatBot - title: CodeFuse-DevOps-Eval-zh + rename: CodeFuse-DevOps-Eval - title: CodeFuse-DevOps-Model-zh - - title: MFTCoder-zh - - title: CodeFuse-ModelCache-zh - # - title: FasterTransformer4CodeFuse-zh - - title: Test-Agent-zh - - title: CodeFuse-Query-zh + rename: CodeFuse-DevOps-Model + - title: CodeFuse-evalution-zh + rename: CodeFuse-Evalution + +- title: CodeFuse-Query + pages: + - title: CodeFuse-Query-introduction-zh + rename: 基本介绍 + - title: CodeFuse-Query-quickstart-zh + rename: 快速开始 + - title: codefuse-query-GodelLanguage-zh + rename: 查询语言介绍 + - title: codefuse-query-toolchain-zh + rename: VSCode插件 + - title: codefuse-query-usercase-zh + rename: 用户案例 + +- title: MFTCoder + pages: + - title: mftcoder-introduction-zh + rename: 基本介绍 + - title: mftcoder-quickstart-zh + rename: 快速使用 + - title: mftcoder-accelerate-zh + rename: Accelerate + DeepSpeed/FSDP 框架篇 + - title: mftcoder-atorch-zh + rename: Atorch框架篇 + +- title: CodeFuse-MFT-VLM + pages: + - title: codefuse-mft-vlm-quickstart-zh + rename: 快速使用 + +- title: 🌱 Test Agent + pages: + - title: test-agent-quickstart-zh + rename: 快速开始 + +- title: 🌱 CodeFuse-ModelCache + pages: + - title: CodeFuse-ModelCache-quickstart-zh + rename: 快速开始 + - title: CodeFuse-ModelCache-feature-zh + rename: 功能特性 + - title: CodeFuse-ModelCache-config-zh + rename: 最佳配置 + - title: CodeFuse-ModelCache-release-zh + rename: 版本记录 - title: 🌱 CodeFuse-ChatBot pages: - - title: 快速开始 + - title: CodeFuse-ChatBot-quickstart-zh + rename: 快速开始 - title: 启动明细 - title: 本地私有化&大模型接口接入 - title: ChatBot 技术路线 + +- title: 🌱 CodeFuse-DevOps-Model + pages: + - title: CodeFuse-DevOps-Model-Train-zh + rename: 训练解析 + - title: CodeFuse-DevOps-Model-QuickStart-zh + rename: 快速使用 + +- title: 🌱 CodeFuse-DevOps-Eval + pages: + - title: 数据介绍 + - title: CodeFuse-DevOps-Eval-quickstart-zh + rename: 快速开始 + +- title: 🌱 CodeFuse-evalution + pages: + - title: CodeFuse-evalution-quickstart-zh + rename: 快速开始 \ No newline at end of file diff --git a/data/zh/muagent/sidebar.yml b/data/zh/muagent/sidebar.yml new file mode 100644 index 0000000..25a49c4 --- /dev/null +++ b/data/zh/muagent/sidebar.yml @@ -0,0 +1,40 @@ + +- title: ❤️ MuAgent + pages: + - title: MuAgent 概览 + rename: MuAgent + - title: Agent 编排 + rename: Agents Flow + - title: 快速开始 + +- title: Connector + pages: + - title: Connector Agent ZH + rename: Agent构建 + - title: Connector Chain ZH + rename: Chain构建 + - title: Connector Phase ZH + rename: Phase构建 + - title: Connector Prompt ZH + rename: Prompt编写 + - title: Connector Memory ZH + rename: Memory构建 + - title: Custom Examples ZH + rename: 自定义示例 + +- title: llm_models + pages: + - title: LLM Model Config ZH + rename: LLM 配置 + - title: Embedding Model Config ZH + rename: Embedding 配置 + +- title: Tools + pages: + - title: Custom tool zh + rename: 自定义 Tool + +- title: Retrieval + pages: + - title: Custom retrieval zh + rename: 自定义 Retrieval \ No newline at end of file diff --git a/docs/coagent/agent-flow/index.html b/docs/coagent/agent-flow/index.html index ea7794a..3ea8987 100644 --- a/docs/coagent/agent-flow/index.html +++ b/docs/coagent/agent-flow/index.html @@ -50,6 +50,7 @@ +