From d8751713b18fba6b6a039a02367f062e70c85a1e Mon Sep 17 00:00:00 2001 From: hanfeng Date: Mon, 11 Sep 2023 21:03:16 +0800 Subject: [PATCH 01/23] update README.md --- README.md | 5 ++ README_ENG.md | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 149 insertions(+) create mode 100644 README_ENG.md diff --git a/README.md b/README.md index 62b9cf75..30a17f6b 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,10 @@ Bisheng banner +

+ 简体中文 | + English | +

+
diff --git a/README_ENG.md b/README_ENG.md new file mode 100644 index 00000000..f5852749 --- /dev/null +++ b/README_ENG.md @@ -0,0 +1,144 @@ +Bisheng banner + +

+ 简体中文 | + English | +

+ + +
+ + + + + +
+ +# Welcome to Bisheng + +## What is Bisheng + +Bisheng is a leading open-source platform for developing LLM applications. It empowers and accelerates the development of LLM applications and helps users to enter the next generation of application development mode with the best experience. + +"Bisheng" is the inventor of movable type printing, which played a huge role in the dissemination of human knowledge. We hope that "Bisheng" can also provide strong support for the widespread landing of intelligent applications. Welcome to participate together. + +Bisheng was released under the Apache 2.0 License at the end of August 2023. + + +## Key Features + +- Convenience: Even business person can quickly build intelligent applications centered around LLM through simple and intuitive form filling based on our pre-configured application templates. +- Flexibility: For person familiar with LLM technologies, we provide hundreds of development components following the latest trends in the LLM technology ecosystem. With visual and flexible process orchestration capabilities, any type of LLM application can be developed, not just simple prompting projects. +- Reliability and Enterprise-level: Many similar open-source projects are only suitable for experimental testing scenarios and lack enterprise-level features for real production use, including high availability under high concurrency, continuous iteration and optimization of application operations and effects, and practical functions that fit real business scenarios. These are the differentiated capabilities of the ByteDance platform. In addition, data quality within enterprises is uneven. To truly utilize all data, comprehensive unstructured data governance capabilities are needed, which is the core capability our team has accumulated over the past few years. In Bisheng's demo environment, you can directly access these capabilities through related components, and these capabilities are free and unlimited. + + +## Product Applications + +With the Bisheng platform, we can build a variety of LLM applications: + +Analysis Report Generation: + +- 📃 Contract Review Report Generation +- 🏦 Credit Investigation Report Generation +- 📈 IPO Analysis Report Generation +- 💼 Intelligent Investment Advisory Report Generation +- 👀 Document Summary Generation + + +Knowledge Base Q&A: + +- 👩‍💻 User Manual Q&A +- 👩‍🔬 Research Report Knowledge Base Q&A +- 🗄 Regulations and Rules Q&A +- 💊 "Chinese Pharmacopoeia" Knowledge Q&A +- 📊 Stock Price Database Q&A + + +Dialogues: + +- 🎭 Role-play as an interviewer +- 📍 Xiaohongshu (Red Book) Copywriting Assistant +- 👩‍🎤 Role-play as a foreign language teacher +- 👨‍🏫 Resume Optimization Assistant + + +Element Extraction: + +- 📄 Key Elements Extraction from Contracts +- 🏗️ Engineering Report Elements Extraction +- 🗂️ General Metadata Extraction +- 🎫 Key Elements Extraction from Cards and Bills + + +For methods to build various applications, see[Application Cases](https://m7a7tqsztt.feishu.cn/wiki/ZfkmwLPfeiAhQSkK2WvcX87unxc). + +We believe that in real enterprise scenarios, "dialogue" is just one of many interaction forms. +In the future, we will also add support for more application forms such as process automation and search. + + +## Quick Start + +### Start With Bisheng + +- [Install Bisheng](https://m7a7tqsztt.feishu.cn/wiki/BSCcwKd4Yiot3IkOEC8cxGW7nPc) + + +### Compile Bisheng From Src + +Todo: update later + +Get More Contents,Please Read [Dev Documents](https://m7a7tqsztt.feishu.cn/wiki/ITmJwMXVliBnzpkW3nkcqPVrnse)。 + + +## Contributing + +Contributions to Bisheng are welcome from everyone. See [Guidelines for Contributing]((https://github.com/dataelement/bisheng/blob/main/CONTRIBUTING.md)) +for details on submitting patches and the contribution workflow. +Refer [community repository](https://github.com/dataelement/community) to learn about our governance and access more community resources. + + + + +
+ +## Bisheng Document + +For more guides on installation, development, deployment, and management, please see [Bisheng Documentation](https://m7a7tqsztt.feishu.cn/wiki/ZxW6wZyAJicX4WkG0NqcWsbynde). + + +## Community + +- You're welcome to join our [Slack](https://www.dataelem.com/) channel to share your suggestions and issues. +- You can also visit the [FAQ](https://m7a7tqsztt.feishu.cn/wiki/XdGCwkDJviC0Z8klbdbcF790n9b) page to see frequently asked questions and their answers. +- You can also join the [Discussion Group](https://github.com/dataelement/bisheng/discussions) to raise questions and discussions. + + + + +Follow Bisheng on social media: + + +- Bisheng Technical Exchange WeChat Group + +Wechat QR Code + +## Join Us + +DataElem Inc. is the company behind the Bisheng project. We are [hiring](https://www.dataelem.com/contact/team) algorithm developers, developers, and full-stack engineers. +Join us as we work together to build the next generation of intelligent application development platform + + +## Acknowledgments + +Bisheng adopts dependencies from the following: + +- Thanks to the open-source model inference framework [Triton](https://github.com/triton-inference-server). +- Thanks to the open-source LLM application development library [langchain](https://github.com/langchain-ai/langchain). +- Thanks to the open-source unstructured data parsing engine [unstructured](https://github.com/Unstructured-IO/unstructured). +- Thanks to the open-source langchain visualization tool [langflow](https://github.com/logspace-ai/langflow). From 4aa47f8103c2103a3b8b6f88661a149784adf7f8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=A7=9A=E5=8A=B2?= Date: Fri, 24 Nov 2023 16:45:52 +0800 Subject: [PATCH 02/23] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index d626620a..0f65af37 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,9 @@

简体中文 | - English | + English +

+

license docker-pull-count From 4ae06c4d61b00b69136702e8e7bd8dc57ea1710b Mon Sep 17 00:00:00 2001 From: gulixin0922 Date: Mon, 4 Dec 2023 21:51:28 +0800 Subject: [PATCH 03/23] update huatai --- .../autoplanning/set_openai_param.sh | 2 + .../experimental/contract/contract_extract.py | 65 +++++++++ .../contract/convert_image_pdf.py | 44 ++++++ .../experimental/contract/ellm_extract.py | 114 +++++++++++++++ .../experimental/contract/llm_extract.py | 138 ++++++++++++++++++ 5 files changed, 363 insertions(+) create mode 100644 src/bisheng-langchain/experimental/autoplanning/set_openai_param.sh create mode 100644 src/bisheng-langchain/experimental/contract/contract_extract.py create mode 100644 src/bisheng-langchain/experimental/contract/convert_image_pdf.py create mode 100644 src/bisheng-langchain/experimental/contract/ellm_extract.py create mode 100644 src/bisheng-langchain/experimental/contract/llm_extract.py diff --git a/src/bisheng-langchain/experimental/autoplanning/set_openai_param.sh b/src/bisheng-langchain/experimental/autoplanning/set_openai_param.sh new file mode 100644 index 00000000..c90632f8 --- /dev/null +++ b/src/bisheng-langchain/experimental/autoplanning/set_openai_param.sh @@ -0,0 +1,2 @@ +export OPENAI_API_KEY='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' +export OPENAI_PROXY='http://118.195.232.223:39995' \ No newline at end of file diff --git a/src/bisheng-langchain/experimental/contract/contract_extract.py b/src/bisheng-langchain/experimental/contract/contract_extract.py new file mode 100644 index 00000000..bf54bef5 --- /dev/null +++ b/src/bisheng-langchain/experimental/contract/contract_extract.py @@ -0,0 +1,65 @@ +import os +import json +import logging +from tqdm import tqdm +from ellm_extract import EllmExtract +from llm_extract import LlmExtract + +logging.getLogger().setLevel(logging.INFO) + + +class ContractExtract(object): + def __init__(self, + ellm_api_base_url: str = 'http://192.168.106.20:3502/v2/idp/idp_app/infer', + llm_model_name: str = 'Qwen-14B-Chat', + llm_model_api_url: str = 'https://bisheng.dataelem.com/api/v1/models/{}/infer', + unstructured_api_url: str = "https://bisheng.dataelem.com/api/v1/etl4llm/predict", + replace_ellm_cache: bool = False, + replace_llm_cache: bool = True, + ): + self.ellm_client = EllmExtract(api_base_url=ellm_api_base_url) + self.llm_client = LlmExtract(model_name=llm_model_name, + model_api_url=llm_model_api_url, + unstructured_api_url=unstructured_api_url) + self.replace_ellm_cache = replace_ellm_cache + self.replace_llm_cache = replace_llm_cache + + def predict_one_pdf(self, pdf_path, schema, save_folder): + pdf_name_prefix = os.path.splitext(os.path.basename(pdf_path))[0] + save_ellm_path = os.path.join(save_folder, pdf_name_prefix + '_ellm.json') + save_llm_path = os.path.join(save_folder, pdf_name_prefix + '_llm.json') + + if self.replace_ellm_cache or not os.path.exists(save_ellm_path): + ellm_kv_results = self.ellm_client.predict(pdf_path, schema) + with open(save_ellm_path, 'w') as f: + json.dump(ellm_kv_results, f, indent=2, ensure_ascii=False) + else: + with open(save_ellm_path, 'r') as f: + ellm_kv_results = json.load(f) + + if self.replace_llm_cache or not os.path.exists(save_llm_path): + llm_kv_results = self.llm_client.predict(pdf_path, schema) + with open(save_llm_path, 'w') as f: + json.dump(llm_kv_results, f, indent=2, ensure_ascii=False) + else: + with open(save_llm_path, 'r') as f: + llm_kv_results = json.load(f) + + return ellm_kv_results, llm_kv_results + + def predict_all_pdf(self, pdf_folder, schema, save_folder): + if not os.path.exists(save_folder): + os.makedirs(save_folder) + pdf_names = os.listdir(pdf_folder) + for pdf_name in tqdm(pdf_names): + pdf_path = os.path.join(pdf_folder, pdf_name) + ellm_kv_results, llm_kv_results = self.predict_one_pdf(pdf_path, schema, save_folder) + + +if __name__ == '__main__': + client = ContractExtract() + pdf_folder = '/home/gulixin/workspace/datasets/huatai/重大商务合同(汇总)_pdf' + # schema = '合同标题|合同编号|借款人|贷款人|借款金额' + schema = '买方|卖方|合同期限|结算条款|售后条款|金额总金额' + save_folder = '/home/gulixin/workspace/datasets/huatai/重大商务合同(汇总)_pdf_res' + client.predict_all_pdf(pdf_folder, schema, save_folder) diff --git a/src/bisheng-langchain/experimental/contract/convert_image_pdf.py b/src/bisheng-langchain/experimental/contract/convert_image_pdf.py new file mode 100644 index 00000000..a5f640dd --- /dev/null +++ b/src/bisheng-langchain/experimental/contract/convert_image_pdf.py @@ -0,0 +1,44 @@ +import os +import glob +import fitz +import time +from tqdm import tqdm +from PIL import Image + + +def pic2pdf(img_dir, pdfFile): + if os.path.exists(pdfFile): + return + doc = fitz.open() + for img in sorted(glob.glob("{}/*".format(img_dir))): + if os.path.isdir(img): + continue + imgdoc = fitz.open(img) + pdfbytes = imgdoc.convert_to_pdf() + imgpdf = fitz.open("pdf", pdfbytes) + doc.insert_pdf(imgpdf) + if os.path.exists(pdfFile): + os.remove(pdfFile) + doc.save(pdfFile) + doc.close() + + +if __name__ == "__main__": + img_dirs = ['保密条款/L1', '采购合同/原材料采购合同/W1', '采购合同/外协配套件采购合同/J18', '采购合同/外协件采购合同/JYT-1', + '采购合同/生产采购一般条款/JYT-1', '采购合同/汽车服务备件采购协议/J7', '价格协议/J3', '价格协议/J4', '价格协议/J10', + '供货合同/W26', '框架协议/产品开发协议/J5', '框架协议/物料采购框架协议/J6', '框架协议/销售协议/J44', '框架协议/主供货协议/J28', + '买卖合同/J23', '买卖合同/J30', '项目定点合同/采购目标协议书/J22', '项目定点合同/采购目标协议书/J42', '项目定点合同/采购目标协议书/J44', + '施工合同/W22', '施工合同/W23', '施工合同/W24', '销售合同/W25', '销售价格协议/W18', '最高额保证合同/W27', '最高额保证合同/W28', + '最高额抵押合同/W29', '最高额抵押合同/W30', '最高额抵押合同/W31', '价格单/L-1', '价格单/L-2', '价格单/L-3'] + + base_folder = '/Users/gulixin/Documents/数据项素/工作/项目支持/华泰合同/重大商务合同(汇总)' + save_folder = '/Users/gulixin/Documents/数据项素/工作/项目支持/华泰合同/重大商务合同(汇总)_pdf' + if not os.path.exists(save_folder): + os.makedirs(save_folder) + for img_dir in tqdm(img_dirs): + img_dir_abs_path = os.path.join(base_folder, img_dir) + save_abs_path = os.path.join(save_folder, '_'.join(img_dir.split('/')) + '.pdf') + + start_time = time.time() + pic2pdf(img_dir_abs_path, save_abs_path) + print('time:', time.time() - start_time) \ No newline at end of file diff --git a/src/bisheng-langchain/experimental/contract/ellm_extract.py b/src/bisheng-langchain/experimental/contract/ellm_extract.py new file mode 100644 index 00000000..bd33b1a9 --- /dev/null +++ b/src/bisheng-langchain/experimental/contract/ellm_extract.py @@ -0,0 +1,114 @@ +# import base64 +import copy +import base64 +import requests +import fitz +import numpy as np +import cv2 +from collections import defaultdict +from PIL import Image +from typing import Any, Iterator, List, Mapping, Optional, Union + + +def convert_base64(image): + image_binary = cv2.imencode('.jpg', image)[1].tobytes() + x = base64.b64encode(image_binary) + return x.decode('ascii').replace('\n', '') + + +def transpdf2png(pdf_file): + pdf_bytes = open(pdf_file, 'rb').read() + pdf = fitz.Document(stream=pdf_bytes, filetype='pdf') + dpis = [72, 144, 200] + + pdf_images = dict() + for page in pdf: + pix = None + for dpi in dpis: + pix = page.get_pixmap(dpi=dpi) + if min(pix.width, pix.height) >= 2560: break + + mode = "RGBA" if pix.alpha else "RGB" + img = Image.frombytes(mode, [pix.width, pix.height], pix.samples) + # RGB to BGR + img = np.array(img)[:, :, ::-1] + img_name = "page_{:03d}".format(page.number) + pdf_images[img_name] = img + + return pdf_images + + +class EllmExtract(object): + def __init__(self, api_base_url: Optional[str] = None): + self.ep = api_base_url + self.client = requests.Session() + self.timeout = 10000 + self.params = { + 'sort_filter_boxes': True, + 'enable_huarong_box_adjust': True, + 'support_long_image_segment': True, + # 'checkbox': ['std_checkbox'], + 'rotateupright': True + } + + self.scene_mapping = { + 'doc': { + 'det': 'general_text_det_v2', + 'recog': 'transformer-v2.8-gamma-faster', + 'ellm': 'ELLM' + } + } + + def predict_single_img(self, inp): + """ + single image + """ + scene = inp.pop('scene', 'doc') + b64_image = inp.pop('b64_image') + ellm_schema = inp.pop('keys') + params = copy.deepcopy(self.params) + params.update(self.scene_mapping[scene]) + params.update({'ellm_schema': ellm_schema}) + + req_data = {'data': [b64_image], 'param': params} + + try: + r = self.client.post(url=self.ep, + json=req_data, + timeout=self.timeout) + return r.json() + except Exception as e: + return {'status_code': 400, 'status_message': str(e)} + + def predict(self, pdf_path, schema): + """ + pdf + """ + pdf_images = transpdf2png(pdf_path) + kv_results = defaultdict(list) + for pdf_name in pdf_images: + page = int(pdf_name.split('page_')[-1]) + + b64data = convert_base64(pdf_images[pdf_name]) + payload = {'b64_image': b64data, 'keys': schema} + resp = self.predict_single_img(payload) + + if 'code' in resp and resp['code'] == 200: + key_values = resp['result']['ellm_result'] + else: + raise ValueError(f"ellm kv extract failed: {resp}") + + for key, value in key_values.items(): + text_info = [{'value': text, 'page': int(page)} for text in value['text']] + kv_results[key].extend(text_info) + + return kv_results + + +if __name__ == '__main__': + ellm_client = EllmExtract(api_base_url='http://192.168.106.20:3502/v2/idp/idp_app/infer') + pdf_path = '/home/gulixin/workspace/datasets/huatai/流动资金借款合同_pdf/流动资金借款合同1.pdf' + schema = '合同标题|合同编号|借款人|贷款人|借款金额' + ellm_client.predict(pdf_path, schema) + + diff --git a/src/bisheng-langchain/experimental/contract/llm_extract.py b/src/bisheng-langchain/experimental/contract/llm_extract.py new file mode 100644 index 00000000..140e3e56 --- /dev/null +++ b/src/bisheng-langchain/experimental/contract/llm_extract.py @@ -0,0 +1,138 @@ +import os +import copy +import requests +import json +import time +import logging +import re +from collections import defaultdict +from langchain.prompts import PromptTemplate +from bisheng_langchain.document_loaders import ElemUnstructuredLoader +from bisheng_langchain.text_splitter import ElemCharacterTextSplitter + + +logging.getLogger().setLevel(logging.INFO) + + +DEFAULT_PROMPT = PromptTemplate( + input_variables=["context", "keywords"], + template="""现在你需要帮我完成信息抽取的任务,你需要帮我抽取出原文中相关字段信息,如果没找到对应的值,则设为空,并按照JSON的格式输出. + +原文内容: +{context} + +提取上述文本中以下字段信息:{keywords},并按照json的格式输出,如果没找到对应的值,则设为空。 +""" +) + + +class LlmExtract(object): + def __init__(self, + model_name: str = 'Qwen-14B-Chat', + model_api_url: str = 'https://bisheng.dataelem.com/api/v1/models/{}/infer', + unstructured_api_url: str = "https://bisheng.dataelem.com/api/v1/etl4llm/predict", + ): + self.model_name = model_name + self.model_api_url = model_api_url.format(model_name) + self.unstructured_api_url = unstructured_api_url + + def call_llm(self, prompt_info, max_tokens=8192): + input_template = { + 'model': 'unknown', + 'messages': [ + {'role': 'system', 'content': '你是一个关键信息提取助手。'}, + { + 'role': 'user', + 'content': prompt_info + } + ], + 'max_tokens': max_tokens, + } + payload = copy.copy(input_template) + payload['model'] = self.model_name + response = requests.post(url=self.model_api_url, json=payload).json() + assert response['status_code'] == 200, response + choices = response.get('choices', []) + assert choices, response + + json_string = choices[0]['message']['content'] + match = re.search(r"```(json)?(.*)```", json_string, re.DOTALL) + if match is None: + json_str = json_string + else: + json_str = match.group(2) + json_str = json_str.strip() + extract_res = json.loads(json_str) + return extract_res + + def parse_pdf(self, + file_path, + chunk_size=4096, + chunk_overlap=200, + separators=['\n\n', '\n', ' ', ''] + ): + file_name = os.path.basename(file_path) + loader = ElemUnstructuredLoader(file_name=file_name, + file_path=file_path, + unstructured_api_url=self.unstructured_api_url) + docs = loader.load() + pdf_content = ''.join([doc.page_content for doc in docs]) + logging.info(f'pdf content len: {len(pdf_content)}') + + text_splitter = ElemCharacterTextSplitter(chunk_size=chunk_size, + chunk_overlap=chunk_overlap, + separators=separators) + split_docs = text_splitter.split_documents(docs) + logging.info(f'docs num: {len(docs)}, split_docs num: {len(split_docs)}') + return split_docs, docs + + def post_extract_res(self, split_docs_extract, split_docs_content, schema): + kv_results = defaultdict(list) + for ext_res, content in zip(split_docs_extract, split_docs_content): + # 每一个split_doc的提取结果 + for key, value in ext_res.items(): + # 去掉非法key和没有内容的key + if (key not in schema) or (not value): + continue + + if key not in kv_results: + kv_results[key].append(value) + else: + # 去重 + if value in kv_results[key]: + continue + kv_results[key].append(value) + + return kv_results + + def predict(self, pdf_path, schema): + logging.info('llm extract phase1: pdf parsing') + schema = schema.split('|') + keywords = '、'.join(schema) + split_docs, docs = self.parse_pdf(pdf_path) + + logging.info('llm extract phase2: llm extract') + split_docs_extract = [] + split_docs_content = [] + for each_doc in split_docs: + pdf_content = each_doc.page_content + prompt_info = DEFAULT_PROMPT.format(context=pdf_content, keywords=keywords) + extract_res = self.call_llm(prompt_info) + split_docs_extract.append(extract_res) + split_docs_content.append(pdf_content) + + logging.info(f'split_docs_extract: {split_docs_extract}') + + logging.info('llm extract phase3: post extract result') + kv_results = self.post_extract_res(split_docs_extract, split_docs_content, schema) + logging.info(f'final kv results: {kv_results}') + + return kv_results + + +if __name__ == '__main__': + llm_client = LlmExtract(model_name='Qwen-14B-Chat') + pdf_path = '/home/gulixin/workspace/datasets/huatai/流动资金借款合同_pdf/流动资金借款合同1.pdf' + schema = '合同标题|合同编号|借款人|贷款人|借款金额' + llm_client.predict(pdf_path, schema) + From 56ada183e4fa5db5c4278976a3495dc3f8048405 Mon Sep 17 00:00:00 2001 From: yaojin Date: Tue, 5 Dec 2023 21:06:34 +0800 Subject: [PATCH 04/23] release --- docker/bisheng/config/config.yaml | 31 +++++++++++++------ .../bisheng/interface/embeddings/custom.py | 4 +-- .../bisheng/template/frontend_node/chains.py | 2 ++ src/backend/test/test_docx.py | 2 +- 4 files changed, 27 insertions(+), 12 deletions(-) diff --git a/docker/bisheng/config/config.yaml b/docker/bisheng/config/config.yaml index 328e70c2..eb82a4f3 100644 --- a/docker/bisheng/config/config.yaml +++ b/docker/bisheng/config/config.yaml @@ -28,8 +28,7 @@ autogen_roles: documentation: "" AutoGenCustomRole: documentation: "" - -agents: +agents: ZeroShotAgent: documentation: "https://python.langchain.com/docs/modules/agents/how_to/custom_mrkl_agent" JsonAgent: @@ -45,6 +44,10 @@ agents: SQLAgent: documentation: "" chains: + RuleBasedRouter: + documentation: "" + MultiRuleChain: + documentation: "" TransformChain: documentation: "" MultiPromptChain: @@ -57,8 +60,6 @@ chains: documentation: "" TransformChain: documentation: "" - MultiRetrievalQA: - documentation: "" SimpleSequentialChain: documentation: "" SequentialChain: @@ -87,7 +88,15 @@ chains: documentation: "https://python.langchain.com/docs/modules/chains/popular/chat_vector_db" CombineDocsChain: documentation: "" + # SummarizeDocsChain: + # documentation: "" + LoaderOutputChain: + documentation: "" documentloaders: + CustomKVLoader: + documentation: "" + UniversalKVLoader: + documentation: "" ElemUnstructuredLoaderV0: documentation: "" AirbyteJSONLoader: @@ -141,6 +150,8 @@ documentloaders: PDFWithSemanticLoader: documentation: "https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/git" embeddings: + OpenAIProxyEmbedding: + documentation: "" OpenAIEmbeddings: documentation: "https://python.langchain.com/docs/modules/data_connection/text_embedding/integrations/openai" HuggingFaceEmbeddings: @@ -365,13 +376,15 @@ output_parsers: documentation: "https://python.langchain.com/docs/modules/model_io/output_parsers/structured" ResponseSchema: documentation: "https://python.langchain.com/docs/modules/model_io/output_parsers/structured" - + RouterOutputParser: + documentation: "" + input_output: - Input: + VariableNode: + documentation: "" + InputNode: documentation: "" Output: documentation: "" - InputFile: + InputFileNode: documentation: "" - - diff --git a/src/backend/bisheng/interface/embeddings/custom.py b/src/backend/bisheng/interface/embeddings/custom.py index 3983bc68..367f5e3b 100644 --- a/src/backend/bisheng/interface/embeddings/custom.py +++ b/src/backend/bisheng/interface/embeddings/custom.py @@ -16,7 +16,7 @@ def embed_documents(self, texts: List[str]) -> List[List[float]]: """Embed search docs.""" texts = [text for text in texts if text] data = {'texts': texts} - resp = self.request.post(url='http://43.133.35.137:8080/chunks_embed', data=data) + resp = self.request.post(url='http://43.133.35.137:8080/chunks_embed', json=data) logger.info(f'texts={texts}') return json.loads(resp).get('data') @@ -24,7 +24,7 @@ def embed_documents(self, texts: List[str]) -> List[List[float]]: def embed_query(self, text: str) -> List[float]: """Embed query text.""" data = {'query': [text]} - resp = self.request.post(url='http://43.133.35.137:8080/query_embed', data=data) + resp = self.request.post(url='http://43.133.35.137:8080/query_embed', json=data) logger.info(f'texts={data}') return json.loads(resp).get('data')[0] diff --git a/src/backend/bisheng/template/frontend_node/chains.py b/src/backend/bisheng/template/frontend_node/chains.py index 11a5116f..3b56e821 100644 --- a/src/backend/bisheng/template/frontend_node/chains.py +++ b/src/backend/bisheng/template/frontend_node/chains.py @@ -74,6 +74,8 @@ def format_field(field: TemplateField, name: Optional[str] = None) -> None: FrontendNode.format_field(field, name) if name == 'RuleBasedRouter' and field.name == 'rule_function': field.field_type = 'function' + if name == 'RuleBasedRouter' and field.name == 'input_variables': + field.show = True if name == 'LoaderOutputChain' and field.name == 'documents': field.is_list = False diff --git a/src/backend/test/test_docx.py b/src/backend/test/test_docx.py index 15e23d7b..575216b5 100644 --- a/src/backend/test/test_docx.py +++ b/src/backend/test/test_docx.py @@ -59,5 +59,5 @@ def test(document): print(idx, uu) -document = Document('/Users/huangly/Downloads/1. 股东会-会议通知.docx') +document = Document('/Users/huangly/Downloads/bisheng.docx') test(document) From ff5d2b77c97301a132deefbdd7532519c67b7606 Mon Sep 17 00:00:00 2001 From: yaojin Date: Tue, 5 Dec 2023 21:06:55 +0800 Subject: [PATCH 05/23] release --- src/backend/test/test_docx.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/test/test_docx.py b/src/backend/test/test_docx.py index 575216b5..ab51b6b3 100644 --- a/src/backend/test/test_docx.py +++ b/src/backend/test/test_docx.py @@ -59,5 +59,5 @@ def test(document): print(idx, uu) -document = Document('/Users/huangly/Downloads/bisheng.docx') +document = Document('bisheng.docx') test(document) From 1254af5d932d0fb2ae1ae9bec7987fed6d86252e Mon Sep 17 00:00:00 2001 From: yaojin Date: Tue, 5 Dec 2023 21:25:06 +0800 Subject: [PATCH 06/23] require --- src/backend/pyproject.toml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/backend/pyproject.toml b/src/backend/pyproject.toml index 5d0bf195..be500d53 100644 --- a/src/backend/pyproject.toml +++ b/src/backend/pyproject.toml @@ -19,7 +19,7 @@ bisheng = "bisheng.__main__:main" [tool.poetry.dependencies] bisheng_langchain = "v0.1.8" -bisheng_pyautogen = "0.1.17" +bisheng_pyautogen = "0.1.18" minio = "^7.2.0" fastapi_jwt_auth = "^0.5.0" redis = "^5.0.0" @@ -57,7 +57,6 @@ langchain-serve = { version = ">0.0.51", optional = true } qdrant-client = "^1.3.0" websockets = "^10.3" weaviate-client = "^3.21.0" -jina = "3.15.2" cohere = "^4.11.0" python-multipart = "^0.0.6" sqlmodel = "^0.0.8" From 33bb1e3931dfa2a34886f9e77b719384becad080 Mon Sep 17 00:00:00 2001 From: yaojin Date: Tue, 5 Dec 2023 21:34:28 +0800 Subject: [PATCH 07/23] modify pyproject --- src/backend/pyproject.toml | 1 - 1 file changed, 1 deletion(-) diff --git a/src/backend/pyproject.toml b/src/backend/pyproject.toml index be500d53..04a39d96 100644 --- a/src/backend/pyproject.toml +++ b/src/backend/pyproject.toml @@ -53,7 +53,6 @@ psycopg2-binary = "^2.9.6" pyarrow = "^12.0.0" tiktoken = "~0.4.0" wikipedia = "^1.4.0" -langchain-serve = { version = ">0.0.51", optional = true } qdrant-client = "^1.3.0" websockets = "^10.3" weaviate-client = "^3.21.0" From 896d7860eb30f2193a743141c59b9d9e83ee0dad Mon Sep 17 00:00:00 2001 From: yaojin Date: Wed, 6 Dec 2023 11:56:25 +0800 Subject: [PATCH 08/23] up --- src/backend/bisheng/interface/initialize/loading.py | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/src/backend/bisheng/interface/initialize/loading.py b/src/backend/bisheng/interface/initialize/loading.py index 8b473c19..d0ce7863 100644 --- a/src/backend/bisheng/interface/initialize/loading.py +++ b/src/backend/bisheng/interface/initialize/loading.py @@ -128,8 +128,16 @@ def instantiate_input_output(node_type, class_object, params, id_dict): chain_obj = {} chain_obj['object'] = chains[index] if id in preset_question: - chain_obj['node_id'] = preset_question[id][0] - chain_obj['input'] = {chains[index].input_keys[0]: preset_question[id][1]} + if isinstance(preset_question[id], list): + for node_id in preset_question[id]: + chain_ = chain_obj.copy() + chain_['node_id'] = node_id[0] + chain_['input'] = {chains[index].input_keys[0]: node_id[1]} + chain_list.append(chain_) + continue + else: + chain_obj['node_id'] = preset_question[id][0] + chain_obj['input'] = {chains[index].input_keys[0]: preset_question[id][1]} else: # give a default input logger.error(f'Report has no question id={id}') From 899577c0fb362fe69156bc4f13e69cb2b7cfdd6d Mon Sep 17 00:00:00 2001 From: dolphin <78075021@qq.com> Date: Wed, 6 Dec 2023 16:24:39 +0800 Subject: [PATCH 09/23] fix: download file in sass --- .../src/pages/SkillChatPage/components/ChatMessage.tsx | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/frontend/src/pages/SkillChatPage/components/ChatMessage.tsx b/src/frontend/src/pages/SkillChatPage/components/ChatMessage.tsx index 491122f7..ebe50d54 100644 --- a/src/frontend/src/pages/SkillChatPage/components/ChatMessage.tsx +++ b/src/frontend/src/pages/SkillChatPage/components/ChatMessage.tsx @@ -10,6 +10,7 @@ import { alertContext } from "../../../contexts/alertContext"; import { CodeBlock } from "../../../modals/formModal/chatMessage/codeBlock"; import { ChatMessageType } from "../../../types/chat"; import { downloadFile } from "../../../util/utils"; +import { checkSassUrl } from "./FileView"; // 颜色列表 const colorList = [ @@ -133,7 +134,7 @@ export const ChatMessage = ({ chat, userName, onSource }: { chat: ChatMessageTyp // download file const handleDownloadFile = (file) => { const url = file?.file_url - url && downloadFile(url, file?.file_name) + url && downloadFile(checkSassUrl(url), file?.file_name) } const source =

From e540620ad980425942fef3435bf2e9de6d0940cd Mon Sep 17 00:00:00 2001 From: yaojin Date: Wed, 6 Dec 2023 18:12:19 +0800 Subject: [PATCH 10/23] fixbug 0.2.0 --- docker/docker-compose.yml | 8 + docker/office/bisheng/all.js | 487 ++++++++++++++++++ docker/office/bisheng/bisheng.js | 15 + docker/office/bisheng/config.json | 35 ++ docker/office/bisheng/icon.png | Bin 0 -> 17709 bytes docker/office/bisheng/index.html | 12 + src/backend/bisheng/api/v1/knowledge.py | 13 +- src/backend/bisheng/cache/utils.py | 7 +- .../bisheng/database/models/message.py | 4 +- .../bisheng/interface/chains/custom.py | 4 +- src/backend/bisheng/main.py | 1 - .../bisheng/template/frontend_node/chains.py | 9 +- src/backend/bisheng/utils/minio_client.py | 3 - .../chat_models/proxy_llm.py | 2 + .../bisheng_langchain/vectorstores/milvus.py | 0 15 files changed, 584 insertions(+), 16 deletions(-) create mode 100644 docker/office/bisheng/all.js create mode 100644 docker/office/bisheng/bisheng.js create mode 100644 docker/office/bisheng/config.json create mode 100644 docker/office/bisheng/icon.png create mode 100644 docker/office/bisheng/index.html create mode 100644 src/bisheng-langchain/bisheng_langchain/vectorstores/milvus.py diff --git a/docker/docker-compose.yml b/docker/docker-compose.yml index 937d2eb1..17218908 100644 --- a/docker/docker-compose.yml +++ b/docker/docker-compose.yml @@ -23,6 +23,13 @@ services: - ${DOCKER_VOLUME_DIRECTORY:-.}/mysql/conf/my.cnf:/etc/mysql/my.cnf - ${DOCKER_VOLUME_DIRECTORY:-.}/mysql/data:/var/lib/mysql + office: + image: onlyoffice/documentserver:7.1.1 + ports: + - "8701:80" + volumes: + - ${DOCKER_VOLUME_DIRECTORY:-.}/office/bisheng:/var/www/onlyoffice/documentserver/sdkjs-plugins/bisheng + backend: image: dataelement/bisheng-backend:latest healthcheck: @@ -42,6 +49,7 @@ services: restart: on-failure depends_on: - "mysql" + - "office" - "redis" nginx: diff --git a/docker/office/bisheng/all.js b/docker/office/bisheng/all.js new file mode 100644 index 00000000..93560514 --- /dev/null +++ b/docker/office/bisheng/all.js @@ -0,0 +1,487 @@ +(function (window, undefined) { + let selectText = '' + + window.Asc.plugin.init = function (e) { + selectText = e + } + window.Asc.plugin.event_onClick = function () { + selectText = '' + } + + window.Asc.plugin.button = function (id) { + } + + const EventMap = { + sendToParent (method, data) { + let params = { + type: 'onExternalFrameMessage', + method, + data + } + window.top.postMessage(JSON.stringify(params), location.origin) + }, + focusInDocument (data) { + window.Asc.scope.field = { + id: data.id, + fieldFlag: data.fieldFlag, + $index: data.$index || 1 + } + window.Asc.plugin.callCommand(function () { + let field = Asc.scope.field || {} + let index = field.$index ? field.$index : 1 + let oDoc = Api.GetDocument() + let flag = `{{${field.fieldFlag}}}` + let oRange = oDoc.Search(flag) + let cur = 1 + for (let i = 0; i < oRange.length; i++) { + if (oRange[i].GetText() === flag) { + if (cur === index) { + oRange[i].Select() + break + } + cur = cur + 1 + } + } + }) + }, + focusTableInDoc (data) { + window.Asc.scope.marker = data.marker + window.Asc.plugin.callCommand(function () { + let flag = Asc.scope.marker || '' + let oDoc = Api.GetDocument() + let oRange = oDoc.GetBookmarkRange(flag) + oRange.Select() + }) + }, + addMarker (data) { + let flag = '{{' + data.fieldFlag + '}}' + window.Asc.plugin.executeMethod('PasteText', [flag]) + }, + addBookMarker (data) { + window.Asc.scope.value = data + window.Asc.plugin.callCommand(function () { + let oDoc = Api.GetDocument() + let range = oDoc.GetRangeBySelect() + let params = { + type: 'onExternalFrameMessage', + method: 'addBookMarker' + } + let marker = Asc.scope.value + let markers = [] + if (range) { + let texts = range.GetText() + let pars = range.GetAllParagraphs() || [] + let txtList = [] + for (let i = 0; i < pars.length; i++) { + let text = pars[i].GetText() + txtList.push(text) + } + let table = pars[0] ? pars[0].GetParentTable() : null + let count = table ? table.GetRowsCount() : 0 + for (let i = 0; i < count; i++) { + let row = table.GetRow(i) + let firstCell = row.GetCell(0) + let cellText = firstCell ? firstCell.GetContent().GetElement(0).GetText() : '' + // 序号 + let isNumbering = false + if (firstCell.GetContent().GetElement(0).GetNumbering()) { + isNumbering = true + let cellCount = row.GetCellsCount() + for (let j = 1; j < cellCount; j++) { + let cellItem = row.GetCell(j) + if (!cellItem.GetContent().GetElement(0).GetNumbering()) { + cellText = cellItem.GetContent().GetElement(0).GetText() + firstCell = cellItem + break + } + } + } + if (cellText && txtList.includes(cellText)) { + let cRange = firstCell.Search(cellText)[0] + cRange.AddBookmark(marker.key + i) + markers.push(marker.key + i) + } + } + // range.AddBookmark(Asc.scope.value.key) + params.data = Object.assign(marker, { + key: markers.join(','), + texts + }) + } else { + params.data = false + } + window.top.postMessage(JSON.stringify(params), location.origin) + }) + }, + deleteBookMarker (data) { + window.Asc.scope.value = data + window.Asc.plugin.callCommand(function () { + let oDoc = Api.GetDocument() + let markers = Asc.scope.value || [] + for (let i = 0; i < markers.length; i++) { + oDoc.DeleteBookmark(markers[i]) + } + }) + }, + // 批量删除循环应用内标签 + deleteLoopApp (list) { + window.Asc.scope.value = list + window.Asc.plugin.callCommand(function () { + let list = window.Asc.scope.value || [] + let oDoc = Api.GetDocument() + list.forEach(row => { + if (row.loopType === 0) { + oDoc.SearchAndReplace({ searchString: `{{${row.startTag}}}`, replaceString: '' }, `{{${row.startTag}}}`, '') + oDoc.SearchAndReplace({ searchString: `{{${row.endTag}}}`, replaceString: '' }, `{{${row.endTag}}}`, '') + } else if (row.loopType === 1) { + oDoc.DeleteBookmark(row.bookmark) + } + }) + }) + }, + // 更新占位符 + replaceMarker (data) { + window.Asc.scope.st = '{{' + data.newValue + '}}' + // 原来的值 + if (data.oldValue) { + window.Asc.scope.old = '{{' + data.oldValue + '}}' + } else { + this.addMarker(data) + return + } + window.Asc.plugin.callCommand(function () { + let oDocument = Api.GetDocument() + oDocument.SearchAndReplace({ searchString: Asc.scope.old, replaceString: Asc.scope.st }, Asc.scope.old, Asc.scope.st) + }, false) + }, + // 查找并插入占位符 + findAndInsertMarker (data) { + window.Asc.scope.st = '{{' + data.fieldName + '}}' + window.Asc.scope.searchStr = data.fieldValue + window.Asc.plugin.callCommand(function () { + let oDocument = Api.GetDocument() + oDocument.SearchAndReplace({ searchString: Asc.scope.searchStr, replaceString: Asc.scope.st }, Asc.scope.searchStr, Asc.scope.st) + }, false) + }, + insertPosition (data) { + if (!selectText) { + let postData = { + text: selectText, + ...data, + selected: false + } + this.sendToParent('addRange', postData) + return false + } + window.Asc.scope.postData = data + window.Asc.plugin.callCommand(function() { + let postData = Asc.scope.postData || {} + let oDoc = Api.GetDocument() + let oRange = oDoc.GetRangeBySelect() + let selectText = oRange.GetText() + let oAllPar = oRange.GetAllParagraphs() + let oPar = oAllPar[oAllPar.length - 1] + let parText = oPar.GetText() + if (oAllPar.length > 1) { + oRange.AddText(`{{${postData.start}}}`, 'before') + if (selectText.includes(parText)) { + let newRange = oPar.GetRange(0, parText.length - 1) + newRange.AddText(`{{${postData.end}}}`, 'after') + } else { + oRange.AddText(`{{${postData.end}}}`, 'after') + } + } else { + let isEnd = parText.substr(0 - selectText.length) === selectText + isEnd = isEnd || selectText.includes(parText) + console.log('end = ', isEnd) + let start = Math.max(parText.indexOf(selectText), 0) + let end = start + Math.min(parText.length, selectText.length) - 1 + let newRange = oPar.GetRange(start, end) + oRange.AddText(`{{${postData.start}}}`, 'before') + newRange.AddText(`{{${postData.end}}}`, 'after') + } + + postData.selected = true + postData.text = selectText + let params = { + type: 'onExternalFrameMessage', + method: 'addRange', + data: postData + } + window.top.postMessage(JSON.stringify(params), location.origin) + }) + }, + deletePosition (data) { + window.Asc.scope.range = data + window.Asc.plugin.callCommand(function () { + let oDocument = Api.GetDocument() + let { start, end } = Asc.scope.range + let markers = [`{{${start}}}`, `{{${end}}}`] + for (let j = 0; j < markers.length; j++) { + oDocument.SearchAndReplace({ searchString: markers[j], replaceString: '' }, markers[j], '') + } + }) + }, + deletePositionMarker (data) { + window.Asc.scope.data = data + window.Asc.plugin.callCommand(function () { + let oDocument = Api.GetDocument() + let markers = Asc.scope.data || [] + for (let j = 0; j < markers.length; j++) { + oDocument.SearchAndReplace({ searchString: markers[j], replaceString: '' }, markers[j], '') + } + }) + }, + deletePositionArray (data) { + window.Asc.scope.data = data + window.Asc.plugin.callCommand(function () { + let oDocument = Api.GetDocument() + let markers = Asc.scope.data || [] + for (let j = 0; j < markers.length; j++) { + oDocument.SearchAndReplace({ searchString: markers[j], replaceString: '' }, markers[j], '') + } + }) + }, + replaceRangePosition (data) { + window.Asc.scope.list = data + window.Asc.plugin.callCommand(function () { + let list = Asc.scope.list || [] + let oDocument = Api.GetDocument() + list.forEach(row => { + oDocument.SearchAndReplace({ searchString: row.str, replaceString: row.newStr }, row.str, row.newStr) + }) + }) + }, + delMarker (data) { + let fields = [] + data.forEach(item => { + fields.push({ + text: '{{' + item.fieldFlag + '}}', + type: 'field' + }) + }) + window.Asc.scope.st = fields + window.Asc.plugin.callCommand(function () { + let oDocument = Api.GetDocument() + let markers = Asc.scope.st.slice(0) + for (let j = 0; j < markers.length; j++) { + let marker = markers[j] + if (marker.type === 'field') { + oDocument.SearchAndReplace({ searchString: marker.text, replaceString: '' }) + } + } + }) + }, + delMarkerGroup (data) { + this.delMarker(data.fields) + }, + // excel + insertCellName (field) { + window.Asc.scope.field = field + window.Asc.plugin.callCommand(function () { + let fieldItem = Asc.scope.field + let sheetObj = Api.GetActiveSheet() + let sheetName = sheetObj.GetName() + let oRange = Api.GetSelection() + let oCount = oRange.GetCount() + let params = { + type: 'onExternalFrameMessage', + method: 'addCellName' + } + if (oCount !== 1) { + params.data = false + } else { + let oAddr = oRange.GetAddress(true, true, '', false) + let sheetFlag = `${sheetName}!${oAddr}` + // let name = [fieldItem.fieldName, 'DEF', fieldItem.id].join('') + // let nameObj = sheetObj.GetDefName(name) + // console.log('inser cell before = ', name, sheetFlag, nameObj) + // let result = sheetObj.AddDefName(name, sheetFlag) + // console.log('insert cell ', name, sheetFlag, result) + fieldItem.fieldFlag = sheetFlag + fieldItem.$success = oAddr !== '' + params.data = fieldItem + } + window.top.postMessage(JSON.stringify(params), location.origin) + }) + }, + getFocusedCell () { + window.Asc.plugin.callCommand(function () { + let sheetObj = Api.GetActiveSheet() + let sheetName = sheetObj.GetName() + let oRange = Api.GetSelection() + let params = { + type: 'onExternalFrameMessage', + method: 'getFocusedCell' + } + let oAddr = oRange.GetAddress(true, true, '', false) + let sheetFlag = `${sheetName}!${oAddr}` + params.data = sheetFlag + window.top.postMessage(JSON.stringify(params), location.origin) + }) + }, + loadFileFlags (data) { + window.Asc.scope.list = data + window.Asc.plugin.callCommand(function () { + let list = Asc.scope.list || [] + let oDocument = Api.GetDocument() + let oParCount = oDocument.GetElementsCount() + let dataMap = {} + for (let i = 0; i < oParCount; i++) { + let oPar = oDocument.GetElement(i) + let ctype = oPar.GetClassType() + if (ctype === 'table') { + list.forEach(row => { + let flag = `{{${row.fieldFlag}}}` + let rs = oPar.Search(flag) + for (let i = 0; i < rs.length; i++) { + let oRange = rs[i] + if (oRange && oRange.GetText() === flag) { + if (dataMap[row.id]) { + dataMap[row.id] = dataMap[row.id] + 1 + } else { + dataMap[row.id] = 1 + } + } + } + }) + } else if (ctype === 'paragraph') { + let oParText = oPar.GetText() + list.forEach(row => { + let count = oParText.split(`{{${row.fieldFlag}}}`).length - 1 + if (dataMap[row.id]) { + dataMap[row.id] = dataMap[row.id] + count + } else { + dataMap[row.id] = count + } + }) + } + } + let params = { + type: 'onExternalFrameMessage', + method: 'loadFieldFlagCount', + data: dataMap + } + window.top.postMessage(JSON.stringify(params), location.origin) + }) + }, + /** + * data.sheetName: 要聚焦的sheet名称 + * data.cellName: 要聚焦的cell名称,如C1, D3等 + */ + focusCell (data) { + window.Asc.scope.data = data + window.Asc.plugin.callCommand(function () { + const theData = Asc.scope.data + const theSheet = Api.GetSheet(theData.sheetName || '') + if (theSheet) { + theSheet.SetActive() + + const theCell = theSheet.GetRange(theData.cellName || '') + if (theCell) { + theCell.Select() + } + } + }) + }, + + getSelectedText (data) { + window.Asc.scope.data = data + window.Asc.plugin.callCommand(function () { + const theData = Asc.scope.data + const oDoc = Api.GetDocument() + const oRange = oDoc.GetRangeBySelect() + if (oRange) { + const oParas = oRange.GetAllParagraphs() + // 只能选择一个段落,否则认为不成功 + if (oParas.length === 1) { + + const params = { + type: 'onExternalFrameMessage', + method: 'getSelectedText', + data: { + id: theData.id, + text: oRange.GetText() + } + } + window.top.postMessage(JSON.stringify(params), location.origin) + } + } + }) + } + } + + function receiveMessage (e) { + let data = e.data ? JSON.parse(e.data) : {} + if (data.type === 'onExternalPluginMessage') { + switch (data.method) { + case 'focus': + EventMap.focusInDocument(data.data) + break + case 'focusTable': + EventMap.focusTableInDoc(data.data) + break + case 'insert': + EventMap.addMarker(data.data) + break + case 'addBookMarker': + EventMap.addBookMarker(data.data) + break + case 'delBookMarker': + EventMap.deleteBookMarker(data.data) + break + case 'delLoopApp': + EventMap.deleteLoopApp(data.data) + break + case 'update': + EventMap.replaceMarker(data.data) + break + case 'findAndInsertMarker': + EventMap.findAndInsertMarker(data.data) + break + case 'addRange': + EventMap.insertPosition(data.data) + break + case 'updateRange': + EventMap.replaceRangePosition(data.data) + break + case 'delRange': + EventMap.deletePosition(data.data) + break + case 'delRangeArray': + EventMap.deletePositionArray(data.data) + break + case 'delQuoteGroup': + EventMap.deletePositionMarker(data.data) + break + case 'remove': + EventMap.delMarker([ data.data ]) + break + case 'removeQuestion': + EventMap.delMarkerGroup(data.data) + break + // excel + case 'addCellName': + EventMap.insertCellName(data.data) + break + case 'loadFieldFlagCount': + EventMap.loadFileFlags(data.data) + break + // 聚焦到某个单元格 + case 'focusCell': + EventMap.focusCell(data.data) + break + // 获取当前选中的单元格 + case 'getFocusedCell': + EventMap.getFocusedCell() + break + // 获取当前选中的文字 + case 'getSelectedText': + EventMap.getSelectedText(data.data) + break + } + } + } + + window.addEventListener('message', receiveMessage, false) +})(window, undefined) diff --git a/docker/office/bisheng/bisheng.js b/docker/office/bisheng/bisheng.js new file mode 100644 index 00000000..03507467 --- /dev/null +++ b/docker/office/bisheng/bisheng.js @@ -0,0 +1,15 @@ +(function () { + window.Asc.plugin.init = function (e) {} + window.Asc.plugin.event_onClick = function () {} + window.Asc.plugin.button = function (id) {} + + function onMessage(e) { + var data = e.data ? JSON.parse(e.data) : {} + if (data.action === 'insetMarker') { + const flag = '{{' + data.data + '}}' + window.Asc.plugin.executeMethod('PasteText', [flag]) + } + } + + window.addEventListener('message', onMessage, false) +})() \ No newline at end of file diff --git a/docker/office/bisheng/config.json b/docker/office/bisheng/config.json new file mode 100644 index 00000000..b6f13abb --- /dev/null +++ b/docker/office/bisheng/config.json @@ -0,0 +1,35 @@ +{ + "name": "文档自动化", + "guid": "asc.{D2A0F3BE-CC8D-4956-BCD9-6CBEA6E8960E}", + "variations": [ + { + "description": "插入label-配置", + "url": "index.html", + "icons": [ + "icon.png", + "icon.png", + "icon.png", + "icon.png" + ], + "EditorsSupport": [ + "word", + "cell", + "slide" + ], + "isViewer": false, + "isVisual": false, + "isModal": true, + "isInsideMode": false, + "isSystem": false, + "initOnSelectionChanged": true, + "hideClose": true, + "initDataType": "text", + "isDisplayedInViewer": true, + "isUpdateOleOnResize": true, + "events": [ + "onClick", + "onTargetPositionChanged" + ] + } + ] +} \ No newline at end of file diff --git a/docker/office/bisheng/icon.png b/docker/office/bisheng/icon.png new file mode 100644 index 0000000000000000000000000000000000000000..df532d700ea5e9ac052053b0a763315234076538 GIT binary patch literal 17709 zcmeHv2UJsCw`~+fET~i|D$-35q$4PxfRuz#lF*yd6MB~-d@3NlhEN586hf5_f*?vS z(gH{TL8^d&AfWUY@4f#A?+tzLj&aBP-?-!bJLBYR_L*7PdveZN>zp;=;N!sr;38OA zO&M_P7yxjL@&X)89{Z@KsA#4I(^OVdzyI$CS^<=HLKpyWaCCQtsVH32H!!^RYwREG zp@p0CKl;DfQ|xXI{<01L^b7vx`2T6>w564s1qJktvhlc5zML{yMoOR2=HL6*fA!7( zy?^Id-^<gek-qtnzJ;^vFMI?AbH~Bn)^5GcIO90?PAOKL&eK>Z= ziSt9(hZOr1k}2D9Yij^tjQ{}98UO$cJpjNNqklU_dHciMu2BwMrp(KU^0EQg1FQho z0BQh7fCWH^(h~&;14ICl2O|JQ%CGak)_xiJuUVjc`rs{qj^Qj{QO$KXHue`XVv3Z{T&0}W$0^`nXZL?J>Qg6aj-NPvjOq;KxC|WtaPm0d z^jWHtzZUfia_l(OiIdc)=;#?PvFIlBGcqxABJ|?o(S0NxOfL-=w}|#Vo`<3bWWYJf z9FEgb(E;QE+kjK+5AKZr<17Aq?@w*3#Y;ENLr2<^<@~ln;<=R2Ap7}#d@fBQ2vK$q zk}PUYLUYonBg&Oet8=rwJLO^%UM-^Sz`G=Wr6_OK@nUk^mC`bwy?*4{5Did_r#qG~ z1VS+;HK-XfW1Hy$AF{mTlX+|!59S*{l=SiaZg}n(mF@l@raW+-NnK8Cp0v#l;9|S@25VVi`8H73T zJqp#}yIsgeZ$nbf;u2eEiFh1KZLa{vDJHNAi1C1wybO3*nZ?wAxJ#SAN(C@ki03MH4YKd&3=}Q_t{{ zQKXU7wdA#D{{sIf*x;a5VOi6N`tVq}AlmC)qNaTi)O9->xv|LSFq0_)o7HrkeF9R# zastE9)!|@RlCnqyCr#sY_-m4jnxdC%ZFzwbnv*FPg~qW?^5B4R);QJ;Iy%7 zlsTW{_z964Pi9WBlPSt^Z-V*ICzpflUf=c!T(CfP7=F}^8FoO)ap>ULXj$uOnf^sHlLyZQ8??)C~Uk1@5qii_O1{fOY zevKD3e!k4o^zCNfo9n}}XJ^uO`p$T&aDOnLzC$c@n;EVvL``4LPWwLI(E5FNoGe;d zB<`+T-R&K9uWQY)Mv!0p23~L=Eb~pKJL{u>^{>fuWmwM>5zfvXM4ilR5 z1$vAyX7@%#C7I=&a%R$pObxuY`%X@lFhakENW#ey!8?9e;lOC*BBH(%+@JC5^R9}} zsbHl_;$k9_U<(Z(jM+j}^pk0tmD7{fGBt?39#lD8bO@F*V!xFr5xTR>+7EI%p#+ieReK0jevb~HKWV#`OUn(t!&G7g7%u?WM#k91R#wqrhRXCY&RjTa1L`-XLLk61&=? zcP|?{6C+H^Tm;38UfV9co>g@v`{DR3lh>?;5W^^N4~|bb!oReEAeP)BHtaoGZsV40 z&^fC;GbdW_X{viCzi2EWn!{A29@y5Vp{${-W;s3C*4DJ+>f&KJ*%W?B_vT^_;MB2y z%bfq$gy2ET;3XDu2YHCLC|sp?SZf1Rep^=1#i~EG-F6%%Ss>{v8A~*6k3)k^9D?qJ z0UbZxqw_%6={Q_+)7!2ViP&Le*a&LdW7p~z4e!@!P&MK{rik?w$Umwl0;K-3)%Q58nLk#g_h^ zvl@2j>5aAgO}8z#zW|a-x-b9N#s1@Gsy&_$X`kE)w6mSDa!QRFv2x%6kb81p!pV+Q zkU3XEN3tLi+*1ZRqmX%$LOmx#RcZT;B!XmIdq%9(K)B==^=ewyu`d6CvQXPV`eav2 zRT51C@U;3hbS4e>A&tVtKu)S4sw;A_YPs>qT-f{omnF%8@uxPPK{?+*c~}+s?p_|9 z6`PAiqyFk7=Ww*Bv7&ZU0-s0`-HoZbMZS`oUlLMbuEiF2D2H1}w?{kk+F5fD)z#eo zvk*gb%Jd#Nvh|cTbJ*h4bhE+`N$+=mIZojN!036PPM(fH!YL2b7&ucuw?371CD%hQ z3dM9PKAVP5?8O@|QMW_}b2OeVSq3%4qHr<7zvs#BHau!JjsdL$>w+N z0z@rxWrAbg{K)#$e*l2Lblm9aS+7r-^{-U5+tb4n4*=3(VY4^CeJQtw?%NrgBMb&)hF^P=DMex`)@rp;`eaZJ}Mmwy`KhKbVAb zovUy&OXWf_C)@`qH3BJlF6I1pKFS@wWnLv%S2Vwk`L1V27Y>rj{a)dvRf&F0T3|-^ zc8ygNAOiy^SDvn1(xRpTG-gZ2KUvZbFF5?lmFxf`2rgWaj%$A*QULc#o)iKVJMdy7 z9Eu?coTo8CwdFEbvu4!PuO%tFk%Q>^Th2yM?goJIb6rVQdn8oIcgaRD#wyIoYWvJ^UNg$4wENeXKGd( zig4Mlq^_>veHHIYM83pt8<@q?y{r^Oi`ZdmsD?Ydrk+`yGn^@WyXiXib7H8k_al3> zH(WOJC0lb3{PB&V){q_}%n@@X!Gf+s4}7H^cb)C#ZNJjC)DqKsdT1?dfpr+YX(e01 zjw9xV6`8B;Udg6a`rndRFY~0H`h@7VwWTgDV;qMFtZ|7m&m1v#WVvkHhPSC)RzNmo z;#B(NGfudKEjqTdu1CyCYBd!ae!XBLI37j&Sn$e|svikrWVU1Z4kxAh+IK@D|HK}3 z9VY7SsBe4n5-ng*@@eY4mA{hm)6t3}=8iD)H`BrY&533Th^M$=yOUmJ4iB4(dVPHQ zMmN>dr28dgxjvYjQb&xpB`{THqhuu;M86OT2tGdAAq3%-9RlB(^gJ|WwKkbF6!!v- zug^y0GWuj2L$)s0Hs-iOpZAQk!Y)H1s-Aze?8=`(dg~AP zf)Ji4khzj~sEgKYlSJxGcIAjgFeF4V^3I@3DU4A$9^8i?HRZM3y>@)+;@jW(1^|Fb zXFZ~d?4)CE#i0m)AsL)qzMY8ZE59OhR{bho15eU%FuR|p9=|ViBSR|x7D3XDtaFeSC3K+YYb|SHx%}Q-vDwII#}{B*GqS(xDBELs&-#2(7K0Z`c(rrIYto6LRqNo04fq1%LeTx0 zjvUeask6~spqba-i%pQzuY)n_x35T}Yxg^c_0384vE zo?6Rv@V;}Y3`KL^6FdRXQL(#P-ANUwf`f8yb z1}`@_K5pQ6%7!YgxexJ@USa%tI`jy_3Eysd{1qi{md%k6e(&|5Qf?x;fmT9)EDKkY z_9M*mMB~aKLh6tkwxowOuZ%LBRVIQ{@*R6&V;_8*ZXf>xd$j7pV;=>(tGK@@LUpBV ziH>u!lP{i#+&<)&_6UuCGtK$mo9H%!{Jyd!E}~f>i4A632KH?)H_h30WiKJXkxR3; z{kBH8VxRIkD5GA`um1DyMUe$h{W_>2}3fuNhs}|MRPv!#%!zm zwx9KDf@qGtoqSe>SY{a-AtborzIt7qzuU&N=bRBVnMQHCT9AJvEMk}@tZr4k{EajA zov^yKknomD;U3=spy;bnmOONqqhx=;S-#=`aG_;O{#5`f;OBV4<^doeW4mZ0Bf4YX zEd1yA#?KUWZ>i@_Bn%Vzoym~X$k4eCSg5^jE+dN;GQSx)&#xX+6^*b4dpzf9HU43U zz|f^BS(_IGmVGewn5+2ot^tRwF^|{E=}*ej$h{XW`(wIbATtY1HdV4VgoGkXq7vfm zbk$y0!b%O-T(u&CMoYtueFI+O$F}nw4VnsN>&M`Zd%AW#z%KJwW1pZcb1gq7ZL-7J zt%-G~Zaw4lx-mI)b$ppv>7U1raj}bb`hgg!*GXBYF4lKk;hXwCwd>s56j_QGr%JmQ zQ*(%j?6_6lQn{1sFrM4xL#3ZDN(I0m+pC z>o>>pb=hm$s&;=K04|HWSlACpy`8&gDRc<2aQz_Yvb4XD>a|d6`q!WU?b!6Fzh6KU z+pXdXRO*CVi0Gw5L~*qZa}|Yj4`|JXLD5}1rr_l@M|nB<7TU}DBNEVW(v7TVk3jT~ z&@=Q>BJ91Fw-iSP$5e3C4Tlu`&&!kN{zxEO)5DtD)adT$)%PmQl3j4W^Dfj1%ZC(8 z9Wp8Lh4ycom9J&t=8C-h2jn8j~aGrk(!D2kMM4rYM1E!-%j+ZZD3vQ`asrO}R(~pKss<*CUV3)(Ng7OQx2|OJ+ ztsOajci2C+b8}Y^KSo4E)0?Pm900-~*ev;Pt*ssVERp9L-Wxp&F>KCIC21NhM6wg( zmOUee5VP?H5_z~V+`SD+f<~%Dd(Ob@Y3EONDFL&xP7QD`ALQ$FBBNFCD;w#BS8{C z8ZbIr=e85RsqlFHn4sfcj#R(-&@r=ROr<}GkLRL^x-j=gA^9pEXJucgUbiiNd3JA- z@)YfF4Eubs-CspNAV&2l26}2d3*FaC9HaENC?S5X4DRLTzp62Eh@f5nSpHUT2Ujp& zemay2evW4IZ1nQ!%amN|I~iD0UH^#SaEPF!U4|0^dwqi|cqbQ{sngY%|a zTe1Cd*2+p*1c`|HZ38gOC2_Y|*=(L=^ZbKZoraDv3(dR*KInQ~WIxc0U)k2st-8pt zkA4f5wT*-4bfJfJQogTOIsv;BzYR;FhifsN`VyMc1w+w)NrFc!uX{!^XE=_aKdz?; z3Z^x((5V?zXxYKTVVFtGV)vbeQ6Ht}qb^$)s;0WDrt%Cwlo1gGU4B&}0$VL)kDKF& zna99LN@{9!JaK`XG+?a}2b^F985YQUU5ySvZSQ{fk8a@#tlQvrhX01|V_KAWf}$?! zbV=5HV?}Swa1baxG$c`yV3;SH@+no705)!fgcCkk;_bKMQYr)I+@@8Nv-9^OSP}DO z$=^4E!U)imzykYpRF`w5-~fw)NTRg+j{Q~&CZuIDr`1c&*n(!)8Ypk2e9Yss<|*Af z6~rG_GYXmA-dT??@k#xUk zNGKKXV`Pp+;xbjztw;=Q6k{;r!-{6yhUxN;s!}a|9uEW9H`7k5`FR>3M8T4^(PX z-MV8b;r!BvE<`QuDz(Bh)5=2xJH|CJ+WrD!R8uF?{1x{nX!Ts{Pf9qSv(7c8{`rja zUXjreM29q_rQ%esBY$plu|C}j-{|*gKT7KKy;SEWmEXOXmqP_d5J!mA&!_nj@94cI zM@gNwhI132)Lm=UI!d)oitE|=&v8Fz?z-elSf^(#ykKrV9jT{lw2+;$Clr=YAUg9Q zU+gh7TW|g=x}2Ae*qzOn$Y&l4d_OE^qo&kZaOpe?cU7)7k%yb}mY)_dJM7CKm4>94FEJ`TYO;!Z@a5rrP0#UoYg--sqCfvt78IN%Vl~X_FoV zHZ5T1U)DE68()}c3-`Gl7H-`eK_}RRTwfm3L;T-LQyGaNhbIcPO%ei5;7UH$jd|x zUGsP~l}dOPjv4ju&1#C4JE1cFO+QqUp^qhM$Ca$1{+!ewu7D18vT9RoPnueurwjPD+4!5G=Gj*orOo=sy_f+N0tdXvQFdpBh;<$+NS7K z)Fpzumsk%+56fN{;Se- zd#F9vJmO+`0mQ^xGTWFWfr@l3)q1hIZ!=(6+Sa{rlW&(E9Uwq(Ij_PSBF73Hb;IX|%@ZO%KvN)Sj#2X<< zF}YL4Q}Kgvx8ApsN=AIjLbF-(Gq6nbB|SYYz`4H`|N7g1{9ox@SdZs(3++p-Oc3?k z`cB8E6nY&o+)sca=!6hN+oy@5mQBR#3Kx@zs0ezBnk4Z2WcMXk^DHui?0?%#uXB`B zjOVtE?Zytif2Qyv%vvpjTeA>B0*XA1D-%@vRpJy0gm(UyRUfwL%m)C27c{jCn?ZS~ z`vnPaL>8=Id59c&RszZe?Enow!($D#Bmqymip816M$V16r$z;YJohHP)hHA(pAHCfjSWHOdup-dgu6``7D*cw^ zp<;=7`pkPXNX-L>t{M1dsPbR7JNIDg#Q|L0o`;wts-~doV@0^JSCX@9#2v-E;*Vbv zee~;N>1qyjSVt67??(R`>RbNE?&giyqm)_XdJu1Jn6-wC9T7=cox{x(9soRDCm(%w zylXe_WlvjOo9j=hvvAmgOffg7l^omuR$w)h1jE*b0&1jP?l@OolUT2*pdS6CBTuQb zM3*t$gj`zVk~(5<;>d!HV%WA#MyMQ}KpB)miwgNnA})N3j{S%|`%H9g`S|h&W6jHg zt+bDYmY!5fC1_F#E!j&aA$?7hLdy|z{}X@Ib%~yDM^f7fltN4I=2P6f`M#3!*F$sc zhuoZXrTUxe;D0jJ&);h|--*vI&Ct%X82g6nAK`=>*X58O`?RHc+o~g~<43Dsf@3K+ zh2}UR^FB&VKEi@VK|}~}Id*&Fb(@;hjkDgWyv&B$Iz&#ehFl4l)E}6@>OmiQmP@Ww z?Es+od4f7VWX+K+RDf<$ERyJ(7ngCaN5q$oJhz3ZL`S#W_yH^kV$eReHHiWHd# zP4B{@^h&Te`b49LTIOL!l`%>tI&1?*_eTfz3)L;;ntwjXIPIQ)V;Hg6oWT0!B=7LW zxqi!AiRpOKXCg4+;<$oVDD89sVX774z}}l-)x!F4;MRNhDzA>d|NWnK( zAxLp9b-j=-LTEV~np_+=LaD}tp06mwIbMsK{Ofod-OE2Ytz)Y4s_GfNem-jU95CP_sd30x%Ub%2Vnxm-$AXWRnn&*NZod= z4Dm8<-hFA55T&%mKqsTLC4eCI*?l7bi5=ZO-mHsYtF~BfEiMF>SNJI`kP}F`$)RXMXc^v(f9b~m zcwxFy?pRf(P-QI(&xPfpe@vGRXn-S!OEMu5t2fB0oEmDbV!f*+J5i~w%zYzP;R`+Q zHAQ+K_AZbT36zS={ZOGB!DQHEnZY^n??Tx;==YYGxP{A&WxOZwZAjRV(}R$)LnX!k zq&Xl9xe07Mk*{X)m2J4XpXY^XMf#;4BX#6>v}?Y>Chbe5gkUR zm*==qY6W4LR$hD4AGeUTf3o^Gx91UaM;5GAQ{z<7H3I)ZosqL658KcVI6nG>uI7lj zLs^GNj>%Z%;Or7{=b0wyaR$*Rr~ZlK-9PciLK0dYS#JE1oM|au%L>Qzo%zFOSAbn%}YNmto>hA%V(q53_~M*n;X*9J^TXL^xkiR?+uP`9vRWYj$N!v&gOB2NsL}Zh z28czO1m5`>)?o&I5>^spN<_>8K?yfLDqupErZe%}Es&^GktD0itB*-lJLoGTZXGpw zu6CkiB{vVj(yL-9i7atl)i%6Z6Fp+GIAzr3Jy^}Ls+^bhWgq$O|IY3F$sPSC$JsBf z_r1NsAH&!lxlp_Gf+RR2R4x;Q6L<4+MD}fX7^a1ap~_|Y2jP}-8X0(K1~A;y{&Gbs zg4p5nT*I=2pl*j4n|%}>mT#nnr&8R9^WjYCJJkP0y!D;yD(l#@d+*jXC(FC+x zZhf~FlJ@|*AS_t6sODK+%0A^bw#x+*WE^N!>^6jmIFyV#bCRt33usO*xh`BcnSa;V zK+Z|uMZ$J%^n_wo;Z}tVjd7#U!li=14|oSIRoXi z{5ibfa6JZg1w{;#q0cOrCD{rUw+Rn(FnuB6+AAt(CYhEDpJ{jhBx=$cD`y&2y4^hk z*X^^^?A-)qOTy@b-gsj--d(OjaMCDSPQ#ZW%t*tj3vrp8*ih#Pi&YGIFaBKAiQ|h6 z*5`{t9S}ktl0nAl9ZK)`I|HMPk~G8=1u%ms#rz8EY$U z`JQd#5J4%Y{HM9mD$0AyKB1KP^lv50x%QR6wj}s0*v; zmdT@QJbr)m$$l>@zsc?=@<*zv4w?RMO?0&Vv=r$APl-mTnA###MXmp}VEasGjxD9k zlW{j|Ad%NLm5(WOd4UPFq*F%BQCf$O>)P~Rn=~efSY4v$fhJBqI4d+-UeKYC5++-u zgzrIyxM(thjFrtrE7N%5qhCXLLH>$?fxK-OfGrsGHFPNtHcSMTOUk295Y3Z;D!}4Y zrDpC_VpKKmV0!GtcqHS+#4DbH>8-8RxW$?Qte+`=-uc^hs!I6%DUW5dq1;k^9bVfG z7hIZ@L$1gBW`P`m@1u}sp$J-8C`B=vJ2*H9bf0CoDq*E&BdJK}pOP4mPEfQk*kdYw zL5t1GbaBcfgep9g^Of2B%baZAs=&e9hLjscShp(a{0OruSPbH$duk!#^1NOAS0y&O zL78?@!>E$hRk6I{B6+rdUVL7HA2|IT>4AoNVDVx%tNQvrD|hFD*^L}!`fWXMLf%dj zY^Kk1f!aEz?oYREL2EA96R_iqrn zMsWD8T*Jb6mu7P(dTHzOgPwx#!}B#w^$(I_;1$1BYZMpypYVO$vbh^2Y^%HVb$x7d z{FS;YaOC%D4gDecQUua2KqCeNAw&=6_QzwJY0r$yrl-~#ZJrK2dHV?Bh+2bfZbGGUG{C8K ziaN+Cy&L*TaOXth(?5_M`8U_k{^?ZPyw~0aO<4}Q^{Q20YDl=JieQ({8FXjTP9w&u zKM3kJCy#5-RVnCfhT%GLDq#IIcm{2j2DRzvK}RYQ7N^4=R00c$w5!CWXemF61oI7+ zr$Bb@1OoNzvB?`3oqb-#!<8sO=w@E(@)W9woyUmnarI=)A-&4ZQgAf!)ixX$GF-RB0|+f#ZzL+1kZp+T*`r!>@eLR;2CaVKvz`A zZ(YK#!TciZ_{LtiyUKk1vrRK{;i=8>&lCj`|K^1y%>w}YTHW{xMNgu$U%I*#(6Ji8 zzF*Iv7P9jG+~iQz6Vu%GhmeH)rDme@*au?I3Wrd7QMZ#2?Tp^~hMMRVe1egpK5t%9 zr}cb-K^l?POf_(@Y_4oC(q@C$>`@W-I_Py>fhq8G#MZelGRM9n;oLNSvAm-@(Hg0; z7xg}!PO5;Caum))lqt)z{E~|GJ2H(?q+-rSgXBh@Okd!{B4v*+E%2CXCpiAT z$L4g4@hKC8^`%r}o2<+r@l8_?jOV9!?@X}a=JA?2Bz^1$TtoD+O`c2=XI~iEKnY_4 zK_TD8pc%q7v}K+nN(@C^b>BRrR#^G1xC7W_;ygB1-ZHJSb=(H)%Z^0WN!_wAeG*gi zTfGJ)Ksqjyr`EW4DFHIFtZbY;O{C~>mUhsKwyrr9ycwc}gfZZ9{O$%SXkoS zhSVy4G%Ve_jxdtU=`CXtOdF0-HxJbGBTo$SS1KfIKE*=_(l7>=T>312bunv(^M?@@$SxM?x*X{|9 zw7Y5TNE+A&!IgwknEjlRloL6Dj}!zX=XfESt* jYRg;D#EY#`nzf=$=?hA(hylC_{PkZi{?!1d4u<~+V@c5{ literal 0 HcmV?d00001 diff --git a/docker/office/bisheng/index.html b/docker/office/bisheng/index.html new file mode 100644 index 00000000..45c5fecb --- /dev/null +++ b/docker/office/bisheng/index.html @@ -0,0 +1,12 @@ + + + + + + + + + + + + \ No newline at end of file diff --git a/src/backend/bisheng/api/v1/knowledge.py b/src/backend/bisheng/api/v1/knowledge.py index e4bd18b1..706e2f2f 100644 --- a/src/backend/bisheng/api/v1/knowledge.py +++ b/src/backend/bisheng/api/v1/knowledge.py @@ -131,8 +131,9 @@ async def process_knowledge(*, session.add(db_file) session.commit() session.refresh(db_file) - files.append(db_file) - file_paths.append(filepath) + if not repeat: + files.append(db_file) + file_paths.append(filepath) logger.info(f'fileName={file_name} col={collection_name}') result.append(db_file.copy()) @@ -310,6 +311,7 @@ def delete_knowledge_file(*, # minio minio_client.MinioClient().delete_minio(str(knowledge_file.id)) + minio_client.MinioClient().delete_minio(str(knowledge_file.object_name)) # elastic esvectore_client = decide_vectorstores(collection_name, 'ElasticKeywordsSearch', embeddings) if esvectore_client: @@ -365,11 +367,12 @@ async def addEmbedding(collection_name, model: str, chunk_size: int, separator: # 存储 mysql db_file = session.get(KnowledgeFile, knowledge_file.id) setattr(db_file, 'status', 2) - setattr(db_file, 'object_name', knowledge_file.file_name) - session.add(db_file) - session.flush() # 原文件 object_name_original = f'original/{db_file.id}' + setattr(db_file, 'object_name', object_name_original) + session.add(db_file) + session.flush() + minio_client.MinioClient().upload_minio(object_name_original, path) texts, metadatas = _read_chunk_text(path, knowledge_file.file_name, chunk_size, diff --git a/src/backend/bisheng/cache/utils.py b/src/backend/bisheng/cache/utils.py index bf5e1fa1..4103d576 100644 --- a/src/backend/bisheng/cache/utils.py +++ b/src/backend/bisheng/cache/utils.py @@ -13,7 +13,7 @@ import requests from appdirs import user_cache_dir from bisheng.settings import settings -from bisheng.utils.minio_client import mino_client, tmp_bucket +from bisheng.utils.minio_client import MinioClient, tmp_bucket CACHE: Dict[str, Any] = {} @@ -177,10 +177,11 @@ def save_uploaded_file(file, folder_name, file_name): # Save the file with the hash as its name if settings.get_knowledge().get('minio'): + minio_client = MinioClient() # 存储oss file_byte = file.read() - mino_client.upload_tmp(file_name, file_byte) - file_path = mino_client.get_share_link(file_name, tmp_bucket) + minio_client.upload_tmp(file_name, file_byte) + file_path = minio_client.get_share_link(file_name, tmp_bucket) else: file_type = md5_name.split('.')[-1] file_path = folder_path / f'{md5_name}.{file_type}' diff --git a/src/backend/bisheng/database/models/message.py b/src/backend/bisheng/database/models/message.py index 7d611f34..9f291e6f 100644 --- a/src/backend/bisheng/database/models/message.py +++ b/src/backend/bisheng/database/models/message.py @@ -4,7 +4,7 @@ from bisheng.database.models.base import SQLModelSerializable from pydantic import BaseModel -from sqlalchemy import JSON, Column, DateTime, Text, text +from sqlalchemy import JSON, Column, DateTime, String, Text, text from sqlmodel import Field @@ -22,7 +22,7 @@ class MessageBase(SQLModelSerializable): intermediate_steps: Optional[str] = Field(index=False, sa_column=Column(Text), description='过程日志') - files: Optional[str] = Field(index=False, description='上传的文件等') + files: Optional[str] = Field(sa_column=Column(String(length=4096)), description='上传的文件等') # file_access: Optional[bool] = Field(index=False, default=True, description='召回文件是否可以访问') create_time: Optional[datetime] = Field( sa_column=Column(DateTime, nullable=False, server_default=text('CURRENT_TIMESTAMP'))) diff --git a/src/backend/bisheng/interface/chains/custom.py b/src/backend/bisheng/interface/chains/custom.py index 2d4b653d..e594a25a 100644 --- a/src/backend/bisheng/interface/chains/custom.py +++ b/src/backend/bisheng/interface/chains/custom.py @@ -106,9 +106,11 @@ def initialize(cls, llm: BaseLanguageModel, chain_type: str, prompt: BasePromptTemplate = None, + document_prompt: BasePromptTemplate = None, token_max: str = -1): if chain_type == 'stuff': - return load_qa_chain(llm=llm, chain_type=chain_type, prompt=prompt, token_max=token_max) + return load_qa_chain(llm=llm, chain_type=chain_type, prompt=prompt, + token_max=token_max, document_prompt=document_prompt) else: return load_qa_chain(llm=llm, chain_type=chain_type) diff --git a/src/backend/bisheng/main.py b/src/backend/bisheng/main.py index a51d98d2..92b0d307 100644 --- a/src/backend/bisheng/main.py +++ b/src/backend/bisheng/main.py @@ -60,7 +60,6 @@ def authjwt_exception_handler(request: Request, exc: AuthJWTException): app.include_router(router_rpc) app.on_event('startup')(create_db_and_tables) app.on_event('startup')(setup_llm_caching) - return app diff --git a/src/backend/bisheng/template/frontend_node/chains.py b/src/backend/bisheng/template/frontend_node/chains.py index 3b56e821..a0b10115 100644 --- a/src/backend/bisheng/template/frontend_node/chains.py +++ b/src/backend/bisheng/template/frontend_node/chains.py @@ -301,7 +301,14 @@ class CombineDocsChainNode(FrontendNode): name='prompt', display_name='prompt', advanced=False, - info='只对Stuff类型生效') + info='只对Stuff类型生效'), + TemplateField( + field_type='BasePromptTemplate', + required=False, + show=True, + name='document_prompt', + advanced=False, + ) ], ) diff --git a/src/backend/bisheng/utils/minio_client.py b/src/backend/bisheng/utils/minio_client.py index ab193992..f5a26ad4 100644 --- a/src/backend/bisheng/utils/minio_client.py +++ b/src/backend/bisheng/utils/minio_client.py @@ -95,6 +95,3 @@ def mkdir(self, bucket: str): if self.minio_client: if not self.minio_client.bucket_exists(bucket): self.minio_client.make_bucket(bucket) - - -mino_client = MinioClient() diff --git a/src/bisheng-langchain/bisheng_langchain/chat_models/proxy_llm.py b/src/bisheng-langchain/bisheng_langchain/chat_models/proxy_llm.py index 9a80897a..a973b8aa 100644 --- a/src/bisheng-langchain/bisheng_langchain/chat_models/proxy_llm.py +++ b/src/bisheng-langchain/bisheng_langchain/chat_models/proxy_llm.py @@ -195,6 +195,8 @@ def _completion_with_retry(**kwargs: Any) -> Any: 'functions': kwargs.get('functions', []) } response = self.client.post(self.elemai_base_url, json=params) + if response.status_code != 200: + raise return response.json() return _completion_with_retry(**kwargs) diff --git a/src/bisheng-langchain/bisheng_langchain/vectorstores/milvus.py b/src/bisheng-langchain/bisheng_langchain/vectorstores/milvus.py new file mode 100644 index 00000000..e69de29b From 1a26c286676bd868e2fe89e35446393b57749a7f Mon Sep 17 00:00:00 2001 From: yaojin Date: Wed, 6 Dec 2023 19:24:50 +0800 Subject: [PATCH 11/23] bug fix --- src/backend/bisheng/api/v1/knowledge.py | 12 +++++++----- src/backend/bisheng/api/v1/qa.py | 7 ++++--- src/backend/bisheng/utils/docx_temp.py | 4 ++-- 3 files changed, 13 insertions(+), 10 deletions(-) diff --git a/src/backend/bisheng/api/v1/knowledge.py b/src/backend/bisheng/api/v1/knowledge.py index 706e2f2f..48018f59 100644 --- a/src/backend/bisheng/api/v1/knowledge.py +++ b/src/backend/bisheng/api/v1/knowledge.py @@ -17,8 +17,8 @@ from bisheng.interface.importing.utils import import_vectorstore from bisheng.interface.initialize.loading import instantiate_vectorstore from bisheng.settings import settings -from bisheng.utils import minio_client from bisheng.utils.logger import logger +from bisheng.utils.minio_client import MinioClient from bisheng_langchain.document_loaders.elem_unstrcutured_loader import ElemUnstructuredLoader from bisheng_langchain.embeddings.host_embedding import HostEmbeddings from bisheng_langchain.text_splitter import ElemCharacterTextSplitter @@ -310,8 +310,9 @@ def delete_knowledge_file(*, logger.info(f'act=delete_vector file_id={file_id} res={res}') # minio - minio_client.MinioClient().delete_minio(str(knowledge_file.id)) - minio_client.MinioClient().delete_minio(str(knowledge_file.object_name)) + minio_client = MinioClient() + minio_client.delete_minio(str(knowledge_file.id)) + minio_client.delete_minio(str(knowledge_file.object_name)) # elastic esvectore_client = decide_vectorstores(collection_name, 'ElasticKeywordsSearch', embeddings) if esvectore_client: @@ -360,6 +361,7 @@ async def addEmbedding(collection_name, model: str, chunk_size: int, separator: except Exception as e: logger.exception(e) + minio_client = MinioClient() for index, path in enumerate(file_paths): knowledge_file = knowledge_files[index] session = next(get_session()) @@ -373,13 +375,13 @@ async def addEmbedding(collection_name, model: str, chunk_size: int, separator: session.add(db_file) session.flush() - minio_client.MinioClient().upload_minio(object_name_original, path) + minio_client.upload_minio(object_name_original, path) texts, metadatas = _read_chunk_text(path, knowledge_file.file_name, chunk_size, chunk_overlap, separator) # 溯源必须依赖minio, 后期替换更通用的oss - minio_client.MinioClient().upload_minio(str(db_file.id), path) + minio_client.upload_minio(str(db_file.id), path) logger.info(f'chunk_split file_name={knowledge_file.file_name} size={len(texts)}') [metadata.update({'file_id': knowledge_file.id}) for metadata in metadatas] diff --git a/src/backend/bisheng/api/v1/qa.py b/src/backend/bisheng/api/v1/qa.py index 88b3a018..d8a37c5b 100644 --- a/src/backend/bisheng/api/v1/qa.py +++ b/src/backend/bisheng/api/v1/qa.py @@ -4,7 +4,7 @@ from bisheng.database.base import get_session from bisheng.database.models.knowledge_file import KnowledgeFile from bisheng.database.models.recall_chunk import RecallChunk -from bisheng.utils import minio_client +from bisheng.utils.minio_client import MinioClient from fastapi import APIRouter, Depends from sqlmodel import Session, select @@ -39,11 +39,12 @@ def get_original_file(*, message_id: int, keys: str, session: Session = Depends( # keywords keywords = keys.split(';') if keys else [] result = [] + minio_client = MinioClient() for index, chunk in enumerate(chunks): file = id2file.get(chunk.file_id) chunk_res = json.loads(json.loads(chunk.meta_data).get('bbox')) - chunk_res['source_url'] = minio_client.MinioClient().get_share_link(str(chunk.file_id)) - chunk_res['original_url'] = minio_client.MinioClient().get_share_link( + chunk_res['source_url'] = minio_client.get_share_link(str(chunk.file_id)) + chunk_res['original_url'] = minio_client.get_share_link( file.object_name if file.object_name else str(file.id)) chunk_res['score'] = round(match_score(chunk.chunk, keywords), 2) if len(keywords) > 0 else 0 diff --git a/src/backend/bisheng/utils/docx_temp.py b/src/backend/bisheng/utils/docx_temp.py index 060a62d7..c9c8f043 100644 --- a/src/backend/bisheng/utils/docx_temp.py +++ b/src/backend/bisheng/utils/docx_temp.py @@ -168,7 +168,7 @@ from urllib.parse import unquote, urlparse import requests -from bisheng.utils.minio_client import mino_client +from bisheng.utils.minio_client import MinioClient from bisheng.utils.util import _is_valid_url from docx import Document @@ -302,6 +302,6 @@ def test_replace_string(template_file, kv_dict: dict, file_name: str): temp_dir = tempfile.TemporaryDirectory() temp_file = Path(temp_dir.name) / file_name output.save(temp_file) - mino_client.upload_minio(file_name, temp_file) + MinioClient().upload_minio(file_name, temp_file) return file_name From 33cd1bd2e48091dde10879b7f54d91ab33077d98 Mon Sep 17 00:00:00 2001 From: yaojin Date: Wed, 6 Dec 2023 21:56:37 +0800 Subject: [PATCH 12/23] bug fix --- src/backend/bisheng/chat/handlers.py | 8 ++++---- src/backend/bisheng/interface/initialize/loading.py | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/backend/bisheng/chat/handlers.py b/src/backend/bisheng/chat/handlers.py index f3702e20..931f9a35 100644 --- a/src/backend/bisheng/chat/handlers.py +++ b/src/backend/bisheng/chat/handlers.py @@ -10,7 +10,7 @@ from bisheng.database.models.report import Report from bisheng.utils.docx_temp import test_replace_string from bisheng.utils.logger import logger -from bisheng.utils.minio_client import mino_client +from bisheng.utils.minio_client import MinioClient from bisheng.utils.util import get_cache_key from bisheng_langchain.chains.autogen.auto_gen import AutoGenChain from langchain.docstore.document import Document @@ -73,12 +73,12 @@ async def process_report(self, session: ChatManager, if not template: logger.error('template not support') return - - template_muban = mino_client.get_share_link(template.object_name) + minio_client = MinioClient() + template_muban = minio_client.get_share_link(template.object_name) report_name = langchain_object.report_name report_name = report_name if report_name.endswith('.docx') else f'{report_name}.docx' test_replace_string(template_muban, result, report_name) - file = mino_client.get_share_link(report_name) + file = minio_client.get_share_link(report_name) response = ChatResponse(type='end', files=[{'file_url': file, 'file_name': report_name}], user_id=user_id) diff --git a/src/backend/bisheng/interface/initialize/loading.py b/src/backend/bisheng/interface/initialize/loading.py index d0ce7863..ca7c8cf6 100644 --- a/src/backend/bisheng/interface/initialize/loading.py +++ b/src/backend/bisheng/interface/initialize/loading.py @@ -379,8 +379,8 @@ def instantiate_prompt(node_type, class_object, params: Dict): # Add the handle_keys to the list format_kwargs['handle_keys'].append(input_variable) - from langchain.chains.router.llm_router import RouterOutputParser - prompt.output_parser = RouterOutputParser() + # from langchain.chains.router.llm_router import RouterOutputParser + # prompt.output_parser = RouterOutputParser() return prompt, format_kwargs From 0ad336ff6b2437b7b61d0eb8e61f16ed2c5a916e Mon Sep 17 00:00:00 2001 From: gulixin0922 Date: Thu, 7 Dec 2023 16:37:57 +0800 Subject: [PATCH 13/23] update --- .../experimental/contract/contract_extract.py | 108 +++++++++++++---- .../experimental/contract/ellm_extract.py | 14 ++- .../experimental/contract/llm_extract.py | 114 +++++++++++++----- .../experimental/contract/run_web.py | 85 +++++++++++++ 4 files changed, 267 insertions(+), 54 deletions(-) create mode 100644 src/bisheng-langchain/experimental/contract/run_web.py diff --git a/src/bisheng-langchain/experimental/contract/contract_extract.py b/src/bisheng-langchain/experimental/contract/contract_extract.py index bf54bef5..177646b5 100644 --- a/src/bisheng-langchain/experimental/contract/contract_extract.py +++ b/src/bisheng-langchain/experimental/contract/contract_extract.py @@ -4,6 +4,7 @@ from tqdm import tqdm from ellm_extract import EllmExtract from llm_extract import LlmExtract +from collections import defaultdict logging.getLogger().setLevel(logging.INFO) @@ -14,52 +15,113 @@ def __init__(self, llm_model_name: str = 'Qwen-14B-Chat', llm_model_api_url: str = 'https://bisheng.dataelem.com/api/v1/models/{}/infer', unstructured_api_url: str = "https://bisheng.dataelem.com/api/v1/etl4llm/predict", + do_ellm: bool = True, + do_llm: bool = True, replace_ellm_cache: bool = False, - replace_llm_cache: bool = True, + replace_llm_cache: bool = False, + ensemble_method: str = 'llm_first' ): self.ellm_client = EllmExtract(api_base_url=ellm_api_base_url) self.llm_client = LlmExtract(model_name=llm_model_name, model_api_url=llm_model_api_url, unstructured_api_url=unstructured_api_url) + self.do_ellm = do_ellm + self.do_llm = do_llm + self.ensemble_method = ensemble_method self.replace_ellm_cache = replace_ellm_cache self.replace_llm_cache = replace_llm_cache - def predict_one_pdf(self, pdf_path, schema, save_folder): + def predict_one_pdf(self, pdf_path, schema, save_folder=''): pdf_name_prefix = os.path.splitext(os.path.basename(pdf_path))[0] - save_ellm_path = os.path.join(save_folder, pdf_name_prefix + '_ellm.json') - save_llm_path = os.path.join(save_folder, pdf_name_prefix + '_llm.json') - if self.replace_ellm_cache or not os.path.exists(save_ellm_path): - ellm_kv_results = self.ellm_client.predict(pdf_path, schema) - with open(save_ellm_path, 'w') as f: - json.dump(ellm_kv_results, f, indent=2, ensure_ascii=False) + if self.do_ellm: + if save_folder: + save_ellm_path = os.path.join(save_folder, pdf_name_prefix + '_ellm.json') + if self.replace_ellm_cache or not os.path.exists(save_ellm_path): + ellm_kv_results = self.ellm_client.predict(pdf_path, schema) + with open(save_ellm_path, 'w') as f: + json.dump(ellm_kv_results, f, indent=2, ensure_ascii=False) + else: + # get results from cache + with open(save_ellm_path, 'r') as f: + ellm_kv_results = json.load(f) + else: + ellm_kv_results = self.ellm_client.predict(pdf_path, schema) else: - with open(save_ellm_path, 'r') as f: - ellm_kv_results = json.load(f) + ellm_kv_results = {} - if self.replace_llm_cache or not os.path.exists(save_llm_path): - llm_kv_results = self.llm_client.predict(pdf_path, schema) - with open(save_llm_path, 'w') as f: - json.dump(llm_kv_results, f, indent=2, ensure_ascii=False) + if self.do_llm: + if save_folder: + save_llm_path = os.path.join(save_folder, pdf_name_prefix + '_llm.json') + if self.replace_llm_cache or not os.path.exists(save_llm_path): + llm_kv_results = self.llm_client.predict(pdf_path, schema) + with open(save_llm_path, 'w') as f: + json.dump(llm_kv_results, f, indent=2, ensure_ascii=False) + else: + # get results from cache + with open(save_llm_path, 'r') as f: + llm_kv_results = json.load(f) + else: + llm_kv_results = self.llm_client.predict(pdf_path, schema) else: - with open(save_llm_path, 'r') as f: - llm_kv_results = json.load(f) + llm_kv_results = {} - return ellm_kv_results, llm_kv_results + final_kv_results = self.ensemble(ellm_kv_results, llm_kv_results) + if save_folder: + save_final_path = os.path.join(save_folder, pdf_name_prefix + '_ensemble.json') + with open(save_final_path, 'w') as f: + json.dump(final_kv_results, f, indent=2, ensure_ascii=False) + + return ellm_kv_results, llm_kv_results, final_kv_results + + def ensemble(self, ellm_kv_results, llm_kv_results): + """ + 1. 如果当前字段llm有结果,以llm为准,丢掉ellm的提取结果 + 2. 如果ellm还有剩余字段,归到最终结果中 + """ + final_kv_results = defaultdict(list) + if self.ensemble_method == 'llm_first': + for key in llm_kv_results: + final_kv_results[key] = llm_kv_results[key] + if key in ellm_kv_results: + ellm_kv_results.pop(key) + for key in ellm_kv_results: + final_kv_results[key] = ellm_kv_results[key] + elif self.ensemble_method == 'ellm_first': + for key in ellm_kv_results: + final_kv_results[key] = ellm_kv_results[key] + if key in llm_kv_results: + llm_kv_results.pop(key) + for key in llm_kv_results: + final_kv_results[key] = llm_kv_results[key] + + return final_kv_results def predict_all_pdf(self, pdf_folder, schema, save_folder): if not os.path.exists(save_folder): os.makedirs(save_folder) pdf_names = os.listdir(pdf_folder) + # invalid_pdf_names = ['供货合同_W26.pdf', + # '框架协议_主供货协议_J28.pdf', + # '保密条款_L1.pdf', + # '价格协议_J3.pdf'] for pdf_name in tqdm(pdf_names): + # if pdf_name in invalid_pdf_names: + # continue + logging.info(f'process pdf: {pdf_name}') pdf_path = os.path.join(pdf_folder, pdf_name) - ellm_kv_results, llm_kv_results = self.predict_one_pdf(pdf_path, schema, save_folder) + ellm_kv_results, llm_kv_results, final_kv_results = self.predict_one_pdf( + pdf_path, schema, save_folder) if __name__ == '__main__': - client = ContractExtract() - pdf_folder = '/home/gulixin/workspace/datasets/huatai/重大商务合同(汇总)_pdf' - # schema = '合同标题|合同编号|借款人|贷款人|借款金额' - schema = '买方|卖方|合同期限|结算条款|售后条款|金额总金额' - save_folder = '/home/gulixin/workspace/datasets/huatai/重大商务合同(汇总)_pdf_res' + # llm_model_name = 'Qwen-14B-Chat' + llm_model_name = 'Qwen-72B-Chat-Int4' + client = ContractExtract(llm_model_name=llm_model_name) + schema = '合同标题|借款合同编号|担保合同编号|借款人|贷款人|借款金额' + pdf_folder = '/home/gulixin/workspace/datasets/huatai/流动资金借款合同_pdf' + save_folder = '/home/gulixin/workspace/datasets/huatai/流动资金借款合同_pdf_qwen72B_res' + # schema = '买方|卖方|合同期限|结算条款|售后条款|合同总金额' + # pdf_folder = '/home/gulixin/workspace/datasets/huatai/重大商务合同(汇总)_pdf' + # save_folder = '/home/gulixin/workspace/datasets/huatai/重大商务合同(汇总)_pdf_qwen72B_res' client.predict_all_pdf(pdf_folder, schema, save_folder) diff --git a/src/bisheng-langchain/experimental/contract/ellm_extract.py b/src/bisheng-langchain/experimental/contract/ellm_extract.py index bd33b1a9..96898878 100644 --- a/src/bisheng-langchain/experimental/contract/ellm_extract.py +++ b/src/bisheng-langchain/experimental/contract/ellm_extract.py @@ -5,11 +5,15 @@ import fitz import numpy as np import cv2 +import logging from collections import defaultdict from PIL import Image from typing import Any, Iterator, List, Mapping, Optional, Union +logging.getLogger().setLevel(logging.INFO) + + def convert_base64(image): image_binary = cv2.imencode('.jpg', image)[1].tobytes() x = base64.b64encode(image_binary) @@ -84,6 +88,7 @@ def predict(self, pdf_path, schema): """ pdf """ + logging.info('ellm extract phase1: ellm extract') pdf_images = transpdf2png(pdf_path) kv_results = defaultdict(list) for pdf_name in pdf_images: @@ -99,9 +104,14 @@ def predict(self, pdf_path, schema): raise ValueError(f"ellm kv extract failed: {resp}") for key, value in key_values.items(): - text_info = [{'value': text, 'page': int(page)} for text in value['text']] - kv_results[key].extend(text_info) + # text_info = [{'value': text, 'page': int(page)} for text in value['text']] + # kv_results[key].extend(text_info) + + for text in value['text']: + if text not in kv_results[key]: + kv_results[key].append(text) + logging.info(f'ellm kv results: {kv_results}') return kv_results diff --git a/src/bisheng-langchain/experimental/contract/llm_extract.py b/src/bisheng-langchain/experimental/contract/llm_extract.py index 140e3e56..de9a1d96 100644 --- a/src/bisheng-langchain/experimental/contract/llm_extract.py +++ b/src/bisheng-langchain/experimental/contract/llm_extract.py @@ -18,13 +18,58 @@ input_variables=["context", "keywords"], template="""现在你需要帮我完成信息抽取的任务,你需要帮我抽取出原文中相关字段信息,如果没找到对应的值,则设为空,并按照JSON的格式输出. -原文内容: +原文: {context} -提取上述文本中以下字段信息:{keywords},并按照json的格式输出,如果没找到对应的值,则设为空。 +问题:提取上述文本中以下字段信息:{keywords},并按照json的格式输出,如果没找到对应的值,则设为空。 +回答: """ ) +# DEFAULT_PROMPT = PromptTemplate( +# input_variables=["context", "keywords"], +# template="""现在你需要帮我完成信息抽取的任务,你需要帮我抽取出原文中相关字段信息,如果没找到对应的值,则设为空,并按照JSON的格式输出。 + +# Examples: +# 原文:'| 买卖合同 | | 日期 2021.01.01-2022.12.31 | |\n| --- | --- | --- | --- |\n| 客户编号 55652246 | | | 目的地国家 China |\n| 联系人 chen xu | 电话 862138623097 | | 传真 |\n| 买方 浙江峻和科技股份有限公司 余姚市远东工业城CE-11 浙江省余姚市 315400 联系人 : 电话 : 传真 | | 卖方 杜邦贸易 (上海) 有限公司 DuPont Trading (Shanghai) Co., Ltd. 中国《上海》自由贸易试验区港澳路239号一幢楼5层5 27室 Room 527, Floor 5, Building 1, No, 239, Gangao Road .China (Shanghai) Pilot F ree Trade Zone Shanghai 200131.PRC |\n| 付款条件 MET 30 DAYS EOM |\n| | 运输方式 | | 销售条款 CPT YUYAO CITY |\n| 1、买方以采购订单的形式。列明需求传送给卖方。 2、卖方负责提供合理报价、依买方采购订单内容生产交货。 3、本合同所附的"买卖条件"为本合同一个明确的组成部分。 4、若本合同任何其他规定与下列附加条件有冲突、则以附加条件为准。 |\n| 卖方银行账 汇丰银行 开户账号 SWIFT代码 | |\n| 代表实力 浙江岭和科技股份有 (组) For and on behalf (seal) Zhejiang Junke Trehaning 签署 by: 姓名 Now Of Title: 日期 Pate: | 代表表示 杜斯汀 (图) For and configure of SELLER: (seal) Durent 监测制 乌鲁木齐市 日期 late: |' +# 问题: 提取上述文本中以下字段信息:{schema},并按照json的格式输出,如果没找到对应的值,则设为空。 +# 回答:```json\n{{\n "买方": "浙江峻和科技股份有限公司",\n "卖方": "杜邦贸易 (上海) 有限公司",\n "合同期限": "2021.01.01-2022.12.31",\n "结算条款": "MET 30 DAYS EOM",\n "售后条款": "本合同所附的\'买卖条件\'为本合同一个明确的组成部分。",\n "金额总金额": ""\n}}\n``` + +# ---------------------------------- + +# 原文:{context} +# 问题: 提取上述文本中以下字段信息:{schema},并按照json的格式输出,如果没找到对应的值,则设为空。 +# 回答: +# """ +# ) + + +def parse_json(json_string: str) -> dict: + match = re.search(r"```(json)?(.*)```", json_string, re.DOTALL) + if match is None: + json_str = json_string + else: + json_str = match.group(2) + + json_str = json_str.strip() + json_str = json_str.replace('```', '') + + match = re.search(r"{.*}", json_str, re.DOTALL) + if match is None: + json_str = json_str + else: + json_str = match.group(0) + + if json_str.endswith('}\n}'): + json_str = json_str[:-2] + if json_str.startswith('{\n{'): + json_str = json_str.replace('{\n{', '{', 1) + + logging.info(f'llm response after parse: {json_str}') + extract_res = json.loads(json_str) + + return extract_res + class LlmExtract(object): def __init__(self, @@ -50,20 +95,27 @@ def call_llm(self, prompt_info, max_tokens=8192): } payload = copy.copy(input_template) payload['model'] = self.model_name - response = requests.post(url=self.model_api_url, json=payload).json() - assert response['status_code'] == 200, response - choices = response.get('choices', []) - assert choices, response + try: + raw_response = requests.post(url=self.model_api_url, json=payload) + response = raw_response.json() + assert response['status_code'] == 200, response + except Exception as e: + # llm request error + logging.error(f'llm predict fail: {str(e)}') + logging.error(f'raw_response: {raw_response.text}') + return {}, len(raw_response.text) + choices = response.get('choices', []) + logging.info(f'llm response: {response}') json_string = choices[0]['message']['content'] - match = re.search(r"```(json)?(.*)```", json_string, re.DOTALL) - if match is None: - json_str = json_string - else: - json_str = match.group(2) - json_str = json_str.strip() - extract_res = json.loads(json_str) - return extract_res + try: + extract_res = parse_json(json_string) + except Exception as e: + # json parse error + logging.error(f'json parse fail: {str(e)}') + extract_res = {} + + return extract_res, len(json_string) def parse_pdf(self, file_path, @@ -77,16 +129,18 @@ def parse_pdf(self, unstructured_api_url=self.unstructured_api_url) docs = loader.load() pdf_content = ''.join([doc.page_content for doc in docs]) - logging.info(f'pdf content len: {len(pdf_content)}') text_splitter = ElemCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=separators) split_docs = text_splitter.split_documents(docs) - logging.info(f'docs num: {len(docs)}, split_docs num: {len(split_docs)}') + logging.info(f'pdf content len: {len(pdf_content)}, docs num: {len(docs)}, split_docs num: {len(split_docs)}') return split_docs, docs def post_extract_res(self, split_docs_extract, split_docs_content, schema): + """ + combine and delete duplicate + """ kv_results = defaultdict(list) for ext_res, content in zip(split_docs_extract, split_docs_content): # 每一个split_doc的提取结果 @@ -94,13 +148,8 @@ def post_extract_res(self, split_docs_extract, split_docs_content, schema): # 去掉非法key和没有内容的key if (key not in schema) or (not value): continue - - if key not in kv_results: - kv_results[key].append(value) - else: - # 去重 - if value in kv_results[key]: - continue + # delete duplicate + if value not in kv_results[key]: kv_results[key].append(value) return kv_results @@ -109,23 +158,31 @@ def predict(self, pdf_path, schema): logging.info('llm extract phase1: pdf parsing') schema = schema.split('|') keywords = '、'.join(schema) - split_docs, docs = self.parse_pdf(pdf_path) + try: + split_docs, docs = self.parse_pdf(pdf_path) + except Exception as e: + # pdf parse error + logging.error(f'pdf parse fail: {str(e)}') + return {} logging.info('llm extract phase2: llm extract') split_docs_extract = [] split_docs_content = [] + avg_generate_num = 0 for each_doc in split_docs: pdf_content = each_doc.page_content prompt_info = DEFAULT_PROMPT.format(context=pdf_content, keywords=keywords) - extract_res = self.call_llm(prompt_info) + start_time = time.time() + extract_res, generate_num = self.call_llm(prompt_info) + llm_time = time.time() - start_time + avg_generate_num += generate_num / llm_time split_docs_extract.append(extract_res) split_docs_content.append(pdf_content) - - logging.info(f'split_docs_extract: {split_docs_extract}') + avg_generate_num = avg_generate_num / len(split_docs) logging.info('llm extract phase3: post extract result') kv_results = self.post_extract_res(split_docs_extract, split_docs_content, schema) - logging.info(f'final kv results: {kv_results}') + logging.info(f'llm kv results: {kv_results}, avg generate char num: {avg_generate_num}') return kv_results @@ -135,4 +192,3 @@ def predict(self, pdf_path, schema): pdf_path = '/home/gulixin/workspace/datasets/huatai/流动资金借款合同_pdf/流动资金借款合同1.pdf' schema = '合同标题|合同编号|借款人|贷款人|借款金额' llm_client.predict(pdf_path, schema) - diff --git a/src/bisheng-langchain/experimental/contract/run_web.py b/src/bisheng-langchain/experimental/contract/run_web.py new file mode 100644 index 00000000..e2098d3e --- /dev/null +++ b/src/bisheng-langchain/experimental/contract/run_web.py @@ -0,0 +1,85 @@ +# flake8: noqa: E501 +import json +import os +import requests +import gradio as gr +import time +import tempfile +from contract_extract import ContractExtract + +tmpdir = './tmp/extract_files' +if not os.path.exists(tmpdir): + os.makedirs(tmpdir) + + +ellm_client = ContractExtract(do_ellm=True, do_llm=False, llm_model_name='Qwen-72B-Chat-Int4') +llm_client = ContractExtract(do_ellm=False, do_llm=True, llm_model_name='Qwen-72B-Chat-Int4') +ensemble_llm_first_client = ContractExtract(do_ellm=True, do_llm=True, + ensemble_method='llm_first', llm_model_name='Qwen-72B-Chat-Int4') +ensemble_ellm_first_client = ContractExtract(do_ellm=True, do_llm=True, + ensemble_method='ellm_first', llm_model_name='Qwen-72B-Chat-Int4') + + +def ellm_run(pdf_path, schema): + pdf_path = pdf_path.name + ellm_kv_results, llm_kv_results, final_kv_results = ellm_client.predict_one_pdf(pdf_path, schema) + final_kv_results = json.dumps(final_kv_results, ensure_ascii=False, indent=2) + return final_kv_results + + +def llm_run(pdf_path, schema): + pdf_path = pdf_path.name + ellm_kv_results, llm_kv_results, final_kv_results = llm_client.predict_one_pdf(pdf_path, schema) + final_kv_results = json.dumps(final_kv_results, ensure_ascii=False, indent=2) + return final_kv_results + + +def ensemble_llm_first_run(pdf_path, schema): + pdf_path = pdf_path.name + ellm_kv_results, llm_kv_results, final_kv_results = ensemble_llm_first_client.predict_one_pdf(pdf_path, schema) + final_kv_results = json.dumps(final_kv_results, ensure_ascii=False, indent=2) + return final_kv_results + + +def ensemble_ellm_first_run(pdf_path, schema): + pdf_path = pdf_path.name + ellm_kv_results, llm_kv_results, final_kv_results = ensemble_ellm_first_client.predict_one_pdf(pdf_path, schema) + final_kv_results = json.dumps(final_kv_results, ensure_ascii=False, indent=2) + return final_kv_results + + +with tempfile.TemporaryDirectory(dir='./tmp/extract_files') as tmpdir: + with gr.Blocks(css='#margin-top {margin-top: 15px} #center {text-align: center;} #description {text-align: center}') as demo: + with gr.Row(elem_id='center'): + gr.Markdown('# Bisheng IE Demo') + + with gr.Row(elem_id = 'description'): + gr.Markdown("""Information extraction for anything.""") + + with gr.Row(): + intput_file = gr.components.File(label='FlowFile') + schema = gr.Textbox(label='抽取字段', value='买方|卖方|合同期限|结算条款|售后条款|合同总金额', interactive=True, lines=2) + + with gr.Row(): + with gr.Column(): + ellm_kv_results = gr.Textbox(label='ELLM抽取结果', value='', interactive=True, lines=1) + btn0 = gr.Button('Run ELLM') + btn0.click(fn=ellm_run, inputs=[intput_file, schema], outputs=ellm_kv_results) + + with gr.Column(): + llm_kv_results = gr.Textbox(label='LLM抽取结果', value='', interactive=True, lines=1) + btn1 = gr.Button('Run LLM') + btn1.click(fn=llm_run, inputs=[intput_file, schema], outputs=llm_kv_results) + + with gr.Column(): + ensemble1_kv_results = gr.Textbox(label='ensemble1抽取结果', value='', interactive=True, lines=1) + btn2 = gr.Button('Run ensemble1') + btn2.click(fn=ensemble_llm_first_run, inputs=[intput_file, schema], outputs=ensemble1_kv_results) + + # with gr.Row(): + # ensemble2_kv_results = gr.Textbox(label='ensemble2抽取结果', value='', interactive=True, lines=1) + # btn3 = gr.Button('Run ensemble2') + # btn3.click(fn=ensemble_ellm_first_run, inputs=[intput_file, schema], outputs=ensemble2_kv_results) + + demo.launch(server_name='192.168.106.12', server_port=9118, share=True) + From 538702b43f7b7084728455473340a6fbf3958a3a Mon Sep 17 00:00:00 2001 From: yaojin Date: Thu, 7 Dec 2023 17:00:39 +0800 Subject: [PATCH 14/23] stable 0.2.0 --- src/backend/bisheng/api/v1/knowledge.py | 4 ++-- src/backend/bisheng/cache/utils.py | 5 ++--- src/backend/bisheng/chat/manager.py | 2 +- src/backend/bisheng/initdb_config.yaml | 22 +++++++++---------- .../bisheng/interface/chains/custom.py | 8 +++++-- .../bisheng/interface/initialize/loading.py | 16 ++++++++++++-- .../chat_models/proxy_llm.py | 8 ++++--- 7 files changed, 41 insertions(+), 24 deletions(-) diff --git a/src/backend/bisheng/api/v1/knowledge.py b/src/backend/bisheng/api/v1/knowledge.py index 48018f59..4272ba6a 100644 --- a/src/backend/bisheng/api/v1/knowledge.py +++ b/src/backend/bisheng/api/v1/knowledge.py @@ -57,7 +57,7 @@ async def upload_file(*, file: UploadFile = File(...)): # 缓存本地 file_path = save_uploaded_file(file.file, 'bisheng', file_name) if not isinstance(file_path, str): - file_path = str(file_path) + '_' + file_name + file_path = str(file_path) return UploadFileResponse(file_path=file_path) except Exception as exc: logger.error(f'Error saving file: {exc}') @@ -116,7 +116,7 @@ async def process_knowledge(*, result = [] for path in file_path: filepath, file_name = file_download(path) - md5_ = filepath.rsplit('/', 1)[1].split('.')[0] + md5_ = filepath.rsplit('/', 1)[1].split('.')[0].split('_')[0] # 是否包含重复文件 repeat = session.exec(select(KnowledgeFile) .where(KnowledgeFile.md5 == md5_, diff --git a/src/backend/bisheng/cache/utils.py b/src/backend/bisheng/cache/utils.py index 4103d576..a472df79 100644 --- a/src/backend/bisheng/cache/utils.py +++ b/src/backend/bisheng/cache/utils.py @@ -183,8 +183,7 @@ def save_uploaded_file(file, folder_name, file_name): minio_client.upload_tmp(file_name, file_byte) file_path = minio_client.get_share_link(file_name, tmp_bucket) else: - file_type = md5_name.split('.')[-1] - file_path = folder_path / f'{md5_name}.{file_type}' + file_path = folder_path / f'{md5_name}_{file_name}' with open(file_path, 'wb') as new_file: while chunk := file.read(8192): new_file.write(chunk) @@ -248,7 +247,7 @@ def file_download(file_path: str): return file_path, filename elif not os.path.isfile(file_path): raise ValueError('File path %s is not a valid file or url' % file_path) - return file_path, '' + return file_path, file_path.split('_', 1)[1] if '_' in file_path else '' def _is_valid_url(url: str): diff --git a/src/backend/bisheng/chat/manager.py b/src/backend/bisheng/chat/manager.py index d7686ca6..2d5faa7d 100644 --- a/src/backend/bisheng/chat/manager.py +++ b/src/backend/bisheng/chat/manager.py @@ -290,7 +290,7 @@ def refresh_graph_data(self, graph_data: dict, node_data: List[dict]): if url_path.netloc: file_name = unquote(url_path.path.split('/')[-1]) else: - file_path, file_name = file_path.split('_', 1) + file_name = file_path.split('_', 1)[1] if '_' in file_path else '' nd['value'] = file_name tweak[nd.get('id')] = {'file_path': file_path, 'value': file_name} elif 'VariableNode' in nd.get('id'): diff --git a/src/backend/bisheng/initdb_config.yaml b/src/backend/bisheng/initdb_config.yaml index 7ce72bfa..1bd68601 100644 --- a/src/backend/bisheng/initdb_config.yaml +++ b/src/backend/bisheng/initdb_config.yaml @@ -11,18 +11,18 @@ knowledges: # 知识库相关配置 vectorstores: # Milvus 最低要求cpu 4C 8G 推荐4C 16G Milvus: # 如果需要切换其他vectordb,确保其他服务已经启动,然后配置对应参数 - connection_args: {'host': 'milvus', 'port': '19530', 'user': '', 'password': '', 'secure': False} + connection_args: {'host': '110.16.193.170', 'port': '50032', 'user': '', 'password': '', 'secure': False} # 可选配置,有些类型的场景使用ES可以提高召回效果 - ElasticKeywordsSearch: - elasticsearch_url: 'http://elasticsearch:9200' - ssl_verify: "{'basic_auth': ('elastic', 'password')}" - minio: # 如果要支持溯源功能,由于溯源会展示源文件,必须配置 oss 存储 - SCHEMA: true - CERT_CHECK: false - MINIO_ENDPOINT: "milvus:9001" - MINIO_SHAREPOIN: "milvus:9001" - MINIO_ACCESS_KEY: "minioadmin" - MINIO_SECRET_KEY: "minioadmin" + # ElasticKeywordsSearch: + # elasticsearch_url: 'http://elasticsearch:9200' + # ssl_verify: "{'basic_auth': ('elastic', 'password')}" + # minio: # 如果要支持溯源功能,由于溯源会展示源文件,必须配置 oss 存储 + # SCHEMA: false # 是否支持 https + # CERT_CHECK: false # 是否校验 http证书 + # MINIO_ENDPOINT: "milvus:9001" # 这个地址用来写请求 + # MINIO_SHAREPOIN: "milvus:9001" # 为保证外网和内网隔离。 浏览器获取连接是这个域名 + # MINIO_ACCESS_KEY: "minioadmin" + # MINIO_SECRET_KEY: "minioadmin" # # 全局配置大模型 diff --git a/src/backend/bisheng/interface/chains/custom.py b/src/backend/bisheng/interface/chains/custom.py index e594a25a..48b8c528 100644 --- a/src/backend/bisheng/interface/chains/custom.py +++ b/src/backend/bisheng/interface/chains/custom.py @@ -109,8 +109,12 @@ def initialize(cls, document_prompt: BasePromptTemplate = None, token_max: str = -1): if chain_type == 'stuff': - return load_qa_chain(llm=llm, chain_type=chain_type, prompt=prompt, - token_max=token_max, document_prompt=document_prompt) + if document_prompt: + return load_qa_chain(llm=llm, chain_type=chain_type, prompt=prompt, + token_max=token_max, document_prompt=document_prompt) + else: + return load_qa_chain(llm=llm, chain_type=chain_type, prompt=prompt, + token_max=token_max) else: return load_qa_chain(llm=llm, chain_type=chain_type) diff --git a/src/backend/bisheng/interface/initialize/loading.py b/src/backend/bisheng/interface/initialize/loading.py index ca7c8cf6..a087adc5 100644 --- a/src/backend/bisheng/interface/initialize/loading.py +++ b/src/backend/bisheng/interface/initialize/loading.py @@ -76,7 +76,7 @@ def instantiate_based_on_type(class_object, base_type, node_type, params, param_ if base_type == 'agents': return instantiate_agent(node_type, class_object, params) elif base_type == 'prompts': - return instantiate_prompt(node_type, class_object, params) + return instantiate_prompt(node_type, class_object, params, param_id_dict) elif base_type == 'tools': tool = instantiate_tool(node_type, class_object, params) if hasattr(tool, 'name') and isinstance(tool, BaseTool): @@ -320,7 +320,7 @@ def instantiate_agent(node_type, class_object: Type[agent_module.Agent], params: return load_agent_executor(class_object, params) -def instantiate_prompt(node_type, class_object, params: Dict): +def instantiate_prompt(node_type, class_object, params: Dict, param_id_dict: Dict): if node_type == 'ZeroShotPrompt': if 'tools' not in params: @@ -339,6 +339,10 @@ def instantiate_prompt(node_type, class_object, params: Dict): else: prompt = class_object(**params) + no_human_input = set(param_id_dict.keys()) + human_input = set(prompt.input_variables).difference(no_human_input) + order_input = list(human_input) + list(set(prompt.input_variables) & no_human_input) + prompt.input_variables = order_input format_kwargs: Dict[str, Any] = {} for input_variable in prompt.input_variables: if input_variable in params: @@ -348,6 +352,14 @@ def instantiate_prompt(node_type, class_object, params: Dict): elif isinstance(variable, BaseOutputParser) and hasattr(variable, 'get_format_instructions'): format_kwargs[input_variable] = variable.get_format_instructions() + elif isinstance(variable, dict): + # variable node + if len(variable) == 0: + format_kwargs[input_variable] = '' + continue + elif len(variable) != 1: + raise ValueError(f'VariableNode contains multi-key {variable.keys()}') + format_kwargs[input_variable] = list(variable.values())[0] elif isinstance(variable, List) and all( isinstance(item, Document) for item in variable): # Format document to contain page_content and metadata diff --git a/src/bisheng-langchain/bisheng_langchain/chat_models/proxy_llm.py b/src/bisheng-langchain/bisheng_langchain/chat_models/proxy_llm.py index a973b8aa..795e82e4 100644 --- a/src/bisheng-langchain/bisheng_langchain/chat_models/proxy_llm.py +++ b/src/bisheng-langchain/bisheng_langchain/chat_models/proxy_llm.py @@ -195,11 +195,13 @@ def _completion_with_retry(**kwargs: Any) -> Any: 'functions': kwargs.get('functions', []) } response = self.client.post(self.elemai_base_url, json=params) - if response.status_code != 200: - raise return response.json() - return _completion_with_retry(**kwargs) + rsp_dict = _completion_with_retry(**kwargs) + if 200 != rsp_dict.get('status_code'): + logger.error(f'proxy_llm_error resp={rsp_dict}') + raise Exception(rsp_dict) + return rsp_dict def _combine_llm_outputs(self, llm_outputs: List[Optional[dict]]) -> dict: overall_token_usage: dict = {} From e1bb2658f0677a6f1f82a3abc1eebe350db6fff3 Mon Sep 17 00:00:00 2001 From: gulixin0922 Date: Thu, 7 Dec 2023 17:21:48 +0800 Subject: [PATCH 15/23] update --- .../experimental/contract/llm_extract.py | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/src/bisheng-langchain/experimental/contract/llm_extract.py b/src/bisheng-langchain/experimental/contract/llm_extract.py index de9a1d96..1aaeb6e1 100644 --- a/src/bisheng-langchain/experimental/contract/llm_extract.py +++ b/src/bisheng-langchain/experimental/contract/llm_extract.py @@ -31,14 +31,18 @@ # template="""现在你需要帮我完成信息抽取的任务,你需要帮我抽取出原文中相关字段信息,如果没找到对应的值,则设为空,并按照JSON的格式输出。 # Examples: -# 原文:'| 买卖合同 | | 日期 2021.01.01-2022.12.31 | |\n| --- | --- | --- | --- |\n| 客户编号 55652246 | | | 目的地国家 China |\n| 联系人 chen xu | 电话 862138623097 | | 传真 |\n| 买方 浙江峻和科技股份有限公司 余姚市远东工业城CE-11 浙江省余姚市 315400 联系人 : 电话 : 传真 | | 卖方 杜邦贸易 (上海) 有限公司 DuPont Trading (Shanghai) Co., Ltd. 中国《上海》自由贸易试验区港澳路239号一幢楼5层5 27室 Room 527, Floor 5, Building 1, No, 239, Gangao Road .China (Shanghai) Pilot F ree Trade Zone Shanghai 200131.PRC |\n| 付款条件 MET 30 DAYS EOM |\n| | 运输方式 | | 销售条款 CPT YUYAO CITY |\n| 1、买方以采购订单的形式。列明需求传送给卖方。 2、卖方负责提供合理报价、依买方采购订单内容生产交货。 3、本合同所附的"买卖条件"为本合同一个明确的组成部分。 4、若本合同任何其他规定与下列附加条件有冲突、则以附加条件为准。 |\n| 卖方银行账 汇丰银行 开户账号 SWIFT代码 | |\n| 代表实力 浙江岭和科技股份有 (组) For and on behalf (seal) Zhejiang Junke Trehaning 签署 by: 姓名 Now Of Title: 日期 Pate: | 代表表示 杜斯汀 (图) For and configure of SELLER: (seal) Durent 监测制 乌鲁木齐市 日期 late: |' -# 问题: 提取上述文本中以下字段信息:{schema},并按照json的格式输出,如果没找到对应的值,则设为空。 +# 原文: +# | 买卖合同 | | 日期 2021.01.01-2022.12.31 | |\n| --- | --- | --- | --- |\n| 客户编号 55652246 | | | 目的地国家 China |\n| 联系人 chen xu | 电话 862138623097 | | 传真 |\n| 买方 浙江峻和科技股份有限公司 余姚市远东工业城CE-11 浙江省余姚市 315400 联系人 : 电话 : 传真 | | 卖方 杜邦贸易 (上海) 有限公司 DuPont Trading (Shanghai) Co., Ltd. 中国《上海》自由贸易试验区港澳路239号一幢楼5层5 27室 Room 527, Floor 5, Building 1, No, 239, Gangao Road .China (Shanghai) Pilot F ree Trade Zone Shanghai 200131.PRC |\n| 付款条件 MET 30 DAYS EOM |\n| | 运输方式 | | 销售条款 CPT YUYAO CITY |\n| 1、买方以采购订单的形式。列明需求传送给卖方。 2、卖方负责提供合理报价、依买方采购订单内容生产交货。 3、本合同所附的"买卖条件"为本合同一个明确的组成部分。 4、若本合同任何其他规定与下列附加条件有冲突、则以附加条件为准。 |\n| 卖方银行账 汇丰银行 开户账号 SWIFT代码 | |\n| 代表实力 浙江岭和科技股份有 (组) For and on behalf (seal) Zhejiang Junke Trehaning 签署 by: 姓名 Now Of Title: 日期 Pate: | 代表表示 杜斯汀 (图) For and configure of SELLER: (seal) Durent 监测制 乌鲁木齐市 日期 late: | + +# 问题: 提取上述文本中以下字段信息:{keywords},并按照json的格式输出,如果没找到对应的值,则设为空。 # 回答:```json\n{{\n "买方": "浙江峻和科技股份有限公司",\n "卖方": "杜邦贸易 (上海) 有限公司",\n "合同期限": "2021.01.01-2022.12.31",\n "结算条款": "MET 30 DAYS EOM",\n "售后条款": "本合同所附的\'买卖条件\'为本合同一个明确的组成部分。",\n "金额总金额": ""\n}}\n``` # ---------------------------------- -# 原文:{context} -# 问题: 提取上述文本中以下字段信息:{schema},并按照json的格式输出,如果没找到对应的值,则设为空。 +# 原文: +# {context} + +# 问题: 提取上述文本中以下字段信息:{keywords},并按照json的格式输出,如果没找到对应的值,则设为空。 # 回答: # """ # ) From a01e06a3e9942d57d5989b8cf62ddf6b29cbaab6 Mon Sep 17 00:00:00 2001 From: gulixin0922 Date: Thu, 7 Dec 2023 21:13:23 +0800 Subject: [PATCH 16/23] support img --- .../experimental/contract/ellm_extract.py | 53 ++++++++++++++----- 1 file changed, 39 insertions(+), 14 deletions(-) diff --git a/src/bisheng-langchain/experimental/contract/ellm_extract.py b/src/bisheng-langchain/experimental/contract/ellm_extract.py index 96898878..2b7d5b18 100644 --- a/src/bisheng-langchain/experimental/contract/ellm_extract.py +++ b/src/bisheng-langchain/experimental/contract/ellm_extract.py @@ -6,6 +6,7 @@ import numpy as np import cv2 import logging +import filetype from collections import defaultdict from PIL import Image from typing import Any, Iterator, List, Mapping, Optional, Union @@ -43,7 +44,7 @@ def transpdf2png(pdf_file): class EllmExtract(object): - def __init__(self, api_base_url: Optional[str] = None): + def __init__(self, api_base_url: str = 'http://192.168.106.20:3502/v2/idp/idp_app/infer'): self.ep = api_base_url self.client = requests.Session() self.timeout = 10000 @@ -84,17 +85,23 @@ def predict_single_img(self, inp): except Exception as e: return {'status_code': 400, 'status_message': str(e)} - def predict(self, pdf_path, schema): + def predict(self, file_path, schema): """ pdf """ logging.info('ellm extract phase1: ellm extract') - pdf_images = transpdf2png(pdf_path) - kv_results = defaultdict(list) - for pdf_name in pdf_images: - page = int(pdf_name.split('page_')[-1]) - - b64data = convert_base64(pdf_images[pdf_name]) + mime_type = filetype.guess(file_path).mime + if mime_type.endswith('pdf'): + file_type = 'pdf' + elif mime_type.startswith('image'): + file_type = 'img' + else: + raise ValueError(f"file type {file_type} is not support.") + + if file_type == 'img': + kv_results = defaultdict(list) + bytes_data = open(file_path, 'rb').read() + b64data = base64.b64encode(bytes_data).decode() payload = {'b64_image': b64data, 'keys': schema} resp = self.predict_single_img(payload) @@ -104,12 +111,30 @@ def predict(self, pdf_path, schema): raise ValueError(f"ellm kv extract failed: {resp}") for key, value in key_values.items(): - # text_info = [{'value': text, 'page': int(page)} for text in value['text']] - # kv_results[key].extend(text_info) - - for text in value['text']: - if text not in kv_results[key]: - kv_results[key].append(text) + kv_results[key] = value['text'] + + elif file_type == 'pdf': + pdf_images = transpdf2png(file_path) + kv_results = defaultdict(list) + for pdf_name in pdf_images: + page = int(pdf_name.split('page_')[-1]) + + b64data = convert_base64(pdf_images[pdf_name]) + payload = {'b64_image': b64data, 'keys': schema} + resp = self.predict_single_img(payload) + + if 'code' in resp and resp['code'] == 200: + key_values = resp['result']['ellm_result'] + else: + raise ValueError(f"ellm kv extract failed: {resp}") + + for key, value in key_values.items(): + # text_info = [{'value': text, 'page': int(page)} for text in value['text']] + # kv_results[key].extend(text_info) + + for text in value['text']: + if text not in kv_results[key]: + kv_results[key].append(text) logging.info(f'ellm kv results: {kv_results}') return kv_results From 3ab0f763c254c79dddf55f0f770797b1d3f8f31c Mon Sep 17 00:00:00 2001 From: gulixin0922 Date: Fri, 8 Dec 2023 12:46:53 +0800 Subject: [PATCH 17/23] update prompt --- src/bisheng-langchain/experimental/contract/llm_extract.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/bisheng-langchain/experimental/contract/llm_extract.py b/src/bisheng-langchain/experimental/contract/llm_extract.py index 1aaeb6e1..136cbc76 100644 --- a/src/bisheng-langchain/experimental/contract/llm_extract.py +++ b/src/bisheng-langchain/experimental/contract/llm_extract.py @@ -16,7 +16,7 @@ DEFAULT_PROMPT = PromptTemplate( input_variables=["context", "keywords"], - template="""现在你需要帮我完成信息抽取的任务,你需要帮我抽取出原文中相关字段信息,如果没找到对应的值,则设为空,并按照JSON的格式输出. + template="""现在你需要帮我完成信息抽取的任务,你需要帮我抽取出原文中相关字段信息,如果没找到对应的值,则设为空,并按照JSON的格式输出。请保证输出的JSON格式正确。 原文: {context} From a7ec608e58fdbd6d7fdb1438fa62b315a45d163f Mon Sep 17 00:00:00 2001 From: gulixin0922 Date: Mon, 11 Dec 2023 13:41:12 +0800 Subject: [PATCH 18/23] support doc ie --- .../convert_image_pdf.py | 0 .../document_extract.py} | 63 +++++++++---------- .../{contract => document_ie}/ellm_extract.py | 8 +-- .../{contract => document_ie}/llm_extract.py | 46 ++++++++++---- .../{contract => document_ie}/run_web.py | 25 ++++---- 5 files changed, 80 insertions(+), 62 deletions(-) rename src/bisheng-langchain/experimental/{contract => document_ie}/convert_image_pdf.py (100%) rename src/bisheng-langchain/experimental/{contract/contract_extract.py => document_ie/document_extract.py} (91%) rename src/bisheng-langchain/experimental/{contract => document_ie}/ellm_extract.py (96%) rename src/bisheng-langchain/experimental/{contract => document_ie}/llm_extract.py (84%) rename src/bisheng-langchain/experimental/{contract => document_ie}/run_web.py (79%) diff --git a/src/bisheng-langchain/experimental/contract/convert_image_pdf.py b/src/bisheng-langchain/experimental/document_ie/convert_image_pdf.py similarity index 100% rename from src/bisheng-langchain/experimental/contract/convert_image_pdf.py rename to src/bisheng-langchain/experimental/document_ie/convert_image_pdf.py diff --git a/src/bisheng-langchain/experimental/contract/contract_extract.py b/src/bisheng-langchain/experimental/document_ie/document_extract.py similarity index 91% rename from src/bisheng-langchain/experimental/contract/contract_extract.py rename to src/bisheng-langchain/experimental/document_ie/document_extract.py index 177646b5..d0d7c2b5 100644 --- a/src/bisheng-langchain/experimental/contract/contract_extract.py +++ b/src/bisheng-langchain/experimental/document_ie/document_extract.py @@ -1,15 +1,15 @@ import os import json -import logging from tqdm import tqdm from ellm_extract import EllmExtract from llm_extract import LlmExtract from collections import defaultdict +from llm_extract import init_logger -logging.getLogger().setLevel(logging.INFO) +logger = init_logger(__name__) -class ContractExtract(object): +class DocumentExtract(object): def __init__(self, ellm_api_base_url: str = 'http://192.168.106.20:3502/v2/idp/idp_app/infer', llm_model_name: str = 'Qwen-14B-Chat', @@ -31,6 +31,30 @@ def __init__(self, self.replace_ellm_cache = replace_ellm_cache self.replace_llm_cache = replace_llm_cache + def ensemble(self, ellm_kv_results, llm_kv_results): + """ + 1. 如果当前字段llm有结果,以llm为准,丢掉ellm的提取结果 + 2. 如果ellm还有剩余字段,归到最终结果中 + """ + final_kv_results = defaultdict(list) + if self.ensemble_method == 'llm_first': + for key in llm_kv_results: + final_kv_results[key] = llm_kv_results[key] + if key in ellm_kv_results: + ellm_kv_results.pop(key) + for key in ellm_kv_results: + final_kv_results[key] = ellm_kv_results[key] + elif self.ensemble_method == 'ellm_first': + for key in ellm_kv_results: + final_kv_results[key] = ellm_kv_results[key] + if key in llm_kv_results: + llm_kv_results.pop(key) + for key in llm_kv_results: + final_kv_results[key] = llm_kv_results[key] + + logger.info(f'ensemble final kv results: {final_kv_results}') + return final_kv_results + def predict_one_pdf(self, pdf_path, schema, save_folder=''): pdf_name_prefix = os.path.splitext(os.path.basename(pdf_path))[0] @@ -74,41 +98,12 @@ def predict_one_pdf(self, pdf_path, schema, save_folder=''): return ellm_kv_results, llm_kv_results, final_kv_results - def ensemble(self, ellm_kv_results, llm_kv_results): - """ - 1. 如果当前字段llm有结果,以llm为准,丢掉ellm的提取结果 - 2. 如果ellm还有剩余字段,归到最终结果中 - """ - final_kv_results = defaultdict(list) - if self.ensemble_method == 'llm_first': - for key in llm_kv_results: - final_kv_results[key] = llm_kv_results[key] - if key in ellm_kv_results: - ellm_kv_results.pop(key) - for key in ellm_kv_results: - final_kv_results[key] = ellm_kv_results[key] - elif self.ensemble_method == 'ellm_first': - for key in ellm_kv_results: - final_kv_results[key] = ellm_kv_results[key] - if key in llm_kv_results: - llm_kv_results.pop(key) - for key in llm_kv_results: - final_kv_results[key] = llm_kv_results[key] - - return final_kv_results - def predict_all_pdf(self, pdf_folder, schema, save_folder): if not os.path.exists(save_folder): os.makedirs(save_folder) pdf_names = os.listdir(pdf_folder) - # invalid_pdf_names = ['供货合同_W26.pdf', - # '框架协议_主供货协议_J28.pdf', - # '保密条款_L1.pdf', - # '价格协议_J3.pdf'] for pdf_name in tqdm(pdf_names): - # if pdf_name in invalid_pdf_names: - # continue - logging.info(f'process pdf: {pdf_name}') + logger.info(f'process pdf: {pdf_name}') pdf_path = os.path.join(pdf_folder, pdf_name) ellm_kv_results, llm_kv_results, final_kv_results = self.predict_one_pdf( pdf_path, schema, save_folder) @@ -117,7 +112,7 @@ def predict_all_pdf(self, pdf_folder, schema, save_folder): if __name__ == '__main__': # llm_model_name = 'Qwen-14B-Chat' llm_model_name = 'Qwen-72B-Chat-Int4' - client = ContractExtract(llm_model_name=llm_model_name) + client = DocumentExtract(llm_model_name=llm_model_name) schema = '合同标题|借款合同编号|担保合同编号|借款人|贷款人|借款金额' pdf_folder = '/home/gulixin/workspace/datasets/huatai/流动资金借款合同_pdf' save_folder = '/home/gulixin/workspace/datasets/huatai/流动资金借款合同_pdf_qwen72B_res' diff --git a/src/bisheng-langchain/experimental/contract/ellm_extract.py b/src/bisheng-langchain/experimental/document_ie/ellm_extract.py similarity index 96% rename from src/bisheng-langchain/experimental/contract/ellm_extract.py rename to src/bisheng-langchain/experimental/document_ie/ellm_extract.py index 2b7d5b18..3347bdb0 100644 --- a/src/bisheng-langchain/experimental/contract/ellm_extract.py +++ b/src/bisheng-langchain/experimental/document_ie/ellm_extract.py @@ -5,14 +5,14 @@ import fitz import numpy as np import cv2 -import logging import filetype from collections import defaultdict from PIL import Image from typing import Any, Iterator, List, Mapping, Optional, Union +from llm_extract import init_logger -logging.getLogger().setLevel(logging.INFO) +logger = init_logger(__name__) def convert_base64(image): @@ -89,7 +89,7 @@ def predict(self, file_path, schema): """ pdf """ - logging.info('ellm extract phase1: ellm extract') + logger.info('ellm extract phase1: ellm extract') mime_type = filetype.guess(file_path).mime if mime_type.endswith('pdf'): file_type = 'pdf' @@ -136,7 +136,7 @@ def predict(self, file_path, schema): if text not in kv_results[key]: kv_results[key].append(text) - logging.info(f'ellm kv results: {kv_results}') + logger.info(f'ellm kv results: {kv_results}') return kv_results diff --git a/src/bisheng-langchain/experimental/contract/llm_extract.py b/src/bisheng-langchain/experimental/document_ie/llm_extract.py similarity index 84% rename from src/bisheng-langchain/experimental/contract/llm_extract.py rename to src/bisheng-langchain/experimental/document_ie/llm_extract.py index 136cbc76..4ce79a6a 100644 --- a/src/bisheng-langchain/experimental/contract/llm_extract.py +++ b/src/bisheng-langchain/experimental/document_ie/llm_extract.py @@ -4,6 +4,7 @@ import json import time import logging +import colorlog import re from collections import defaultdict from langchain.prompts import PromptTemplate @@ -11,7 +12,28 @@ from bisheng_langchain.text_splitter import ElemCharacterTextSplitter -logging.getLogger().setLevel(logging.INFO) +def init_logger(name): + logger = logging.getLogger(name) + logger.setLevel(logging.DEBUG) + if not logger.handlers: + stream_handler = logging.StreamHandler() + stream_handler.setLevel(logging.DEBUG) + fmt_string = '%(log_color)s[%(asctime)s][%(name)s][%(levelname)s]%(message)s' + # black red green yellow blue purple cyan and white + log_colors = { + 'DEBUG': 'cyan', + 'INFO': 'green', + 'WARNING': 'yellow', + 'ERROR': 'red', + 'CRITICAL': 'purple' + } + fmt = colorlog.ColoredFormatter(fmt_string, log_colors=log_colors) + stream_handler.setFormatter(fmt) + logger.addHandler(stream_handler) + return logger + + +logger = init_logger(__name__) DEFAULT_PROMPT = PromptTemplate( @@ -69,7 +91,7 @@ def parse_json(json_string: str) -> dict: if json_str.startswith('{\n{'): json_str = json_str.replace('{\n{', '{', 1) - logging.info(f'llm response after parse: {json_str}') + logger.info(f'llm response after parse: {json_str}') extract_res = json.loads(json_str) return extract_res @@ -105,18 +127,18 @@ def call_llm(self, prompt_info, max_tokens=8192): assert response['status_code'] == 200, response except Exception as e: # llm request error - logging.error(f'llm predict fail: {str(e)}') - logging.error(f'raw_response: {raw_response.text}') + logger.error(f'llm predict fail: {str(e)}') + logger.error(f'raw_response: {raw_response.text}') return {}, len(raw_response.text) choices = response.get('choices', []) - logging.info(f'llm response: {response}') + logger.info(f'llm response: {response}') json_string = choices[0]['message']['content'] try: extract_res = parse_json(json_string) except Exception as e: # json parse error - logging.error(f'json parse fail: {str(e)}') + logger.error(f'json parse fail: {str(e)}') extract_res = {} return extract_res, len(json_string) @@ -138,7 +160,7 @@ def parse_pdf(self, chunk_overlap=chunk_overlap, separators=separators) split_docs = text_splitter.split_documents(docs) - logging.info(f'pdf content len: {len(pdf_content)}, docs num: {len(docs)}, split_docs num: {len(split_docs)}') + logger.info(f'pdf content len: {len(pdf_content)}, docs num: {len(docs)}, split_docs num: {len(split_docs)}') return split_docs, docs def post_extract_res(self, split_docs_extract, split_docs_content, schema): @@ -159,17 +181,17 @@ def post_extract_res(self, split_docs_extract, split_docs_content, schema): return kv_results def predict(self, pdf_path, schema): - logging.info('llm extract phase1: pdf parsing') + logger.info('llm extract phase1: pdf parsing') schema = schema.split('|') keywords = '、'.join(schema) try: split_docs, docs = self.parse_pdf(pdf_path) except Exception as e: # pdf parse error - logging.error(f'pdf parse fail: {str(e)}') + logger.error(f'pdf parse fail: {str(e)}') return {} - logging.info('llm extract phase2: llm extract') + logger.info('llm extract phase2: llm extract') split_docs_extract = [] split_docs_content = [] avg_generate_num = 0 @@ -184,9 +206,9 @@ def predict(self, pdf_path, schema): split_docs_content.append(pdf_content) avg_generate_num = avg_generate_num / len(split_docs) - logging.info('llm extract phase3: post extract result') + logger.info('llm extract phase3: post extract result') kv_results = self.post_extract_res(split_docs_extract, split_docs_content, schema) - logging.info(f'llm kv results: {kv_results}, avg generate char num: {avg_generate_num}') + logger.info(f'llm kv results: {kv_results}, avg generate char num: {avg_generate_num} char/s') return kv_results diff --git a/src/bisheng-langchain/experimental/contract/run_web.py b/src/bisheng-langchain/experimental/document_ie/run_web.py similarity index 79% rename from src/bisheng-langchain/experimental/contract/run_web.py rename to src/bisheng-langchain/experimental/document_ie/run_web.py index e2098d3e..cf469614 100644 --- a/src/bisheng-langchain/experimental/contract/run_web.py +++ b/src/bisheng-langchain/experimental/document_ie/run_web.py @@ -5,18 +5,18 @@ import gradio as gr import time import tempfile -from contract_extract import ContractExtract +from document_extract import DocumentExtract tmpdir = './tmp/extract_files' if not os.path.exists(tmpdir): os.makedirs(tmpdir) -ellm_client = ContractExtract(do_ellm=True, do_llm=False, llm_model_name='Qwen-72B-Chat-Int4') -llm_client = ContractExtract(do_ellm=False, do_llm=True, llm_model_name='Qwen-72B-Chat-Int4') -ensemble_llm_first_client = ContractExtract(do_ellm=True, do_llm=True, +ellm_client = DocumentExtract(do_ellm=True, do_llm=False, llm_model_name='Qwen-72B-Chat-Int4') +llm_client = DocumentExtract(do_ellm=False, do_llm=True, llm_model_name='Qwen-72B-Chat-Int4') +ensemble_llm_first_client = DocumentExtract(do_ellm=True, do_llm=True, ensemble_method='llm_first', llm_model_name='Qwen-72B-Chat-Int4') -ensemble_ellm_first_client = ContractExtract(do_ellm=True, do_llm=True, +ensemble_ellm_first_client = DocumentExtract(do_ellm=True, do_llm=True, ensemble_method='ellm_first', llm_model_name='Qwen-72B-Chat-Int4') @@ -71,15 +71,16 @@ def ensemble_ellm_first_run(pdf_path, schema): btn1 = gr.Button('Run LLM') btn1.click(fn=llm_run, inputs=[intput_file, schema], outputs=llm_kv_results) + with gr.Row(): with gr.Column(): - ensemble1_kv_results = gr.Textbox(label='ensemble1抽取结果', value='', interactive=True, lines=1) - btn2 = gr.Button('Run ensemble1') - btn2.click(fn=ensemble_llm_first_run, inputs=[intput_file, schema], outputs=ensemble1_kv_results) + ensemble2_kv_results = gr.Textbox(label='ensemble_ellm_first抽取结果', value='', interactive=True, lines=1) + btn3 = gr.Button('Run ensemble_ellm_first') + btn3.click(fn=ensemble_ellm_first_run, inputs=[intput_file, schema], outputs=ensemble2_kv_results) - # with gr.Row(): - # ensemble2_kv_results = gr.Textbox(label='ensemble2抽取结果', value='', interactive=True, lines=1) - # btn3 = gr.Button('Run ensemble2') - # btn3.click(fn=ensemble_ellm_first_run, inputs=[intput_file, schema], outputs=ensemble2_kv_results) + with gr.Column(): + ensemble1_kv_results = gr.Textbox(label='ensemble_llm_first抽取结果', value='', interactive=True, lines=1) + btn2 = gr.Button('Run ensemble_llm_first') + btn2.click(fn=ensemble_llm_first_run, inputs=[intput_file, schema], outputs=ensemble1_kv_results) demo.launch(server_name='192.168.106.12', server_port=9118, share=True) From b7bd157343ad20f180ccda595cd19b245b399c20 Mon Sep 17 00:00:00 2001 From: gulixin0922 Date: Mon, 11 Dec 2023 15:16:52 +0800 Subject: [PATCH 19/23] update requirement --- src/bisheng-langchain/requirements.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/bisheng-langchain/requirements.txt b/src/bisheng-langchain/requirements.txt index 81a1a2b3..ffa201de 100644 --- a/src/bisheng-langchain/requirements.txt +++ b/src/bisheng-langchain/requirements.txt @@ -4,4 +4,5 @@ websocket-client elasticsearch opencv-python==4.5.5.64 Pillow==9.5.0 -bisheng-pyautogen \ No newline at end of file +bisheng-pyautogen +colorlog \ No newline at end of file From 8748345e8ad2e87b583810009291da96457e27f4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=A7=9A=E5=8A=B2?= Date: Mon, 11 Dec 2023 17:47:22 +0800 Subject: [PATCH 20/23] Update pyproject.toml --- src/backend/pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/pyproject.toml b/src/backend/pyproject.toml index 04a39d96..d76b10c9 100644 --- a/src/backend/pyproject.toml +++ b/src/backend/pyproject.toml @@ -18,7 +18,7 @@ include = ["./bisheng/*", "bisheng/**/*"] bisheng = "bisheng.__main__:main" [tool.poetry.dependencies] -bisheng_langchain = "v0.1.8" +bisheng_langchain = "0.2.0rc1" bisheng_pyautogen = "0.1.18" minio = "^7.2.0" fastapi_jwt_auth = "^0.5.0" From 12400cd2a6f9a2cd3b00dc8fe27431d4138c1c2b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=A7=9A=E5=8A=B2?= Date: Mon, 11 Dec 2023 17:48:56 +0800 Subject: [PATCH 21/23] Update Dockerfile --- src/backend/Dockerfile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/Dockerfile b/src/backend/Dockerfile index 2c02774a..1ac69f6e 100644 --- a/src/backend/Dockerfile +++ b/src/backend/Dockerfile @@ -11,7 +11,7 @@ RUN curl -sSL https://install.python-poetry.org | python3 - # # Add Poetry to PATH ENV PATH="${PATH}:/root/.local/bin" # # Copy the pyproject.toml and poetry.lock files -COPY poetry.lock pyproject.toml ./ +# COPY poetry.lock pyproject.toml ./ # Copy the rest of the application codes COPY ./ ./ @@ -22,4 +22,4 @@ RUN python -m pip install --upgrade pip && \ RUN poetry config virtualenvs.create false RUN poetry install --no-interaction --no-ansi --without dev -CMD ["uvicorn", "bisheng.main:app", "--workers", "2", "--host", "0.0.0.0", "--port", "7860"] \ No newline at end of file +CMD ["uvicorn", "bisheng.main:app", "--workers", "2", "--host", "0.0.0.0", "--port", "7860"] From 1c25e994252f62034232c8de948da0de0a2ac1db Mon Sep 17 00:00:00 2001 From: gulixin0922 Date: Tue, 12 Dec 2023 20:55:10 +0800 Subject: [PATCH 22/23] update --- .../experimental/document_ie/document_extract.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/bisheng-langchain/experimental/document_ie/document_extract.py b/src/bisheng-langchain/experimental/document_ie/document_extract.py index d0d7c2b5..22f69c4b 100644 --- a/src/bisheng-langchain/experimental/document_ie/document_extract.py +++ b/src/bisheng-langchain/experimental/document_ie/document_extract.py @@ -13,7 +13,7 @@ class DocumentExtract(object): def __init__(self, ellm_api_base_url: str = 'http://192.168.106.20:3502/v2/idp/idp_app/infer', llm_model_name: str = 'Qwen-14B-Chat', - llm_model_api_url: str = 'https://bisheng.dataelem.com/api/v1/models/{}/infer', + llm_model_api_url: str = 'http://192.168.106.20:7001/v2.1/models/{}/infer', unstructured_api_url: str = "https://bisheng.dataelem.com/api/v1/etl4llm/predict", do_ellm: bool = True, do_llm: bool = True, From 5ab8cc8ede8531cb17641beb992af69630746219 Mon Sep 17 00:00:00 2001 From: BDS <18110720022@fudan.edu.cn> Date: Wed, 13 Dec 2023 17:02:37 +0800 Subject: [PATCH 23/23] APICHAINaddPostPut --- .../APICHAINaddPostPut/JSon_Post_Put.py | 53 +++++++++++ .../NewBeeAPIchain_Post_Put.py | 95 +++++++++++++++++++ .../APICHAINaddPostPut/template.py | 38 ++++++++ 3 files changed, 186 insertions(+) create mode 100644 src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/JSon_Post_Put.py create mode 100644 src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/NewBeeAPIchain_Post_Put.py create mode 100644 src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/template.py diff --git a/src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/JSon_Post_Put.py b/src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/JSon_Post_Put.py new file mode 100644 index 00000000..39e6132f --- /dev/null +++ b/src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/JSon_Post_Put.py @@ -0,0 +1,53 @@ +""" +2023/12/8,说明: +APIdocs是指引大模型进行API调用的文档。 +对于GET方法,应该包含的内容有:"API URL":要访问的URL地址;URL_example:one-shot提示,一个URL的例子;所有key的说明, +以及tool_name:这个工具是干嘛的,还有"HTTP METHOD"。 +对于POST等方法,应当包含"API URL",JSON——example,以及参数说明(是否必须,参数含义等)。以及tool_name:这个工具是干嘛的。"HTTP METHOD" +以上内容写的时候尽量清晰,这样大模型判读才不会有误,自动化成功率会很高。 +大模型会根据question以及apidocs生成合适的内容进行api请求 +""" + +# 实在智能创建任务 +tool1 = { + "HTTP METHOD": "POST", + "API URL": "https://z-commander-api.ai-indeed.com/openAPI/v1/job", + 'JSON_example': ''' +{ "jobName":"测试任务",//任务名称 +"processDetailUUID":"e8a2f88a2a470c2b5f9b9140e1d2225a",//创建/获取到的流程版本的UUID +"executeTimes":1,//执行的次数。不能超过 30 次。 +"executeType":2,//执行方式, 2-立即执行 9-定时执行 +"cronExpression":"* * * * * ?",//时间表达式,执行方式选择为定时执行时必填,内容为Cron 表达式 +"inputParam":{ //机器人任务入参,json 格式,任务执行时使用。 "param1":"value1", "param2":"value2"}, +"priority": 1, // 任务优先级 1-高,2-中,3-低 +"distributionType": 2, // 分配类型:1-自动分配,2-指定 bot 机器人;分配类型为 2(指定 bot 机器人)时,botList 属性不能为空 +"botList": [ { "botUUID": "fVbcpvj1jG0Qoak5nI1CUUBBYabCb5mX", // 机器人 botUUID + "priority": 1 // 优先级 1-高,2-中,3-低 }, +{ "botUUID": "pYRA8fvWWgRmhZcGj4GNuZNj5lWi7c9n", // 机器人 botUUID +"priority": 1 // 优先级 1-高,2-中,3-低 }]}''', + + 'api_docs': '''jobName String 是 任务的名称(最长三十个字符)processDetailUUID String 是 创建/获取到的流程版本 UUIDexecuteTimes Integer 否 +执行的次数。不能超过 30 次。executeType Integer 是 执行方式, 2-立即执行 9-定时执行cronExpression String 否时间表达式,执行方式选择为定时执行时必填,内容为 Cron +表达式inputParam Object 否 机器人任务入参,json 格式,任务执行时使用。priority Integer 否 任务优先级 1-高,2-中,3-低,默认 2-中distributionType Integer 否分配类型:1-自动分配,2-指定 bot 机器人;分配类型为 2(指定 bot 机器人)时,botList 属性不能为空,默认 1-自动分配botList List 否bot 机器人列表,当分配类型为 2-指定 bot 机器人时,需要传递此参数''', + 'tool_name': "任务创建" +} +#任务处理 +tool2 = { + "接口说明": "该接口是为了对已经创建的任务进行操作所提供的接口 操作类型:1-立即/再次执行任务 2-停止任务 3-强制停止任务(bot 触发的手动触发类型的任务) 4-删除任务", + "URL": "https://z-commander-api.ai-indeed.com/openAPI/v1/job/{jobUUID}/{operation}", + "HTTP method": "PUT", + "Content-Type": "application/json", + 'JSON_example': '''http://commander-manager.dev.ii-ai.tech/openAPI/v1/job/ea5adaads4123/. Body 请求样例{ + "inputParam":{ //机器人任务入参,json 格式,任务执行时使用。 "param1":"value1", "param2":"value2"}}''', + 'api_docs': '''Body参数说明:inputParam Object 任务入参,仅在任务立即执行/再次执行时,如果传递该参数,那么会先更新任务入参,再执行任务''', + "响应样例": '''{ 'msg': "success", "code": 0,// 0 为成功 "data": true}''', + 'tool_name': "任务处理" +} +#宠物查询 +tool3 = {''' + HTTP METHOD = GET + tool_name = "pet query" + URL_example = "https://api.jisuapi.com/pet/query?appkey=&name=拉布拉多" + key说明 = "name关键字代表宠物名字;appkey代表密钥,一个可用的密钥是4b13addb8994d645。" + ''' + } diff --git a/src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/NewBeeAPIchain_Post_Put.py b/src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/NewBeeAPIchain_Post_Put.py new file mode 100644 index 00000000..a3cbd384 --- /dev/null +++ b/src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/NewBeeAPIchain_Post_Put.py @@ -0,0 +1,95 @@ +""" +2023/12/8,说明: +此代码继承langchain的APIchain实现更高级的APIchain,能够实现get和post等http方法。 +工作原理:见代码注释。 +注意:必须搭配新的提示词模板template。 +<此代码包含隐私内容,包括apikey等信息,供测试> +""" + +import json +import os + +from template import API_REQUEST_PROMPT, API_RESPONSE_PROMPT # change this to the path you placed the templates +from langchain.chains import APIChain +from typing import Any, Dict, Optional +from langchain.prompts import BasePromptTemplate +from langchain.requests import TextRequestsWrapper +from langchain.chains.llm import LLMChain +from langchain.chat_models import ChatOpenAI +from langchain.callbacks.manager import CallbackManagerForChainRun +os.environ["OPENAI_API_KEY"] = "sk-XXX" #其中key和value均为string类型 +# 测试用llm +llm = ChatOpenAI(model="gpt-4", temperature=0, openai_api_key=os.getenv("OPENAI_API_KEY")) + + +class NewBeeAPIChain(APIChain): + def _call(self, inputs: Dict[str, any], + run_manager: Optional[CallbackManagerForChainRun] = None, + ) -> Dict[str, str]: + Callback_Manager = run_manager or CallbackManagerForChainRun.get_noop_manager() + question = inputs[self.question_key] + + # get api_url, request_method and body to call + request_info = self.api_request_chain.predict( + question=question, + api_docs=self.api_docs + ) + api_url, request_method, body = request_info.split('|') + print(f'request info: {request_info}') + + # get the http method with same name, and call api for api response + Callback_Manager.on_text(api_url, color="green", end="\n", verbose=self.verbose) + request_func = getattr(self.requests_wrapper, request_method.lower()) + if request_method == "GET": + api_response = request_func(api_url) + else: + api_response = request_func(api_url, json.loads(body)) + Callback_Manager.on_text(api_response, color="yellow", end="\n", verbose=self.verbose) + + print("api_response:", str(api_response)) + # get the answer to the original question using the API response + answer = self.api_answer_chain.predict( + question=question, + api_docs=self.api_docs, + api_url=api_url, + api_response=api_response, + ) + return {self.output_key: answer} + + @classmethod + def from_llm_and_api_docs( + cls, + llm: llm, + api_docs: str, + headers: Optional[dict] = None, + api_url_prompt: BasePromptTemplate = API_REQUEST_PROMPT, + api_response_prompt: BasePromptTemplate = API_RESPONSE_PROMPT, + **kwargs: Any, + ) -> APIChain: + """Load chain from just an LLM and the api docs.""" + get_request_chain = LLMChain(llm=llm, prompt=api_url_prompt) + requests_wrapper = TextRequestsWrapper(headers=headers) + get_answer_chain = LLMChain(llm=llm, prompt=api_response_prompt) + return cls( + api_request_chain=get_request_chain, + api_answer_chain=get_answer_chain, + requests_wrapper=requests_wrapper, + limit_to_domains=None, + api_docs=api_docs, + **kwargs, + ) + + + +if __name__ == "__main__": + from JSon_Post_Put import * + api_docs = str(tool1)+str(tool2)+str(tool3) + headers = {"appKey":"XXX", + "appSecret":"XXX"} + # headers = {} + chain = NewBeeAPIChain.from_llm_and_api_docs(llm=llm, headers=headers, api_docs=api_docs) + # result = chain.run("创建任务DDY测试1,ID是8ef0da86ccd84ed99b78526006ec9bb3,立即执行一次。") #Post + # result = chain.run("任务ID是24826366e3881eec2f74f3755cc1fc77的任务立即执行一次!") #put + result = chain.run("任务ID是24826366e3881eec2f74f3755cc1fc77的任务立即删掉!") #Delete 也是通过Put实现的,实在智能RPA的PUT接口说明包括执行、停止、删除等操作 + # result = chain.run("查询蓝猫信息") #Get + print(result) diff --git a/src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/template.py b/src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/template.py new file mode 100644 index 00000000..b714ac2c --- /dev/null +++ b/src/bisheng-langchain/bisheng_langchain/APICHAINaddPostPut/template.py @@ -0,0 +1,38 @@ +from langchain.prompts.prompt import PromptTemplate + +API_URL_PROMPT_TEMPLATE = """You are given the below API Documentation: +{api_docs} +Using this documentation, generate the full API url to call for answering the user question. +You should build the API url in order to get a response that is as short as possible, while still getting the necessary information to answer the question. Pay attention to deliberately exclude any unnecessary pieces of data in the API call. +You should extract the request METHOD from doc, and generate the BODY data in JSON format according to the user question if necessary. The BODY data could be empty dict. + +Question:{question} +""" + +API_REQUEST_PROMPT_TEMPLATE = API_URL_PROMPT_TEMPLATE + """Output the API url, METHOD and BODY, join them with `|`. DO NOT GIVE ANY EXPLANATION.""" + +API_REQUEST_PROMPT = PromptTemplate( + input_variables=[ + "api_docs", + "question", + ], + template=API_REQUEST_PROMPT_TEMPLATE, +) + +API_RESPONSE_PROMPT_TEMPLATE = ( + API_URL_PROMPT_TEMPLATE + + """API url: {api_url} + +Here is the response from the API: + +{api_response} + +Summarize this response to answer the original question. + +Summary:""" +) + +API_RESPONSE_PROMPT = PromptTemplate( + input_variables=["api_docs", "question", "api_url", "api_response"], + template=API_RESPONSE_PROMPT_TEMPLATE, +) \ No newline at end of file