[ISSUE #4047] Support chatGPT source connector #4817

jevinjiang · 2024-04-07T09:57:40Z

Fixes #4047

Motivation

Support chatGPT source connector

Documentation

Does this pull request introduce a new feature? yes

Pil0tXia

Good job!

Subsequent documents can be added in https://github.com/apache/eventmesh-site.

Pil0tXia · 2024-04-07T12:12:50Z

...hatgpt/src/main/java/org/apache/eventmesh/connector/chatgpt/server/ChatGPTConnectServer.java

+        if (serverConfig.isSinkEnable()) {
+            // TODO support sink connector
+        }


Actually we have no idea what kind of scenarios the chatgpt sink connector needs to be used in.

@jevinjiang this review is waiting for your response.

@Pil0tXia @pandaapo
Sorry, in fact, I can't think of a specific application scenario about sink connector too.

The following are possible scenarios:

Expand openai's "function calling" function, which can execute the corresponding function based on the text sent from the upstream , "function calling" is configured in the yaml file, and the specific function is an http request

It can analyze certain files and image streams and send them to the callback url configured by the user according to the user's prompt.

Based on the upstream text data, talk to openai and return to the callback url provided by the user

Does such a scenario exist? Create a view that aggregates ten-day data in a database such as mongodb or mysql, and send it to the sink connector of chatgpt through the respective source connector to generate articles such as delayed reports.

这样的场景是否存在, 在mongodb或者mysql等数据库中建立聚合十日等数据的视图, 由各自的source connector 发送到 chatgpt的sink connector中, 生成研报等文章内容

...ain/java/org/apache/eventmesh/connector/chatgpt/source/connector/ChatGPTSourceConnector.java

eventmesh-connectors/eventmesh-connector-chatgpt/src/main/resources/source-config.yml

Pil0tXia · 2024-04-07T12:22:53Z

plz fix ci~

pandaapo

@jevinjiang
I don't see the need to discard the idea of "sending prompt and getting the result of cloudevents specification from ChatGPT" that you raised in the issue, it's a good idea, and can be added as a feature of the Connector, e.g. configured to either run in the current way, or in the way you mentioned. This would take full advantage of ChatGPT's capabilities, could add flexibility to that Connector, and could save some data conversion overhead. You can implement this feature in a new PR.

我觉得您在issue中提出的“发送prompt，让gpt直接返回cloudevents格式的数据”的想法，没必要直接舍弃，这是个很好的想法，完全可以作为该Connector的一个特性加进来，比如通过配置来决定或采用当前的运行方式、或采用您说的这种方式。这样可以充分利用ChatGPT的能力，可以给该Connector增加灵活性，还可以节省一些数据转换的开销。您可以在新的PR中来实现该功能。

Pil0tXia · 2024-04-07T14:02:49Z

sending prompt and getting the result of cloudevents specification from ChatGPT

I also think it's a good idea, but it's technically unworkable.

This is the required ChatGPT source connector architecture described in #4047 (comment):

If we let ChatGPT source connector to send prompts, then there will be two inputs, and there's no way to customize the content of prompts:

Moreover, the step that "creating new content from source connector to Event Source" goes against the design of the source connector. The source connector should only listen to the Event Source and receive new content.

Secondly, In the case of GPT 3.5, even if the output format is specified using prompts, it is possible for it to output something like the following, which the source connector cannot pass directly without processing:

Of course. Here is the CloudEvent data you required:

{
    ......
}

On the other hand, users can enter prompts even if they are not sent by the source connector. The current design does not cause the ChatGPT source connector to not support the prompts feature.

Any new idea? ☺️

pandaapo · 2024-04-07T14:43:06Z

sending prompt and getting the result of cloudevents specification from ChatGPT

I also think it's a good idea, but it's technically unworkable.

This is the required ChatGPT source connector architecture described in #4047 (comment):

If we let ChatGPT source connector to send prompts, then there will be two inputs, and there's no way to customize the content of prompts:

Moreover, the step that "creating new content from source connector to Event Source" goes against the design of the source connector. The source connector should only listen to the Event Source and receive new content.

Secondly, In the case of GPT 3.5, even if the output format is specified using prompts, it is possible for it to output something like the following, which the source connector cannot pass directly without processing:
Of course. Here is the CloudEvent data you required:

{
    ......
}
On the other hand, users can enter prompts even if they are not sent by the source connector. The current design does not cause the ChatGPT source connector to not support the prompts feature.

Any new idea? ☺️

@Pil0tXia Based on the content you've written, it seems like you may not have carefully reviewed the core functionality implemented in this PR. Otherwise, according to your understanding of #4047 (comment) , you would also have raised similar questions regarding this PR.

I suggest you take another review at this PR. I believe the functionality of "sending prompt and getting the result of cloudevents specification from ChatGPT" can be implemented based on this PR, and it does not contradict the architecture described in #4047 (comment) . As for the case where "the source connector cannot pass directly without processing (the result returned by ChatGPT)", "some data conversion overhead" (not all) can still be saved, and optimizing the result can be achieved through prompts.

Pil0tXia · 2024-04-07T16:46:32Z

I came up with a way that allows users to specify whether or not to use prompts:

Users can indicate in the request parameters of the Web API whether they want to use predefined prompts templates and which template to use.

The ChatGPT source connector needs to check if the returned result from ChatGPT, when prompts templates are enabled by the user, is in the correct JSON format. If not, the entire result will be wrapped in a CloudEvent. However, this checking process may not save much performance overhead compared to deserialization. By not relying on the JSON format generated by ChatGPT, more stable results can be obtained.

pandaapo · 2024-04-08T00:08:10Z

@Pil0tXia You have correctly drawn the architecture of this PR (with the specified prompts idea first proposed by the author and your prompts template idea), so I believe you understand the core functionality implemented in this PR now.

As for the design of the prompt templates and how to send specified prompt, I suggest leaving them to the developer without limiting the developer's thought, as the author creatively thought of multiple data formats can being specified before. Of course the developer can adopt your good design.
The developer can even temporarily not develop prompt templates and submit the feature of prompt templates after multiple PRs submitted for this Connector.

this checking process may not save much performance overhead compared to deserialization.

Maybe. Maybe the overhead of checking format is less than the overhead of building CloudEvent item by item. Suitable code can even do without explicitly checking format.

By not relying on the JSON format generated by ChatGPT, more stable results can be obtained.

I said earlier to make it optional, with the option to choose how it currently runs like what this PR do, and the option to ask for a specified format to be returned according to a specified prompt. The stable results here only make this Connector feature-poor in practice.

jevinjiang · 2024-04-08T11:21:30Z

I will fix the interface path to /chat and /parse in the next commit. /parse supports "sending prompt and getting the result of cloudevents specification from ChatGPT" and /chat is the current way to interact with OpenAI.

jevinjiang · 2024-04-10T07:38:53Z

@Pil0tXia @pandaapo please review again.

pandaapo · 2024-04-14T03:40:48Z

eventmesh-connectors/eventmesh-connector-chatgpt/src/main/resources/prompt

+  "subject": string, Set to ${subject}
+  "id": string, Set to  ${id}
+  "time": string, Set to  ${time}
+  "datacontenttype": string, Set to  ${datacontenttype}


I find that my understanding of this prompt file is indeed not as good as ChatGPT's. When users use the prompt mode, do they refer to or directly use the prompt file, and then replace all ${} in the file with their own content?

我发现我对该prompt文件的理解能力确实不如ChatGPT。用户使用prompt模式时，是参照该prompt文件或直接使用该文件，然后将该文件中所有${}替换成自己的内容吗？

@pandaapo Yes, this file named prompt is actually a prompt template, which replaces the content in ${} according to the user's input. The following is an example of a final prompt:

You are an AI assistant named CloudEventsConverter. avoid escape characters . Your task is to construct a JSON object in CloudEvents format. Based on the field name and field description in the 'data' field of the CloudEvents formatted JSON object, convert the input text provided by the user into the content of the 'data' field, which must comply with the specifications of the content of the 'datacontenttype' field. The role is : - If the 'datacontenttype' field content is 'application/json', then the' data 'field content should be a JSON object, - else If the 'datacontenttype' field content is not 'application/json' and is 'application/xml', then the' data 'field content should be a string in XML format and the outermost of XML format is <data> </data>, inside is the XML generated by you based on field info; - else the 'datacontenttype' field content is not 'application/json' and 'application/xml', then the' data 'field content is string of the 'text' field content; Except for the content of the data field, all other values should be set to and cannot be modified. Finally, return to me the JSON object in CloudEvents format that you constructed The following text is the field name and field description in the 'data' field of the CloudEvents-formatted JSON object, extract the following information: <BEGIN FIELD INFO> orderNo:this is order number address:this is a address phone:this is phone number <END FIELD INFO> text: User 13356288979 from Tianjin store placed an order with order number 11221122 The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```": ```json { "specversion": string, Set to "1.0" "type": string, Set to com.example.someevent "source": string, Set to /mycontext "subject": string, Set to test "id": string, Set to 9aae39c3-7641-474a-b181-f4560d6aee3b "time": string, Set to 2024-04-14T18:59:44.365946+08:00 "datacontenttype": string, Set to application/json "data": object or string }

What will be the result returned by ChatGPT for this prompt?

To generate this full prompt text after replacing the placeholders, what would a user's request to access this Connector look like?

@pandaapo
chatgpt result:

{ "specversion": "1.0", "type": "com.example.someevent", "source": "/mycontext", "subject": "test", "id": "0171d995-f97a-4533-9c70-e2fb14fa297e", "time": "2024-04-14T20:01:19.939919+08:00", "datacontenttype": "application/json", "data": { "orderNo": "11221122", "address": "Tianjin store", "phone": "13356288979" } }

request:

{ "requestType": "PARSE", "type": "com.example.someevent", "source": "/mycontext", "subject": "test", "datacontenttype": "application/json", "text": "User 13356288979 from Tianjin store placed an order with order number 11221122", "fields": "orderNo:this is order number;address:this is a address;phone:this is phone number" }

If your data content type is application/xml, then chatgpt return, in xml is the root path specified by me by default.:

{ "specversion": "1.0", "type": "com.example.someevent", "source": "/mycontext", "subject": "test", "id": "ee446a10-f7eb-4596-a81b-bdc79d45bca5", "time": "2024-04-14T20:02:36.738191+08:00", "datacontenttype": "application/xml", "data": "<data><orderNo>11221122</orderNo><address>Tianjin store</address><phone>13356288979</phone></data>" }

@pandaapo the purpose of the type is PARSE is to parse the unstructured data input by the user into a json or xml structured data set into the data field in cloudevent. text is user-defined unstructured data, chatgpt according to field info Extract information from text and convert it into structured data,Return to finally give me a complete cloud event.

PARSE 的目的是，根据用户输入的非结构化数据(text)解析为一个 json 或者 xml 的结构化数据, 然后放入cloudevent 中的data字段中，text 是用户自定义的非结构化数据，chatgpt 根据fields传入的信息从text中提取信息，转化为结构化数据,最终返回给我一个完整的cloudevent。

What happens if the content of "text" entered by the user does not match the parse rule in "fields"? (I want to get clarity on some functionality to reduce the difficulty of review)

@pandaapo
In fact, it depends on the user's description of the field information. I have not made any relevant guarantees. I tested it. If the field does not exist in the text, the description of the field will be returned. The following is a return example:

{ "specversion": "1.0", "type": "com.example.someevent", "source": "/mycontext", "subject": "test", "id": "56bf1ce4-d57f-4591-9196-fe0cb8abf9c7", "time": "2024-04-14T21:01:34.548791+08:00", "datacontenttype": "application/json", "data": { "orderNo": "this is order number", "address": "this is a address", "phone": "this is phone number", "text": "There are a cloth sofa, a wood table and two wood chairs in the living room" } }

Users can adjust the description information in fields, such as

<BEGIN FIELD INFO> orderNo:this is order number!If this information is not found, output none. address:this is a address!If this information is not found, output none. phone:this is phone number!If this information is not found, output none. <END FIELD INFO>

In this case, the return of chatgpt is

{ "specversion": "1.0", "type": "com.example.someevent", "source": "/mycontext", "subject": "test", "id": "51387c66-2a4c-4b7a-a1cc-f9f554aa039b", "time": "2024-04-14T20:57:08.864208+08:00", "datacontenttype": "application/json", "data": { "orderNo": "none", "address": "none", "phone": "none" } }

...ain/java/org/apache/eventmesh/connector/chatgpt/source/connector/ChatGPTSourceConnector.java

pandaapo · 2024-04-15T08:44:34Z

...tgpt/src/main/java/org/apache/eventmesh/connector/chatgpt/source/managers/OpenaiManager.java

+        OpenaiConfig openaiConfig = sourceConfig.getOpenaiConfig();
+        AssertUtils.isTrue(openaiConfig.getTimeout() > 0, "openaiTimeout must be >= 0");
+        boolean proxyEnable = sourceConfig.connectorConfig.isProxyEnable();
+        if (proxyEnable) {


Does proxy mean that Connector accesses ChatGPT through a proxy instead of directly?

yes... network ... Is a huge wall

pandaapo · 2024-04-15T08:57:36Z

eventmesh-connectors/eventmesh-connector-chatgpt/src/main/resources/prompt

+  "source": string, Set to  ${source}
+  "subject": string, Set to ${subject}
+  "id": string, Set to  ${id}
+  "time": string, Set to  ${time}


When this PR is merged, I hope you can add the documentation for the use of the Connector here #4817 (review) . In the documentation, It will be nice to provide some clarity on the use of this prompt file, the construction of the user's request body when using prompt mode, and the correspondence between the two.

okk~ I will complete this task as soon as possible

pandaapo · 2024-04-15T09:07:09Z

...tgpt/src/main/java/org/apache/eventmesh/connector/chatgpt/source/managers/OpenaiManager.java

+
+    private void initOpenAi(ChatGPTSourceConfig sourceConfig) {
+        OpenaiConfig openaiConfig = sourceConfig.getOpenaiConfig();
+        AssertUtils.isTrue(openaiConfig.getTimeout() > 0, "openaiTimeout must be >= 0");


A similar suggestion. To reduce the required configuration items for user convenience, use default values as much as possible when they are missing. You can check if there are any other similar places that can be handled in this way.

The current default configuration is equal to the default configuration of openai. This judgment is to prevent users from configuring the openaiTimeout to be 0

I'm a bit uncertain. If the configuration file does not have 'timeout' or it is present but the user hasn't configured a value for it, would having a default value of >0 for 'timeout' prevent reporting this error to the user?

我有些不确定。当配置文件中没有timeout或者有但是用户没有配置它的值的时候，如果timeout有>0的默认值是不是就能避免将该错误报告给用户？

You are right, we do need some default values

...tgpt/src/main/java/org/apache/eventmesh/connector/chatgpt/source/managers/OpenaiManager.java

pandaapo · 2024-04-17T10:24:15Z

...ain/java/org/apache/eventmesh/connector/chatgpt/source/connector/ChatGPTSourceConnector.java

+            }
+        });
+        this.server = vertx.createHttpServer(new HttpServerOptions().setPort(this.sourceConfig.connectorConfig.getPort())
+            .setIdleTimeout(this.sourceConfig.connectorConfig.getIdleTimeout())).requestHandler(router);


How would Vertx handle it when idleTimeout is 0?

The same problem as openaiTimeout will also occur. I added the processing logic and rechecked the places where default values need to be added.

pandaapo · 2024-04-18T09:36:43Z

...java/org/apache/eventmesh/connector/chatgpt/source/connector/ChatGPTSourceConnectorTest.java

-        HttpPost invalidPost = new HttpPost(uri);
-        invalidPost.setEntity(new StringEntity(JsonUtils.toJSONString(event)));
-        HttpResponse resp = httpClient.execute(invalidPost);
-        Assertions.assertEquals(HttpStatus.SC_BAD_REQUEST, resp.getStatusLine().getStatusCode());


What is your reason for removing this piece of your test code? Note the file conflict below.

Because

ChatGPTRequestDTO bodyObject = body.asPojo(ChatGPTRequestDTO.class);

when deserializing ChatGPTRequest, the null value will be ignored and the default value will be used.

The following check will never be triggered

private void validateRequestDTO(ChatGPTRequestDTO bodyObject) { if (bodyObject.getSubject() == null || bodyObject.getDataContentType() == null || bodyObject.getText() == null) { throw new IllegalArgumentException("Attributes 'subject', 'datacontenttype', and 'text' cannot be null"); } }

Note the file conflict below.

Try to draw your attention.

jevinjiang · 2024-04-18T13:17:36Z

好工作！

后续文档可以在https://github.com/apache/eventmesh-site中添加。

please review apache/eventmesh-site#215

pandaapo · 2024-04-20T00:32:46Z

.../java/org/apache/eventmesh/connector/chatgpt/source/config/ChatGPTSourceConnectorConfig.java

@@ -28,7 +28,7 @@ public class ChatGPTSourceConnectorConfig {

    private int port = 3756;

-    private int idleTimeout = 30;
+    private int idleTimeout = 0;


Since the default value is 0, can this assignment code be removed? Similarly with private long timeout = 0.

pandaapo · 2024-04-20T01:34:34Z

...ain/java/org/apache/eventmesh/connector/chatgpt/source/connector/ChatGPTSourceConnector.java

@@ -150,8 +153,8 @@ private void doInit() {


    private void validateRequestDTO(ChatGPTRequestDTO bodyObject) {
-        if (bodyObject.getSubject() == null || bodyObject.getDataContentType() == null || bodyObject.getText() == null) {
-            throw new IllegalArgumentException("Attributes 'subject', 'datacontenttype', and 'text' cannot be null");
+        if (bodyObject.getSubject() == null || bodyObject.getText() == null) {


Since subject has a default value, is it still necessary to check the null value of subject here? Maybe change it to blank check? If change it to blank check, can text also be changed to blank check?

Inexplicably, two Approved appear.

pandaapo · 2024-04-20T08:20:51Z

@jevinjiang Note the slight error in CI.

jevinjiang · 2024-04-20T08:30:01Z

@pandaapo thank you very much for helping me complete this PR.

[ISSUE apache#4047] Support chatGPT source connector

8b15c8b

Pil0tXia reviewed Apr 7, 2024

View reviewed changes

pandaapo reviewed Apr 7, 2024

View reviewed changes

[ISSUE apache#4047] Add OpenAI configuration and adjust DTO

317ffb4

JiangShuJu and others added 2 commits April 9, 2024 00:14

[ISSUE apache#4047] Join parse request support

608d169

[ISSUE apache#4047] impl Parse request

e2dcf44

jevinjiang and others added 5 commits April 10, 2024 15:47

[ISSUE apache#4047] fix code style

ca42ded

[ISSUE apache#4047] fix code style

3331997

[ISSUE apache#4047] fix dependencies check failed

126b29d

[ISSUE apache#4047] fix dependencies check

e79795f

[ISSUE apache#4047] fix license check

5f600c6

pandaapo reviewed Apr 14, 2024

View reviewed changes

pandaapo reviewed Apr 15, 2024

View reviewed changes

...tgpt/src/main/java/org/apache/eventmesh/connector/chatgpt/source/managers/OpenaiManager.java Show resolved Hide resolved

JiangShuJu added 2 commits April 15, 2024 23:10

[ISSUE apache#4047] fix review question

6ccc530

[ISSUE apache#4047] fix review question

010e2a9

pandaapo reviewed Apr 17, 2024

View reviewed changes

JiangShuJu added 2 commits April 17, 2024 21:29

[ISSUE apache#4047] add default value

7bfadbb

[ISSUE apache#4047] fix test

c849912

pandaapo reviewed Apr 18, 2024

View reviewed changes

[ISSUE apache#4047] default timeout value is zero , not timeout .

0bffda8

pandaapo reviewed Apr 20, 2024

View reviewed changes

JiangShuJu and others added 2 commits April 20, 2024 14:59

[ISSUE apache#4047] fix review

7343c76

Merge branch 'master' into fix_patch_4047

aab792f

pandaapo approved these changes Apr 20, 2024

View reviewed changes

pandaapo previously approved these changes Apr 20, 2024

View reviewed changes

[ISSUE apache#4047] fix license check

86bf139

pandaapo approved these changes Apr 20, 2024

View reviewed changes

jevinjiang requested a review from Pil0tXia April 20, 2024 13:12

Pil0tXia approved these changes Apr 20, 2024

View reviewed changes

pandaapo merged commit fdb96d8 into apache:master Apr 21, 2024
9 checks passed

[ISSUE #4047] Support chatGPT source connector #4817

[ISSUE #4047] Support chatGPT source connector #4817

Conversation

jevinjiang commented Apr 7, 2024 • edited Loading

Pil0tXia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pil0tXia commented Apr 7, 2024

pandaapo left a comment • edited Loading

Choose a reason for hiding this comment

Pil0tXia commented Apr 7, 2024 • edited Loading

pandaapo commented Apr 7, 2024 • edited Loading

Pil0tXia commented Apr 7, 2024

pandaapo commented Apr 8, 2024 • edited Loading

jevinjiang commented Apr 8, 2024

jevinjiang commented Apr 10, 2024

pandaapo Apr 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pandaapo Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jevinjiang commented Apr 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pandaapo commented Apr 20, 2024

jevinjiang commented Apr 20, 2024

jevinjiang commented Apr 7, 2024 •

edited

Loading

pandaapo left a comment •

edited

Loading

Pil0tXia commented Apr 7, 2024 •

edited

Loading

pandaapo commented Apr 7, 2024 •

edited

Loading

pandaapo commented Apr 8, 2024 •

edited

Loading

pandaapo Apr 14, 2024 •

edited

Loading

pandaapo Apr 18, 2024 •

edited

Loading