Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Connector-V2] HTTP supports page increase #5477 #5561

Merged
merged 31 commits into from
Oct 26, 2023

Conversation

xiaofan2022
Copy link
Contributor

Purpose of this pull request

http supports page increase #5477

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

@Hisoka-X Hisoka-X added this to the 2.3.4 milestone Sep 26, 2023
@Hisoka-X
Copy link
Member

Please add some test case for it
image

@xiaofan2022
Copy link
Contributor Author

source {
    Http {
      url = "http://localhost:8080/test/queryData"
      method = "GET"
      format = "json"
      params={
       page: "${page}"
      }
      content_field = "$.data.*"
      pageing={
       total_page_size=20
       page_field=page
       #total_page_field_path="$.totalPage"
       #json_verify_expression="$.hasNext"
       #json_verify_value="false"

      }
      schema = {
        fields {
          name = string
          age = string
        }
      }
    }
}

The configuration has been updated to the http.md document, and the local verification has passed
page_incr_test

@Hisoka-X Hisoka-X changed the title http supports page increase #5477 [Feature][Connector-V2] HTTP supports page increase #5477 Sep 28, 2023
LICENSE Outdated
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/sqlengine/zeta/ZetaSQLType.java from https://github.com/JSQLParser/JSqlParser
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/sqlengine/zeta/ZetaSQLFilter.java from https://github.com/JSQLParser/JSqlParser
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/sqlengine/zeta/ZetaSQLFunction.java from https://github.com/JSQLParser/JSqlParser
MIT License
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| json_field | Config | No | - | This parameter helps you configure the schema,so this parameter must be used with schema. |
| pageing | Config | No | - | This parameter is used for paging queries |
| pageing.page_field | String | No | - | This parameter is used to specify the page field name in the request parameter |
| *pageing.max_page_size* | Int | No | 10000 | Change the parameter to control the maximum number of pages (if using paging please ensure that this value is greater than the target number of pages otherwise it may cause early *data accuracy* problems) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| *pageing.max_page_size* | Int | No | 10000 | Change the parameter to control the maximum number of pages (if using paging please ensure that this value is greater than the target number of pages otherwise it may cause early *data accuracy* problems) |
| pageing.max_page_size | Int | No | 10000 | Change the parameter to control the maximum number of pages (if using paging please ensure that this value is greater than the target number of pages otherwise it may cause early *data accuracy* problems) |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove * on pageing.max_page_size

Comment on lines 56 to 57
| pageing.json_verify_expression | String | No | - | This parameter is used verify that the condition is met through json path |
| pageing.json_verify_value | String | No | - | This parameter must be configured after json_verify_expression is configured (make sure verify value is in this parameter). |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find the code you use these two config, and I don't know what's used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two parameters are primarily used in cases where there is no total page count field in the API response to determine the number of iterations. They allow you to pinpoint a field using a JSON path and then decide whether to continue the task based on whether a specific condition is met (typically, whether the target value is found in json_verify_value).

@Hisoka-X
Copy link
Member

@xiaofan2022 Also please add e2e test case for this new feature, you can refer https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-http-e2e/src/test.

@Hisoka-X
Copy link
Member

cc @ruanwenjun @hailin0

@Hisoka-X Hisoka-X added the feature New feature label Sep 28, 2023
LICENSE Outdated
@@ -178,15 +178,15 @@
APPENDIX: How to apply the Apache License to your work.

To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
boilerplate notice, with the fields enclosed by brackets "[]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this file change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Options.key("json_verify_expression")
.stringType()
.noDefaultValue()
.withDescription("json verify expression ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description not clear, I don't know what's mean if I only read json verify expression.

| json_field | Config | No | - | This parameter helps you configure the schema,so this parameter must be used with schema. |
| pageing | Config | No | - | This parameter is used for paging queries |
| pageing.page_field | String | No | - | This parameter is used to specify the page field name in the request parameter |
| *pageing.max_page_size* | Int | No | 10000 | Change the parameter to control the maximum number of pages (if using paging please ensure that this value is greater than the target number of pages otherwise it may cause early *data accuracy* problems) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove * on pageing.max_page_size

Options.key("json_verify_value")
.stringType()
.noDefaultValue()
.withDescription("json verify value ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. json_verify_expression: retrieve and validate target field values using a json_path expression
  2. json_verify_value: verify if the value of the JSON verify expression matches the expected value.
  3. max_page_size: The "pageing.max_page_size" parameter is primarily used to prevent infinite loops when there is no configuration for validation through JSONPath, ensuring that the program can complete its execution even if the termination condition is not triggered.

Options.key("page_field")
.stringType()
.defaultValue("page")
.withDescription("page field");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. I think you should copy description from doc to here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 29 to 42
json_field = {
name = "$.data[*].name"
age = "$.data[*].age"
}
pageing={
total_page_size=2
page_field=page
}
schema = {
fields {
name = string
age = int
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 29 to 43
json_field = {
name = "$.data[*].name"
age = "$.data[*].age"
}
pageing={
json_verify_expression="$.hasNext"
json_verify_value="false"
page_field=page
}
schema = {
fields {
name = string
age = int
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"age": 51
}
],
"currentPageIndex": 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"currentPageIndex": 1,
"currentPageIndex": 2,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Hisoka-X
Copy link
Member

After a second consider, I'm a little worry about the scene this PR try to solved. Normally, we can try to use batch size to check we can continue read next page or not.
Eg:

batch size = 100
if read size < 100 : read fininsh.
if read size = 100 : read next page.

Could you support this for default way to check read next page or not? I think this way will more useful and simple.

@xiaofan2022
Copy link
Contributor Author

After a second consider, I'm a little worry about the scene this PR try to solved. Normally, we can try to use batch size to check we can continue read next page or not. Eg:

batch size = 100
if read size < 100 : read fininsh.
if read size = 100 : read next page.

Could you support this for default way to check read next page or not? I think this way will more useful and simple.

if read size>100? next batch?

@Hisoka-X
Copy link
Member

if read size>100? next batch?

yes, but if batch size = 100, the read size could not larger than 100 if api not make mistabke.

@xiaofan2022
Copy link
Contributor Author

size
Does it support user-configurable batch size? If, after processing a batch, it doesn't meet the validation criteria, it proceeds to the next batch in a loop. In this case, I don't see a difference from using 'while (max_limit)'—in fact, batch processing might even result in idle processing.

@Hisoka-X
Copy link
Member

I don't see a difference from using 'while (max_limit)'—in fact, batch processing might even result in idle processing.

It would not care the page size, imagine that if the total amount of data returned by the API keeps rising, the pagesize size may need to be changed every time a task is run, but by specifying the batchsize, we can always read without changing the job file to read all data.

@xiaofan2022
Copy link
Contributor Author

I did not change the Engine-Client related code, why ci/cd error

@Hisoka-X
Copy link
Member

Never mind, this is a bug will be fixed in #5710

Hisoka-X
Hisoka-X previously approved these changes Oct 26, 2023
Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one comment

this.httpParameter
.getParams()
.put(pageInfo.getPageField(), pageInfo.getPageIndex().toString());
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Hisoka-X
Copy link
Member

cc @hailin0 @ruanwenjun

Hisoka-X
Hisoka-X previously approved these changes Oct 26, 2023
@@ -310,6 +314,35 @@ source {
- Test data can be found at this link [mockserver-config.json](../../../../seatunnel-e2e/seatunnel-connector-v2-e2e/connector-http-e2e/src/test/resources/mockserver-config.json)
- See this link for task configuration [http_jsonpath_to_assert.conf](../../../../seatunnel-e2e/seatunnel-connector-v2-e2e/connector-http-e2e/src/test/resources/http_jsonpath_to_assert.conf).

### pageing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### pageing
### pageing

docs/en/connector-v2/source/Http.md Outdated Show resolved Hide resolved
EricJoy2048
EricJoy2048 previously approved these changes Oct 26, 2023
Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xiaofan2022 !

@Hisoka-X Hisoka-X merged commit bb180b2 into apache:dev Oct 26, 2023
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants