Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INLONG-6654][Sort] Supports s3 side-output for dirty data #6655

Merged
merged 3 commits into from
Nov 29, 2022

Conversation

yunqingmoswu
Copy link
Contributor

Prepare a Pull Request

(Change the title refer to the following example)

  • Title: [INLONG-6654][Sort] Supports s3 side-output for dirty data

(The following XYZ should be replaced by the actual GitHub Issue number)

Motivation

Supports s3 side-output for dirty data, through it, dirty data can be archived to s3.
This piece is designed as follows(some common process of dirty data side-output can view #6618):

  1. Add s3 side-output implements
    In this part, i will add dirty data cache and flush to s3 periodically.
  2. Add s3 side-output factory to create object of s3 side-output

Modifications

1.Add the core class S3DirtySink for dirty data side-output
2.Add S3DirtySinkFactory to create S3DirtySink
3.Add S3Options to config S3DirtySink
4.Add S3Helper to help side-output to s3

Verifying this change

(Please pick either of the following options)

  • This change is a trivial rework/code cleanup without any test coverage.

  • This change is already covered by existing tests, such as:
    (please describe tests)

  • This change added tests and can be verified as follows:

    (example:)

    • Added integration tests for end-to-end deployment with large payloads (10MB)
    • Extended integration test for recovery after broker failure

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
  • If a feature is not applicable for documentation, explain why?
  • If a feature is not documented yet in this PR, please create a follow-up issue for adding the documentation

@yunqingmoswu yunqingmoswu added this to the 1.5.0 milestone Nov 28, 2022
@Yizhou-Yang
Copy link
Contributor

Yizhou-Yang commented Nov 29, 2022

I think that logdirtysink and s3dirtysink should be refactored to somewhere other than sort-base because essentially they are two sink factories. two questions: 1) will inlong manager need to change to adapt to these changes? 2) when are these sinks created? in other sinks' connectors? when manager parses the parameters? or when the task is created?

consider adding some unit tests to make these dirty sink uses cases a bit more clear?

@yunqingmoswu
Copy link
Contributor Author

I think that logdirtysink and s3dirtysink should be refactored to somewhere other than sort-base because essentially they are two sink factories. two questions: 1) will inlong manager need to change to adapt to these changes? 2) when are these sinks created? in other sinks' connectors? when manager parses the parameters? or when the task is created?

consider adding some unit tests to make these dirty sink uses cases a bit more clear?

How to use it has been explained in the 'log' pr, this piece cannot be added to ut at present, because the dirty data output is bypassed and nested in each connector.

@yunqingmoswu yunqingmoswu requested review from gong, Yizhou-Yang and EMsnap and removed request for Yizhou-Yang November 29, 2022 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][Sort] Supports s3 side-output for dirty data
5 participants