Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add system fields to input sources. #15276

Merged
merged 6 commits into from Nov 2, 2023
Merged

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Oct 30, 2023

Main changes:

  1. The SystemField enum defines system fields __file_uri, __file_path,
    and __file_bucket. They are associated with each input entity.

  2. The SystemFieldInputSource interface can be added to any InputSource
    to make it system-field-capable. It sets up serialization of a list
    of configured systemFields in the JSON form of the input source, and
    provides a method getSystemFieldValue for computing the value of each
    system field. Cloud object, HDFS, HTTP, and Local now have this.

The SystemFieldInputSource isn't strictly necessary, since each input source could have implemented system fields internally in its own way. However, I think the interface is valuable because it helps ensure system fields are dealt with consistently, and because it provides a path to exposing system fields in SQL in a nice way. I think that ideally, they would be referenceable by name, but not participate in star expansion. AFAICT this would require a new Calcite feature. Relevant Calcite mailing list thread: https://lists.apache.org/thread/pnf3bx3jlrmv7q1q7jhwhsylrw4q5t20

Until then, system fields can be used in SQL without the planner's awareness: with EXTERN, add systemFields to the inputSource section, and add the system field names to the signature section.

Main changes:

1) The SystemField enum defines system fields "__file_uri", "__file_path",
   and "__file_bucket". They are associated with each input entity.

2) The SystemFieldInputSource interface can be added to any InputSource
   to make it system-field-capable. It sets up serialization of a list
   of configured "systemFields" in the JSON form of the input source, and
   provides a method getSystemFieldValue for computing the value of each
   system field. Cloud object, HDFS, HTTP, and Local now have this.
@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 30, 2023
@gianm
Copy link
Contributor Author

gianm commented Oct 30, 2023

The IT standard-its / integration-index-tests-middleManager (kafka-transactional-index) / kafka-transactional-index integration test (Compile=jdk8, Run=jdk8, Indexer=middleManager, Mysql=com.mysql.jdbc.Driver) is repeatedly failing when running ./it.sh ci due to this:

Error:  Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.14.2:npm (npm-install) on project web-console: Failed to run task: 'npm ci' failed. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
Error:  
Error:  After correcting the problems, you can resume the build with the command
Error:    mvn <args> -rf :web-console
Error: Process completed with exit code 1.

I am not sure what's going on here but I don't think this is related to this PR. Should be safe to merge without this test passing.

@gianm
Copy link
Contributor Author

gianm commented Oct 31, 2023

I am not sure what's going on here but I don't think this is related to this PR. Should be safe to merge without this test passing.

I pushed a commit that simply merges master, which'll rerun all the tests. Hopefully that clears the problem.

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

docs/ingestion/input-sources.md Show resolved Hide resolved
Copy link
Contributor

@zachjsh zachjsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@gianm gianm merged commit d87d92b into apache:master Nov 2, 2023
82 checks passed
@gianm gianm deleted the extern-filename branch November 2, 2023 17:31
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
* Add system fields to input sources.

Main changes:

1) The SystemField enum defines system fields "__file_uri", "__file_path",
   and "__file_bucket". They are associated with each input entity.

2) The SystemFieldInputSource interface can be added to any InputSource
   to make it system-field-capable. It sets up serialization of a list
   of configured "systemFields" in the JSON form of the input source, and
   provides a method getSystemFieldValue for computing the value of each
   system field. Cloud object, HDFS, HTTP, and Local now have this.

* Fix various LocalInputSource calls.

* Fix style stuff.

* Fixups.

* Fix tests and coverage.
@vogievetsky vogievetsky added the Needs web console change Backend API changes that would benefit from frontend support in the web console label Jan 23, 2024
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Dependencies Area - Documentation Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying Area - Segment Format and Ser/De Needs web console change Backend API changes that would benefit from frontend support in the web console
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants