Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48131][Core] Unify MDC key mdc.taskName and task_name #46386

Closed
wants to merge 1 commit into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented May 4, 2024

What changes were proposed in this pull request?

Currently there are two MDC keys for task name:

To make the MDC keys unified, this PR renames the mdc.taskName as task_name. This MDC is showing frequently in logs when running Spark application.
Before change:

"context":{"mdc.taskName":"task 19.0 in stage 0.0 (TID 19)”}

after change

"context":{“task_name":"task 19.0 in stage 0.0 (TID 19)”}

Why are the changes needed?

  1. Make the MDC names consistent
  2. Minor upside: this will allow users to query task names with SELECT * FROM logs where context.task_name = .... Otherwise, querying with context.mdc.task_name will result in an analysis exception. Users will have to query with context['mdc.task_name']

Does this PR introduce any user-facing change?

No really. The MDC key is used by developers for debugging purpose.

How was this patch tested?

Manual test

Was this patch authored or co-authored using generative AI tooling?

No

@gengliangwang
Copy link
Member Author

I suggest having this merged before the Spark 4.0 preview release.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM, @gengliangwang .

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.0.0-preview.

sinaiamonkar-sai pushed a commit to sinaiamonkar-sai/spark that referenced this pull request May 5, 2024
### What changes were proposed in this pull request?

Currently there are two MDC keys for task name:
* `mdc.taskName`, which is introduced in apache#28801. Before the change, it was `taskName`.
* `task_name`: introduce from the structured logging framework project.

To make the MDC keys unified, this PR renames the `mdc.taskName` as `task_name`. This MDC is showing frequently in logs when running Spark application.
Before change:
```
"context":{"mdc.taskName":"task 19.0 in stage 0.0 (TID 19)”}
```
after change
```
"context":{“task_name":"task 19.0 in stage 0.0 (TID 19)”}
```

### Why are the changes needed?

1. Make the MDC names consistent
2. Minor upside: this will allow users to query task names with `SELECT * FROM logs where context.task_name = ...`.  Otherwise, querying with `context.mdc.task_name` will result in an analysis exception. Users will have to query with `context['mdc.task_name']`

### Does this PR introduce _any_ user-facing change?

No really. The MDC key is used by developers for debugging purpose.

### How was this patch tested?

Manual test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46386 from gengliangwang/unify.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@mridulm
Copy link
Contributor

mridulm commented May 5, 2024

Wouldn't this not impact/break existing log config files where users are customizing the template ?

@gengliangwang
Copy link
Member Author

Wouldn't this not impact/break existing log config files where users are customizing the template ?

Yes, and this is documented on docs/configuration.md. If necessary, I can add this to the migration guide as well.
BTW, the key was changed from taskName to mdc.taskName in #28801. So this is not the first time it has been changed, and hopefully, we can stay with task_name this time.

@mridulm
Copy link
Contributor

mridulm commented May 7, 2024

That change was not visible to end users, as there was not release made - right ?

@gengliangwang
Copy link
Member Author

That change was not visible to end users, as there was not release made - right ?

Yes you are right.
Are you ok with the changes in this PR? If not, please let me know your preference and I can check with @cloud-fan to see if we still have time to make further changes.

@mridulm
Copy link
Contributor

mridulm commented May 7, 2024

Is this change strictly necessary would be the question ... if it is, we can evaluate it in that context.
If not and is a nice to have, it is better not to make breaking changes which can impact users.

Whichever direction we decide on, would be good to conclude before preview release :-)

@gengliangwang
Copy link
Member Author

@mridulm As the task name MDC is frequently showing in the logs; I would say this is necessary for the new logging framework. After the renaming, the MDC names are consistent and simpler to use. Otherwise, it will be executor_id/worker_id/task_id/etc and an odd mdc.taskName.

@mridulm
Copy link
Contributor

mridulm commented May 7, 2024

So if I understood right, this as a backwardly incompatible change done for consistency of naming for Structured logging ?
If yes, IMO it is preferable to revert this and not break existing users.

+CC @dongjoon-hyun as well for thoughts.

@dongjoon-hyun
Copy link
Member

I understand your point, @mridulm . Do you use the following syntax?

spark.sparkContext.setLocalProperty("mdc." + name, "value")

Initially, I thought it's okay because MDC is not used by default and this PR is for Apache Spark 4.0.0.

However, if there is any concern on the plain logging part, I agree with @mridulm because MDC has been supported since Apache Spark 3.1.0.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 7, 2024

WDYT, @gengliangwang ?

If we still want to unify this, we need to do with a configuration .

@gengliangwang
Copy link
Member Author

@dongjoon-hyun Thanks, I will add a configuration about the unification.

dongjoon-hyun pushed a commit that referenced this pull request May 8, 2024
… of Task Name

### What changes were proposed in this pull request?

Introduce a new Spark config `spark.log.legacyTaskNameMdc.enabled`:
When true, the MDC key `mdc.taskName` will be set in the logs, which is consistent with the behavior of Spark 3.1 to Spark 3.5 releases. When false, the logging framework will use `task_name` as the MDC key for consistency with other new MDC keys.

### Why are the changes needed?

As discussed in #46386 (comment), we should add a configuration and migration guide about the change in the MDC key of Task Name.
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46446 from gengliangwang/addConfig.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?

Currently there are two MDC keys for task name:
* `mdc.taskName`, which is introduced in apache#28801. Before the change, it was `taskName`.
* `task_name`: introduce from the structured logging framework project.

To make the MDC keys unified, this PR renames the `mdc.taskName` as `task_name`. This MDC is showing frequently in logs when running Spark application.
Before change:
```
"context":{"mdc.taskName":"task 19.0 in stage 0.0 (TID 19)”}
```
after change
```
"context":{“task_name":"task 19.0 in stage 0.0 (TID 19)”}
```

### Why are the changes needed?

1. Make the MDC names consistent
2. Minor upside: this will allow users to query task names with `SELECT * FROM logs where context.task_name = ...`.  Otherwise, querying with `context.mdc.task_name` will result in an analysis exception. Users will have to query with `context['mdc.task_name']`

### Does this PR introduce _any_ user-facing change?

No really. The MDC key is used by developers for debugging purpose.

### How was this patch tested?

Manual test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46386 from gengliangwang/unify.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… of Task Name

### What changes were proposed in this pull request?

Introduce a new Spark config `spark.log.legacyTaskNameMdc.enabled`:
When true, the MDC key `mdc.taskName` will be set in the logs, which is consistent with the behavior of Spark 3.1 to Spark 3.5 releases. When false, the logging framework will use `task_name` as the MDC key for consistency with other new MDC keys.

### Why are the changes needed?

As discussed in apache#46386 (comment), we should add a configuration and migration guide about the change in the MDC key of Task Name.
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46446 from gengliangwang/addConfig.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants