-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-29708] Convert CR Error field to JSON with enriched exception … #409
Conversation
9eb1b3d
to
0e9c1e1
Compare
...-operator/src/main/java/org/apache/flink/kubernetes/operator/crd/spec/AbstractFlinkSpec.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks like a very nice improvement. Added a few smaller commits around size limits and to move the spec field to a config.
Thank you!
...or/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkResourceExceptionUtils.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkResourceExceptionUtils.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkResourceExceptionUtils.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkResourceExceptionUtils.java
Outdated
Show resolved
Hide resolved
I tested it and it almost works great, apart from the minor things, Gyula mentioned. There'll be probably issues with JSON encoding: I've tested it with the following invalid CR (missing volumes):
It resulted in an string, that might not be a valid JSON:
|
Another thing that I found is the structure of the stacktrace probably too verbose:
|
0e9c1e1
to
29470cb
Compare
Thanks for the feedback and sharing your test result:
I will do some testing on this and validate the final JSON that's included in the CR field.
Yup, it does contain several metadata we might not need like "classLoaderName", "nativeMethod". An alternative that I am considering is to store the stackTrace as a more compact List generated from ExceptionUtils.getStackFrames However, the downside of this is each stackFrame is a String, so it's a bit more work to figure out which method/file the stackFrame is called from. Any thoughts on which one's more suited for our needs? |
The error JSON looks good in the logs. Things go wrong when it is saved already in the CR status, and you dump it as a yaml. It breaks the single line error JSON into a multiline YAML structure. |
We might not need the stack trace in a structured format. Maybe just dump it as a string:
Even if the root cause is not in the first cause. We would still find it I guess in the stack trace. |
+1 for having a more concise stacktrace (as a String). This will greately reduce the json size and will make it much more human readable and easier to use. If we later feel that this is insufficient we can change it |
Hi @morhidi, I have updated the stackTrace to be a string and added a ConfigOption to limit the length of the stackTrace string. Regarding the concern on deserializing the error status into a valid json field, I have tested that deserialization back into the FlinkResourceException class works. My testing was done as follows:
As a result, I can confirm deserialization of the json works. The question now is whether we are ok with the current format the error field is shown in CR yaml, which includes the escape field. I tried to search this up and the cause of this is we are storing the string with single quote '' instead of double quote "" in the yaml. Referring to https://www.baeldung.com/yaml-multi-line#quoting. I have tried to looking at openAPIV3Schema, but I don't see a straightforward way to change it to a double quote. I also suspect this could be related to how K8 api serializes the CR into yaml. |
4d5c68d
to
3110ef9
Compare
I am not completely sure about this. It might be better to let operator admins decide on this setting, at least the size limits for sure. In any case there is some inconsitency in the current implementation: |
Ok, so let's leave it as is and just move the configs under the right section. |
@darenwkt did you manage to properly parse the output of On my cluster the error/stacktrace appears to be a multiline nested yaml: Which is not a big deal if we can parse the CR back properly without loosing any content. |
Hi @morhidi, No problem, I have tested the deserialization of
The results are:
Also repeated same test for FlinkSessionJob CR.
I think having the multiline yaml helps with readability of the error field as long as it's able to deserialize correctly as what you have mentioned. |
Thanks for confirming, I will push a new revision with the config placed in SYSTEM section. |
3110ef9
to
404f052
Compare
404f052
to
1c4feca
Compare
Great work @darenwkt ! |
What is the purpose of the change
This PR aims to:
Brief change log
A new class is introduced to templatize the error field in JSON format:
Verifying this change
Added
FlinkResourceExceptionUtilsTest
to test serialization/deserialization of the exception and test the stackTraceEnabled spec.Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors
: yesDocumentation