Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flb_utils: remove false positive error 'file not found'. #7806

Merged
merged 1 commit into from Aug 10, 2023

Conversation

pwhelan
Copy link
Contributor

@pwhelan pwhelan commented Aug 8, 2023

Summary

When retrieving the machine_id for agents, and in some other rare cases, the following error can be spuriously reported:

image

This pushes out the error log so it is not reported when not necessary.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.


@pwhelan pwhelan requested a review from edsiper as a code owner August 8, 2023 16:27
@pwhelan pwhelan self-assigned this Aug 8, 2023
@pwhelan pwhelan temporarily deployed to pr August 8, 2023 16:27 — with GitHub Actions Inactive
@pwhelan pwhelan temporarily deployed to pr August 8, 2023 16:27 — with GitHub Actions Inactive
@pwhelan pwhelan temporarily deployed to pr August 8, 2023 16:27 — with GitHub Actions Inactive
niedbalski
niedbalski previously approved these changes Aug 8, 2023
Copy link
Collaborator

@niedbalski niedbalski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pwhelan pwhelan temporarily deployed to pr August 8, 2023 16:58 — with GitHub Actions Inactive
@leonardo-albertovich
Copy link
Collaborator

Hold this until tomorrow please, it seems to me that it actually introduces an error.

Copy link
Collaborator

@leonardo-albertovich leonardo-albertovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the flb_errno in the fread error code pathbecause there are scenarios where a meaningful value for errno is not to be expected so we should probably improve that check instead of assuming that 0 means trouble. We should also add an flb_errno call in the fopen failure code path and a result check for that fileno call but I think we can safely ignore the result of fclose.

In any case, I can't really see the benefit of removing the flb_calloc flb_errno call because flb_calloc doesn't do it and adding those to the client code can actually print erroneous information and cause issues (eg. flb_utils_get_machine_id fails because RAND_bytes failed or the flb_malloc call in flb_utils_get_machine_id which already calls flb_errno so we repeat the same message).

Could you please explain the rationale behind this change so we are on the same page?

@pwhelan
Copy link
Contributor Author

pwhelan commented Aug 8, 2023

Could you please explain the rationale behind this change so we are on the same page?

The problem is that flb_utils_get_machine_id attempts to read the machine id from two different files. If one of them does not exist it will print out 'file not found' error message. This spurious error message is a false negative at best in this scenario.

This PR simply removes the code that reports the error message, which is reported by the flb_errno message, and moves it to the callers, where appropriate.

@leonardo-albertovich
Copy link
Collaborator

This will definitely affect other use cases, if all you care about is silencing that error message selectively then add a flag to that function named silence_inexistent_file_error so whoever uses it in the future is aware of it and has to opt-in to that, otherwise it's just asking for trouble.

@pwhelan
Copy link
Contributor Author

pwhelan commented Aug 9, 2023

This will definitely affect other use cases, if all you care about is silencing that error message selectively then add a flag to that function named silence_inexistent_file_error so whoever uses it in the future is aware of it and has to opt-in to that, otherwise it's just asking for trouble.

The only two places that function are called have been accounted for. We really should decide where we log error messages. If we log them immediately always it can cause spurious messages, like this specific case. This one also harkens back to the "[lib] Backend failed" error message which has caused nothing but confusion.

Adding the silence_inexistent_file_error seems to me like a bad precedent, which could lead to a flurry of code changes adding to any and all functions that could use it.

Another idea would be to restore the file not found message for flb_utils_read_file but reprioritizing it as a warning, that way people can refer to it if there is a relevant error message later in the logs. It would still cause false positives but perhaps less so.

@leonardo-albertovich
Copy link
Collaborator

leonardo-albertovich commented Aug 9, 2023

That type of behavior change has to be agreed upon so let's not make changes in a rush. Please revert that change and make it an explicit opt-in.

Edit: to address your additions, reprioritizing is an option, still, it would mean either reprioritizing flb_errno as a whole and I'm not sure about it, maybe @edsiper can chime in.

@leonardo-albertovich
Copy link
Collaborator

I'm not against flb_errno being "demoted" to debug but it's an established behavior so changing it will have an impact and we can't make those changes lightly.

@pwhelan
Copy link
Contributor Author

pwhelan commented Aug 9, 2023

Most of the times flb_errno are called are of little to no value. It reports the compiled file name where it is invoked, which is confusing for ENOENT, it has little to no context, etc...

it's an established behavior so changing it will have an impact and we can't make those changes lightly.

As much as I hate to say it, that is frightfully true. I would not be surprised to hear of more than a few spacebar heater workflows based on how flb_errno works.

There are approximately 1092 instances of flb_errno across the source code and plugins:

~/Projects/work/flb/fluent-bit/ [master] grep -r flb_errno src plugins | wc -l
    1092

About half of those are in the plugins:

~/Projects/work/flb/fluent-bit/ [pwhelan-fix-flb_utils_false_error*] grep -r flb_errno plugins | wc -l
     665

That would at least somewhat validate my approach of moving the policy of reporting the error out of flb_utils_read_file and into the callers since that seems to be the approach many other instances take.

@leonardo-albertovich
Copy link
Collaborator

I think improving the logging paradigm is a project that we haven't yet started so I'm pretty sure your participation would be appreciated.

In regards to this change, sadly the argument doesn't convince me.

@pwhelan
Copy link
Contributor Author

pwhelan commented Aug 9, 2023

More than a few plugins already report flb_errno themselves after calling flb APIs. I can include some examples:

filter_nest/nest.c:

            wildcard->key = flb_strndup(kv->val, flb_sds_len(kv->val));
            if (wildcard->key == NULL) {
                flb_errno();
                flb_free(wildcard);
                return -1;
            }

out_s3/s3_store.c:

    /* flb_sds_printf allocs if the incoming sds is not at least 64 bytes */
    hash_str = flb_sds_create_size(64);
    if (!hash_str) {
        flb_errno();
        return NULL;
    }

out_loki/loki.c:

        label_key = flb_sds_create_len(val->via.str.ptr, val->via.str.size);
        if (label_key == NULL) {
            flb_errno();
            return -1;
        }

Here is a case of an API calling an API and logging flb_errno.

src/flb_pack.c

        tmp = flb_realloc(state->tokens, new_size);
        if (!tmp) {
            flb_errno();
            return -1;
        }

And src/flb_network.c seems to always report with flb_errno in it's API.

At the moment it looks to me like the outlier, but I have not checked every single instead inside src to see which ones apply to the runtime or the API (which is made more difficult by the fact that the separation between them is subjective, at best).

It is a toss up between functions reporting it themselves or the functions that call them reporting it, but with some precedence favoring reporting it from the API itself, but no precedence for some magical just_keep_quiet flag.

I personally do not favor adding a magical flag since it seems to me like some weird special case that is likely to amount to future technical debt. It pretty much just makes sense if there are concrete plans to actually remove it in the mid-term future.

@leonardo-albertovich
Copy link
Collaborator

But you do understand that the difference here is that this is a meta operation and it's clear that you can't expect errno to be valid after the call right? because that's what I pointed out in my first message, this change is not reliable.

As for these lines you quoted, the client code is wrong in half of them and the others are just very thin wrappers on top of lower level functions (with flb_realloc just wrapping realloc and flb_strndup using flb_malloc but not printing the error which is the prevalent pattern).

So please, let's just stop arguing about something as meaningless as this, unlike we just demote flb_errno, that flag is the only one of the presented options that doesn't swap one corner case for another and a even though a refactor of the logging approach is something that we need to do it's not something that's appropriate for this PR.

Signed-off-by: Phillip Whelan <phil@calyptia.com>
@pwhelan pwhelan force-pushed the pwhelan-fix-flb_utils_false_error branch from 28d3651 to a017736 Compare August 10, 2023 17:04
@pwhelan pwhelan temporarily deployed to pr August 10, 2023 17:04 — with GitHub Actions Inactive
@pwhelan pwhelan temporarily deployed to pr August 10, 2023 17:04 — with GitHub Actions Inactive
@pwhelan pwhelan temporarily deployed to pr August 10, 2023 17:04 — with GitHub Actions Inactive
@pwhelan
Copy link
Contributor Author

pwhelan commented Aug 10, 2023

I rewrote the solution to use access(2) to check the existence of the files /var/lib/dbus/machine-id and /etc/machine-id before reading them and in that way avoid emitting the false positive error message.

It only checks them with F_OK and not R_OK on purpose to not avoid situations where permissions are denied so that that situation can get logged.

This solution should not introduce any corner cases or technical debt and localizes the changes to the single function flb_utils_get_machine_id which is only used by the out_calyptia plugin at the moment.

@pwhelan pwhelan temporarily deployed to pr August 10, 2023 17:31 — with GitHub Actions Inactive
Copy link
Collaborator

@niedbalski niedbalski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@niedbalski niedbalski merged commit ed13084 into master Aug 10, 2023
42 of 44 checks passed
@niedbalski niedbalski deleted the pwhelan-fix-flb_utils_false_error branch August 10, 2023 23:48
leonardo-albertovich pushed a commit that referenced this pull request Oct 5, 2023
Signed-off-by: Phillip Whelan <phil@calyptia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants