Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiline logs split when rotated by Kubelet (containerd) #4094

Closed
m5lapp opened this issue Mar 13, 2023 · 5 comments
Closed

Multiline logs split when rotated by Kubelet (containerd) #4094

m5lapp opened this issue Mar 13, 2023 · 5 comments
Labels
stale waiting-for-user Similar to "moreinfo", but especially need feedback from user

Comments

@m5lapp
Copy link

m5lapp commented Mar 13, 2023

Describe the bug

We are running Fluentd in a Kubernetes environment using the containerd runtime and having logs rotated by the Kubelet when they reach 10MB in size and retaining the last four plus the current log. When the log rotation happens when a large multi-line log is being written and therefore gets split across the old and the new log file, the log is picked up by Fluentd as many single-line logs rather than a multiline log.

To Reproduce

  1. Run Kubelet with the containerd runtime
  2. Configure Kubelet to rotate logs with the following two parameters:
    1. containerLogMaxFiles: 5
    2. containerLogMaxSize: 10Mi
  3. When the log file is close to the 10MB limit, start writing a large multi-line log
    1. The first line should begin with the tag [SKY_LOG_START] as per the configuration below
  4. Ensure that the multi-line log spans both the old and the new rotated logs files
    1. The multi-line log should end with [SKY_LOG_END] as per the configuration below
  5. Observe in the output log file that the multi-line log you wrote has been captured as many single-line log entries and not one multi-line log

Expected behavior

A multi-line log that get split across two files when our logs get rotated should still be picked up as a single log entry by Fluentd.

Your Environment

- Fluentd version: v1.14.3
- TD Agent version: N/A
- Operating system: Red Hat Enterprise Linux 8.7
- Kernel version: 4.18.0-425.10.1.el8_7.x86_64
- Kubernetes version: 1.21.13
- containerd version: v1.6.4

Your Configuration

<system>
  log_level info
</system>

#enable monitoring endpoint for ready and liveness probes
<source>
    @type monitor_agent
    bind 0.0.0.0
    port 24220
</source>
<source>
    @type tail
    enable_stat_watcher false
    pos_file /var/lib/containerd/fluentd/fluentd-containers.pos
    tag containers.*
    read_from_head true
    # Read from /var/log/containers/ as filenames have details used by the
    # kubernetes_metadata plugin.
    path /var/log/containers/*.log
    refresh_interval 1s
    <parse>
        @type regexp
        keep_time_key true
        expression /^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$/
    </parse>
</source>

<filter containers.**>
    @type kubernetes_metadata
    verify_ssl false
</filter>

<filter containers.**>
    @type concat
    key log
    separator "\n"
    stream_identity_key kubernetes.pod_name
    multiline_start_regexp /\[SKY_LOG_START\]/
    multiline_end_regexp /\[SKY_LOG_END\]/
    flush_interval 5
    timeout_label @NORMAL
</filter>

<filter **>
    @type record_transformer
    <record>
        location slu
        environment sandbox
    </record>
</filter>

<match **>
    @type relabel
    @label @NORMAL
</match>

<label @NORMAL>
    <match **>
        @type file
        @id output
        delimiter SPACE
        output_tag false
        output_time false
        path /var/lib/containerd/fluentd/containers
        
        #Rotate the output file every hour
        time_slice_format %Y%m%d%H
        append true
        flush_interval 0s
        flush_at_shutdown true
        #Force the event time as now
        localtime true
    </match>
</label>

## Prometheus Input Plugin Configuration

# Prometheus input plugin that exports metrics
<source>
    @type prometheus
</source>

# Prometheus input plugin that collects metrics from MonitorAgent
<source>
    @type prometheus_monitor
    <labels>
        host ${hostname}
    </labels>
</source>

# Prometheus input plugin that collects metrics for output plugin
<source>
    @type prometheus_output_monitor
    <labels>
        host ${hostname}
    </labels>
</source>

Your Error Log

N/A

Additional context

When the Kubelet does the log rotation it swaps the current log and the most recently rotated log's inodes around as follows:

393234 -rw-r-----. 1 root root 9.7M Mar 10 10:33 0.log
393223 -rw-r-----. 1 root root  11M Mar 10 10:32 0.log.20230310-103201

# Rotation 1
393223 -rw-r-----. 1 root root 4.6M Mar 10 10:33 0.log
393234 -rw-r-----. 1 root root  11M Mar 10 10:33 0.log.20230310-103312

# Rotation 2
393234 -rw-r-----. 1 root root 6.7M Mar 10 10:44 0.log
393223 -rw-r-----. 1 root root  12M Mar 10 10:43 0.log.20230310-104354
@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

@github-actions github-actions bot added the stale label Apr 13, 2023
@mmn01-sky
Copy link

Are there any updates on this? Keen to get this resolved.

@daipom daipom removed the stale label Apr 13, 2023
@daipom
Copy link
Contributor

daipom commented Apr 13, 2023

Sorry for the delay.

This looks like an issue of fluent-plugin-concat.

Isn't the filter_concat timeouting before in_tail reads the new file?
If you are getting a log message Timeout flush: xxx , that should be the cause.

Although it would be a less likely case that the rotation would take more than 5 seconds,
since you are setting refresh_interval 1s so that in_tail should follow the new file immediately...

Anyway, to narrow down the cause, we should check if this log message occurs during the rotation.

Example:

[info]: #0 following tail of /test/fluentd/input/test.log
[info]: #0 fluentd worker is now running worker=0
(datetime) test: {"message":"[START]Hello"}
[info]: #0 Timeout flush: test:default
[info]: #0 following tail of /test/fluentd/input/test2.log
(datetime) test: {"message":"World![END]"}

@daipom daipom added the waiting-for-user Similar to "moreinfo", but especially need feedback from user label Apr 13, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

@github-actions github-actions bot added the stale label May 14, 2023
@github-actions
Copy link

This issue was automatically closed because of stale in 7 days

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale waiting-for-user Similar to "moreinfo", but especially need feedback from user
Projects
None yet
Development

No branches or pull requests

3 participants