Skip to content

Commit

Permalink
Add new fingerprint file identity (#35734)
Browse files Browse the repository at this point in the history
This is a new alternative to existing options like `native`, `path`
and `inode_marker`.

Unlike the existing options, this file identity does not rely on any
file system metadata and uses only the file size and its content.

Users can specify what amount of bytes is used to fingerprint the
beginning of each file, optionally it's possible to set an offset from the beginning.

This identity is supposed to be more stable and less affected by the
environment/setup of the users.

This change also contains a few performance optimisations of how we work with the filesystem and watch for file changes.
  • Loading branch information
rdner committed Jul 14, 2023
1 parent a755bbc commit b701377
Show file tree
Hide file tree
Showing 22 changed files with 1,422 additions and 529 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.next.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ automatic splitting at root level, if root level element is an array. {pull}3415
- Improve CEL input performance. {pull}35915[35915]
- Adding filename details from zip to response for httpjson {issue}33952[33952] {pull}34044[34044]
- Add `clean_session` configuration setting for MQTT input. {pull}35806[16204]

- Add fingerprint mode for the filestream scanner and new file identity based on it {issue}34419[34419] {pull}35734[35734]

*Auditbeat*
- Migration of system/package module storage from gob encoding to flatbuffer encoding in bolt db. {pull}34817[34817]
Expand Down
13 changes: 13 additions & 0 deletions filebeat/_meta/config/filebeat.inputs.reference.yml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,19 @@ filebeat.inputs:
# original for harvesting but will report the symlink name as the source.
#prospector.scanner.symlinks: false

# If enabled, instead of relying on the device ID and inode values when comparing files,
# compare hashes of the given byte ranges in files. A file becomes an ingest target
# when its size grows larger than offset+length (see below). Until then it's ignored.
#prospector.scanner.fingerprint.enabled: false

# If fingerprint mode is enabled, sets the offset from the beginning of the file
# for the byte range used for computing the fingerprint value.
#prospector.scanner.fingerprint.offset: 0

# If fingerprint mode is enabled, sets the length of the byte range used for
# computing the fingerprint value. Cannot be less than 64 bytes.
#prospector.scanner.fingerprint.length: 1024

### Parsers configuration

#### JSON configuration
Expand Down
9 changes: 4 additions & 5 deletions filebeat/docs/inputs/input-common-file-options.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -77,10 +77,10 @@ certain criteria or time. Closing the harvester means closing the file handler.
If a file is updated after the harvester is closed, the file will be picked up
again after `scan_frequency` has elapsed. However, if the file is moved or
deleted while the harvester is closed, {beatname_uc} will not be able to pick up
the file again, and any data that the harvester hasn't read will be lost.
The `close_*` settings are applied synchronously when {beatname_uc} attempts
the file again, and any data that the harvester hasn't read will be lost.
The `close_*` settings are applied synchronously when {beatname_uc} attempts
to read from a file, meaning that if {beatname_uc} is in a blocked state
due to blocked output, full queue or other issue, a file that would
due to blocked output, full queue or other issue, a file that would
otherwise be closed remains open until {beatname_uc} once again attempts to read from the file.


Expand Down Expand Up @@ -240,7 +240,7 @@ that should be removed based on the `clean_inactive` setting. This happens
because {beatname_uc} doesn't remove the entries until it opens the registry
again to read a different file. If you are testing the `clean_inactive` setting,
make sure {beatname_uc} is configured to read from more than one file, or the
file state will never be removed from the registry.
file state will never be removed from the registry.

[float]
[id="{beatname_lc}-input-{type}-clean-removed"]
Expand Down Expand Up @@ -441,4 +441,3 @@ Set the location of the marker file the following way:
----
file_identity.inode_marker.path: /logs/.filebeat-marker
----

67 changes: 67 additions & 0 deletions filebeat/docs/inputs/input-filestream-file-options.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,62 @@ stays open and constantly polls your files.

The default setting is 10s.

[float]
[id="{beatname_lc}-input-{type}-scan-fingerprint"]
===== `prospector.scanner.fingerprint`

Instead of relying on the device ID and inode values when comparing files, compare hashes of the given byte ranges of files.

Enable this option if you're experiencing data loss or data duplication due to unstable file identifiers provided by the file system.

Following are some scenarios where this can happen:

. Some file systems (i.e. in Docker) cache and re-use inodes
+
for example if you:
+
.. Create a file (`touch x`)
.. Check the file's inode (`ls -i x`)
.. Delete the file (`rm x`)
.. Create a new file right away (`touch y`)
.. Check the inode of the new file (`ls -i y`)
+

For both files you might see the same inode value despite even having different filenames.
+
. Non-Ext file systems can change inodes:
+
Ext file systems store the inode number in the `i_ino` file, inside a struct `inode`, which is written to disk. In this case, if the file is the same (not another file with the same name) then the inode number is guaranteed to be the same.
+
If the file system is other than Ext, the inode number is generated by the inode operations defined by the file system driver. As they don't have the concept of what an inode is, they have to mimic all of the inode's internal fields to comply with VFS, so this number will probably be different after a reboot, even after closing and opening the file again (theoretically).
+
. Some file processing tools change inode values
+
Sometimes users unintentionally change inodes by using tools like `rsync` or `sed`.
+
. Some operating systems change device IDs after reboot
+
Depending on a mounting approach, the device ID (which is also used for comparing files) might change after a reboot.

**Configuration**

Fingerprint mode is disabled by default.

WARNING: Enabling fingerprint mode delays ingesting new files until they grow to at least `offset`+`length` bytes in size, so they can be fingerprinted. Until then these files are ignored.

Normally, log lines contain timestamps and other unique fields that should be able to use the fingerprint mode,
but in every use-case users should inspect their logs to determine what are the appropriate values for
the `offset` and `length` parameters. Default `offset` is `0` and default `length` is `1024` or 1 KB. `length` cannot be less than `64`.

[source,yaml]
----
fingerprint:
enabled: false
offset: 0
length: 1024
----


[float]
[id="{beatname_lc}-input-{type}-ignore-older"]
===== `ignore_older`
Expand Down Expand Up @@ -502,6 +558,17 @@ Set the location of the marker file the following way:
file_identity.inode_marker.path: /logs/.filebeat-marker
----

*`fingerprint`*:: To identify files based on their content byte range.

WARNING: In order to use this file identity option, you must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, or other settings) will lead to a global re-ingestion of all files that match the paths configuration of the input.

Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.

[source,yaml]
----
file_identity.fingerprint: ~
----

[[filestream-log-rotation-support]]
[float]
=== Log rotation
Expand Down
9 changes: 8 additions & 1 deletion filebeat/docs/inputs/input-filestream.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ device IDs. However, on network shares and cloud providers these
values might change during the lifetime of the file. If this happens
{beatname_uc} thinks that file is new and resends the whole content
of the file. To solve this problem you can configure `file_identity` option. Possible
values besides the default `inode_deviceid` are `path` and `inode_marker`.
values besides the default `inode_deviceid` are `path`, `inode_marker` and `fingerprint`.

WARNING: Changing `file_identity` methods between runs may result in
duplicated events in the output.
Expand All @@ -116,6 +116,13 @@ example oneliner generates a hidden marker file for the selected mountpoint `/lo
Please note that you should not use this option on Windows as file identifiers might be
more volatile.

Selecting `fingerprint` instructs {beatname_uc} to identify files based on their
content byte range.

WARNING: In order to use this file identity option, one must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, etc) will lead to a global re-ingestion of all files that match the paths configuration of the input.

Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.

["source","sh",subs="attributes"]
----
$ lsblk -o MOUNTPOINT,UUID | grep /logs | awk '{print $2}' >> /logs/.filebeat-marker
Expand Down
13 changes: 13 additions & 0 deletions filebeat/filebeat.reference.yml
Original file line number Diff line number Diff line change
Expand Up @@ -707,6 +707,19 @@ filebeat.inputs:
# original for harvesting but will report the symlink name as the source.
#prospector.scanner.symlinks: false

# If enabled, instead of relying on the device ID and inode values when comparing files,
# compare hashes of the given byte ranges in files. A file becomes an ingest target
# when its size grows larger than offset+length (see below). Until then it's ignored.
#prospector.scanner.fingerprint.enabled: false

# If fingerprint mode is enabled, sets the offset from the beginning of the file
# for the byte range used for computing the fingerprint value.
#prospector.scanner.fingerprint.offset: 0

# If fingerprint mode is enabled, sets the length of the byte range used for
# computing the fingerprint value. Cannot be less than 64 bytes.
#prospector.scanner.fingerprint.length: 1024

### Parsers configuration

#### JSON configuration
Expand Down
1 change: 1 addition & 0 deletions filebeat/input/filestream/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ import (
type config struct {
Reader readerConfig `config:",inline"`

ID string `config:"id"`
Paths []string `config:"paths"`
Close closerConfig `config:"close"`
FileWatcher *conf.Namespace `config:"prospector"`
Expand Down
4 changes: 3 additions & 1 deletion filebeat/input/filestream/copytruncate_prospector.go
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,9 @@ func (p *copyTruncateFileProspector) onRotatedFile(
hg.Start(ctx, src)
return
}
originalSrc := p.identifier.GetSource(loginp.FSEvent{NewPath: originalPath, Info: fi})
descCopy := fe.Descriptor
descCopy.Info = fi
originalSrc := p.identifier.GetSource(loginp.FSEvent{NewPath: originalPath, Descriptor: descCopy})
p.rotatedFiles.addOriginalFile(originalPath, originalSrc)
p.rotatedFiles.addRotatedFile(originalPath, fe.NewPath, src)
hg.Start(ctx, src)
Expand Down
2 changes: 1 addition & 1 deletion filebeat/input/filestream/environment_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@ func (e *inputTestingEnvironment) getRegistryState(key string) (registryEntry, e

func getIDFromPath(filepath, inputID string, fi os.FileInfo) string {
identifier, _ := newINodeDeviceIdentifier(nil)
src := identifier.GetSource(loginp.FSEvent{Info: fi, Op: loginp.OpCreate, NewPath: filepath})
src := identifier.GetSource(loginp.FSEvent{Descriptor: loginp.FileDescriptor{Info: fi}, Op: loginp.OpCreate, NewPath: filepath})
return "filestream::" + inputID + "::" + src.Name()
}

Expand Down

0 comments on commit b701377

Please sign in to comment.