Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new fingerprint file identity #35734

Merged
merged 22 commits into from
Jul 14, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.next.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,7 @@ automatic splitting at root level, if root level element is an array. {pull}3415
- Add device support for Azure AD entity analytics. {pull}35807[35807]
- Improve CEL input performance. {pull}35915[35915]
- Adding filename details from zip to response for httpjson {issue}33952[33952] {pull}34044[34044]
- Add fingerprint mode for the filestream scanner and new file identity based on it {issue}34419[34419] {pull}35734[35734]

*Auditbeat*
- Migration of system/package module storage from gob encoding to flatbuffer encoding in bolt db. {pull}34817[34817]
Expand Down
13 changes: 13 additions & 0 deletions filebeat/_meta/config/filebeat.inputs.reference.yml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,19 @@ filebeat.inputs:
# original for harvesting but will report the symlink name as the source.
#prospector.scanner.symlinks: false

# If enabled, instead of relying on the device ID and inode values when comparing files,
# compare hashes of the given byte ranges in files. A file becomes an ingest target
# when its size grows larger than offset+length (see below). Until then it's ignored.
#prospector.scanner.fingerprint.enabled: false

# If fingerprint mode is enabled, sets the offset from the beginning of the file
# for the byte range used for computing the fingerprint value.
#prospector.scanner.fingerprint.offset: 0

# If fingerprint mode is enabled, sets the length of the byte range used for
# computing the fingerprint value. Cannot be less than 64 bytes.
#prospector.scanner.fingerprint.length: 1024

### Parsers configuration

#### JSON configuration
Expand Down
9 changes: 4 additions & 5 deletions filebeat/docs/inputs/input-common-file-options.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -77,10 +77,10 @@ certain criteria or time. Closing the harvester means closing the file handler.
If a file is updated after the harvester is closed, the file will be picked up
again after `scan_frequency` has elapsed. However, if the file is moved or
deleted while the harvester is closed, {beatname_uc} will not be able to pick up
the file again, and any data that the harvester hasn't read will be lost.
The `close_*` settings are applied synchronously when {beatname_uc} attempts
the file again, and any data that the harvester hasn't read will be lost.
The `close_*` settings are applied synchronously when {beatname_uc} attempts
to read from a file, meaning that if {beatname_uc} is in a blocked state
due to blocked output, full queue or other issue, a file that would
due to blocked output, full queue or other issue, a file that would
otherwise be closed remains open until {beatname_uc} once again attempts to read from the file.


Expand Down Expand Up @@ -240,7 +240,7 @@ that should be removed based on the `clean_inactive` setting. This happens
because {beatname_uc} doesn't remove the entries until it opens the registry
again to read a different file. If you are testing the `clean_inactive` setting,
make sure {beatname_uc} is configured to read from more than one file, or the
file state will never be removed from the registry.
file state will never be removed from the registry.

[float]
[id="{beatname_lc}-input-{type}-clean-removed"]
Expand Down Expand Up @@ -441,4 +441,3 @@ Set the location of the marker file the following way:
----
file_identity.inode_marker.path: /logs/.filebeat-marker
----

67 changes: 67 additions & 0 deletions filebeat/docs/inputs/input-filestream-file-options.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,62 @@ stays open and constantly polls your files.

The default setting is 10s.

[float]
[id="{beatname_lc}-input-{type}-scan-fingerprint"]
===== `prospector.scanner.fingerprint`

Instead of relying on the device ID and inode values when comparing files, compare hashes of the given byte ranges of files.

Enable this option if you're experiencing data loss or data duplication due to unstable file identifiers provided by the file system.

Following are some scenarios where this can happen:

. Some file systems (i.e. in Docker) cache and re-use inodes
+
for example if you:
+
.. Create a file (`touch x`)
.. Check the file's inode (`ls -i x`)
.. Delete the file (`rm x`)
.. Create a new file right away (`touch y`)
.. Check the inode of the new file (`ls -i y`)
+

For both files you might see the same inode value despite even having different filenames.
+
. Non-Ext file systems can change inodes:
+
Ext file systems store the inode number in the `i_ino` file, inside a struct `inode`, which is written to disk. In this case, if the file is the same (not another file with the same name) then the inode number is guaranteed to be the same.
+
If the file system is other than Ext, the inode number is generated by the inode operations defined by the file system driver. As they don't have the concept of what an inode is, they have to mimic all of the inode's internal fields to comply with VFS, so this number will probably be different after a reboot, even after closing and opening the file again (theoretically).
+
. Some file processing tools change inode values
+
Sometimes users unintentionally change inodes by using tools like `rsync` or `sed`.
+
. Some operating systems change device IDs after reboot
+
Depending on a mounting approach, the device ID (which is also used for comparing files) might change after a reboot.

**Configuration**

Fingerprint mode is disabled by default.

WARNING: Enabling fingerprint mode delays ingesting new files until they grow to at least `offset`+`length` bytes in size, so they can be fingerprinted. Until then these files are ignored.

Normally, log lines contain timestamps and other unique fields that should be able to use the fingerprint mode,
but in every use-case users should inspect their logs to determine what are the appropriate values for
the `offset` and `length` parameters. Default `offset` is `0` and default `length` is `1024` or 1 KB. `length` cannot be less than `64`.

[source,yaml]
----
fingerprint:
enabled: false
offset: 0
length: 1024
----


[float]
[id="{beatname_lc}-input-{type}-ignore-older"]
===== `ignore_older`
Expand Down Expand Up @@ -502,6 +558,17 @@ Set the location of the marker file the following way:
file_identity.inode_marker.path: /logs/.filebeat-marker
----

*`fingerprint`*:: To identify files based on their content byte range.

WARNING: In order to use this file identity option, you must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, or other settings) will lead to a global re-ingestion of all files that match the paths configuration of the input.

Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.

[source,yaml]
----
file_identity.fingerprint: ~
----

[[filestream-log-rotation-support]]
[float]
=== Log rotation
Expand Down
9 changes: 8 additions & 1 deletion filebeat/docs/inputs/input-filestream.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ device IDs. However, on network shares and cloud providers these
values might change during the lifetime of the file. If this happens
{beatname_uc} thinks that file is new and resends the whole content
of the file. To solve this problem you can configure `file_identity` option. Possible
values besides the default `inode_deviceid` are `path` and `inode_marker`.
values besides the default `inode_deviceid` are `path`, `inode_marker` and `fingerprint`.

WARNING: Changing `file_identity` methods between runs may result in
duplicated events in the output.
Expand All @@ -116,6 +116,13 @@ example oneliner generates a hidden marker file for the selected mountpoint `/lo
Please note that you should not use this option on Windows as file identifiers might be
more volatile.

Selecting `fingerprint` instructs {beatname_uc} to identify files based on their
content byte range.

WARNING: In order to use this file identity option, one must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, etc) will lead to a global re-ingestion of all files that match the paths configuration of the input.

Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.

["source","sh",subs="attributes"]
----
$ lsblk -o MOUNTPOINT,UUID | grep /logs | awk '{print $2}' >> /logs/.filebeat-marker
Expand Down
13 changes: 13 additions & 0 deletions filebeat/filebeat.reference.yml
Original file line number Diff line number Diff line change
Expand Up @@ -707,6 +707,19 @@ filebeat.inputs:
# original for harvesting but will report the symlink name as the source.
#prospector.scanner.symlinks: false

# If enabled, instead of relying on the device ID and inode values when comparing files,
# compare hashes of the given byte ranges in files. A file becomes an ingest target
# when its size grows larger than offset+length (see below). Until then it's ignored.
#prospector.scanner.fingerprint.enabled: false

# If fingerprint mode is enabled, sets the offset from the beginning of the file
# for the byte range used for computing the fingerprint value.
#prospector.scanner.fingerprint.offset: 0

# If fingerprint mode is enabled, sets the length of the byte range used for
# computing the fingerprint value. Cannot be less than 64 bytes.
#prospector.scanner.fingerprint.length: 1024

### Parsers configuration

#### JSON configuration
Expand Down
1 change: 1 addition & 0 deletions filebeat/input/filestream/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ import (
type config struct {
Reader readerConfig `config:",inline"`

ID string `config:"id"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need for better logging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely clear what id represents: is it the identifier of the config object itself? The filestream input? Is it stable through restarts ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an input ID in an input-level configuration. You can find it here https://www.elastic.co/guide/en/beats/filebeat/current/configuration-filebeat-options.html#CO11-1

Paths []string `config:"paths"`
Close closerConfig `config:"close"`
FileWatcher *conf.Namespace `config:"prospector"`
Expand Down
4 changes: 3 additions & 1 deletion filebeat/input/filestream/copytruncate_prospector.go
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,9 @@ func (p *copyTruncateFileProspector) onRotatedFile(
hg.Start(ctx, src)
return
}
originalSrc := p.identifier.GetSource(loginp.FSEvent{NewPath: originalPath, Info: fi})
descCopy := fe.Descriptor
descCopy.Info = fi
originalSrc := p.identifier.GetSource(loginp.FSEvent{NewPath: originalPath, Descriptor: descCopy})
p.rotatedFiles.addOriginalFile(originalPath, originalSrc)
p.rotatedFiles.addRotatedFile(originalPath, fe.NewPath, src)
hg.Start(ctx, src)
Expand Down
2 changes: 1 addition & 1 deletion filebeat/input/filestream/environment_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@ func (e *inputTestingEnvironment) getRegistryState(key string) (registryEntry, e

func getIDFromPath(filepath, inputID string, fi os.FileInfo) string {
identifier, _ := newINodeDeviceIdentifier(nil)
src := identifier.GetSource(loginp.FSEvent{Info: fi, Op: loginp.OpCreate, NewPath: filepath})
src := identifier.GetSource(loginp.FSEvent{Descriptor: loginp.FileDescriptor{Info: fi}, Op: loginp.OpCreate, NewPath: filepath})
return "filestream::" + inputID + "::" + src.Name()
}

Expand Down