Add new fingerprint file identity (#35734)

This is a new alternative to existing options like `native`, `path` and `inode_marker`. Unlike the existing options, this file identity does not rely on any file system metadata and uses only the file size and its content. Users can specify what amount of bytes is used to fingerprint the beginning of each file, optionally it's possible to set an offset from the beginning. This identity is supposed to be more stable and less affected by the environment/setup of the users. This change also contains a few performance optimisations of how we work with the filesystem and watch for file changes.
elastic · Jul 14, 2023 · b701377 · b701377
1 parent a755bbc
commit b701377
Show file tree

Hide file tree

Showing 22 changed files with 1,422 additions and 529 deletions.
diff --git a/CHANGELOG.next.asciidoc b/CHANGELOG.next.asciidoc
@@ -347,7 +347,7 @@ automatic splitting at root level, if root level element is an array. {pull}3415
 - Improve CEL input performance. {pull}35915[35915]
 - Adding filename details from zip to response for httpjson {issue}33952[33952] {pull}34044[34044]
 - Add `clean_session` configuration setting for MQTT input.  {pull}35806[16204]
-
+- Add fingerprint mode for the filestream scanner and new file identity based on it {issue}34419[34419] {pull}35734[35734]
 
 *Auditbeat*
    - Migration of system/package module storage from gob encoding to flatbuffer encoding in bolt db. {pull}34817[34817]

diff --git a/filebeat/_meta/config/filebeat.inputs.reference.yml.tmpl b/filebeat/_meta/config/filebeat.inputs.reference.yml.tmpl
@@ -300,6 +300,19 @@ filebeat.inputs:
   # original for harvesting but will report the symlink name as the source.
   #prospector.scanner.symlinks: false
 
+  # If enabled, instead of relying on the device ID and inode values when comparing files,
+  # compare hashes of the given byte ranges in files. A file becomes an ingest target
+  # when its size grows larger than offset+length (see below). Until then it's ignored.
+  #prospector.scanner.fingerprint.enabled: false
+
+  # If fingerprint mode is enabled, sets the offset from the beginning of the file
+  # for the byte range used for computing the fingerprint value.
+  #prospector.scanner.fingerprint.offset: 0
+
+  # If fingerprint mode is enabled, sets the length of the byte range used for
+  # computing the fingerprint value. Cannot be less than 64 bytes.
+  #prospector.scanner.fingerprint.length: 1024
+
   ### Parsers configuration
 
   #### JSON configuration

diff --git a/filebeat/docs/inputs/input-common-file-options.asciidoc b/filebeat/docs/inputs/input-common-file-options.asciidoc
@@ -77,10 +77,10 @@ certain criteria or time. Closing the harvester means closing the file handler.
 If a file is updated after the harvester is closed, the file will be picked up
 again after `scan_frequency` has elapsed. However, if the file is moved or
 deleted while the harvester is closed, {beatname_uc} will not be able to pick up
-the file again, and any data that the harvester hasn't read will be lost. 
-The `close_*` settings are applied synchronously when {beatname_uc} attempts 
+the file again, and any data that the harvester hasn't read will be lost.
+The `close_*` settings are applied synchronously when {beatname_uc} attempts
 to read from a file, meaning that if {beatname_uc} is in a blocked state
-due to blocked output, full queue or other issue, a file that would 
+due to blocked output, full queue or other issue, a file that would
 otherwise be closed remains open until {beatname_uc} once again attempts to read from the file.
 
 
@@ -240,7 +240,7 @@ that should be removed based on the `clean_inactive` setting. This happens
 because {beatname_uc} doesn't remove the entries until it opens the registry
 again to read a different file. If you are testing the `clean_inactive` setting,
 make sure {beatname_uc} is configured to read from more than one file, or the
-file state will never be removed from the registry. 
+file state will never be removed from the registry.
 
 [float]
 [id="{beatname_lc}-input-{type}-clean-removed"]
@@ -441,4 +441,3 @@ Set the location of the marker file the following way:
 ----
 file_identity.inode_marker.path: /logs/.filebeat-marker
 ----
-
diff --git a/filebeat/docs/inputs/input-filestream-file-options.asciidoc b/filebeat/docs/inputs/input-filestream-file-options.asciidoc
@@ -146,6 +146,62 @@ stays open and constantly polls your files.
 
 The default setting is 10s.
 
+[float]
+[id="{beatname_lc}-input-{type}-scan-fingerprint"]
+===== `prospector.scanner.fingerprint`
+
+Instead of relying on the device ID and inode values when comparing files, compare hashes of the given byte ranges of files.
+
+Enable this option if you're experiencing data loss or data duplication due to unstable file identifiers provided by the file system.
+
+Following are some scenarios where this can happen:
+
+. Some file systems (i.e. in Docker) cache and re-use inodes
++
+for example if you:
++
+.. Create a file (`touch x`)
+.. Check the file's inode (`ls -i x`)
+.. Delete the file (`rm x`)
+.. Create a new file right away (`touch y`)
+.. Check the inode of the new file (`ls -i y`)
++
+
+For both files you might see the same inode value despite even having different filenames.
++
+. Non-Ext file systems can change inodes:
++
+Ext file systems store the inode number in the `i_ino` file, inside a struct `inode`, which is written to disk. In this case, if the file is the same (not another file with the same name) then the inode number is guaranteed to be the same.
++
+If the file system is other than Ext, the inode number is generated by the inode operations defined by the file system driver. As they don't have the concept of what an inode is, they have to mimic all of the inode's internal fields to comply with VFS, so this number will probably be different after a reboot, even after closing and opening the file again (theoretically).
++
+. Some file processing tools change inode values
++
+Sometimes users unintentionally change inodes by using tools like `rsync` or `sed`.
++
+. Some operating systems change device IDs after reboot
++
+Depending on a mounting approach, the device ID (which is also used for comparing files) might change after a reboot.
+
+**Configuration**
+
+Fingerprint mode is disabled by default.
+
+WARNING: Enabling fingerprint mode delays ingesting new files until they grow to at least `offset`+`length` bytes in size, so they can be fingerprinted. Until then these files are ignored.
+
+Normally, log lines contain timestamps and other unique fields that should be able to use the fingerprint mode,
+but in every use-case users should inspect their logs to determine what are the appropriate values for
+the `offset` and `length` parameters. Default `offset` is `0` and default `length` is `1024` or 1 KB. `length` cannot be less than `64`.
+
+[source,yaml]
+----
+fingerprint:
+  enabled: false
+  offset: 0
+  length: 1024
+----
+
+
 [float]
 [id="{beatname_lc}-input-{type}-ignore-older"]
 ===== `ignore_older`
@@ -502,6 +558,17 @@ Set the location of the marker file the following way:
 file_identity.inode_marker.path: /logs/.filebeat-marker
 ----
 
+*`fingerprint`*:: To identify files based on their content byte range.
+
+WARNING: In order to use this file identity option, you must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, or other settings) will lead to a global re-ingestion of all files that match the paths configuration of the input.
+
+Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.
+
+[source,yaml]
+----
+file_identity.fingerprint: ~
+----
+
 [[filestream-log-rotation-support]]
 [float]
 === Log rotation

diff --git a/filebeat/docs/inputs/input-filestream.asciidoc b/filebeat/docs/inputs/input-filestream.asciidoc
@@ -95,7 +95,7 @@ device IDs. However, on network shares and cloud providers these
 values might change during the lifetime of the file. If this happens
 {beatname_uc} thinks that file is new and resends the whole content
 of the file. To solve this problem you can configure `file_identity` option. Possible
-values besides the default `inode_deviceid` are `path` and `inode_marker`.
+values besides the default `inode_deviceid` are `path`, `inode_marker` and `fingerprint`.
 
 WARNING: Changing `file_identity` methods between runs may result in
 duplicated events in the output.
@@ -116,6 +116,13 @@ example oneliner generates a hidden marker file for the selected mountpoint `/lo
 Please note that you should not use this option on Windows as file identifiers might be
 more volatile.
 
+Selecting `fingerprint` instructs {beatname_uc} to identify files based on their
+content byte range.
+
+WARNING: In order to use this file identity option, one must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, etc) will lead to a global re-ingestion of all files that match the paths configuration of the input.
+
+Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.
+
 ["source","sh",subs="attributes"]
 ----
 $ lsblk -o MOUNTPOINT,UUID | grep /logs | awk '{print $2}' >> /logs/.filebeat-marker

diff --git a/filebeat/filebeat.reference.yml b/filebeat/filebeat.reference.yml
@@ -707,6 +707,19 @@ filebeat.inputs:
   # original for harvesting but will report the symlink name as the source.
   #prospector.scanner.symlinks: false
 
+  # If enabled, instead of relying on the device ID and inode values when comparing files,
+  # compare hashes of the given byte ranges in files. A file becomes an ingest target
+  # when its size grows larger than offset+length (see below). Until then it's ignored.
+  #prospector.scanner.fingerprint.enabled: false
+
+  # If fingerprint mode is enabled, sets the offset from the beginning of the file
+  # for the byte range used for computing the fingerprint value.
+  #prospector.scanner.fingerprint.offset: 0
+
+  # If fingerprint mode is enabled, sets the length of the byte range used for
+  # computing the fingerprint value. Cannot be less than 64 bytes.
+  #prospector.scanner.fingerprint.length: 1024
+
   ### Parsers configuration
 
   #### JSON configuration

diff --git a/filebeat/input/filestream/config.go b/filebeat/input/filestream/config.go
@@ -33,6 +33,7 @@ import (
 type config struct {
 	Reader readerConfig `config:",inline"`
 
+	ID             string             `config:"id"`
 	Paths          []string           `config:"paths"`
 	Close          closerConfig       `config:"close"`
 	FileWatcher    *conf.Namespace    `config:"prospector"`

diff --git a/filebeat/input/filestream/copytruncate_prospector.go b/filebeat/input/filestream/copytruncate_prospector.go
@@ -329,7 +329,9 @@ func (p *copyTruncateFileProspector) onRotatedFile(
 			hg.Start(ctx, src)
 			return
 		}
-		originalSrc := p.identifier.GetSource(loginp.FSEvent{NewPath: originalPath, Info: fi})
+		descCopy := fe.Descriptor
+		descCopy.Info = fi
+		originalSrc := p.identifier.GetSource(loginp.FSEvent{NewPath: originalPath, Descriptor: descCopy})
 		p.rotatedFiles.addOriginalFile(originalPath, originalSrc)
 		p.rotatedFiles.addRotatedFile(originalPath, fe.NewPath, src)
 		hg.Start(ctx, src)

diff --git a/filebeat/input/filestream/environment_test.go b/filebeat/input/filestream/environment_test.go
@@ -374,7 +374,7 @@ func (e *inputTestingEnvironment) getRegistryState(key string) (registryEntry, e
 
 func getIDFromPath(filepath, inputID string, fi os.FileInfo) string {
 	identifier, _ := newINodeDeviceIdentifier(nil)
-	src := identifier.GetSource(loginp.FSEvent{Info: fi, Op: loginp.OpCreate, NewPath: filepath})
+	src := identifier.GetSource(loginp.FSEvent{Descriptor: loginp.FileDescriptor{Info: fi}, Op: loginp.OpCreate, NewPath: filepath})
 	return "filestream::" + inputID + "::" + src.Name()
 }