Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLUME-3068] Ignore hidden files in TaildirMatcher #113

Closed
wants to merge 109 commits into from
Closed

[FLUME-3068] Ignore hidden files in TaildirMatcher #113

wants to merge 109 commits into from

Conversation

lingjinjiang
Copy link

https://issues.apache.org/jira/browse/FLUME-3068
When using the taildir to monitor a directory, hidden files in the directory are tailed by the source.
As the hidden files are often temp files, config files or the files which not want to be tailed, so it's necessary to ignore the hidden file when using the taildir source.

@adenes
Copy link
Contributor

adenes commented Mar 9, 2017

Hi @lingjinjiang,
Thank you for the patch. My only concern with it is that with this we won't be backward compatible, although I'm not totally sure that this is necessary in this case.
On the other hand it wouldn't need significant effort to introduce a new configuration option for this (which by default would allow listing the hidden files).
What do you (and the others) think?

@lingjinjiang
Copy link
Author

Hi @adenes ,
Thank you for your advice. Now I add a new configuration option named "ignoreHiddenFile", and the default value is "false". When set it to "true", the hidden file will be ingored.

The default value is false, the hidden file will not be tailed when set true
@adenes
Copy link
Contributor

adenes commented Mar 30, 2017

Hi @lingjinjiang, thank you for the update on the patch and sorry for the delayed reply.
Could you please also add unit tests for the feature and update the documentation as well? Thank you.

@lingjinjiang
Copy link
Author

Hi @adenes , I add the unit test and update the documents. Could you review it?

@asfgit
Copy link

asfgit commented Aug 17, 2018

Can one of the admins verify this patch?

adenes and others added 22 commits January 14, 2019 13:23
If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might
not end up in COMPLETE state. In this case block recovery should happen but as the lease is
still held by Flume the NameNode will start the recovery process only after the hard limit of
1 hour expires.

This change adds an explicit recoverLease() call in case of close failure.

This closes #127

Reviewers: Hari Shreedharan

(Denes Arvay via Bessenyei Balázs Donát)
This closes #126

Reviewers: Denes Arvay, Bessenyei Balázs Donát

(Marcell Hegedus via Bessenyei Balázs Donát)
When logging level is set to DEBUG, Kafka Sink and Kafka Channel may throw a NullPointerException.

This patch ensures that `metadata` is not null to avoid the exception.

This closes #125

Reviewers: Denes Arvay, Bessenyei Balázs Donát

(loleek via Bessenyei Balázs Donát)
…r Source

This patch addresses an edge case of the Taildir Source wherein it can miss
reading events written in the same second as the file closing.

This closes #128

Reviewers: Satoshi Iijima, Bessenyei Balázs Donát

(eskrm via Bessenyei Balázs Donát)
…d to data loss

This commit fixes the issue when in HDFSEventSink.process() a BucketWriter.append()
call threw a BucketClosedException then the newly created BucketWriter wasn't
flushed after the processing loop.

This closes #129

Reviewers: Attila Simon, Mike Percy

(Denes Arvay via Mike Percy)
This patch adds the following new metrics to the FileChannel's counters:
- eventPutErrorCount: incremented if an IOException occurs during put operation.
- eventTakeErrorCount: incremented if an IOException or CorruptEventException occurs
  during take operation.
- checkpointWriteErrorCount: incremented if an exception occurs during checkpoint write.
- unhealthy: this flag represents whether the channel has started successfully
  (i.e. the replay ran without any problem), so the channel is capable for normal operation
- closed flag: the numeric representation (1: closed, 0: open) of the negated open flag.

Closes #131.

Reviewers: Attila Simon, Mike Percy

(Denes Arvay via Mike Percy)
…Sink

This patch adds the ability of header substitution n Kafka Sink's
kafka.topic configuration variable.

This closes #137.

Reviewers: Denes Arvay

(Takafumi Saito via Denes Arvay)
- Removed the unsupported PermSize and MaxPermSize settings from .travis.yml
- Updated DEVNOTES, README and Flume User Guide
- Removed the maven-compiler-plugin from the taildir-source subproject
- Changed the sourceJavaVersion and targetJavaVersion to 1.8 in the root pom.xml

(Lior Zeno via Denes Arvay)
Log4jAppender and LoadBalancingLog4jAppender resolve local hosts address at startup and
add it to each event's header with the key "flume.client.log4j.address".

This closes #121.

(Andras Beni via Denes Arvay)
JMSSource has created only nondurable subscriptions which could lead to event loss in case
of topic destination type.

This change enables durable subscription creation and lets user specify client id.
Also removed JMSMessageConsumerFactory which has no additional value.

This closes #120.

Reviewers: Attila Simon, Denes Arvay

(Andras Beni via Denes Arvay)
This commit changes the Travis CI build config to use Java 8.

This closes #142.

Reviewers: Denes Arvay

(Attila Simon via Denes Arvay)
Cleanup after Netty initialisation fails (call this.stop())

- Make sure this.stop() releases the resources and end up the component in
  a LifecycleAware.STOPPED state
- Added junit test to cover the invalid host scenario
- Added junit test to cover the used port scenario

This closes #141.

Reviewers: Denes Arvay

(Attila Simon via Denes Arvay)
This patch fixes the issue in NetcatSource which occurs if there is a problem
while binding the channel's socket to a local address and leads to a file descriptor
(socket) leak.

Reviewers: Attila Simon, Denes Arvay

(Siddharth Ahuja via Denes Arvay)
This patch adds a netcat UDP source.

Reviewers: Lior Zeno, Chris Horrocks, Bessenyei Balázs Donát

(Tristan Stevens via Bessenyei Balázs Donát)
Update Developer Guide with notes on how to upgrade Protocol Buffer
version.

Reviewers: Ashish Paliwal, Attila Simon

(Roshan Naik via Denes Arvay)
- Make avro ip filter tests more reliable by checking whether the
  caught exception is really what the test expected
- Use lambda instead of anonymous classes to make the code shorter

This closes #143.

Reviewers: Denes Arvay

(Attila Simon via Denes Arvay)
After a bad response, connection.getInputStream() returns null.
This patch adds a check for this.

This closes #139

Reviewers: Bessenyei Balázs Donát

(filippovmn via Bessenyei Balázs Donát)
Flume user guide does not specify whether a value in event header could be null or not.
Given an external system generating events which header values can be null and a user configures
Flume with Memory Channel then he will have no trouble.
Later on when the user changes Memory Channel to File Channel then Flume will fail with NPE.
It is because FC is serializing events with protocol buffer and header values are defined as
required in the proto file.
In this patch I have changed the value field to optional. However protocol buffer does not have
a notation for null and setting a field to null raises NPE again. Added a null check before
serialization to prevent this.
There is on caveat: When an optional field is not set, at deserialization it will be set to a
default value: in this case it will be empty string.

Reviewers: Miklos Csanady

(Marcell Hegedus via Denes Arvay)
…eringInterceptor

- Use RegexFilteringInterceptor.class in LoggerFactory.getLogger() call
- Fix the Javadoc of the RegexFilteringInterceptor.Builder class

This closes #148

Reviewers: Attila Simon, Marcell Hegedus

(Peter Chen via Denes Arvay)
This commit extracts the version numbers from the subprojects'
pom.xml to the root pom.xml without introducing any other change
(i.e. the dependency tree didn't change)

This closes #132

Reviewers: Ferenc Szabo, Attila Simon

(Miklos Csanady via Denes Arvay)
This closes #149

Reviewers: Denes Arvay

(Miklos Csanady via Denes Arvay)
…ollection of messages

Log4jAppender treats Collection messages as a special case making it possible to log
Collection of events in one Log4j log call. The appender sends these events to the
receiving Flume instance as one batch with the rpcClient.appendBatch() method.

This closes #151

Reviewers: Ferenc Szabo, Miklos Csanady

(Denes Arvay via Denes Arvay)
szaboferee and others added 29 commits January 14, 2019 13:23
…code analyzer

Adding spotbugs, pmd to the build.
moving flume-checkstyle to a new build-support module that contains
any future build tool configuration.

Fixed some trivial checkstyle issues.

Updated apache parent version and maven plugin versions.
Added maxAllowedViolations so this commit could contain only the new checks
and not any code change.

Did some cleanup in the pom files.

This closes #232

Reviewers: Peter Turcsanyi, Endre Major

(Ferenc Szabo via Ferenc Szabo)
Fixing the documentation

(Ferenc Szabo via Ferenc Szabo)
when using default "maxIOWorkers" value.

Reviewers: Denes Arvay, Ferenc Szabo

(Takafumi Saito via Ferenc Szabo)
In the newer version of the Syslog message format (RFC-5424) the hostname
is not a mandatory header anymore so the Syslog client might not send it.
On the Flume side it would be a useful information that could be used
in interceptors or for event routing.
To keep this information, two new properties have been added to the Syslog
sources: clientIPHeader and clientHostnameHeader.
Flume users can define custom event header names through these parameters
for storing the IP address / hostname of the Syslog client in the Flume
event as headers.
The IP address / hostname are retrieved from the underlying network sockets,
not from the Syslog message.

This change is based on the patch submitted by Jinjiang Ling which has been
rebased onto the current trunk and the review comments have been implemented.

This closes #234

Reviewers: Ferenc Szabo, Endre Major

(Peter Turcsanyi via Ferenc Szabo)
…n values.

Adding support for getFloat() and getDouble() on context

Reviewers: Mike Percy, Ferenc Szabo

(Hans Uhlig via Ferenc Szabo)
Reason: 2.8.9 has a vulnerability issue, fixed in 2.8.11+

This closes #236

Reviewers: Ferenc Szabo

(Peter Turcsanyi via Ferenc Szabo)
This is based on the contributions for FLUME-2653 regarding a new feature
for the hdfs sink.
Added a new parameter hdfs.emptyInUseSuffix to allow the output file name
to remain unchanged. See the user guide changes for details.
This is desired feature from the community.

I added a new junit test case for testing.
Temporarily modified old test cases in my ide to use the new flag, and
they passed. I did this just as one of test, to be on the safe side.
It is not in this PR.

This closes #237

Reviewers: Peter Turcsanyi, Ferenc Szabo

(Endre Major via Ferenc Szabo)
Changed http sink to log with slf4j instead of log4j.
Changed some test to use slf4j as well.

This closes #233

Reviewers: Peter Turcsanyi, Endre Major

(Ferenc Szabo via Ferenc Szabo)
This has been tested with unit tests. The main difference that caused the most
problems is the consumer.poll(Duration) change. This does not block even when
it fetches meta data whereas the previous poll(long timeout) blocked
indefinitely for meta data fetching.
This has resulted in many test timing issues. I tried to do minimal changes at
the tests, just enough to make them pass.

Kafka 2.0 requires a higher version for slf4j, I had to update it to 1.7.25.

Option migrateZookeeperOffsets is deprecated in this PR.
This will allow us to get rid of Kafka server libraries in Flume.

Compatibility testing.
Modified the TestUtil to be able to use external servers. This way I could test
against a variety of Kafka Server versions using the normal unit tests.
Channel tests using 2.0.1 client:
Kafka_2.11_0.9.0.0 Not compatible
Kafka_2.11_0.10.0.0 Not compatible
Kafka_2.11_0.10.1.0 passed with TestPartition timeouts
(rerunning the single test passes so it is a tes isolation issue)
Kafka_2.11_0.10.2.0 passed with TestPartition timeouts
(rerunning the single test passes so it is a tes isolation issue)
Kafka_2.11-0.11.0.3 - timeouts in TestPartitions when creating topics
Kafka_2.11-1.0.2 - passed
Kafka_2.11-1.1.1 - passed
Kafka_2.11-2.0.1 - passed

This closes #235

Reviewers: Tristan Stevens, Ferenc Szabo, Peter Turcsanyi

(Endre Major via Ferenc Szabo)
Hadoop 1/2 profiles were obsolete and had not been used for a long time,
so they have been deleted.
HBase profile was always active, so its content has been moved to
top level.
Additional clean-ups: some version declarations moved to the parent pom,
redundant version declarations and exclusions deleted.

This closes #239

Reviewers: Ferenc Szabo

(Peter Turcsanyi via Ferenc Szabo)
Adding missing counter to KafkaSink

Reviewers: Denes Arvay, Attila Simon, Ferenc Szabo

(Udai Kiran Potluri via Ferenc Szabo)
It seems when solving https://issues.apache.org/jira/browse/FLUME-2799 ,
an oversight resulted in the message offset not being added to the header.

This change corrects this.

This closes #238

Reviewers: Ferenc Szabo, Peter Turcsanyi

(Jehan Bruggeman via Ferenc Szabo)
If there are multiple files in the path(s) that need to be tailed and there
is a file written by high frequency, then Taildir can read the batchSize size
events from that file every time. This can lead to an endless loop and Taildir
will only read data from the busy file, while other files will not be
processed.
Another problem is that in this case TaildirSource will be unresponsive to
stop requests too.

This commit handles this situation by introducing a new config property called
maxBatchCount. It controls the number of batches being read consecutively
from the same file. After reading maxBatchCount rounds from a file, Taildir
will switch to another file / will have a break in the processing.

This change is based on hunshenshi's patch.

This closes #240

Reviewers: Ferenc Szabo, Endre Major

(Peter Turcsanyi via Ferenc Szabo)
TaildirSource.process() implements the correct polling logic now. It returns
Status.READY / Status.BACKOFF which controls the common backoff sleeping
mechanism implemented in PollableSourceRunner.PollingRunner (instead of
always returning Status.READY and sleeping inside the method which was
an incorrect behaviour).

This closes #241

Reviewers: Endre Major, Denes Arvay

(Peter Turcsanyi via Ferenc Szabo)
The default hdfs.callTimeout used by the HDFS sink was too low only 10 seconds
that can cause problems on a busy system.
The new default is 30 sec.
I think this parameter should be deprecated and some new more error tolerant
solution should be used. To enable the future change I indicated this in the
code and in the Users Guide.
Tested only with the unit tests.

This closes #243

Reviewers: Ferenc Szabo, Peret Turcsanyi

(Endre Major via Ferenc Szabo)
An update to the configuration section of the user guide.

This closes #246

Reviewers: Peret Turcsanyi, Ferenc Szabo

(Endre Major via Ferenc Szabo)
This PR adds a few tables to the User Guide that describe the metrics
published by sorurces, sinks and channels.
I used simple unix tools to gather the data then I wrote a small utility to
convert it to csv.
Then I used an online converter https://www.tablesgenerator.com/ to generate
the rst tables and then a little manual editing.
I discovered some rst formatting problems in the FlumeUserGuide.rst,
corrected them, too.
It was rather painful process to gather the data and find a decent
representation.
So far this PR only contains the end result. I would be happy to share the
utilities, just don't know what would be the best way.

This closes #242

Reviewers: Denes Arvay, Ferenc Szabo

(Endre Major via Ferenc Szabo)
KafkaChannel was missing some metrics:
  eventTakeAttemptCount, eventPutAttemptCount

This PR is based on the patch included in the issue that was the work
of Umesh Chaudhary.
I reworked the test a bit to use Mockito, and made some other minor
modifications to the test.

This closes #244

Reviewers: Peter Turcsanyi, Ferenc Szabo

(Endre Major via Ferenc Szabo)
Addung SHA-512 checksum generation to maven
Removed deprecated checksums
Updated documentation

This closes #247

Reviewers: Endre Major, Peter Turcsanyi

(Ferenc Szabo via Ferenc Szabo)
…fig-filter

hadoop-common should be optional as in the hdfs-sink

This closes #248

Reviewers: Endre Major, Peter Turcsanyi

(Ferenc Szabo via Ferenc Szabo)
Moving log4j dependencies to test scope.
Adding log4j as dependency to flume-ng-dist to pack it in the binary tarball.

This closes #249

Reviewers: Endre Major, Peter Turcsanyi

(Ferenc Szabo via Ferenc Szabo)
Updating the LICENSE file.
Adding a helper script to dev-support

This closes #251

Reviewers: Endre Major, Denes Arvay

(Ferenc Szabo via Ferenc Szabo)
readlink -f is not portable

Reviewers: Bessenyei Balázs Donát, Peter Turcsanyi
This script was checked in without the execute bit set.

This closes #258

Reviewers: Bessenyei Balázs Donát, Peter Turcsanyi
The default value is false, the hidden file will not be tailed when set true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet