Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-2512] Introduces TextIO.watchForNewFiles() and the Match transform #3607

Closed
wants to merge 3 commits into from

Conversation

jkff
Copy link
Contributor

@jkff jkff commented Jul 21, 2017

https://issues.apache.org/jira/browse/BEAM-2512

Part of http://s.apache.org/textio-sdf, based on http://s.apache.org/beam-watch-transform.

The Match transform can be useful for users who want to write their own file-based connectors, or for advanced use cases such as: watch for new subdirectories to appear in a directory (using Match), and then start watching each subdirectory for new files and reading them (using TextIO.watchForNewFiles()).

Additionally, finally makes it configurable whether TextIO.read/readAll() allow filepatterns matching no files, and adds a FileSystems utility for that.

Normal reads disallow empty filepatterns (to preserve old behavior), readAll() allows them if the filepattern contains a wildcard (which seems a reasonable default behavior that read() should have had from the beginning, but we can't change it), and watchForNewFiles() allows them unconditionally (because files might appear later).

R: @reuvenlax

@jkff jkff force-pushed the textio-read-watch-new-files branch 2 times, most recently from a8e180c to 6e2a6f5 Compare July 21, 2017 01:05
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 70.34% when pulling 6e2a6f5 on jkff:textio-read-watch-new-files into 1d9160f on apache:master.

@jkff jkff force-pushed the textio-read-watch-new-files branch from 6e2a6f5 to 1f33b25 Compare July 21, 2017 22:26
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 70.343% when pulling 1f33b25 on jkff:textio-read-watch-new-files into 81c2e90 on apache:master.

* Pipeline p = ...;
*
* PCollection<String> lines = p.apply(TextIO.read()
* .from("/local/path/to/files/*")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine a case where a new directory is generated every day, and you then want to start watching for new files in that directory. Can we model this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so - I added the Match transform to support this. Everything looks cleaner now.

@jkff jkff force-pushed the textio-read-watch-new-files branch 3 times, most recently from 03b6b05 to ebeaacf Compare August 1, 2017 22:47
@jkff jkff changed the title [BEAM-2512] Introduces TextIO.read/readAll().watchForNewFiles() [BEAM-2512] Introduces TextIO.watchForNewFiles() and the Match transform Aug 1, 2017
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling ebeaacf on jkff:textio-read-watch-new-files into ** on apache:master**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling ebeaacf on jkff:textio-read-watch-new-files into ** on apache:master**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling ebeaacf on jkff:textio-read-watch-new-files into ** on apache:master**.

}
}
};
writer.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we start writer after the pipeline starts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - TestPipeline.run() waits for it to complete. The writer deliberately sleeps a bit in the beginning, to have a high-but-not-100% chance of the first poll hitting an empty set of files.

@jkff jkff force-pushed the textio-read-watch-new-files branch 2 times, most recently from 5ebd39c to c249a9d Compare August 3, 2017 22:38
@jkff
Copy link
Contributor Author

jkff commented Aug 3, 2017

Thanks! Refactored this into 3 commits for easier review, and cleaned up things a bit.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling c249a9d on jkff:textio-read-watch-new-files into ** on apache:master**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling c249a9d on jkff:textio-read-watch-new-files into ** on apache:master**.

@jkff
Copy link
Contributor Author

jkff commented Aug 4, 2017

Reuven verbally LGTMd this - I'm going to go ahead and merge.

Part of http://s.apache.org/textio-sdf, based on
http://s.apache.org/beam-watch-transform.

The Match transform can be useful for users who want to write their own
file-based connectors, or for advanced use cases such as: watch for new
subdirectories to appear in a directory (using Match), and then start
watching each subdirectory for new files and reading them
(using TextIO.watchForNewFiles()).

Additionally, finally makes it configurable whether TextIO.read/readAll()
allow filepatterns matching no files.

Normal reads disallow empty filepatterns (to preserve old behavior), readAll()
allows them if the filepattern contains a wildcard (which seems a reasonable
default behavior that read() should have had from the beginning, but we can't
change it), and watchForNewFiles() allows them unconditionally (because files
might appear later).
@jkff jkff force-pushed the textio-read-watch-new-files branch from c249a9d to b54883e Compare August 4, 2017 22:33
@asfgit asfgit closed this in df36bd9 Aug 4, 2017
@jkff jkff deleted the textio-read-watch-new-files branch August 5, 2017 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants