New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-1001] Implement TimePartitionGlobFinder #2846
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2846 +/- ##
============================================
- Coverage 45.59% 45.57% -0.03%
- Complexity 8984 9031 +47
============================================
Files 1904 1908 +4
Lines 71347 71729 +382
Branches 7876 7912 +36
============================================
+ Hits 32534 32692 +158
- Misses 35806 36031 +225
+ Partials 3007 3006 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually why not using existing Glob finder with your time-partition pattern instead of inventing another finder and did you search through other alternatives ?
I don't remember exactly the finder but I have rough impression something else serve the same purpose, just curious
@autumnust It does use, compositely use instead of inheriting, an existing GlobFinder, which is |
OK Gotcha. If |
@autumnust Yeah, By Another consideration was we have to make internal copies of open source compaction constructs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline about the context and use-case, agreed on the approach to implement.
Several comments.
@@ -33,16 +33,6 @@ public DefaultFileSystemGlobFinder(FileSystem fs, Properties properties) throws | |||
} | |||
|
|||
public FileSystemDataset datasetAtPath(final Path path) throws IOException { | |||
return new FileSystemDataset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are more than one places that having this anonymous class impl. Shall we refactor them all if you would like to replace it with the real impl. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed PinotAuditCountVerifierTest
. Its usage in CompactionSuiteBase
will be replaced by a new ser/de mechanism.
import org.apache.hadoop.fs.Path; | ||
|
||
|
||
public class EmptyFileSystemDataset extends SimpleFileSystemDataset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an overkill if only thing you want is a different class. Shall we add a flag field within SimpleFileSystemDataset
to identify an emptyDataset instead ?
import org.apache.gobblin.dataset.FileSystemDataset; | ||
|
||
|
||
public class SimpleFileSystemDataset implements FileSystemDataset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add javadoc for the reasons why we need it
|
||
|
||
@Slf4j | ||
public class TimePartitionGlobFinder implements DatasetsFinder<FileSystemDataset> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add javadoc to differentiate it with DefaultFileSystemGlobFinder
is you decide not to override
datasets.forEach(dataset -> yesterdayDatasetPartitions.add(createYesterdayDatasetPartition(dataset))); | ||
|
||
// Find all dataset time partitions | ||
List<FileSystemDataset> datasetPartitions = findDatasets(datasetPartitionPattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this code block be optimized to be a single expression like :
datasetPartitions.addAll(yesterdaypartitions.stream().allMatch(x->!datasetPartitions.contains(x)).map(x -> createEmptyFileDataset))
?
@autumnust I generalized the finder which supports look back and will create a virtual partition dataset for any partition within the look back window that doesn't have a physical folder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments
ZonedDateTime.now(ZoneId.of(properties.getProperty(TIME_ZONE, DEFAULT_TIME_ZONE)))); | ||
} | ||
|
||
@VisibleForTesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I learned before, this annotation is not helping you testing methods and it doesn't even make the annotated method loaded in JVM: https://stackoverflow.com/questions/24051476/guava-visiblefortesting-help-me-with-a-complete-example it is mostly used for documentation purpose.
Can you double-check if you need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I learned it and meant to use it for documentation purpose if we haven't found an alternative way
List<FileSystemDataset> actualPartitions = findDatasets(datasetPartitionPattern); | ||
|
||
String pathStr; | ||
for (FileSystemDataset physicalPartition : actualPartitions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this for-loop a intersection operation of actualPartitions
and computedPartitions
with a tweak on element type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The loop also results in the diff between computedPartitions
and actualPartitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
@@ -46,6 +50,9 @@ public String datasetURN() { | |||
return path.toString(); | |||
} | |||
|
|||
/** | |||
* @return true if the dataset doesn't have a physical file/folder | |||
*/ | |||
public boolean getIsVirtual() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it sound more intuitive if isVirtual
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Closes apache#2846 from zxcware/comp
Closes apache#2846 from zxcware/comp
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
TimePartitionGlobFinder
has the capability to create empty file system dataset if a partition has empty dataTests
TimePartitionGlobFinder.testDayPartition
covers an empty day partition dataset is created whenTimePartitionGlobFinder.enableEmptyPartition
is true.Commits