[GOBBLIN-1011] adjust compaction flow to work with virtual partition #2856

zxcware · 2019-12-21T01:34:45Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1011

Description

Here are some details about my PR:
- Update existing CompactionVerifiers and CompactionCompleteActions to work with virtual simple file system dataset proply
- Improve ser/de of FileSystemDataset in CompactionSuiteBase
- Update gobblin-hive-registration to work with table parameters properly

Tests

My PR adds the following unit tests:
- TimeIteratorTest covers functions in TimeIterator
- AvroCompactionTaskTest.testCompactVirtualDataset covers existing compaction constructs can handle virtual SimpleFileSystemDataset correctly
- HiveMetaStoreUtilsTest.testGetTableAvro covers table parameters are loaded correctly

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

codecov-io · 2019-12-21T06:00:10Z

Codecov Report

Merging #2856 into master will increase coverage by 0.03%.
The diff coverage is 73.43%.

@@             Coverage Diff              @@
##             master    #2856      +/-   ##
============================================
+ Coverage     45.57%   45.61%   +0.03%     
- Complexity     9032     9052      +20     
============================================
  Files          1908     1909       +1     
  Lines         71729    71766      +37     
  Branches       7912     7918       +6     
============================================
+ Hits          32693    32738      +45     
+ Misses        36031    36018      -13     
- Partials       3005     3010       +5

Impacted Files	Coverage Δ	Complexity Δ
...n/compaction/suite/CompactionSuiteBaseFactory.java	`100% <ø> (ø)`	`2 <0> (ø)`	⬇️
...che/gobblin/hive/metastore/HiveMetaStoreUtils.java	`31.69% <ø> (ø)`	`12 <0> (ø)`	⬇️
...lin/hive/metastore/HiveMetaStoreBasedRegister.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
.../org/apache/gobblin/dataset/FileSystemDataset.java	`0% <0%> (ø)`	`0 <0> (?)`
...ta/management/dataset/SimpleFileSystemDataset.java	`88.88% <100%> (ø)`	`4 <1> (ø)`	⬇️
.../action/CompactionCompleteFileOperationAction.java	`63.49% <100%> (+1.19%)`	`5 <2> (+1)`	⬆️
.../gobblin/compaction/suite/CompactionSuiteBase.java	`100% <100%> (ø)`	`10 <5> (+2)`	⬆️
...mpaction/action/CompactionMarkDirectoryAction.java	`46.15% <100%> (+4.48%)`	`5 <2> (+1)`	⬆️
...compaction/verify/CompactionThresholdVerifier.java	`75% <100%> (+2.27%)`	`5 <0> (+1)`	⬆️
...ction/action/CompactionHiveRegistrationAction.java	`36% <100%> (+5.56%)`	`5 <2> (+1)`	⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0700092...2f33296. Read the comment docs.

yukuai518 · 2019-12-23T18:07:14Z

gobblin-compaction/src/main/java/org/apache/gobblin/compaction/suite/CompactionSuiteBase.java

@@ -46,14 +50,17 @@
 @Slf4j
 public class CompactionSuiteBase implements CompactionSuite<FileSystemDataset> {
  public static final String SERIALIZE_COMPACTION_FILE_PATH_NAME = "compaction-file-path-name";


Consider to remove this SERIALIZE_COMPACTION_FILE_PATH_NAME?

yukuai518 · 2019-12-23T18:15:57Z

...paction/src/main/java/org/apache/gobblin/compaction/verify/CompactionAuditCountVerifier.java

-    DateTime startTime = result.getTime();
-    DateTime endTime = startTime.plusHours(1);
+    CompactionPathParser.CompactionParserResult result = new CompactionPathParser(state).parse(dataset);
+    ZonedDateTime startTime = ZonedDateTime.ofInstant(Instant.ofEpochMilli(result.getTime().getMillis()), zone);


I'm not quite sure if we really need to convert to ZoneDateTime first. Seems like current joda DateTime already supports plusMinutes, plusHours, plusDays, plusMonths.

Because TimeIterator.inc accepts ZoneDateTime. I implemented TimeIterator using java.time as joda time is kind of deprecated by java 8+.

The standard date and time classes prior to Java SE 8 are poor. By tackling this problem head-on, Joda-Time became the de facto standard date and time library for Java prior to Java SE 8. Note that from Java SE 8 onwards, users are asked to migrate to java.time (JSR-310) - a core part of the JDK which replaces this project.

autumnust · 2019-12-23T18:25:44Z

gobblin-runtime/src/main/java/org/apache/gobblin/runtime/task/TaskIFaceWrapper.java

@@ -75,6 +78,8 @@ public String getProgress() {
  public void run() {
    try {
      this.underlyingTask.run();
+    } catch (Exception e) {
+      log.error(String.format("Task %s completed with exception", this.taskContext.getTaskState().getTaskId()), e);


This seems to be an anti-pattern to catch exception if run method itself is not throwing an exception here. I wonder what is the reason for this change?

I found that when underlyingTask threw an exception (in my case, it is an NPE), there was no message at all(the exception got swallowed and didn't get a chance to be logged while bubbling up) and this wrapper continued to the finally block, marking the task as a success.

autumnust · 2019-12-23T18:44:41Z

...registration/src/main/java/org/apache/gobblin/hive/metastore/HiveMetaStoreBasedRegister.java

@@ -473,6 +475,10 @@ public void dropTableIfExists(String dbName, String tableName) throws IOExceptio
  public void dropPartitionIfExists(String dbName, String tableName, List<Column> partitionKeys,
      List<String> partitionValues) throws IOException {
    try (AutoReturnableObject<IMetaStoreClient> client = this.clientPool.getClient()) {
+      if (client.get().getPartition(dbName, tableName, partitionValues) == null) {


Could you clean up the catch block on NoSuchObjectException is you decide to use existence check instead of relying on the exception to determine there's no such object ?

I am not familiar with this getPartition API but sending a list of partition values but returning single partition object seems to be a little bit strange. Can you double check the semantic ? Or alternative there's a listPartitions method which is usually the API that I used for checking existence.

It's nice to catch NoSuchObjectException, getPartition also throws NoSuchObjectException.

My understanding is that a partition has multiple partition columns and each value in the list corresponds a column

Verified with a test job that we also need this catch exception to detect partition not exist..

autumnust · 2019-12-23T18:47:31Z

gobblin-hive-registration/src/main/java/org/apache/gobblin/hive/HiveTable.java


    public Builder withPartitionKeys(List<Column> partitionKeys) {
      this.partitionKeys = partitionKeys;
      return this;
    }

+    public Builder withTableParameters(Map<String, String> tableParameters) {


There's reason that TableParameters are not included in the builder: We internally use some timestamp and record count in the table parameters. Including these values in the table object could cause diff-check failure and issue much more updatePartition call in the hive metastore. Could you double check with the potential write-amplification to relevant team ?

autumnust · 2019-12-23T18:50:15Z

gobblin-compaction/src/main/java/org/apache/gobblin/compaction/suite/CompactionSuiteBase.java

+  protected State state;
+  protected CompactionJobConfigurator configurator;
+  private static final Gson GSON = GsonInterfaceAdapter.getGson(FileSystemDataset.class);
+  private static final String SERIALIZED_DATASET = "compaction.serializedDataset";


please remove old key name if you decide to replace it with a new name, also why this change ?

autumnust · 2019-12-23T18:52:05Z

...compaction/src/test/java/org/apache/gobblin/compaction/mapreduce/AvroCompactionTaskTest.java

+        .setConfiguration(TimePartitionGlobFinder.ENABLE_VIRTUAL_PARTITION, "true");
+
+    JobExecutionResult result = embeddedGobblin.run();
+    Assert.assertTrue(result.isSuccessful());


Shall we need to verify the contents of execution beyond simply verifying if the exception is successful ?

autumnust · 2019-12-23T19:04:39Z

gobblin-compaction/src/main/java/org/apache/gobblin/compaction/suite/CompactionSuiteBase.java

@@ -130,7 +127,6 @@ public String datasetURN() {
   * @return a map-reduce job which will compact files against {@link org.apache.gobblin.dataset.Dataset}
   */
  public Job createJob (FileSystemDataset dataset) throws IOException {
-    configurator = CompactionJobConfigurator.instantiateConfigurator(this.state);


Seems the configurator has to be instantiated lazily. Check optionalInit method which could inject overwriting properties in the state object.

yukuai518 · 2019-12-23T19:19:20Z

gobblin-runtime/src/main/java/org/apache/gobblin/runtime/task/TaskIFaceWrapper.java

@@ -75,6 +78,8 @@ public String getProgress() {
  public void run() {
    try {
      this.underlyingTask.run();
+    } catch (Exception e) {


For compaction job, we already have logs in MRTask. Maybe we should propagate the exception back to GobblinMultiTaskAttempt.

autumnust

LGTM

htran1 · 2019-12-24T01:32:27Z

...rc/main/java/org/apache/gobblin/compaction/action/CompactionCompleteFileOperationAction.java

@@ -73,6 +74,11 @@ public CompactionCompleteFileOperationAction (State state, CompactionJobConfigur
   * Create a record count file containing the number of records that have been processed .
   */
  public void onCompactionJobComplete (FileSystemDataset dataset) throws IOException {
+    if (dataset instanceof SimpleFileSystemDataset


How about extending the FileSystemDataset interface with an isVirtual() method that defaults to false to avoid this specific instanceof and casting?

htran1

+1

Closes apache#2856 from zxcware/comp2

zxcware added 3 commits December 20, 2019 17:30

[GOBBLIN-1011] adjust compaction flow to work with virtual partition

52ee14f

add unit tests

d4c97cb

separate check to avoid confusion

a258710

yukuai518 reviewed Dec 23, 2019

View reviewed changes

autumnust requested changes Dec 23, 2019

View reviewed changes

yukuai518 reviewed Dec 23, 2019

View reviewed changes

zxcware added 2 commits December 23, 2019 15:47

remove table parameters

1b490eb

capture partition not exist

d2305f9

autumnust approved these changes Dec 24, 2019

View reviewed changes

yukuai518 approved these changes Dec 24, 2019

View reviewed changes

htran1 suggested changes Dec 24, 2019

View reviewed changes

Add isVirtual to interface

2f33296

htran1 approved these changes Dec 24, 2019

View reviewed changes

asfgit closed this in 1de3f7e Dec 24, 2019

zxcware deleted the comp2 branch February 7, 2020 01:11

haojiliu pushed a commit to haojiliu/incubator-gobblin that referenced this pull request Apr 9, 2020

[GOBBLIN-1011] adjust compaction flow to work with virtual partition

dd10f2e

Closes apache#2856 from zxcware/comp2

jhsenjaliya pushed a commit to jhsenjaliya/incubator-gobblin that referenced this pull request Apr 26, 2020

[GOBBLIN-1011] adjust compaction flow to work with virtual partition

ce99d16

Closes apache#2856 from zxcware/comp2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOBBLIN-1011] adjust compaction flow to work with virtual partition #2856

[GOBBLIN-1011] adjust compaction flow to work with virtual partition #2856

zxcware commented Dec 21, 2019 •

edited

codecov-io commented Dec 21, 2019 •

edited

yukuai518 Dec 23, 2019

zxcware Dec 23, 2019

yukuai518 Dec 23, 2019

zxcware Dec 23, 2019 •

edited

autumnust Dec 23, 2019

zxcware Dec 23, 2019

autumnust Dec 23, 2019

zxcware Dec 23, 2019

zxcware Dec 24, 2019

autumnust Dec 23, 2019

autumnust Dec 23, 2019

autumnust Dec 23, 2019

autumnust Dec 23, 2019

yukuai518 Dec 23, 2019

zxcware Dec 24, 2019

autumnust left a comment

htran1 Dec 24, 2019

zxcware Dec 24, 2019

htran1 left a comment

[GOBBLIN-1011] adjust compaction flow to work with virtual partition #2856

[GOBBLIN-1011] adjust compaction flow to work with virtual partition #2856

Conversation

zxcware commented Dec 21, 2019 • edited

JIRA

Description

Tests

Commits

codecov-io commented Dec 21, 2019 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zxcware Dec 23, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

autumnust left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htran1 left a comment

Choose a reason for hiding this comment

zxcware commented Dec 21, 2019 •

edited

codecov-io commented Dec 21, 2019 •

edited

zxcware Dec 23, 2019 •

edited