Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1011] adjust compaction flow to work with virtual partition #2856

Closed
wants to merge 6 commits into from

Conversation

zxcware
Copy link
Contributor

@zxcware zxcware commented Dec 21, 2019

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR:
    • Update existing CompactionVerifiers and CompactionCompleteActions to work with virtual simple file system dataset proply
    • Improve ser/de of FileSystemDataset in CompactionSuiteBase
    • Update gobblin-hive-registration to work with table parameters properly

Tests

  • My PR adds the following unit tests:
    • TimeIteratorTest covers functions in TimeIterator
    • AvroCompactionTaskTest.testCompactVirtualDataset covers existing compaction constructs can handle virtual SimpleFileSystemDataset correctly
    • HiveMetaStoreUtilsTest.testGetTableAvro covers table parameters are loaded correctly

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-io
Copy link

codecov-io commented Dec 21, 2019

Codecov Report

Merging #2856 into master will increase coverage by 0.03%.
The diff coverage is 73.43%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2856      +/-   ##
============================================
+ Coverage     45.57%   45.61%   +0.03%     
- Complexity     9032     9052      +20     
============================================
  Files          1908     1909       +1     
  Lines         71729    71766      +37     
  Branches       7912     7918       +6     
============================================
+ Hits          32693    32738      +45     
+ Misses        36031    36018      -13     
- Partials       3005     3010       +5
Impacted Files Coverage Δ Complexity Δ
...n/compaction/suite/CompactionSuiteBaseFactory.java 100% <ø> (ø) 2 <0> (ø) ⬇️
...che/gobblin/hive/metastore/HiveMetaStoreUtils.java 31.69% <ø> (ø) 12 <0> (ø) ⬇️
...lin/hive/metastore/HiveMetaStoreBasedRegister.java 0% <0%> (ø) 0 <0> (ø) ⬇️
.../org/apache/gobblin/dataset/FileSystemDataset.java 0% <0%> (ø) 0 <0> (?)
...ta/management/dataset/SimpleFileSystemDataset.java 88.88% <100%> (ø) 4 <1> (ø) ⬇️
.../action/CompactionCompleteFileOperationAction.java 63.49% <100%> (+1.19%) 5 <2> (+1) ⬆️
.../gobblin/compaction/suite/CompactionSuiteBase.java 100% <100%> (ø) 10 <5> (+2) ⬆️
...mpaction/action/CompactionMarkDirectoryAction.java 46.15% <100%> (+4.48%) 5 <2> (+1) ⬆️
...compaction/verify/CompactionThresholdVerifier.java 75% <100%> (+2.27%) 5 <0> (+1) ⬆️
...ction/action/CompactionHiveRegistrationAction.java 36% <100%> (+5.56%) 5 <2> (+1) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0700092...2f33296. Read the comment docs.

@@ -46,14 +50,17 @@
@Slf4j
public class CompactionSuiteBase implements CompactionSuite<FileSystemDataset> {
public static final String SERIALIZE_COMPACTION_FILE_PATH_NAME = "compaction-file-path-name";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to remove this SERIALIZE_COMPACTION_FILE_PATH_NAME?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

DateTime startTime = result.getTime();
DateTime endTime = startTime.plusHours(1);
CompactionPathParser.CompactionParserResult result = new CompactionPathParser(state).parse(dataset);
ZonedDateTime startTime = ZonedDateTime.ofInstant(Instant.ofEpochMilli(result.getTime().getMillis()), zone);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure if we really need to convert to ZoneDateTime first. Seems like current joda DateTime already supports plusMinutes, plusHours, plusDays, plusMonths.

Copy link
Contributor Author

@zxcware zxcware Dec 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because TimeIterator.inc accepts ZoneDateTime. I implemented TimeIterator using java.time as joda time is kind of deprecated by java 8+.

The standard date and time classes prior to Java SE 8 are poor. By tackling this problem head-on, Joda-Time became the de facto standard date and time library for Java prior to Java SE 8. Note that from Java SE 8 onwards, users are asked to migrate to java.time (JSR-310) - a core part of the JDK which replaces this project.

@@ -75,6 +78,8 @@ public String getProgress() {
public void run() {
try {
this.underlyingTask.run();
} catch (Exception e) {
log.error(String.format("Task %s completed with exception", this.taskContext.getTaskState().getTaskId()), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be an anti-pattern to catch exception if run method itself is not throwing an exception here. I wonder what is the reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that when underlyingTask threw an exception (in my case, it is an NPE), there was no message at all(the exception got swallowed and didn't get a chance to be logged while bubbling up) and this wrapper continued to the finally block, marking the task as a success.

@@ -473,6 +475,10 @@ public void dropTableIfExists(String dbName, String tableName) throws IOExceptio
public void dropPartitionIfExists(String dbName, String tableName, List<Column> partitionKeys,
List<String> partitionValues) throws IOException {
try (AutoReturnableObject<IMetaStoreClient> client = this.clientPool.getClient()) {
if (client.get().getPartition(dbName, tableName, partitionValues) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could you clean up the catch block on NoSuchObjectException is you decide to use existence check instead of relying on the exception to determine there's no such object ?
  2. I am not familiar with this getPartition API but sending a list of partition values but returning single partition object seems to be a little bit strange. Can you double check the semantic ? Or alternative there's a listPartitions method which is usually the API that I used for checking existence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice to catch NoSuchObjectException, getPartition also throws NoSuchObjectException.

My understanding is that a partition has multiple partition columns and each value in the list corresponds a column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified with a test job that we also need this catch exception to detect partition not exist..


public Builder withPartitionKeys(List<Column> partitionKeys) {
this.partitionKeys = partitionKeys;
return this;
}

public Builder withTableParameters(Map<String, String> tableParameters) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's reason that TableParameters are not included in the builder: We internally use some timestamp and record count in the table parameters. Including these values in the table object could cause diff-check failure and issue much more updatePartition call in the hive metastore. Could you double check with the potential write-amplification to relevant team ?

protected State state;
protected CompactionJobConfigurator configurator;
private static final Gson GSON = GsonInterfaceAdapter.getGson(FileSystemDataset.class);
private static final String SERIALIZED_DATASET = "compaction.serializedDataset";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove old key name if you decide to replace it with a new name, also why this change ?

.setConfiguration(TimePartitionGlobFinder.ENABLE_VIRTUAL_PARTITION, "true");

JobExecutionResult result = embeddedGobblin.run();
Assert.assertTrue(result.isSuccessful());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we need to verify the contents of execution beyond simply verifying if the exception is successful ?

@@ -130,7 +127,6 @@ public String datasetURN() {
* @return a map-reduce job which will compact files against {@link org.apache.gobblin.dataset.Dataset}
*/
public Job createJob (FileSystemDataset dataset) throws IOException {
configurator = CompactionJobConfigurator.instantiateConfigurator(this.state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the configurator has to be instantiated lazily. Check optionalInit method which could inject overwriting properties in the state object.

@@ -75,6 +78,8 @@ public String getProgress() {
public void run() {
try {
this.underlyingTask.run();
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compaction job, we already have logs in MRTask. Maybe we should propagate the exception back to GobblinMultiTaskAttempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Contributor

@autumnust autumnust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -73,6 +74,11 @@ public CompactionCompleteFileOperationAction (State state, CompactionJobConfigur
* Create a record count file containing the number of records that have been processed .
*/
public void onCompactionJobComplete (FileSystemDataset dataset) throws IOException {
if (dataset instanceof SimpleFileSystemDataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about extending the FileSystemDataset interface with an isVirtual() method that defaults to false to avoid this specific instanceof and casting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@htran1 htran1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@asfgit asfgit closed this in 1de3f7e Dec 24, 2019
@zxcware zxcware deleted the comp2 branch February 7, 2020 01:11
haojiliu pushed a commit to haojiliu/incubator-gobblin that referenced this pull request Apr 9, 2020
jhsenjaliya pushed a commit to jhsenjaliya/incubator-gobblin that referenced this pull request Apr 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants