[GOBBLIN-288] Add finer-grain dynamic partition generation for Salesf… #2140

htran1 · 2017-10-13T05:28:38Z

…orce

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-288

Description

Here are some details about my PR, including screenshots (if applicable):
Added refinement of the daily bucket histogram by splitting large buckets into smaller time ranges until a target size is reached. This is to avoid timeout errors when a partition is too large.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
Updated partitioning test and ran Salesforce Contact flow.

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

…orce

htran1 · 2017-10-13T05:29:02Z

@zxcware please review.

zxcware · 2017-10-13T18:55:31Z

gobblin-salesforce/src/main/java/org/apache/gobblin/salesforce/SalesforceSource.java

+  /**
+   * Split a histogram bucket along the midpoint if it is larger than the bucket size limit.
+   */
+  private int getHistogramRecursively(TableCountProbingContext probingContext, Histogram histogram, StrSubstitutor sub,


We can simplify the recursion as:

getHistogramRecursively(context, count, start, end, sub, values, outputHistogram) { check base case: outputHistogram.add(new HistogramGroup(...)) int leftCount = queryLeft getHistogramRecursively(context, leftCount, start, mid, sub, values, outputHistogram) getHistogramRecursively(context, count - leftCount, mid, end, sub, values, outputHistogram) }

Sure. The method was written to allow the caller to optionally provide the count, but since we can simplify if that flexibility is not allowed, I'll remove that option.

zxcware · 2017-10-13T18:57:19Z

gobblin-salesforce/src/main/java/org/apache/gobblin/salesforce/SalesforceSource.java

+
+    log.info("Refining histogram with bucket size limit {}.", bucketSizeLimit);
+
+    final Iterator<HistogramGroup> it = histogram.getGroups().iterator();


We can simplify the logic by appending HisgrogramGroup(partition.highwatermark, 0)

I'm going to copy the list and add it item since I want to avoid changing the input histogram, but that still allows us to avoid the special handling of the last group.

zxcware · 2017-10-13T19:01:08Z

gobblin-salesforce/src/main/java/org/apache/gobblin/salesforce/SalesforceSource.java

    while (elements.hasNext()) {
      element = elements.next().getAsJsonObject();
-      histogram.add(new HistogramGroup(element.get("time").getAsString(), element.get("cnt").getAsInt()));
+      String time = element.get("time").getAsString() + ZERO_TIME_SUFFIX;


In terms of the function itself, whether to append the ZERO_TIME_SUFFIX really depends on the element.get("time"). My concern is that the function won't be correct if time already has a proper suffix.

The solution would be:

Leverage Utils.toDateTimeFormat, given the input time format

Remove the function and put the logic to getHistogramByDayBucketing, for it's only used there.

I'll change the name to make it clearer that this is only for parsing results from the day bucketing.

zxcware · 2017-10-13T19:05:01Z

gobblin-salesforce/src/main/java/org/apache/gobblin/salesforce/SalesforceSource.java

+    String startTimeStr = Utils.dateToString(new Date(startTime), SalesforceExtractor.SALESFORCE_TIMESTAMP_FORMAT);
+    String endTimeStr = Utils.dateToString(new Date(endTime), SalesforceExtractor.SALESFORCE_TIMESTAMP_FORMAT);
+
+    subValues.put("start", startTimeStr);


It's necessary to consider whether we should include the start by comparing it with the global partition, which can be put in the probingContext

Distinguishing inclusive/exclusive is not required for correctness here and would complicate the logic for no real gain.

zxcware · 2017-10-13T19:06:58Z

gobblin-salesforce/src/main/java/org/apache/gobblin/salesforce/SalesforceSource.java

+    }
+
+    // exchange the first partition point with the global low watermark
+    partitionPoints.set(0, Long.toString(lowWatermark));


Instead of doing this, we can set the first HistogramGroup key to be lowWatermak, which also filters records out of scope in fine probing.

Sure, I'll move it to between the first histogram generation where the watermark is lost and the refinement.

zxcware · 2017-10-13T19:11:37Z

gobblin-salesforce/src/main/java/org/apache/gobblin/salesforce/SalesforceSource.java

+  private static final String MIN_TARGET_PARTITION_SIZE = "salesforce.minTargetPartitionSize";
+  private static final int DEFAULT_MIN_TARGET_PARTITION_SIZE = 250000;
+  private static final String PROBE_TARGET_RATIO = "salesforce.probeTargetRatio";
+  private static final double DEFAULT_PROBE_TARGET_RATIO = 0.60;


This is a great idea! We should document it so that readers can understand its meaning.

Added a comment.

zxcware

LGTM

abti

+1

Closes apache#2140 from htran1/salesforce_dynamic_probing

[GOBBLIN-288] Add finer-grain dynamic partition generation for Salesf…

b370b97

…orce

zxcware suggested changes Oct 13, 2017

View reviewed changes

Address review comments.

8b1365d

zxcware approved these changes Oct 14, 2017

View reviewed changes

abti approved these changes Oct 14, 2017

View reviewed changes

asfgit closed this in 626d312 Oct 14, 2017

zxliucmu pushed a commit to zxliucmu/incubator-gobblin that referenced this pull request Nov 16, 2017

[GOBBLIN-288] Add finer-grain dynamic partition generation for Salesf…

f4bb3e7

Closes apache#2140 from htran1/salesforce_dynamic_probing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOBBLIN-288] Add finer-grain dynamic partition generation for Salesf… #2140

[GOBBLIN-288] Add finer-grain dynamic partition generation for Salesf… #2140

htran1 commented Oct 13, 2017

htran1 commented Oct 13, 2017

zxcware Oct 13, 2017 •

edited

htran1 Oct 13, 2017

zxcware Oct 13, 2017

htran1 Oct 13, 2017

zxcware Oct 13, 2017

htran1 Oct 13, 2017

zxcware Oct 13, 2017

htran1 Oct 13, 2017

zxcware Oct 13, 2017 •

edited

htran1 Oct 13, 2017

zxcware Oct 13, 2017

htran1 Oct 13, 2017

zxcware left a comment

abti left a comment


		log.info("Refining histogram with bucket size limit {}.", bucketSizeLimit);

		final Iterator<HistogramGroup> it = histogram.getGroups().iterator();

[GOBBLIN-288] Add finer-grain dynamic partition generation for Salesf… #2140

[GOBBLIN-288] Add finer-grain dynamic partition generation for Salesf… #2140

Conversation

htran1 commented Oct 13, 2017

JIRA

Description

Tests

Commits

htran1 commented Oct 13, 2017

zxcware Oct 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zxcware Oct 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zxcware left a comment

Choose a reason for hiding this comment

abti left a comment

Choose a reason for hiding this comment

zxcware Oct 13, 2017 •

edited

zxcware Oct 13, 2017 •

edited