[GOBBLIN-706]enable dynamic mappers #2576

ZihanLi58 · 2019-03-21T23:53:57Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-706

Description

Here are some details about my PR, including screenshots (if applicable):
Today the number of mappers is hard coded and almost always reached, however, many mappers do very little work. This tends to cause small files and short jobs that are overhead dominated. The improvement here is to enable user add a job property named target.mapper.size to set the target mapper size and Gobblin will dynamically scale the number of mappers up and down depending on the total load. If the property is not set, the number of mappers will still be the vale of mr.job.max.mappers.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

ZihanLi58 · 2019-03-21T23:55:39Z

@ibuenros @autumnust @htran1

ibuenros

Can you also write some basic unit tests for this functionality?

ibuenros · 2019-03-22T16:50:34Z

gobblin-api/src/main/java/org/apache/gobblin/configuration/ConfigurationKeys.java

@@ -617,6 +617,7 @@
  /** Specifies a static location in HDFS to upload jars to. Useful for sharing jars across different Gobblin runs.*/
  public static final String MR_JARS_DIR = "mr.jars.dir";
  public static final String MR_JOB_MAX_MAPPERS_KEY = "mr.job.max.mappers";
+  public static final String TARGET_MAPPER_SIZE = "target.mapper.size";


Can you add the mr prefix to the key similar to the other configuration keys? e.g. mr.target.mapper.size

ibuenros · 2019-03-22T16:51:40Z

...afka-common/src/main/java/org/apache/gobblin/source/extractor/extract/kafka/KafkaSource.java

+        double totalEstDataSize = KafkaWorkUnitPacker.getInstance(this, state).getWorkUnitEstSizes(workUnits);
+        LOG.info(String.format("The total estimated data size is %.2f", totalEstDataSize));
+        double targetMapperSize = state.getPropAsDouble(ConfigurationKeys.TARGET_MAPPER_SIZE);
+        numOfMultiWorkunits = (int) (totalEstDataSize / targetMapperSize);


You probably need to add 1 to this. If totalEstDataSize < targetMapperSize you would end up with 0 mappers :)

ibuenros · 2019-03-22T16:52:30Z

...afka-common/src/main/java/org/apache/gobblin/source/extractor/extract/kafka/KafkaSource.java

      int numOfMultiWorkunits =
          state.getPropAsInt(ConfigurationKeys.MR_JOB_MAX_MAPPERS_KEY, ConfigurationKeys.DEFAULT_MR_JOB_MAX_MAPPERS);
+      if(state.contains(ConfigurationKeys.TARGET_MAPPER_SIZE)) {
+        double totalEstDataSize = KafkaWorkUnitPacker.getInstance(this, state).getWorkUnitEstSizes(workUnits);


It be better to instantiate KafkaWorkUnitPacker once in line 264 then use the same object both here and in line 270.

ibuenros · 2019-03-22T16:52:51Z

...a/org/apache/gobblin/source/extractor/extract/kafka/workunit/packer/KafkaWorkUnitPacker.java

@@ -362,6 +362,15 @@ public static KafkaWorkUnitPacker getInstance(PackerType packerType, AbstractSou
        throw new IllegalArgumentException("WorkUnit packer type " + packerType + " not found");
    }
  }
+  public double getWorkUnitEstSizes(Map<String, List<WorkUnit>> workUnitsByTopic){


Is there a reason we can't just use setWorkUnitEstSizes? the logic in here is almost identical.

The method setWorkUnitEsiSizes is a protected method. So I just create a public method, and this new method just calculating the size instead of setting the property of workUnits

You can make the other method public. In general, it is a bad idea to have the same logic duplicated in different locations.

Also, please add javadoc to any public methods.

ibuenros · 2019-03-22T18:08:32Z

...afka-common/src/main/java/org/apache/gobblin/source/extractor/extract/kafka/KafkaSource.java

+        double targetMapperSize = state.getPropAsDouble(ConfigurationKeys.MR_TARGET_MAPPER_SIZE);
+        numOfMultiWorkunits = (int) (totalEstDataSize / targetMapperSize);
+        numOfMultiWorkunits = numOfMultiWorkunits>maxMapperNum? maxMapperNum:numOfMultiWorkunits;
+        numOfMultiWorkunits = numOfMultiWorkunits==0? 1: numOfMultiWorkunits;


You actually need to add 1 in general. Since integer division truncates down, you always want to add 1. (e.g. total size is 1.99, but targetMapperSize is 1, the current algorithm would use 1 mapper).

ibuenros · 2019-03-22T18:08:53Z

...afka-common/src/main/java/org/apache/gobblin/source/extractor/extract/kafka/KafkaSource.java

+        LOG.info(String.format("The total estimated data size is %.2f", totalEstDataSize));
+        double targetMapperSize = state.getPropAsDouble(ConfigurationKeys.MR_TARGET_MAPPER_SIZE);
+        numOfMultiWorkunits = (int) (totalEstDataSize / targetMapperSize);
+        numOfMultiWorkunits = numOfMultiWorkunits>maxMapperNum? maxMapperNum:numOfMultiWorkunits;


nit: You can use Math.min(maxMapperNum, numOfMultiWorkunits)

ibuenros · 2019-03-22T18:09:47Z

...a/org/apache/gobblin/source/extractor/extract/kafka/workunit/packer/KafkaWorkUnitPacker.java

@@ -362,6 +362,15 @@ public static KafkaWorkUnitPacker getInstance(PackerType packerType, AbstractSou
        throw new IllegalArgumentException("WorkUnit packer type " + packerType + " not found");
    }
  }
+  public double getWorkUnitEstSizes(Map<String, List<WorkUnit>> workUnitsByTopic){


You can make the other method public. In general, it is a bad idea to have the same logic duplicated in different locations.

Also, please add javadoc to any public methods.

ibuenros

+1 @htran1 can you merge?

Zihan Li added 3 commits March 21, 2019 16:32

enable dynamic mappers

6574018

enable dynamic mappers

3e4ac4d

deleyte unuseful comments

a4e55d0

ibuenros suggested changes Mar 22, 2019

View reviewed changes

Zihan Li added 2 commits March 22, 2019 10:17

change the property name and make sure mapper number will not be 0

55f3354

make sure the number of mappers will not exceed MR_JOB_MAX_MAPPER_KEY

0be6242

ibuenros suggested changes Mar 22, 2019

View reviewed changes

delete method that has duplicated logic

8a2ab1b

ibuenros approved these changes Mar 22, 2019

View reviewed changes

asfgit closed this in bd35490 Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOBBLIN-706]enable dynamic mappers #2576

[GOBBLIN-706]enable dynamic mappers #2576

ZihanLi58 commented Mar 21, 2019 •

edited

ZihanLi58 commented Mar 21, 2019

ibuenros left a comment

ibuenros Mar 22, 2019

ibuenros Mar 22, 2019

ibuenros Mar 22, 2019

ibuenros Mar 22, 2019

ZihanLi58 Mar 22, 2019

ibuenros Mar 22, 2019

ibuenros Mar 22, 2019

ibuenros Mar 22, 2019

ibuenros Mar 22, 2019

ibuenros left a comment

[GOBBLIN-706]enable dynamic mappers #2576

[GOBBLIN-706]enable dynamic mappers #2576

Conversation

ZihanLi58 commented Mar 21, 2019 • edited

JIRA

Description

Tests

Commits

ZihanLi58 commented Mar 21, 2019

ibuenros left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibuenros left a comment

Choose a reason for hiding this comment

ZihanLi58 commented Mar 21, 2019 •

edited