New MapReduce API #743

milleruntime · 2018-11-01T12:57:06Z

Split changes up into a few commits. I first did some refactoring, then removed AccumuloMultiTableInputFormat. This InputFormat is overly complicated and seems like its not needed or could be done better across Hadoop jobs. From there created a new simplified mapreduce API, which includes OutputInfo, FileOutputInfo and InputInfo fluent API for building options. Finally removed the Log4j logging and setting of the log level.

This is a follow on task of #712.

hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapred/AccumuloFileOutputFormat.java

hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.java

hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.java

ctubbsii · 2018-11-01T20:38:21Z

test/pom.xml

+    <dependency>
+      <groupId>org.apache.accumulo</groupId>
+      <artifactId>accumulo-hadoop-mapreduce</artifactId>
+    </dependency>


We shouldn't add this here. The mapreduce module can contain its own ITs, which have a test depenency on accumulo-test, for the testing framework stuffs.

ctubbsii · 2018-11-01T20:41:27Z

test/src/main/java/org/apache/accumulo/test/mapred/AccumuloRowInputFormatIT.java

 import org.apache.accumulo.core.security.ColumnVisibility;
 import org.apache.accumulo.core.util.PeekingIterator;
+import org.apache.accumulo.hadoop.mapred.AccumuloRowInputFormat;
+import org.apache.accumulo.hadoop.mapreduce.InputInfo;


This test need not be modified. It's currently testing the core mapreduce stuff, and should continue to do so. Rather than mod, these mapred tests should be copied and modified in the new module's src/test/java directory to run there.

ctubbsii · 2018-11-01T20:43:52Z

...mapreduce/src/main/java/org/apache/accumulo/hadoop/mapred/AccumuloMultiTableInputFormat.java

@@ -1,99 +0,0 @@
-/*


I do not think this should be removed, along with its functionality. I seem to recall multiple table scanning to be a high-demand feature when this was first added. However, it is redundant with AccumuloInputFormat, since that is a trivial case of this one. I think the AccumuloInputFormat should be modified to support multiple tables, so we can eliminate one of these, but still keep the ability to scan multiple tables.

OK I will create a follow on ticket so we don't lose the multi table functionality. It shouldn't be too bad to add it as an option to InputInfo.

Created #749

hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/InputInfo.java

keith-turner · 2018-11-02T15:08:51Z

hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/OutputInfo.java

+       * @param clientProps
+       *          Accumulo connection information
+       */
+      OutputOptions clientProperties(Properties clientProps);


If there are batch writer props, will they be respected? Or this currently just used to connect?

I added this just for connection info. I could remove it since we have ClientParams.clientInfo(clientInfo)

I think clientInfo can also have scanner and/or batch writer config. If this stuff is ignored, need to document that. The best thing would be to respect config from clientinfo AND whatever can be set by clientinfo there is no need to have map reduce APIs for that.

hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/OutputInfo.java

hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/InputInfo.java

hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoopImpl/mapreduce/InputInfoImpl.java

mikewalch · 2018-11-06T16:15:40Z

hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/OutputInfo.java

+       * @param bwConfig
+       *          the configuration for the {@link BatchWriter}
+       */
+      OutputOptions batchWriterOptions(BatchWriterConfig bwConfig);


Users can set BatchWriterConfig when they use the AccumuloClient builder to build ClientInfo. Therefore, this method could be removed.

I think this is a good idea since we could reduce what goes into the API. My only concern is that it will be confusing of which options should be set through which builder.

It's up to you if you want to remove it, but if you keep it you should tell users what happens if they set the config in both place (i.e. tell them this method takes precedence over the method in AccumuloClient builder) and make sure this actually occurs in the implementation.

With the current state of the code, it would take some effort under the hood to redo how the batch writer config is stored and serialized. I think I will make this a follow on blocker.

Created #751

milleruntime · 2018-11-07T00:30:09Z

I created follow on tickets so this PR doesn't become too big... @mikewalch @ctubbsii @keith-turner OK with merging this PR as the next step in the API evolution?

* Created OutputInfo, FileOutputInfo, InputInfo fluent API for building options * Updated unit Tests in hadoop-mapreduce to use new API * Moved AccumuloOutputFormat methods to AccumuloOutputFormatImpl * Made all previous static api methods protected * Made AccumuloRowInputFormatIT use new APi since it has no unit test * Created NewAccumuloInputFormatIT to test new API * Added log4j file for tests

ctubbsii

I like the direction this is headed, but I think that the entry point for the fluent API should dangle off of the Input/OutputFormats directly, rather than off a separate object, and then passed in via a setter.

milleruntime · 2018-11-07T17:37:04Z

I think that the entry point for the fluent API should dangle off of the Input/OutputFormats directly, rather than off a separate object, and then passed in via a setter.

This is a good idea. I was thinking something similar while modifying the javadoc that it could be even better. I will merge the progress here and open up another follow on.

milleruntime · 2018-11-07T18:14:44Z

Created follow on #753

keith-turner reviewed Nov 1, 2018

View reviewed changes

ctubbsii reviewed Nov 1, 2018

View reviewed changes

milleruntime force-pushed the mr-refactor branch from d2939b4 to a44a920 Compare November 1, 2018 20:46

keith-turner reviewed Nov 2, 2018

View reviewed changes

mikewalch reviewed Nov 6, 2018

View reviewed changes

milleruntime force-pushed the mr-refactor branch from fbd8b6c to 4ba9372 Compare November 6, 2018 17:11

This was referenced Nov 6, 2018

Add Multi table functionality to new MapReduce API #749

Closed

Deduplicate batch writer config in MapReduce API #751

Closed

keith-turner approved these changes Nov 7, 2018

View reviewed changes

milleruntime added 15 commits November 7, 2018 10:56

Move MR classes out of API into hadoopImpl

33d83eb

Remove support for AccumuloMultiTableInputFormat

5287bcb

Remove Log4j

ddcab30

Remove deprecated methods and broken javadoc

8586c79

Make top level classes extend hadoop classes

9c9dc3e

Add since javadoc

99034c0

Move new ITs out of test

c5980fd

Add mini test dep

80d68b7

PR updates

4861a4b

Multiple fixes

6d17192

Add checks to fetchColumns

4fb205d

More fixes

6bce4e0

Imports and format

465abfc

Remove unused getClientProperties

2bb9f23

milleruntime force-pushed the mr-refactor branch from 0a75087 to 2bb9f23 Compare November 7, 2018 17:10

ctubbsii reviewed Nov 7, 2018

View reviewed changes

milleruntime mentioned this pull request Nov 7, 2018

Improve new MapReduce API #753

Closed

milleruntime merged commit 9dadca0 into apache:master Nov 7, 2018

milleruntime deleted the mr-refactor branch November 7, 2018 18:18

ctubbsii added the v2.0.0 label Nov 9, 2018

ctubbsii added this to Done in 2.0.0 Jun 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New MapReduce API #743

New MapReduce API #743

milleruntime commented Nov 1, 2018 •

edited

ctubbsii Nov 1, 2018

ctubbsii Nov 1, 2018

ctubbsii Nov 1, 2018

milleruntime Nov 1, 2018

milleruntime Nov 6, 2018

keith-turner Nov 2, 2018

milleruntime Nov 2, 2018

keith-turner Nov 6, 2018

mikewalch Nov 6, 2018

milleruntime Nov 6, 2018

mikewalch Nov 6, 2018

milleruntime Nov 6, 2018

milleruntime Nov 6, 2018

milleruntime commented Nov 7, 2018

ctubbsii left a comment

milleruntime commented Nov 7, 2018

milleruntime commented Nov 7, 2018

New MapReduce API #743

New MapReduce API #743

Conversation

milleruntime commented Nov 1, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milleruntime commented Nov 7, 2018

ctubbsii left a comment

Choose a reason for hiding this comment

milleruntime commented Nov 7, 2018

milleruntime commented Nov 7, 2018

milleruntime commented Nov 1, 2018 •

edited