Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New MapReduce API #743

Merged
merged 15 commits into from Nov 7, 2018
Merged

New MapReduce API #743

merged 15 commits into from Nov 7, 2018

Conversation

milleruntime
Copy link
Contributor

@milleruntime milleruntime commented Nov 1, 2018

Split changes up into a few commits. I first did some refactoring, then removed AccumuloMultiTableInputFormat. This InputFormat is overly complicated and seems like its not needed or could be done better across Hadoop jobs. From there created a new simplified mapreduce API, which includes OutputInfo, FileOutputInfo and InputInfo fluent API for building options. Finally removed the Log4j logging and setting of the log level.

This is a follow on task of #712.

test/pom.xml Outdated
<dependency>
<groupId>org.apache.accumulo</groupId>
<artifactId>accumulo-hadoop-mapreduce</artifactId>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't add this here. The mapreduce module can contain its own ITs, which have a test depenency on accumulo-test, for the testing framework stuffs.

import org.apache.accumulo.core.security.ColumnVisibility;
import org.apache.accumulo.core.util.PeekingIterator;
import org.apache.accumulo.hadoop.mapred.AccumuloRowInputFormat;
import org.apache.accumulo.hadoop.mapreduce.InputInfo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test need not be modified. It's currently testing the core mapreduce stuff, and should continue to do so. Rather than mod, these mapred tests should be copied and modified in the new module's src/test/java directory to run there.

@@ -1,99 +0,0 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this should be removed, along with its functionality. I seem to recall multiple table scanning to be a high-demand feature when this was first added. However, it is redundant with AccumuloInputFormat, since that is a trivial case of this one. I think the AccumuloInputFormat should be modified to support multiple tables, so we can eliminate one of these, but still keep the ability to scan multiple tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will create a follow on ticket so we don't lose the multi table functionality. It shouldn't be too bad to add it as an option to InputInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #749

* @param clientProps
* Accumulo connection information
*/
OutputOptions clientProperties(Properties clientProps);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are batch writer props, will they be respected? Or this currently just used to connect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this just for connection info. I could remove it since we have ClientParams.clientInfo(clientInfo)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clientInfo can also have scanner and/or batch writer config. If this stuff is ignored, need to document that. The best thing would be to respect config from clientinfo AND whatever can be set by clientinfo there is no need to have map reduce APIs for that.

* @param bwConfig
* the configuration for the {@link BatchWriter}
*/
OutputOptions batchWriterOptions(BatchWriterConfig bwConfig);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can set BatchWriterConfig when they use the AccumuloClient builder to build ClientInfo. Therefore, this method could be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea since we could reduce what goes into the API. My only concern is that it will be confusing of which options should be set through which builder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to you if you want to remove it, but if you keep it you should tell users what happens if they set the config in both place (i.e. tell them this method takes precedence over the method in AccumuloClient builder) and make sure this actually occurs in the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current state of the code, it would take some effort under the hood to redo how the batch writer config is stored and serialized. I think I will make this a follow on blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #751

@milleruntime
Copy link
Contributor Author

I created follow on tickets so this PR doesn't become too big... @mikewalch @ctubbsii @keith-turner OK with merging this PR as the next step in the API evolution?

* Created OutputInfo, FileOutputInfo, InputInfo fluent API for building options
* Updated unit Tests in hadoop-mapreduce to use new API
* Moved AccumuloOutputFormat methods to AccumuloOutputFormatImpl
* Made all previous static api methods protected
* Made AccumuloRowInputFormatIT use new APi since it has no unit test
* Created NewAccumuloInputFormatIT to test new API
* Added log4j file for tests
Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction this is headed, but I think that the entry point for the fluent API should dangle off of the Input/OutputFormats directly, rather than off a separate object, and then passed in via a setter.

@milleruntime
Copy link
Contributor Author

I think that the entry point for the fluent API should dangle off of the Input/OutputFormats directly, rather than off a separate object, and then passed in via a setter.

This is a good idea. I was thinking something similar while modifying the javadoc that it could be even better. I will merge the progress here and open up another follow on.

@milleruntime
Copy link
Contributor Author

Created follow on #753

@milleruntime milleruntime merged commit 9dadca0 into apache:master Nov 7, 2018
@milleruntime milleruntime deleted the mr-refactor branch November 7, 2018 18:18
@ctubbsii ctubbsii added the v2.0.0 label Nov 9, 2018
@ctubbsii ctubbsii added this to Done in 2.0.0 Jun 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
2.0.0
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants