New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New MapReduce API #743
New MapReduce API #743
Conversation
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapred/AccumuloFileOutputFormat.java
Show resolved
Hide resolved
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.java
Show resolved
Hide resolved
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.java
Outdated
Show resolved
Hide resolved
test/pom.xml
Outdated
<dependency> | ||
<groupId>org.apache.accumulo</groupId> | ||
<artifactId>accumulo-hadoop-mapreduce</artifactId> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't add this here. The mapreduce module can contain its own ITs, which have a test depenency on accumulo-test
, for the testing framework stuffs.
import org.apache.accumulo.core.security.ColumnVisibility; | ||
import org.apache.accumulo.core.util.PeekingIterator; | ||
import org.apache.accumulo.hadoop.mapred.AccumuloRowInputFormat; | ||
import org.apache.accumulo.hadoop.mapreduce.InputInfo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test need not be modified. It's currently testing the core mapreduce stuff, and should continue to do so. Rather than mod, these mapred tests should be copied and modified in the new module's src/test/java
directory to run there.
@@ -1,99 +0,0 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this should be removed, along with its functionality. I seem to recall multiple table scanning to be a high-demand feature when this was first added. However, it is redundant with AccumuloInputFormat, since that is a trivial case of this one. I think the AccumuloInputFormat should be modified to support multiple tables, so we can eliminate one of these, but still keep the ability to scan multiple tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will create a follow on ticket so we don't lose the multi table functionality. It shouldn't be too bad to add it as an option to InputInfo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #749
d2939b4
to
a44a920
Compare
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/InputInfo.java
Outdated
Show resolved
Hide resolved
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/InputInfo.java
Show resolved
Hide resolved
* @param clientProps | ||
* Accumulo connection information | ||
*/ | ||
OutputOptions clientProperties(Properties clientProps); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are batch writer props, will they be respected? Or this currently just used to connect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this just for connection info. I could remove it since we have ClientParams.clientInfo(clientInfo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think clientInfo can also have scanner and/or batch writer config. If this stuff is ignored, need to document that. The best thing would be to respect config from clientinfo AND whatever can be set by clientinfo there is no need to have map reduce APIs for that.
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/OutputInfo.java
Show resolved
Hide resolved
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoop/mapreduce/InputInfo.java
Show resolved
Hide resolved
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoopImpl/mapreduce/InputInfoImpl.java
Outdated
Show resolved
Hide resolved
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoopImpl/mapreduce/InputInfoImpl.java
Show resolved
Hide resolved
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoopImpl/mapreduce/InputInfoImpl.java
Show resolved
Hide resolved
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoopImpl/mapreduce/InputInfoImpl.java
Show resolved
Hide resolved
hadoop-mapreduce/src/main/java/org/apache/accumulo/hadoopImpl/mapreduce/InputInfoImpl.java
Outdated
Show resolved
Hide resolved
* @param bwConfig | ||
* the configuration for the {@link BatchWriter} | ||
*/ | ||
OutputOptions batchWriterOptions(BatchWriterConfig bwConfig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can set BatchWriterConfig
when they use the AccumuloClient builder to build ClientInfo
. Therefore, this method could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea since we could reduce what goes into the API. My only concern is that it will be confusing of which options should be set through which builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to you if you want to remove it, but if you keep it you should tell users what happens if they set the config in both place (i.e. tell them this method takes precedence over the method in AccumuloClient builder) and make sure this actually occurs in the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current state of the code, it would take some effort under the hood to redo how the batch writer config is stored and serialized. I think I will make this a follow on blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #751
fbd8b6c
to
4ba9372
Compare
I created follow on tickets so this PR doesn't become too big... @mikewalch @ctubbsii @keith-turner OK with merging this PR as the next step in the API evolution? |
* Created OutputInfo, FileOutputInfo, InputInfo fluent API for building options * Updated unit Tests in hadoop-mapreduce to use new API * Moved AccumuloOutputFormat methods to AccumuloOutputFormatImpl * Made all previous static api methods protected * Made AccumuloRowInputFormatIT use new APi since it has no unit test * Created NewAccumuloInputFormatIT to test new API * Added log4j file for tests
0a75087
to
2bb9f23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction this is headed, but I think that the entry point for the fluent API should dangle off of the Input/OutputFormats directly, rather than off a separate object, and then passed in via a setter.
This is a good idea. I was thinking something similar while modifying the javadoc that it could be even better. I will merge the progress here and open up another follow on. |
Created follow on #753 |
Split changes up into a few commits. I first did some refactoring, then removed AccumuloMultiTableInputFormat. This InputFormat is overly complicated and seems like its not needed or could be done better across Hadoop jobs. From there created a new simplified mapreduce API, which includes OutputInfo, FileOutputInfo and InputInfo fluent API for building options. Finally removed the Log4j logging and setting of the log level.
This is a follow on task of #712.