-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding s3a schema and s3a implem to hdfs storage module. #3940
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 , LGTM.
pom.xml
Outdated
@@ -71,7 +71,7 @@ | |||
<netty.version>4.1.6.Final</netty.version> | |||
<slf4j.version>1.7.12</slf4j.version> | |||
<!-- If compiling with different hadoop version also modify default hadoop coordinates in TaskConfig.java --> | |||
<hadoop.compile.version>2.3.0</hadoop.compile.version> | |||
<hadoop.compile.version>2.7.0</hadoop.compile.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not 2.7.3?
@@ -421,6 +421,7 @@ public long push() throws IOException | |||
case "hdfs": | |||
case "viewfs": | |||
case "maprfs": | |||
case "s3a": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason s3a is here instead of in the "s3_zip" section? I think s3a should be treated like s3 deep storage, not hdfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really want uniformity. There's no good reason for people with S3 deep storage to have to load the druid-hdfs-storage extension or hadoop classes on their historical nodes.
The idea behind how things work now is that even if a hadoop indexing job uses the hadoop fs classes to push data to s3 deep storage, you'd still configure your historicals with s3 deep storage, not hdfs.
String segmentDir = "hdfs".equals(fileSystem.getScheme()) || "viewfs".equals(fileSystem.getScheme()) | ||
String segmentDir = "hdfs".equals(fileSystem.getScheme()) | ||
|| "viewfs".equals(fileSystem.getScheme()) | ||
|| "s3a".equals(fileSystem.getScheme()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm how about the issue with :
it still not supported by hadoop thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s3a shouldn't really be here… probably just like determining the loadSpec, the directory should be based on the kind of deep storage configured and not on the scheme. I think adding a getStorageDir to DataSegmentPusher, and getting rid of DataSegmentPusherUtil, would solve that. It could use java 8 default methods to prevent any pusher other than HDFS from having to override it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum this method is used everywhere... anyway will change that.
Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions. indexing-hadoop/src/main/java/io/druid/indexer/JobHelper.java, line 424 at r1 (raw file): Previously, gianm (Gian Merlino) wrote…
@gianm for the long term we are trying to use only hdfs pusher/puller for all hadoop related file systems. the good reason is to have one module that does everything Thought. I get the idea that the S3 module was developed in absence of current hadoop aws module but i guess it is always good to move out of it and have one loadspec for all the hadoop file systems. Comments from Reviewable |
Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions. pom.xml, line 74 at r1 (raw file): Previously, gianm (Gian Merlino) wrote…
Done. Comments from Reviewable |
Just because you use the s3a fs to push data to S3 doesn't mean S3 is a hadoop file system. It's not. It's S3. |
@gianm i get your point, but what i am trying to say is that if we are using hadoop implementations to push the files would it make sense to use hadoop implementations to pull/kill etc ? Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions. Comments from Reviewable |
Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions. a discussion (no related file): Comments from Reviewable |
Your argument is reasonable, but I think the reason we are disagreeing is that in a Hadoop-based indexing job you're not actually using the hdfs pusher. It doesn't use the pushers at all, it has its own hard coded pushing code that is based purely on the scheme. Put another way, a user could configure S3 deep storage in two ways:
In non-Hadoop based indexing jobs, where the pushers actually do get used, the former gets loadSpec "hdfs" and the latter gets "s3_zip" even though the segments end up in the same place. In Hadoop-based indexing jobs the loadSpec for both will be "s3_zip" since it only looks at the scheme. It seems to me that it would be best for the user to have similar control over the loadSpec in Hadoop-based jobs, rather than forcing one choice or another. Does that sound right to you and would you be open to working on that? For my part, as long as the user has no choice in loadSpec used for S3 storages in Hadoop-based jobs, I think they should stay what they currently are, which is "s3_zip", for all S3 schemes (s3, s3n, s3a). |
@gianm agree about
Ok seems like to make this working we need to let the user select the pusher/puller. Hence if S3 module is loaded the loadspec will be |
Yeah that seems like a good approach. Although rather than based on loaded modules I would do it based on which deep storage is configured as the main one -- since I think it's possible to load two deep storage modules at once. To do that I think we need to resolve this todo in JobHelper:
Instead of the big switch statement that should probably call a method like |
That would mean using "hdfs" deep storage to write to s3 would let the HdfsDataSegmentPusher make an "hdfs" load spec, and using the "s3" deep storage to write to s3 would let the S3DataSegmentPusher make a "s3_zip" load spec. |
Bumping to 0.10.1 as discussed in dev sync today. |
@gianm checkout the new approach and let me know what you think. |
@@ -146,4 +148,19 @@ public DataSegment call() throws Exception | |||
} | |||
} | |||
} | |||
|
|||
@Override | |||
public Map<String, Object> makeLoadSpec(URI uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit : for each pusher the makeLoadSpec logic seems duplicated with the push method, this can be extracted to a common method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @nishantmonu51, to avoid duplicated code please use this in push
, or else have them both call a common helper method.
@b-slim sorry to let this drop for so long, I am taking another look now. |
@@ -174,4 +175,17 @@ public DataSegment call() throws Exception | |||
} | |||
} | |||
} | |||
|
|||
@Override | |||
public Map<String, Object> makeLoadSpec(URI uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please either use this in uploadDataSegment
or else have them use a common helper method.
@Override | ||
public Map<String, Object> makeLoadSpec(URI uri) | ||
{ | ||
throw new IAE("not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use spaces instead of tabs for indenting. Also this should technically be an UnsupportedOperationException
.
@@ -146,4 +148,19 @@ public DataSegment call() throws Exception | |||
} | |||
} | |||
} | |||
|
|||
@Override | |||
public Map<String, Object> makeLoadSpec(URI uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @nishantmonu51, to avoid duplicated code please use this in push
, or else have them both call a common helper method.
@@ -142,6 +146,19 @@ public DataSegment push(final File indexFilesDir, final DataSegment segment) thr | |||
} | |||
} | |||
|
|||
@Override | |||
public Map<String, Object> makeLoadSpec(URI finalIndexZipFilePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use this in push
too, or else have them both call a common helper method.
@@ -149,4 +151,14 @@ public DataSegment call() throws Exception | |||
throw Throwables.propagate(e); | |||
} | |||
} | |||
|
|||
@Override | |||
public Map<String, Object> makeLoadSpec(URI finalIndexZipFilePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to other pushers about code duplication.
@@ -741,7 +741,8 @@ public void doRun() | |||
new Path(config.getSchema().getIOConfig().getSegmentOutputPath()), | |||
outputFS, | |||
segmentTemplate | |||
) | |||
), | |||
config.DATA_SEGMENT_PUSHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a static, so this would be more clear as HadoopDruidIndexerConfig.DATA_SEGMENT_PUSHER
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s3a
shouldn't really be here… probably just like determining the loadSpec, the directory should be based on the kind of deep storage configured and not on the scheme. I think adding a getStorageDir
to DataSegmentPusher, and getting rid of DataSegmentPusherUtil, would solve that. It could use java 8 default methods to prevent any pusher other than HDFS from having to override it.
Oops, this comment was in the wrong spot. Moved to https://github.com/druid-io/druid/pull/3940/files#r107766956
@b-slim are you able to pick this back up? |
@gianm thanks for review, give me couple of days. |
Cool, thanks for the update. One other suggestion: how about adding a config to S3DataSegmentPusherConfig that controls whether |
@gianm opened new PR due to the amount of conflicting files, can we make sure this gets reviewed ASAP to avoid extra work on fixing conflicts. |
tracked by #4116 |
+1 |
This Pr adds S3a file schema to the current list of supported file systems by the hadoop indexer.
Also it adds the iplementation Jars of hadoop-aws to hdfs deep storage module that will be used to load segments indexed with S3a schema.
FYI tagged as 0.10 but no rush to have it in.
This change is