-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delagate creation of segmentPath/LoadSpec to DataSegmentPushers and add S3a support #4116
Conversation
This replaces #3940 |
e3e071f
to
72fd6e0
Compare
@akashdw can you take a look at this one as well. |
@dclim could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
besides minor comments, looks good to me, 👍
|
||
public interface DataSegmentPusher | ||
{ | ||
Joiner JOINER = Joiner.on("/").skipNulls(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static final ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is by definition since it is part of interface
return ImmutableMap.<String, Object>of("type", "hdfs", "path", finalIndexZipFilePath.toString()); | ||
} | ||
|
||
private static final Joiner JOINER = Joiner.on("/").skipNulls(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: reuse DataSegmentPusher.JOINER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -115,15 +118,14 @@ public DataSegment push(File inDir, DataSegment segment) throws IOException | |||
storageDir, | |||
segment.getShardSpec().getPartitionNum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outIndexFile
could be created using JobHelper.makeFileNamePath()
. And to do that, we can move makeFileNamePath()
method from JobHelper
to DataSegmentPusher
. This ensures no separate logic for index file path creation at multiple places.
We can do the same for tmp file and descriptor.json
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a good idea but this will require that druid-api has to depend on hadoop APIs which i am afraid it will trigger push back from no-hadoop-fan-club.. most of the logic is encapsulated at dataSegmentPusher.makeIndexPathName
so i guess we are ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -115,15 +118,14 @@ public DataSegment push(File inDir, DataSegment segment) throws IOException | |||
storageDir, | |||
segment.getShardSpec().getPartitionNum() | |||
)); | |||
|
|||
final Path outDescriptorFile = new Path(String.format( | |||
"%s/%s/%d_descriptor.json", | |||
fullyQualifiedStorageDirectory, | |||
storageDir, | |||
segment.getShardSpec().getPartitionNum() | |||
)); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
@@ -23,7 +23,6 @@ | |||
import com.google.common.base.Predicate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 718 of this file: should the URI be created with the scheme written in the loadSpec for the key 'S3Schema' if it exists? (and then default to s3n if it doesn't)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch done
@@ -71,7 +71,7 @@ | |||
<netty.version>4.1.6.Final</netty.version> | |||
<slf4j.version>1.7.12</slf4j.version> | |||
<!-- If compiling with different hadoop version also modify default hadoop coordinates in TaskConfig.java --> | |||
<hadoop.compile.version>2.3.0</hadoop.compile.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you encounter the exception: Exception in thread "main" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V
in your testing? There's some discussion on it here https://issues.apache.org/jira/browse/HADOOP-12420 and the conclusion at the end is that Hadoop 2.7.3 will only work with aws-java-sdk 1.7.4 until Hadoop 2.8 is released (https://issues.apache.org/jira/browse/HADOOP-12269). I was only able to get the s3a scheme working by downgrading Druid's version of aws-java-sdk to 1.7.4 so I'm curious what your experience was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dclim i havn't run into this issue, maybe because i am not loading the s3a module, in fact i am using hdfs module to read segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dclim 2.8 is released if that works for you i can update the dependency, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@b-slim nice! I don't have any concerns updating our client to 2.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Timeout issue closing and opening. |
👍 after Travis |
@b-slim this is failing UT |
@dclim seems like there is a bunch of issue with the 2.8 it self, i am thinking about downgrading it to 2.7 again, what you think ? |
@b-slim do you know what is causing the UT to fail? is it a configuration issue or something more fundamental? downgrading to 2.7 seems reasonable to me if we don't have time to fix the issue, but I expect that people may then run into the aws java sdk incompatibility issue and we may need to walk them through downgrading from 1.10.x to 1.7.4 |
@b-slim have you tested this on a remote Hadoop cluster? I don't think it actually works as expected. The behavior I'm seeing is that even though the segments are getting pushed to S3, the loadSpec for the descriptor.json and in the druid_segments metadata table are showing type 'local' instead of 's3_zip' or 'hdfs'. This is happening because the loadSpec is now being generated by the DataSegmentPusher implementation which is being selected based on the value of |
@dclim i haven't tested that downstream yet, let me check it out and see how we can propagate those properties to the MM and hadoop JVM opts. |
@b-slim cool, also take a look at #4180 - it might be a good idea to see if we can pass down the majority of the druid.* properties into the map and reduce JVMs to mitigate this class of issues - I was playing around with this earlier and ran into problems when I just blanket passed them all in (I think it was related to the extensions loadList probably due to the way we handle extensions in MR jobs) but I didn't investigate any further |
@jon-wei i have run this with 2.7.3 (HDP 2.6) only. |
@b-slim Can you try running with versions < 2.6.0 and see if there's a way to make that work? Otherwise that's a pretty big compatibility breakage |
@jon-wei thanks, can you please share the logs ? you think it is related to this patch ? |
These are the errors I saw running against a sequenceiq 2.3.0 cluster, I tried various classloader options and got different errors: With mapreduce.job.classloader = true
With mapreduce.job.user.classpath.first = true
With neither
|
@b-slim If we can't get <2.6.0 hadoop working, I'm okay with dropping support for those older versions with the 0.10.1 release, but I think we should probably make a community decision, I'll start a druid-dev thread regarding that |
Started a thread on dropping support for Hadoop <2.6.0 if needed: https://groups.google.com/forum/#!topic/druid-development/vFjR0pdq83U |
@jon-wei thanks looks like it is a long shot to make it work with less than 2.6, will give it a try. |
@b-slim Can you add docs for the changes in this patch? |
@jon-wei ah, thanks for pointing out that patch, that sounds good to me |
@jon-wei i will send a followup pr this week thanks ! |
continuing.... #4313 (comment) and #4313 (comment) |
@himanshug setting the version to |
@himanshug @b-slim does that mean we need to change something for 0.10.1 (rc2)? |
@gianm most likely, yes... m gonna test himanshug@9632b0c |
Guys, I'm sorry but I still can't make I'm using
Still I have: |
@@ -39,6 +39,9 @@ | |||
@JsonProperty | |||
@Min(0) | |||
private int maxListingLength = 1000; | |||
// use s3n by default for backward compatibility | |||
@JsonProperty | |||
private boolean useS3aSchema = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add this to documentation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using S3 as deep storage. The batch ingestion by default uses S3N uri - is this the configuration to set so that it uses S3A url format? If so where should this be set? In the deep storage config along with other properties? Is it druid.storage.useS3aSchema = true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@palanieppan-m you have to add this to hadoop config file like core-site.xml
. If you don't have any other configurations it would look something like:
<configuration>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.eu-central-1.amazonaws.com</value>
</property>
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>${S3_ACCESS_KEY}</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>${S3_SECRET_KEY}</value>
</property>
</configuration>
Don't forget to change endpoint if you're using different region.
Next create another config file named jets3t.properties
with the following:
s3service.s3-endpoint = s3.eu-central-1.amazonaws.com
storage-service.request-signature-version=AWS4-HMAC-SHA256
Again change region if needed.
And lastly all this thing will only work with patched Hadoop libraries. I used Hadoop 2.7.3 from HDP 2.5.3 distribution. It is patched with aws-java-sdk-*:1.10.6 and works perfectly with it. So I replaced hadoop and aws jars and only then it worked. For more info you can take a look at this discussion: #4977
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moscowart, your proposal is workaround for the default behavior of Druid which uses S3N URL scheme. To get that to work, you have to configure Hadoop to support it.
In this PR though, it looks like behavior can be changed such that Druid always uses S3A URL scheme instead, provided you turn on a config. I was enquiring whether my understanding is correct and what's the config name to turn on this behavior.
We are currently running into an issue where the hadoop-indexer job is making segment entries in the metadata store using the local load spec. As a result, segments are not getting loaded. The commentary here suggests we may need to pass |
Alternatively, is there a way to configure the DataSegmentPusher? Right now, by default, LocalDataSegmentPusher is injected. This issue is preventing us from using druid-10.1 for bulk ingestion. Forgot to mention, we want to push the segments to S3. Things worked fine with 0.9.1, but after the changes made in this pull request, bulk ingestion doesn't work. |
@samarthjain this is not the best place to ask questions, you need to set druid.storage.type=hdfs or any other deep storage type as part of the properties file. |
@moscowart, your proposal is workaround for the default behavior of Druid
which uses S3N URL scheme. To get that to work, you have to configure
Hadoop to support it.
In this PR though, it looks like behavior can be changed such that Druid
always S3A URL scheme instead, provided you turn on a config. I was
enquiring whether why understanding is correct and what's the config name
to turn on this behavior.
…On Wed, Feb 14, 2018 at 1:15 AM, Artem Moskvin ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In extensions-core/s3-extensions/src/main/java/io/druid/storage/s3/
S3DataSegmentPusherConfig.java
<#4116 (comment)>:
> @@ -39,6 +39,9 @@
@JsonProperty
@min(0)
private int maxListingLength = 1000;
+ // use s3n by default for backward compatibility
+ @JsonProperty
+ private boolean useS3aSchema = false;
@palanieppan-m <https://github.com/palanieppan-m> you have to add this to
hadoop config file like core-site.xml. If you don't have any other
configurations it would look something like:
<configuration>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.eu-central-1.amazonaws.com</value>
</property>
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>${S3_ACCESS_KEY}</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>${S3_SECRET_KEY}</value>
</property>
</configuration>
Don't forget to change endpoint if you're using different region.
Next create another config file named jets3t.properties with the
following:
s3service.s3-endpoint = s3.eu-central-1.amazonaws.com
storage-service.request-signature-version=AWS4-HMAC-SHA256
Again change region if needed.
And lastly all this thing will only work with patched Hadoop libraries. I
used Hadoop 2.7.3 from HDP 2.5.3 distribution. It is patched with
aws-java-sdk-*:1.10.6 and works perfectly with it. So I replaced hadoop and
aws jars and only then it worked. For more info you can take a look at this
discussion: #4977 <#4977>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4116 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AagIZiqQTDyw49K2Qf7LXprXIHY8AwRsks5tUqQ9gaJpZM4MpKUL>
.
|
druid.storage.*
and"druid.javascript.*