-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses #10830
Conversation
core/src/main/java/org/apache/druid/data/input/impl/HttpInputSource.java
Outdated
Show resolved
Hide resolved
…ource.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Do the docs need to be updated as well? Right now it just says "URIs of the input files." https://druid.apache.org/docs/latest/ingestion/native-batch.html#http-input-source |
The linked section starts with I also fixed the HDFS inputSource to allow only |
@@ -69,6 +70,13 @@ public HttpInputSource( | |||
this.config = config; | |||
} | |||
|
|||
public static void throwIfInvalidProtocols(List<URI> uris) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
public static void throwIfInvalidProtocols(List<URI> uris) | |
private static void throwIfInvalidProtocols(List<URI> uris) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be public since this method is now used by HttpFirehoseFactory
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😢
if (Arrays.stream(inputPaths).anyMatch(path -> !"hdfs".equalsIgnoreCase(path.toUri().getScheme()))) { | ||
throw new IllegalArgumentException("Input paths must be the HDFS path"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that FileInputFormat#addInputPath
already has it's own scheme validation, so I don't believe we need this check. See org.apache.hadoop.fs.FileSystem#checkPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileSystem
has a couple of implementations such as DistributedFileSystem
, S3AFileSystem
, etc which support different schemes. The default file system is LocalFileSystem
unless you set fs.default.name
. I think we should not allow anything else but only hdfs
and let users use the right inputSource to read from other storage (e.g., `S3InputSource to read from s3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there users that rely on this behavior? Would restricting this to just hdfs mean it's not possible for some users to ingest files from these locations any more? If that's the case, should we introduce a config to allow server admins to specify which schemes are supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jihoonson Technically WebHdfsFileSystem
did work with this inputsource before, and so it could break ingestion pipelines for operators relying on HDFS Inputsource with webhdfs
scheme. Could you please comment on what is the motivation behind restricting to only hdfs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@suneet-s @a2l007 good point. I realized that this is a documented feature that should be supported. The concern here is that users can use whatever protocol they want if it's supported by the hdfs client. Instead of restricting it to only hdfs, I added new configs for http and hdfs inputSources so that users can configure what protocols they want to allow.
Added |
docs/ingestion/native-batch.md
Outdated
@@ -1064,7 +1064,7 @@ Sample specs: | |||
"type": "index_parallel", | |||
"inputSource": { | |||
"type": "hdfs", | |||
"paths": "hdfs://foo/bar/", "hdfs://bar/foo" | |||
"paths": "hdfs:/foo/bar/", "hdfs:/bar/foo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hdfs://foo/bar
seems strange to me because it refers the /bar
path on the host hdfs://foo
(https://en.wikipedia.org/wiki/Uniform_Resource_Identifier). I don't think it was intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hdfs:/foo/bar
may only work if the namenode is configured in hadoop. For general cases, the URI format is scheme://authority/path
(https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#Overview)
Do you think it would be clearer to change this to "hdfs://namenodehost/foo/bar/"
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds even better. I will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the doc.
) | ||
{ | ||
super(maxCacheCapacityBytes, maxFetchCapacityBytes, prefetchTriggerBytes, fetchTimeout, maxFetchRetry); | ||
this.inputPaths = HdfsInputSource.coerceInputPathsToList(inputPaths, "inputPaths"); | ||
this.inputPaths = HdfsInputSource.coerceInputPathsToList(inputPaths, "paths"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@techdocsmith FYI |
Overall, the defaults look reasonable to me. I looked through other InputSources to see if we need a similar change, and didn't find any. I haven't read through the code in fine detail - will defer to @a2l007 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style suggestions
docs/configuration/index.md
Outdated
|--------|---------------|-----------|-------| | ||
|`druid.ingestion.http.allowedProtocols`|List of protocols|Allowed protocols that HTTP input source and HTTP firehose can use.|["http", "https"]| | ||
|
||
The following properties are to control what domains native batch tasks can access to using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following properties are to control what domains native batch tasks can access to using | |
The following properties control the domains native batch tasks can access using |
docs/ingestion/native-batch.md
Outdated
However, if you want to read from AWS S3 or Google Cloud Storage, consider using | ||
the [S3 input source](#s3-input-source) or the [Google Cloud Storage input source](#google-cloud-storage-input-source) instead. | ||
You can also ingest from other storage using the HDFS input source if the HDFS client supports that storage. | ||
However, if you want to ingest from cloud storage, consider using the proper input sources for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if you want to ingest from cloud storage, consider using the proper input sources for them. | |
However, if you want to ingest from cloud storage, consider using the service-specific input source for your cloud storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can read from not only cloud storage but any storage if it's supported by the HDFS client. I changed to the service-specific input source for your data storage
.
docs/ingestion/native-batch.md
Outdated
@@ -1553,6 +1557,11 @@ Note that prefetching or caching isn't that useful in the Parallel task. | |||
|fetchTimeout|Timeout for fetching each file.|60000| | |||
|maxFetchRetry|Maximum number of retries for fetching each file.|3| | |||
|
|||
You can also ingest from other storage using the HDFS firehose if the HDFS client supports that storage. | |||
However, if you want to ingest from cloud storage, consider using the proper input sources for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if you want to ingest from cloud storage, consider using the proper input sources for them. | |
However, if you want to ingest from cloud storage, consider using the service-specific input source for your cloud storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@suneet-s , I did an editorial pass on the affected d |
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
@techdocsmith thanks for the review 👍
@suneet-s I agree. I don't think other inputSources have the same issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
@a2l007 @techdocsmith @suneet-s thanks for the review 👍 |
Description
This PR adds new configurations for allowed protocols for HTTP and HDFS inputSources and firehoses. These inputSources and firehoses can accept only the URIs that have the allowed protocols and fail otherwise.
druid.ingestion.hdfs.allowedProtocols
: Allowed protocols that HDFS inputSource and HDFS firehose can use.druid.ingestion.http.allowedProtocols
: Allowed protocols that HTTP inputSource and HTTP firehose can use.This PR changes the existing behavior that users can use any protocols with these inputSources and firehoses.
This PR has: