Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses #10830

Merged
merged 16 commits into from
Mar 6, 2021

Conversation

jihoonson
Copy link
Contributor

@jihoonson jihoonson commented Feb 2, 2021

Description

This PR adds new configurations for allowed protocols for HTTP and HDFS inputSources and firehoses. These inputSources and firehoses can accept only the URIs that have the allowed protocols and fail otherwise.

  • druid.ingestion.hdfs.allowedProtocols: Allowed protocols that HDFS inputSource and HDFS firehose can use.
  • druid.ingestion.http.allowedProtocols: Allowed protocols that HTTP inputSource and HTTP firehose can use.

This PR changes the existing behavior that users can use any protocols with these inputSources and firehoses.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

…ource.java

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@suneet-s
Copy link
Contributor

suneet-s commented Feb 2, 2021

Do the docs need to be updated as well? Right now it just says "URIs of the input files." https://druid.apache.org/docs/latest/ingestion/native-batch.html#http-input-source

@jihoonson
Copy link
Contributor Author

Do the docs need to be updated as well? Right now it just says "URIs of the input files." https://druid.apache.org/docs/latest/ingestion/native-batch.html#http-input-source

The linked section starts with The HTTP input source is to support reading files directly from remote sites via HTTP. I think this implies the URIs must be HTTP URIs, but guess it doesn't harm to make it clear. I updated the doc.

I also fixed the HDFS inputSource to allow only hdfs paths. All other inputSources already have a scheme check if they accept URIs.

@@ -69,6 +70,13 @@ public HttpInputSource(
this.config = config;
}

public static void throwIfInvalidProtocols(List<URI> uris)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
public static void throwIfInvalidProtocols(List<URI> uris)
private static void throwIfInvalidProtocols(List<URI> uris)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be public since this method is now used by HttpFirehoseFactory as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

if (Arrays.stream(inputPaths).anyMatch(path -> !"hdfs".equalsIgnoreCase(path.toUri().getScheme()))) {
throw new IllegalArgumentException("Input paths must be the HDFS path");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that FileInputFormat#addInputPath already has it's own scheme validation, so I don't believe we need this check. See org.apache.hadoop.fs.FileSystem#checkPath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSystem has a couple of implementations such as DistributedFileSystem, S3AFileSystem, etc which support different schemes. The default file system is LocalFileSystem unless you set fs.default.name. I think we should not allow anything else but only hdfs and let users use the right inputSource to read from other storage (e.g., `S3InputSource to read from s3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there users that rely on this behavior? Would restricting this to just hdfs mean it's not possible for some users to ingest files from these locations any more? If that's the case, should we introduce a config to allow server admins to specify which schemes are supported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson Technically WebHdfsFileSystem did work with this inputsource before, and so it could break ingestion pipelines for operators relying on HDFS Inputsource with webhdfs scheme. Could you please comment on what is the motivation behind restricting to only hdfs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suneet-s @a2l007 good point. I realized that this is a documented feature that should be supported. The concern here is that users can use whatever protocol they want if it's supported by the hdfs client. Instead of restricting it to only hdfs, I added new configs for http and hdfs inputSources so that users can configure what protocols they want to allow.

@jihoonson jihoonson changed the title Allow only HTTP and HTTPS protocols for the HTTP inputSource HTTP inputSource and firehose should support only HTTP and HTTPS; HDFS inputSource and firehose should support only HDFS Feb 3, 2021
@suneet-s
Copy link
Contributor

suneet-s commented Feb 3, 2021

Added Release Notes as there is a subtle change in behavior that users should be aware of when they upgrade.

@jihoonson jihoonson added this to the 0.21.0 milestone Feb 4, 2021
@jihoonson jihoonson changed the title HTTP inputSource and firehose should support only HTTP and HTTPS; HDFS inputSource and firehose should support only HDFS Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses Feb 5, 2021
@@ -1064,7 +1064,7 @@ Sample specs:
"type": "index_parallel",
"inputSource": {
"type": "hdfs",
"paths": "hdfs://foo/bar/", "hdfs://bar/foo"
"paths": "hdfs:/foo/bar/", "hdfs:/bar/foo"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hdfs://foo/bar seems strange to me because it refers the /bar path on the host hdfs://foo (https://en.wikipedia.org/wiki/Uniform_Resource_Identifier). I don't think it was intentional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hdfs:/foo/bar may only work if the namenode is configured in hadoop. For general cases, the URI format is scheme://authority/path (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#Overview)
Do you think it would be clearer to change this to "hdfs://namenodehost/foo/bar/" instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds even better. I will update it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc.

)
{
super(maxCacheCapacityBytes, maxFetchCapacityBytes, prefetchTriggerBytes, fetchTimeout, maxFetchRetry);
this.inputPaths = HdfsInputSource.coerceInputPathsToList(inputPaths, "inputPaths");
this.inputPaths = HdfsInputSource.coerceInputPathsToList(inputPaths, "paths");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@suneet-s
Copy link
Contributor

@techdocsmith FYI

@suneet-s
Copy link
Contributor

Overall, the defaults look reasonable to me. I looked through other InputSources to see if we need a similar change, and didn't find any. I haven't read through the code in fine detail - will defer to @a2l007

Copy link
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style suggestions

docs/configuration/index.md Outdated Show resolved Hide resolved
docs/configuration/index.md Outdated Show resolved Hide resolved
docs/configuration/index.md Outdated Show resolved Hide resolved
docs/configuration/index.md Outdated Show resolved Hide resolved
|--------|---------------|-----------|-------|
|`druid.ingestion.http.allowedProtocols`|List of protocols|Allowed protocols that HTTP input source and HTTP firehose can use.|["http", "https"]|

The following properties are to control what domains native batch tasks can access to using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following properties are to control what domains native batch tasks can access to using
The following properties control the domains native batch tasks can access using

However, if you want to read from AWS S3 or Google Cloud Storage, consider using
the [S3 input source](#s3-input-source) or the [Google Cloud Storage input source](#google-cloud-storage-input-source) instead.
You can also ingest from other storage using the HDFS input source if the HDFS client supports that storage.
However, if you want to ingest from cloud storage, consider using the proper input sources for them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, if you want to ingest from cloud storage, consider using the proper input sources for them.
However, if you want to ingest from cloud storage, consider using the service-specific input source for your cloud storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can read from not only cloud storage but any storage if it's supported by the HDFS client. I changed to the service-specific input source for your data storage.

docs/ingestion/native-batch.md Outdated Show resolved Hide resolved
docs/ingestion/native-batch.md Outdated Show resolved Hide resolved
@@ -1553,6 +1557,11 @@ Note that prefetching or caching isn't that useful in the Parallel task.
|fetchTimeout|Timeout for fetching each file.|60000|
|maxFetchRetry|Maximum number of retries for fetching each file.|3|

You can also ingest from other storage using the HDFS firehose if the HDFS client supports that storage.
However, if you want to ingest from cloud storage, consider using the proper input sources for them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, if you want to ingest from cloud storage, consider using the proper input sources for them.
However, if you want to ingest from cloud storage, consider using the service-specific input source for your cloud storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

docs/ingestion/native-batch.md Outdated Show resolved Hide resolved
@techdocsmith
Copy link
Contributor

@suneet-s , I did an editorial pass on the affected d.md. Let me know if I've missed something.

@jihoonson
Copy link
Contributor Author

@techdocsmith thanks for the review 👍

Overall, the defaults look reasonable to me. I looked through other InputSources to see if we need a similar change, and didn't find any.

@suneet-s I agree. I don't think other inputSources have the same issue.

@jihoonson
Copy link
Contributor Author

@a2l007 @suneet-s do you have more comments?

Copy link
Contributor

@a2l007 a2l007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@jihoonson
Copy link
Contributor Author

@a2l007 @techdocsmith @suneet-s thanks for the review 👍

@jihoonson jihoonson merged commit 9946306 into apache:master Mar 6, 2021
@jihoonson jihoonson removed this from the 0.21.0 milestone Jul 14, 2021
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants