Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses #10830

jihoonson · 2021-02-02T02:03:16Z

Description

This PR adds new configurations for allowed protocols for HTTP and HDFS inputSources and firehoses. These inputSources and firehoses can accept only the URIs that have the allowed protocols and fail otherwise.

druid.ingestion.hdfs.allowedProtocols: Allowed protocols that HDFS inputSource and HDFS firehose can use.
druid.ingestion.http.allowedProtocols: Allowed protocols that HTTP inputSource and HTTP firehose can use.

This PR changes the existing behavior that users can use any protocols with these inputSources and firehoses.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

core/src/main/java/org/apache/druid/data/input/impl/HttpInputSource.java

…ource.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

abhishekagarwal87

LGTM

suneet-s · 2021-02-02T17:56:03Z

Do the docs need to be updated as well? Right now it just says "URIs of the input files." https://druid.apache.org/docs/latest/ingestion/native-batch.html#http-input-source

…chemes

jihoonson · 2021-02-03T04:01:41Z

Do the docs need to be updated as well? Right now it just says "URIs of the input files." https://druid.apache.org/docs/latest/ingestion/native-batch.html#http-input-source

The linked section starts with The HTTP input source is to support reading files directly from remote sites via HTTP. I think this implies the URIs must be HTTP URIs, but guess it doesn't harm to make it clear. I updated the doc.

I also fixed the HDFS inputSource to allow only hdfs paths. All other inputSources already have a scheme check if they accept URIs.

suneet-s · 2021-02-03T05:16:55Z

core/src/main/java/org/apache/druid/data/input/impl/HttpInputSource.java

@@ -69,6 +70,13 @@ public HttpInputSource(
    this.config = config;
  }

+  public static void throwIfInvalidProtocols(List<URI> uris)


nit:

Suggested change

public static void throwIfInvalidProtocols(List<URI> uris)

private static void throwIfInvalidProtocols(List<URI> uris)

It should be public since this method is now used by HttpFirehoseFactory as well.

suneet-s · 2021-02-03T05:28:53Z

...sions-core/hdfs-storage/src/main/java/org/apache/druid/inputsource/hdfs/HdfsInputSource.java

+    if (Arrays.stream(inputPaths).anyMatch(path -> !"hdfs".equalsIgnoreCase(path.toUri().getScheme()))) {
+      throw new IllegalArgumentException("Input paths must be the HDFS path");
+    }
+


It appears that FileInputFormat#addInputPath already has it's own scheme validation, so I don't believe we need this check. See org.apache.hadoop.fs.FileSystem#checkPath

FileSystem has a couple of implementations such as DistributedFileSystem, S3AFileSystem, etc which support different schemes. The default file system is LocalFileSystem unless you set fs.default.name. I think we should not allow anything else but only hdfs and let users use the right inputSource to read from other storage (e.g., `S3InputSource to read from s3).

Are there users that rely on this behavior? Would restricting this to just hdfs mean it's not possible for some users to ingest files from these locations any more? If that's the case, should we introduce a config to allow server admins to specify which schemes are supported?

@jihoonson Technically WebHdfsFileSystem did work with this inputsource before, and so it could break ingestion pipelines for operators relying on HDFS Inputsource with webhdfs scheme. Could you please comment on what is the motivation behind restricting to only hdfs?

@suneet-s @a2l007 good point. I realized that this is a documented feature that should be supported. The concern here is that users can use whatever protocol they want if it's supported by the hdfs client. Instead of restricting it to only hdfs, I added new configs for http and hdfs inputSources so that users can configure what protocols they want to allow.

suneet-s · 2021-02-03T16:06:37Z

Added Release Notes as there is a subtle change in behavior that users should be aware of when they upgrade.

a2l007 · 2021-02-08T22:01:16Z

docs/ingestion/native-batch.md

@@ -1064,7 +1064,7 @@ Sample specs:
      "type": "index_parallel",
      "inputSource": {
        "type": "hdfs",
-        "paths": "hdfs://foo/bar/", "hdfs://bar/foo"
+        "paths": "hdfs:/foo/bar/", "hdfs:/bar/foo"


nit: Is this change needed?

hdfs://foo/bar seems strange to me because it refers the /bar path on the host hdfs://foo (https://en.wikipedia.org/wiki/Uniform_Resource_Identifier). I don't think it was intentional.

I think hdfs:/foo/bar may only work if the namenode is configured in hadoop. For general cases, the URI format is scheme://authority/path (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#Overview)
Do you think it would be clearer to change this to "hdfs://namenodehost/foo/bar/" instead?

That sounds even better. I will update it.

I updated the doc.

a2l007 · 2021-02-09T17:15:42Z

...ions-core/hdfs-storage/src/main/java/org/apache/druid/firehose/hdfs/HdfsFirehoseFactory.java

  )
  {
    super(maxCacheCapacityBytes, maxFetchCapacityBytes, prefetchTriggerBytes, fetchTimeout, maxFetchRetry);
-    this.inputPaths = HdfsInputSource.coerceInputPathsToList(inputPaths, "inputPaths");
+    this.inputPaths = HdfsInputSource.coerceInputPathsToList(inputPaths, "paths");


suneet-s · 2021-02-10T18:40:20Z

@techdocsmith FYI

suneet-s · 2021-02-10T18:48:27Z

Overall, the defaults look reasonable to me. I looked through other InputSources to see if we need a similar change, and didn't find any. I haven't read through the code in fine detail - will defer to @a2l007

techdocsmith

Style suggestions

docs/configuration/index.md

techdocsmith · 2021-02-10T18:56:08Z

docs/configuration/index.md

+|--------|---------------|-----------|-------|
+|`druid.ingestion.http.allowedProtocols`|List of protocols|Allowed protocols that HTTP input source and HTTP firehose can use.|["http", "https"]|
+
+The following properties are to control what domains native batch tasks can access to using


Suggested change

The following properties are to control what domains native batch tasks can access to using

The following properties control the domains native batch tasks can access using

techdocsmith · 2021-02-10T19:14:57Z

docs/ingestion/native-batch.md

-However, if you want to read from AWS S3 or Google Cloud Storage, consider using
-the [S3 input source](#s3-input-source) or the [Google Cloud Storage input source](#google-cloud-storage-input-source) instead.
+You can also ingest from other storage using the HDFS input source if the HDFS client supports that storage.
+However, if you want to ingest from cloud storage, consider using the proper input sources for them.


Suggested change

However, if you want to ingest from cloud storage, consider using the proper input sources for them.

However, if you want to ingest from cloud storage, consider using the service-specific input source for your cloud storage.

You can read from not only cloud storage but any storage if it's supported by the HDFS client. I changed to the service-specific input source for your data storage.

docs/ingestion/native-batch.md

techdocsmith · 2021-02-10T19:21:23Z

docs/ingestion/native-batch.md

@@ -1553,6 +1557,11 @@ Note that prefetching or caching isn't that useful in the Parallel task.
 |fetchTimeout|Timeout for fetching each file.|60000|
 |maxFetchRetry|Maximum number of retries for fetching each file.|3|

+You can also ingest from other storage using the HDFS firehose if the HDFS client supports that storage.
+However, if you want to ingest from cloud storage, consider using the proper input sources for them.


Suggested change

However, if you want to ingest from cloud storage, consider using the proper input sources for them.

However, if you want to ingest from cloud storage, consider using the service-specific input source for your cloud storage.

docs/ingestion/native-batch.md

techdocsmith · 2021-02-10T19:23:55Z

@suneet-s , I did an editorial pass on the affected d.md. Let me know if I've missed something.

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>

…chemes

jihoonson · 2021-02-12T05:08:07Z

@techdocsmith thanks for the review 👍

Overall, the defaults look reasonable to me. I looked through other InputSources to see if we need a similar change, and didn't find any.

@suneet-s I agree. I don't think other inputSources have the same issue.

jihoonson · 2021-03-03T06:31:32Z

@a2l007 @suneet-s do you have more comments?

a2l007

LGTM. Thanks!

jihoonson · 2021-03-06T19:42:56Z

@a2l007 @techdocsmith @suneet-s thanks for the review 👍

jihoonson added 2 commits February 1, 2021 17:59

Allow only HTTP and HTTPS protocols for the HTTP inputSource

25d8de0

rename

87a586a

jihoonson added the Area - Batch Ingestion label Feb 2, 2021

abhishekagarwal87 reviewed Feb 2, 2021

View reviewed changes

core/src/main/java/org/apache/druid/data/input/impl/HttpInputSource.java Outdated Show resolved Hide resolved

Update core/src/main/java/org/apache/druid/data/input/impl/HttpInputS…

e4e0ced

…ource.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

abhishekagarwal87 approved these changes Feb 2, 2021

View reviewed changes

jihoonson added 3 commits February 2, 2021 10:34

fix http firehose and update doc

6039aff

HDFS inputSource

9b39737

Merge branch 'http-schemes' of github.com:jihoonson/druid into http-s…

b6783f6

…chemes

suneet-s reviewed Feb 3, 2021

View reviewed changes

jihoonson changed the title ~~Allow only HTTP and HTTPS protocols for the HTTP inputSource~~ HTTP inputSource and firehose should support only HTTP and HTTPS; HDFS inputSource and firehose should support only HDFS Feb 3, 2021

suneet-s added the Release Notes label Feb 3, 2021

jihoonson added this to the 0.21.0 milestone Feb 4, 2021

add configs for allowed protocols

661e6f7

jihoonson added the Design Review label Feb 5, 2021

jihoonson changed the title ~~HTTP inputSource and firehose should support only HTTP and HTTPS; HDFS inputSource and firehose should support only HDFS~~ Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses Feb 5, 2021

jihoonson added 2 commits February 5, 2021 17:43

fix checkstyle and doc

f3d0485

more checkstyle

a79f431

a2l007 reviewed Feb 9, 2021

View reviewed changes

techdocsmith reviewed Feb 10, 2021

View reviewed changes

jihoonson and others added 4 commits February 11, 2021 20:37

Merge branch 'master' of github.com:apache/druid into http-schemes

ced2129

remove stale doc

d4b6cbd

remove more doc

f70f1ca

Apply doc suggestions from code review

11ca3de

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>

Merge branch 'http-schemes' of github.com:jihoonson/druid into http-s…

fc59ee2

…chemes

update hdfs address in docs

fe1b474

a2l007 approved these changes Mar 3, 2021

View reviewed changes

fix test

c270ed6

jihoonson merged commit 9946306 into apache:master Mar 6, 2021

jihoonson removed this from the 0.21.0 milestone Jul 14, 2021

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

suneet-s mentioned this pull request Nov 11, 2021

Remove stale warning for HTTP inputSource #11907

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses #10830

Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses #10830

jihoonson commented Feb 2, 2021 •

edited

Loading

abhishekagarwal87 left a comment

suneet-s commented Feb 2, 2021

jihoonson commented Feb 3, 2021

suneet-s Feb 3, 2021

jihoonson Feb 3, 2021

suneet-s Feb 3, 2021

suneet-s Feb 3, 2021

jihoonson Feb 3, 2021

suneet-s Feb 3, 2021

a2l007 Feb 5, 2021

jihoonson Feb 5, 2021

suneet-s commented Feb 3, 2021

a2l007 Feb 8, 2021

jihoonson Feb 12, 2021

a2l007 Feb 12, 2021

jihoonson Feb 12, 2021

jihoonson Mar 3, 2021

a2l007 Feb 9, 2021

suneet-s commented Feb 10, 2021

suneet-s commented Feb 10, 2021

techdocsmith left a comment

techdocsmith Feb 10, 2021

techdocsmith Feb 10, 2021

jihoonson Feb 12, 2021

techdocsmith Feb 10, 2021

jihoonson Feb 12, 2021

techdocsmith commented Feb 10, 2021

jihoonson commented Feb 12, 2021

jihoonson commented Mar 3, 2021

a2l007 left a comment

jihoonson commented Mar 6, 2021

	public static void throwIfInvalidProtocols(List<URI> uris)
	private static void throwIfInvalidProtocols(List<URI> uris)

	The following properties are to control what domains native batch tasks can access to using
	The following properties control the domains native batch tasks can access using

	However, if you want to ingest from cloud storage, consider using the proper input sources for them.
	However, if you want to ingest from cloud storage, consider using the service-specific input source for your cloud storage.

Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses #10830

Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses #10830

Conversation

jihoonson commented Feb 2, 2021 • edited Loading

Description

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

suneet-s commented Feb 2, 2021

jihoonson commented Feb 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suneet-s commented Feb 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suneet-s commented Feb 10, 2021

suneet-s commented Feb 10, 2021

techdocsmith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

techdocsmith commented Feb 10, 2021

jihoonson commented Feb 12, 2021

jihoonson commented Mar 3, 2021

a2l007 left a comment

Choose a reason for hiding this comment

jihoonson commented Mar 6, 2021

jihoonson commented Feb 2, 2021 •

edited

Loading