New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-830 For GET requests, InvokeHTTP should set the filename of the 'Response' FlowFile based on the URL #5475
Conversation
Thanks for the contribution! What if the file name as url encoded characters in it? Shouldn't those be unencoded? |
public static final PropertyDescriptor UPDATE_FILENAME = new PropertyDescriptor.Builder() | ||
.name("update-filename") | ||
.description("If true and HTTP method is GET, the FlowFile's filename will be extracted from the remote URL.") | ||
.displayName("Update Filename") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm just not familiar of the concept what required means here. I thought in this situation it practically means we are not allowing null as a value. Since here null doesn't add anything to us I decided to set it true. Did I do it wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The property name seems somewhat unclear, although the explanation is helpful. For greater clarity, what do you think about naming this property something like Set Filename from URL
?
The linked Jira issue also mentions the possibility of considering the Content-Disposition
header, when found on a response, which could include the filename. That might make sense as a separate PR. With that thought in mind, it might make more sense to use an enum value here instead of a Boolean. For example: Filename Strategy
, with values of RANDOM
and URL
, with the option for a future third option named RESPONSE_HEADER
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea, I'll change that!
@@ -164,6 +164,14 @@ | |||
public static final String HTTP = "http"; | |||
public static final String HTTPS = "https"; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice, seems like the kind of thing that should be shared across all our HTTP capable processors, but that is not an issue for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at HandleHTTPRequest, it doesn't use constants either.
private String getFileNameFromUrl(URL url) { | ||
String fileName; | ||
String path = StringUtils.removeEnd(url.getPath(), "/"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the URL is a class with functions for handling the URL parts correctly, you should need to use manual parsing here.
We have commons.io already as a dependency, so you should be able to
import org.apache.commons.io.FilenameUtils;
FilenameUtils.getName(url.getPath());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you wanted to say "you should not need to use manual parsing". Am I right?
About FilenameUtils. At start I wanted to use it, too, because I don't like custom solutions. Though it didn't work out for me. The problem is that in file paths trailing "/" means that it's a directory. So in a situation where path is ending with a "/", FilenameUtils.getName will return empty string. While for web URLs the situation is different, example and example/ practically mean the same. Not directly, but https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.4 mentions that.
So I ended up in a situation where I think this line is unavoidable and line 1302 can be swapped with the method you are proposing. But from the above it seems like FilenameUtils is not working as expected for web URLs, so I decided to write that line manually because it would be misleading otherwise. What do you think, does it make sense like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense, I think a comment, very similar to your reply above in the code will serve future maintainers to understand why it is doing it like it is, and avoid them just fixing it without understanding your concerns
Hey @ottobackwards! |
I was thinking about unending the URL. For most characters it wouldn't be a problem, but what happens when you encode a "/" as %2F? In this case the filename would contain a "/" which is not valid in files and would also mean that you could construct a path in the filename. Since other processors expect a filename in that property, not a path, I think this wouldn't be backwards compatible. Having encoded characters in the filename is not nice, but at least they are valid. All in all, I'd leave filename as it is. If you agree, I can add a note to the documentation, that the filename could contain url encoded characters and I can also extend the test with a case which ensures that encoded characters in the url remain encoded in the filename too. |
I agree for backwards compatibility with existing flows that use InvokeHTTP to GET a file and then save it to another destination (filesystem, remote server, C3 bucket, etc), it would be best to leave the filename value url encoded to protect those flows that are currently running but not designed to handle special characters. |
String path = StringUtils.removeEnd(url.getPath(), "/"); | ||
|
||
if (StringUtils.isEmpty(path)) { | ||
fileName = url.getHost(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using the URL host, what do you think about changing this to null
and falling back to the current behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your idea, however, this edge case was verified by @markap14 . Which one should I pick in this situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply, I will defer to @markap14. Since using the URL path for the filename is not a default setting, the flow designer should be aware of the potential implications. If the current implementation remains as-is, recommend updating the property description to indicate that the filename will fall back to the URL host when the URL does not contain a path.
import static org.apache.nifi.processors.standard.InvokeHTTP.POST_METHOD; | ||
import static org.apache.nifi.processors.standard.InvokeHTTP.PUT_METHOD; | ||
|
||
@RunWith(Parameterized.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NiFi default module dependencies now include JUnit 5, so rather than introducing a new unit test, it would be better to modify the existing class and update if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about it but I found it rude to update the whole class as my first contribution so that my test would work. But of course, I will do that then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I appreciate the goal of attempting minimal changes for an initial contribution. There are several other pull requests in the process of updating tests to use JUnit 5, but if you are able to make the changes for this particular class without radically changing the behavior, then feel free to make the adjustments!
…oded when the input path was url encoded
…sors.standard.http package
…enum values instead of boolean
Hey! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @nandorsoma. Moving InvokeHTTP
to the http
package needs to be reverted in order to maintain compatibility for upgrades. The FilenameStrategy
can go in that package, but the processor itself should not be moved.
...i-standard-processors/src/main/java/org/apache/nifi/processors/standard/http/InvokeHTTP.java
Outdated
Show resolved
Hide resolved
...dard-processors/src/main/java/org/apache/nifi/processors/standard/http/FilenameStrategy.java
Show resolved
Hide resolved
…age in order to maintain compatibility for upgrades
@kevdoran wrt url encoding, I can see how that would be a problem, you don't want to decode into a path. I guess just showing encoded names in there and leaving it to the flow ( scripted processor or something ) to do this would be OK. |
…tain URL encoded characters
I've added a little note to the documentation that the filename may contain URL encoded characters to avoid future confusion. |
@ottobackwards my understanding is that the
In this case, we are only deciding if that file name should be url encoded, e.g., I think we can make an argument for either approach. Where backward compatibility comes into play is thinking about existing flows and assumptions they may make about the filename attribute. For example:
Currently, the filename used for S3 object key will be the flowfile UUID. This PR will change that to be the filename, which if it is url encoded will still work with PutS3, but if it is decoded will break if the filename contains special characters such as |
Thank you @kevdoran for the clarification! I would +1 as well on leaving url encoded characters in filename. It's an edge case and it doesn't bring that much value to have the possibility to break existing flows. @ottobackwards I don't have an exact answer why the project wants that feature at all, because I'm too new there. My guess is that this is something that you would naturally expect when you download a content and it should have been the default behavior. And this is the key. If this is something that is not an edge case, then it's logic could belong to the processor itself. But as I said, it's just an assumption from me. @kevdoran, does it make sense? |
I don't feel strongly enough to object. Leaving it encoded is simplest, and users can still fix it up later if they want. As long as it is documented as being encoded I'm OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through the feedback @nandorsoma, the current version looks good.
What do you think @kevdoran and @ottobackwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me @nandorsoma @exceptionfactory. Thanks for this contribution!
Thanks for the approval @kevdoran! Do you have any additional feedback @ottobackwards? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your work on this feature @nandorsoma! +1 Merging based on current feedback.
This closes apache#5475 Signed-off-by: David Handermann <exceptionfactory@apache.org>
Thank you for submitting a contribution to Apache NiFi.
Please provide a short description of the PR here:
Description of PR
This pull request adds update filename functionality to the InvokeHTTP processor. If the property is set to true and the request method is GET, it will update the FlowFile's filename with a value fetched from the url. In ideal situation it will be the last element of the path, but I tried to prepare for edge cases, so for example for a "bare url" without path and query params it will be the hostname.
There are two things that I wasn't sure about:
Please let me know what do you think about my two questions and of course also about the whole pr.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically
main
)?Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not
squash
or use--force
when pushing to allow for clean monitoring of changes.For code changes:
mvn -Pcontrib-check clean install
at the rootnifi
folder?LICENSE
file, including the mainLICENSE
file undernifi-assembly
?NOTICE
file, including the mainNOTICE
file found undernifi-assembly
?.displayName
in addition to .name (programmatic access) for each of the new properties?For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.