Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-830 For GET requests, InvokeHTTP should set the filename of the 'Response' FlowFile based on the URL #5475

Closed

Conversation

nandorsoma
Copy link
Contributor

@nandorsoma nandorsoma commented Oct 23, 2021

Thank you for submitting a contribution to Apache NiFi.

Please provide a short description of the PR here:

Description of PR

This pull request adds update filename functionality to the InvokeHTTP processor. If the property is set to true and the request method is GET, it will update the FlowFile's filename with a value fetched from the url. In ideal situation it will be the last element of the path, but I tried to prepare for edge cases, so for example for a "bare url" without path and query params it will be the hostname.

There are two things that I wasn't sure about:

  • In this processor boolean properties can either start with uppercase or lowercase letter. I wasn't sure which one to use, so I've checked out other processors and the ones I saw used lowercase so I used that one. Although it raises the question that wouldn't it worth to make them consistent?
  • It seems to me one processor means strictly one test class. I wanted to write a parameterized test but with junit4 a separate class needs to be created. I tried to write it in HTTPInvokeTest without @parameterized, but it became really hacky so I decided to stay with the separate class approach. Is it good like that?

Please let me know what do you think about my two questions and of course also about the whole pr.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • Have you verified that the full build is successful on JDK 8?
  • Have you verified that the full build is successful on JDK 11?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

@ottobackwards
Copy link
Contributor

Thanks for the contribution!

What if the file name as url encoded characters in it? Shouldn't those be unencoded?
My thinking is that the file names should be the same as if you used GetFile to get it.

public static final PropertyDescriptor UPDATE_FILENAME = new PropertyDescriptor.Builder()
.name("update-filename")
.description("If true and HTTP method is GET, the FlowFile's filename will be extracted from the remote URL.")
.displayName("Update Filename")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm just not familiar of the concept what required means here. I thought in this situation it practically means we are not allowing null as a value. Since here null doesn't add anything to us I decided to set it true. Did I do it wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property name seems somewhat unclear, although the explanation is helpful. For greater clarity, what do you think about naming this property something like Set Filename from URL?

The linked Jira issue also mentions the possibility of considering the Content-Disposition header, when found on a response, which could include the filename. That might make sense as a separate PR. With that thought in mind, it might make more sense to use an enum value here instead of a Boolean. For example: Filename Strategy, with values of RANDOM and URL, with the option for a future third option named RESPONSE_HEADER.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea, I'll change that!

@@ -164,6 +164,14 @@
public static final String HTTP = "http";
public static final String HTTPS = "https";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, seems like the kind of thing that should be shared across all our HTTP capable processors, but that is not an issue for this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at HandleHTTPRequest, it doesn't use constants either.

private String getFileNameFromUrl(URL url) {
String fileName;
String path = StringUtils.removeEnd(url.getPath(), "/");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the URL is a class with functions for handling the URL parts correctly, you should need to use manual parsing here.

We have commons.io already as a dependency, so you should be able to

import org.apache.commons.io.FilenameUtils;

FilenameUtils.getName(url.getPath());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you wanted to say "you should not need to use manual parsing". Am I right?
About FilenameUtils. At start I wanted to use it, too, because I don't like custom solutions. Though it didn't work out for me. The problem is that in file paths trailing "/" means that it's a directory. So in a situation where path is ending with a "/", FilenameUtils.getName will return empty string. While for web URLs the situation is different, example and example/ practically mean the same. Not directly, but https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.4 mentions that.

So I ended up in a situation where I think this line is unavoidable and line 1302 can be swapped with the method you are proposing. But from the above it seems like FilenameUtils is not working as expected for web URLs, so I decided to write that line manually because it would be misleading otherwise. What do you think, does it make sense like that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, I think a comment, very similar to your reply above in the code will serve future maintainers to understand why it is doing it like it is, and avoid them just fixing it without understanding your concerns

@nandorsoma
Copy link
Contributor Author

nandorsoma commented Oct 25, 2021

Thanks for the contribution!

What if the file name as url encoded characters in it? Shouldn't those be unencoded? My thinking is that the file names should be the same as if you used GetFile to get it.

Hey @ottobackwards!
Thank you for your review! URL encoding is a good point, I'll try to find something to handle them. I've tried to answer your other remarks, please see them above.

@kevdoran kevdoran self-requested a review October 26, 2021 14:18
@nandorsoma
Copy link
Contributor Author

Thanks for the contribution!
What if the file name as url encoded characters in it? Shouldn't those be unencoded? My thinking is that the file names should be the same as if you used GetFile to get it.

Hey @ottobackwards! Thank you for your review! URL encoding is a good point, I'll try to find something to handle them. I've tried to answer your other remarks, please see them above.

I was thinking about unending the URL. For most characters it wouldn't be a problem, but what happens when you encode a "/" as %2F? In this case the filename would contain a "/" which is not valid in files and would also mean that you could construct a path in the filename. Since other processors expect a filename in that property, not a path, I think this wouldn't be backwards compatible. Having encoded characters in the filename is not nice, but at least they are valid. All in all, I'd leave filename as it is. If you agree, I can add a note to the documentation, that the filename could contain url encoded characters and I can also extend the test with a case which ensures that encoded characters in the url remain encoded in the filename too.

@kevdoran
Copy link
Contributor

I agree for backwards compatibility with existing flows that use InvokeHTTP to GET a file and then save it to another destination (filesystem, remote server, C3 bucket, etc), it would be best to leave the filename value url encoded to protect those flows that are currently running but not designed to handle special characters.

String path = StringUtils.removeEnd(url.getPath(), "/");

if (StringUtils.isEmpty(path)) {
fileName = url.getHost();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the URL host, what do you think about changing this to null and falling back to the current behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea, however, this edge case was verified by @markap14 . Which one should I pick in this situation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply, I will defer to @markap14. Since using the URL path for the filename is not a default setting, the flow designer should be aware of the potential implications. If the current implementation remains as-is, recommend updating the property description to indicate that the filename will fall back to the URL host when the URL does not contain a path.

import static org.apache.nifi.processors.standard.InvokeHTTP.POST_METHOD;
import static org.apache.nifi.processors.standard.InvokeHTTP.PUT_METHOD;

@RunWith(Parameterized.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NiFi default module dependencies now include JUnit 5, so rather than introducing a new unit test, it would be better to modify the existing class and update if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about it but I found it rude to update the whole class as my first contribution so that my test would work. But of course, I will do that then!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I appreciate the goal of attempting minimal changes for an initial contribution. There are several other pull requests in the process of updating tests to use JUnit 5, but if you are able to make the changes for this particular class without radically changing the behavior, then feel free to make the adjustments!

@nandorsoma
Copy link
Contributor Author

Hey!
I've tried to resolve the topics that were raised in the conversation. Each commit refers to a previous comment. Please let me know if you think something needs additional improvement!

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @nandorsoma. Moving InvokeHTTP to the http package needs to be reverted in order to maintain compatibility for upgrades. The FilenameStrategy can go in that package, but the processor itself should not be moved.

@ottobackwards
Copy link
Contributor

ottobackwards commented Nov 2, 2021

@kevdoran wrt url encoding, I can see how that would be a problem, you don't want to decode into a path. I guess just showing encoded names in there and leaving it to the flow ( scripted processor or something ) to do this would be OK.
But, given that, why have this at all? invokehttp.response.url and the invokehttp.request.url, so the attributes have the information, you can do what you want downstream and handle all the edge cases etc that you like.

@nandorsoma
Copy link
Contributor Author

I've added a little note to the documentation that the filename may contain URL encoded characters to avoid future confusion.

@kevdoran
Copy link
Contributor

kevdoran commented Nov 3, 2021

@ottobackwards my understanding is that the invokehttp.request|response.url attributes contain the full url, where as the new filename attribute is just the leaf of the path, correct? ie

invokehttp.request.url = https://example.com/path/to/my-file.txt
filename = my-file.txt

In this case, we are only deciding if that file name should be url encoded, e.g., filename%20with%20spaces%20and%2Fslashes.txt or filename with spaces and/slashes.txt

I think we can make an argument for either approach.

Where backward compatibility comes into play is thinking about existing flows and assumptions they may make about the filename attribute. For example:

InvokeHTTP >> PutS3

Currently, the filename used for S3 object key will be the flowfile UUID. This PR will change that to be the filename, which if it is url encoded will still work with PutS3, but if it is decoded will break if the filename contains special characters such as /. For this reason, I vote leaving filename url encoded in InvokeHTTP, but I don't feel strongly!

@nandorsoma
Copy link
Contributor Author

Thank you @kevdoran for the clarification! I would +1 as well on leaving url encoded characters in filename. It's an edge case and it doesn't bring that much value to have the possibility to break existing flows.

@ottobackwards I don't have an exact answer why the project wants that feature at all, because I'm too new there. My guess is that this is something that you would naturally expect when you download a content and it should have been the default behavior. And this is the key. If this is something that is not an edge case, then it's logic could belong to the processor itself. But as I said, it's just an assumption from me. @kevdoran, does it make sense?

@ottobackwards
Copy link
Contributor

I don't feel strongly enough to object. Leaving it encoded is simplest, and users can still fix it up later if they want. As long as it is documented as being encoded I'm OK.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the feedback @nandorsoma, the current version looks good.

What do you think @kevdoran and @ottobackwards?

Copy link
Contributor

@kevdoran kevdoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me @nandorsoma @exceptionfactory. Thanks for this contribution!

@exceptionfactory
Copy link
Contributor

Thanks for the approval @kevdoran!

Do you have any additional feedback @ottobackwards?

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your work on this feature @nandorsoma! +1 Merging based on current feedback.

@asfgit asfgit closed this in 6fd1f03 Nov 15, 2021
krisztina-zsihovszki pushed a commit to krisztina-zsihovszki/nifi that referenced this pull request Jun 28, 2022
This closes apache#5475

Signed-off-by: David Handermann <exceptionfactory@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants