Skip to content

NIFI-7505 Add InvokeHTTPRecord processor#7532

Closed
pspiller wants to merge 1 commit intoapache:mainfrom
pspiller:invokehttprecord
Closed

NIFI-7505 Add InvokeHTTPRecord processor#7532
pspiller wants to merge 1 commit intoapache:mainfrom
pspiller:invokehttprecord

Conversation

@pspiller
Copy link

Summary

NIFI-7505 - Add a record-based processor that executes an HTTP request for each record in a flowfile. Many of the options are similar to the InvokeHTTP processor's, but adapted for record-based operation.

Request attributes like the URL, headers and request body can vary per record, using RecordPath queries to extract data from each input record. The response body and various metadata (headers, status code, etc) can be written to each record.

The processor can be configured to route the entire input flowfile to a failure relationship if an error is encountered, or failing records can be split off into a separate flowfile. HTTP requests can be retried a configurable number of times.

The response body can optionally be parsed using a record reader service. As well as allowed the parsed body to be included in the output record, in this mode the response body can also be used to determine whether the request is treated as having succeed or failed. This can be useful for APIs that always return a 200 status code, with error information in the response body.

This PR isn't complete yet (it still needs tests), but I've created it in case anybody has any feedback on the approach I'm taking.

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 17

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Add a record-based processor that executes an HTTP request for each
record in a flowfile. Many of the options are similar to the InvokeHTTP
processor's, but adapted for record-based operation.

Request attributes like the URL, headers and request body can vary per
record, using RecordPath queries to extract data from each input record.
The response body and various metadata (headers, status code, etc) can
be written to each record.

The processor can be configured to route the entire input flowfile to
a failure relationship if an error is encountered, or failing records
can be split off into a separate flowfile. HTTP requests can be retried
a configurable number of times.

The response body can optionally be parsed using a record reader
service. As well as allowed the parsed body to be included in the output
record, in this mode the response body can also be used to determine
whether the request is treated as having succeed or failed. This can be
useful for APIs that always return a 200 status code, with error
information in the response body.

This PR isn't complete yet (it still needs tests), but I've created it
in case anybody has any feedback on the approach I'm taking.
Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proposing this new Processor @pspiller.

This is a challenging component to implement due to the wide variety of HTTP-based services. The current InvokeHTTP Processor includes a very large number of configuration properties for this reason, but it also makes the Processor challenging to configure at times. If we go forward with this Processor, one primary goal should be to reduce the number of configurable properties and implement more logical organization through dependent properties. Due to its expansion over time, the existing InvokeHTTP Processor does not provide a helpful pattern to follow, so there are a number of areas that should be re-evaluated.

Although reviewing the code itself can be helpful, I think it would be better to take a step back and outline the initial scope of the Processor in the Jira issue. This would provide an opportunity to outline items such as configuration properties, relationships, and error handling. As it stands, there would be a large number of elements that I would recommend changing in the current PR when it comes to properties and code implementation. One high-level recommendation would be to use the WebClientServiceProvider Controller Service API to abstract socket and TLS configuration. This also abstracts HTTP access instead of using the OkHttp client directly.

Another general area of concern with the proposed approach is handling multiple records differently. Splitting records into new FlowFiles based on success and failure makes sense at one level, especially in terms of data transmission tracking. However, that also raises the question about splitting the input FlowFile and handling requests individual using the existing InvokeHTTP Processor. Other use cases, such as sending batches of records might also be considered, although that could also be accomplished using existing SplitRecord and InvokeHTTP Processors.

Thank you again for your contribution, this is a challenging task, and needs to be well-scoped in order to provide a usable and maintainable solution. It might be worth evaluating some improvements to existing components as a way to get more familiar with the project review process, and then come back to this Processor. Efforts like this can take a long time to complete if not properly scoped, so the goal here is to help set expectations before getting into the details.

@pspiller
Copy link
Author

pspiller commented Aug 1, 2023

Thanks for the feedback!

I did base InvokeHTTPRecord quite heavily on InvokeHTTP, so if that's not a good example to follow then I can understand that changes are required. I can definitely see the value in using WebClientServiceProvider - that would remove a number of config properties from InvokeHTTPRecord, as well as making the code easier to test.

Having the retry handling and record splitting all inside the one processor does feel a bit like it's going against Nifi's default philosophy (break tasks up into multiple simple processors). My first instinct when building a flow was to split up the input records into individual FlowFiles and then handle HTTP requests individually. Synchronising and reassembling everything at the end got pretty painful, though, especially handling retries.

The current InvokeHTTPRecord approach was largely copied from processors like LookupRecord (i.e. do the same operation for each record in the FlowFile). I could remove the record splitting logic from InvokeHTTPRecord and let splitting be handled later by a separate SplitRecord processor, if that'd be cleaner.

I can write up an outline of how I see the processor working on the Jira issue, if that's the best way to move forward.

@exceptionfactory
Copy link
Contributor

Thanks for the reply @pspiller!

One other thing to note about HTTP communication in relation to this Processor, often REST services will have separate endpoints for batch processing. This is much more efficient in terms of network communication and transaction handling, but of course it depends on the service in question.

Another point to consider is that while InvokeHTTP supports all of the standard HTTP verbs, a record-oriented Processor only seems to make sense in the context of sending a request body, such as with POST, PUT, or PATCH.

Given these points, I think it would be best to close the current PR and scope out the options on the Jira issue.

With the goal being a generic reusable Processor, it would also be very helpful to highlight some well-known REST services where this proposed Processor would make sense. That would help highlight whether it is a good fit for something generic, or whether it ends up being too complicated for general use.

@ChrisSamo632
Copy link
Contributor

One could consider the PutElasticsearchRecord processor as a service/application-specific instance of this - operations are executed per-Record based upon processor properties and/or fields set within Records

For example, the Index Operation (index, update, delete) can be set based upon a Record field. Similar options are available for other parameters in the Elasticsearch _bulk API call. These settings dictate the body of the request passed to Elasticsearch

A similar approach could be considered here, i.e. allow a Record Path property to set the request Verb based on the value of a field within each Record. In not, however, all that used to using InvokeHTTP, so such an approach may not be so relevant at such a generic level

@exceptionfactory
Copy link
Contributor

Thanks for the additional input @ChrisSamo632. The Elasticsearch components are a good example of a more specific use case for record-oriented HTTP communication. As you mentioned, it is more difficult to implement a more generic solution given the unbounded options available for HTTP request and response strategies.

With this background, thanks again for proposing this component @pspiller! As recommended earlier, I think continuing the discussion on the Jira issue would be the best way forward, so I will close the pull request for now, pending further discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants