Skip to content

NIFI-10710 implement processor for AWS Polly, Textract, Translate, Tr…#6589

Closed
KalmanJantner wants to merge 16 commits intoapache:mainfrom
KalmanJantner:main
Closed

NIFI-10710 implement processor for AWS Polly, Textract, Translate, Tr…#6589
KalmanJantner wants to merge 16 commits intoapache:mainfrom
KalmanJantner:main

Conversation

@KalmanJantner
Copy link
Contributor

@KalmanJantner KalmanJantner commented Oct 27, 2022

…anscribe

Summary

NIFI-10710

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • [] Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 8
    • JDK 11
    • JDK 17

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @KalmanJantner, these look like useful new components!

The class files are missing the Apache license header, so that is one general adjustment to make. I recommend running a local Maven build with the -P contrib-check option to catch formatting issues.

At a high level, this looks like an opportunity to start with version 2 of the AWS SDK, instead of using the existing version 1.

Copy link
Contributor

@markap14 markap14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting up the Pull Request @KalmanJantner ! I think these will be really helpful processors. But I think there are a lot of conventions that we need to make sure that we are following here. Processor names, descriptions, documentation, best practices, etc. And given @exceptionfactory 's feedback that we're using the 1.x SDK instead of the 2.x SDK I think we need to make sure that these get addressed.

@KalmanJantner KalmanJantner requested review from exceptionfactory and markap14 and removed request for exceptionfactory and markap14 November 4, 2022 21:46
Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @KalmanJantner. I noted a number of additional suggestions, and several files still missing license headers.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to work on these new Processors @KalmanJantner. I noted a number of adjustments, mostly minor changes.

Copy link
Contributor Author

@KalmanJantner KalmanJantner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@exceptionfactory Thank you for the feedbacks. I implemented the changes based on your suggestions.

Comment on lines 61 to 63
}

if (status == JobStatus.COMPLETED) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thx.

@exceptionfactory
Copy link
Contributor

@KalmanJantner, I will take a closer look at the latest changes soon. Please avoid introducing merge commits, instead, please rebase and force-push changes to align with the main branch. Thanks!

@KalmanJantner
Copy link
Contributor Author

@KalmanJantner, I will take a closer look at the latest changes soon. Please avoid introducing merge commits, instead, please rebase and force-push changes to align with the main branch. Thanks!

Thanks @exceptionfactory, I have removed the merge commit.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the latest set of updates @KalmanJantner. I noted a number of minor stylistic adjustments, but this looks to be nearing completion.

Tagging @markap14 for additional review.

Copy link
Contributor

@pvillard31 pvillard31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing with this PR. Some remarks:

  • set mime.type to json when we know that the generated flow file is going to contain a JSON payload (example: StartAwsTextractJob, GetAwsTextractJobStatus, etc).
  • please set description on all properties and enum values where it makes sense. We should not assume the user knows the difference between Text Analysis and Document Text Detection for example (in StartAwsTextractJob).
  • please add additionalDetails with examples of expected JSON payloads + links to AWS documentation if that makes sense (example: StartAwsTextractJob)
  • the type of extract in GetAwsTextractJobStatus is a string and not an enum. If that's because we want to be able to reference a flow file attribute, then we should directly reference the flow file attribute that we created in StartAwsTextractJob. In this case: ${type-of-service}. However when trying to use EL, it fails because it's not one of the allowable values. Something to fix here.

@pvillard31
Copy link
Contributor

Note - I do see the additionalDetails page on the pull request but it's not showing in the usage page of the processor. Maybe something wrong somewhere when building the documentation?

@KalmanJantner
Copy link
Contributor Author

Playing with this PR. Some remarks:

  • set mime.type to json when we know that the generated flow file is going to contain a JSON payload (example: StartAwsTextractJob, GetAwsTextractJobStatus, etc).
  • please set description on all properties and enum values where it makes sense. We should not assume the user knows the difference between Text Analysis and Document Text Detection for example (in StartAwsTextractJob).
  • please add additionalDetails with examples of expected JSON payloads + links to AWS documentation if that makes sense (example: StartAwsTextractJob)
  • the type of extract in GetAwsTextractJobStatus is a string and not an enum. If that's because we want to be able to reference a flow file attribute, then we should directly reference the flow file attribute that we created in StartAwsTextractJob. In this case: ${type-of-service}. However when trying to use EL, it fails because it's not one of the allowable values. Something to fix here.

Thank you for the comments, I went through and fixed them.

@KalmanJantner KalmanJantner requested review from exceptionfactory and removed request for markap14 December 20, 2022 07:50
Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the latest round of updates @KalmanJantner.

Following some runtime testing, I noticed a few additional issues. For status retrieval Processors, it seems like it would be helpful to add a new Task ID property that defaults to reading the value from a FlowFile attribute. As implemented right now, the FlowFile attribute is hard-coded, and not reflected in documentation. Although the documentation could be updated with the addition of ReadsAttribute annotations on the Processors, introducing a new Property with default value would avoid the implicit attribute handling while supporting the same basic behavior.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the latest round of updates @KalmanJantner. While testing some error conditions, it looks like there is an issue with handling FlowFiles in at least some of the Start Processors. Attempting to process a FlowFile without having valid AWS credentials results in the following uncaught exception.

2023-01-23 10:33:23,888 WARN [Timer-Driven Process Thread-3] o.a.n.controller.tasks.ConnectableTask Processing halted: uncaught exception in Component [StartAwsTextractJob[id=97c0dc20-0185-1000-d3db-4140509dc964]]
org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=24bdcb1d-49f6-45ed-81cf-db6fb433afba,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1674491600189-1, container=default, section=1], offset=0, length=2],offset=0,name=24bdcb1d-49f6-45ed-81cf-db6fb433afba,size=2] transfer relationship not specified. This FlowFile was not created in this session and was not transferred to any Relationship via ProcessSession.transfer()
	at org.apache.nifi.controller.repository.StandardProcessSession.validateCommitState(StandardProcessSession.java:262)
	at org.apache.nifi.controller.repository.StandardProcessSession.checkpoint(StandardProcessSession.java:277)
	at org.apache.nifi.controller.repository.StandardProcessSession.commit(StandardProcessSession.java:559)
	at org.apache.nifi.controller.repository.StandardProcessSession.commitAsync(StandardProcessSession.java:513)
	at org.apache.nifi.processors.aws.AbstractAWSProcessor.onTrigger(AbstractAWSProcessor.java:303)
	at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1356)
	at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:246)
	at org.apache.nifi.controller.scheduling.AbstractTimeBasedSchedulingAgent.lambda$doScheduleOnce$0(AbstractTimeBasedSchedulingAgent.java:59)
	at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

@exceptionfactory
Copy link
Contributor

@KalmanJantner Following additional testing, I realized the uncaught exception was the result of running with a previous build, the latest commit appears to have resolved that problem, and handles failures as expected through the failiure relationship.

@exceptionfactory exceptionfactory self-requested a review January 25, 2023 04:55
Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the feedback @KalmanJantner! The latest version looks like it covers previous code comments I had noted.

Do you have any additional feedback @pvillard31?

KalmanJantner and others added 6 commits January 25, 2023 07:05
…java/org/apache/nifi/processors/aws/ml/AwsMachineLearningJobStarter.java

Co-authored-by: exceptionfactory <exceptionfactory@apache.org>
…java/org/apache/nifi/processors/aws/ml/polly/GetAwsPollyJobStatus.java

Co-authored-by: exceptionfactory <exceptionfactory@apache.org>
…java/org/apache/nifi/processors/aws/ml/transcribe/GetAwsTranscribeJobStatus.java

Co-authored-by: exceptionfactory <exceptionfactory@apache.org>
…java/org/apache/nifi/processors/aws/ml/textract/GetAwsTextractJobStatus.java

Co-authored-by: exceptionfactory <exceptionfactory@apache.org>
…java/org/apache/nifi/processors/aws/ml/polly/GetAwsPollyJobStatus.java

Co-authored-by: exceptionfactory <exceptionfactory@apache.org>
…java/org/apache/nifi/processors/aws/ml/AwsMachineLearningJobStarter.java

Co-authored-by: exceptionfactory <exceptionfactory@apache.org>
Copy link
Contributor

@pvillard31 pvillard31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through all of the feedback @KalmanJantner, and thanks for the review confirmation @pvillard31, the latest version looks good! +1 merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants