Skip to content

NIFI-4946 nifi-spark-bundle : Adding support for pyfiles, file, jars options#2521

Closed
Mageswaran1989 wants to merge 4 commits intoapache:masterfrom
dhiraa:NIFI-4946_nifi-spark-bundle_Adding_support_for_pyfiles
Closed

NIFI-4946 nifi-spark-bundle : Adding support for pyfiles, file, jars options#2521
Mageswaran1989 wants to merge 4 commits intoapache:masterfrom
dhiraa:NIFI-4946_nifi-spark-bundle_Adding_support_for_pyfiles

Conversation

@Mageswaran1989
Copy link

@Mageswaran1989 Mageswaran1989 commented Mar 8, 2018

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

@Mageswaran1989
Copy link
Author

Team,

This MR was created in line with our current requirements.
Currently the changes are tested manually and found working.

We would like to make this changes go in mainline, for which we need community help with reviewing and adding test code, since we are new to the Nifi extensions.

Next plan is to add test cases :

  • Create a sample PySpark example modules as part of Test resources
  • Use that in tests

Could any one point out some existing test cases that does similar testing.

Copy link
Contributor

@zenfenan zenfenan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Mageswaran1989 for the contribution. I haven't done actual testing with these changes. I'll do that update the review as well. Thanks

properties.add(MAIN_PY_FILE);
properties.add(NAME);
properties.add(CODE);
// properties.add(ARGS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments to be removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only pyfiles and file options are tested. Rest are yet to be tested.

Plan was to go with implementing test modules and test other features, since the current manual testing takes a long routine of compile, copy and restart of the Nifi.

}
}

log.debug(" ====> jsonResponse: " + jsonResponse);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic change: It would be great if this log.debug message can be changed to something of a proper standard, like "JSON Response : " i.e. Remove ====>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this will be removed in the next commit.

break;
default:
log.debug(" ====> default State: " + state);
session.transfer(flowFile, REL_WAIT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above for these log.debug messages as well.


public static final PropertyDescriptor JAR_FILES = new PropertyDescriptor.Builder()
.name("exec-spark-iactive-jarfiles")
.displayName("jars")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

displayName is what will be rendered on the UI. So lets change it to JARs or Application JARs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do that.

.expressionLanguageSupported(false)
.build();

public static final PropertyDescriptor NAME = new PropertyDescriptor.Builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be the Spark app name? Looks it is never used anywhere other than adding to the PropertyDescriptor list

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like said before, not yet considered. We just wanted to get a hang of the code with our basic requirements


public static final PropertyDescriptor MAIN_PY_FILE = new PropertyDescriptor.Builder()
.name("exec-spark-iactive-main-py-file")
.displayName("file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the JARs case. Most of the PropertyDescriptor use all lowercase characters for displayName. Please change it.


String jsonResponse = null;

if (StringUtils.isEmpty(jsonResponse)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be true all the time, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once current approach is accepted this can be taken care

//Incoming flow file is not an JSON file hence consider it to be an triggering point

String batchPayload = "{ \"pyFiles\": [\"" +context.getProperty(PY_FILES).getValue()+ "\"], " +
"\"file\" : \""+context.getProperty(MAIN_PY_FILE).getValue()+"\" }";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing to me. Why are we saying that if the incoming flowfile is not a valid JSON, we are going ahead with the assumption that it is going to be PySpark? I mean the assumption here lacks clarity. Please correct me, if I'm wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check the description @ https://issues.apache.org/jira/browse/NIFI-4946

The assumption was made such that it doesn't break existing code flow and at the same time we wanted to know the status of the submitted job.

So the naive idea was to re-route the Livy Json response back to Spark processor only, so that it can get last submitted url from the custom (tampered) JSON response, wait for user specified wait time and again query the Livy for the Job status in a loop till it succeeds or fails.

So when the processor is configured to submit a Spark job, it will expect the incoming flowfile to be an custom Json response with an url field to query the Livy, if not it is considered as a triggering point nothing else.

I am open for any ideas from your end.

Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zenfenan could please review above logic and suggest a way to handle plain Scala/Python code and packages source files for pyfiles/jars ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code can be submitted using the CODE property, right? That used to work. Or are you asking of a way to upload/send files or jars through Livy?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the code flow @ https://issues.apache.org/jira/browse/NIFI-4946, currently I am able to send *.zip files (Python modules) through livy. My question was what should we do with flowfile, when we are using the processor to submit a batch job?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometime this week I am planning to add support for jar files, args and application name over the Livy options.

The catch here is unlike plain Spark code, batch process code will take time to finish which is expected one as we know. So as a hack I was re-routing the Json response after batch submission to itself, where I poll the incoming flowfile and check whether it is a Json file and if so I will try to get the "livy url" to post again to know the status of the batch job as long as it runs. After knowing the the job finished, the success route is triggered.

That was the reason the I have made an assumption if the incoming file is Json, it is from previous batch job submission.

In short, the flow file :

  • Is considered as a triggering point (or)
  • Is considered as plain Spark code that compiles over Livy (or)
  • Is a Livy Json response, which can further be used to check the status of long running Spark batch job

I was looking for the right Nifi way of handling this?

I feel I am too conservative and trying to fit all the functionalities on one processor.

  • Flow file/property can be used to run a Spark code
  • Pyfiles can be used to run Spark batch job
  • Jars can be used to run Spark batch job
  • Args options for batch mode
  • By rerouting the success to itself, it can monitor the long running job over Livy rest APIs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the current processor is called ExecuteSparkInteractive, you could move the batch functionality out into something called SubmitSparkJob or something. The outgoing flow file could contain the application ID and/or any other information that would allow you to monitor the job downstream (perhaps with InvokeHttp through Livy, e.g.) Are you still interested in working on this? It seems like very nice functionality to have in NiFi!


} catch (JSONException | InterruptedException e) {

//Incoming flow file is not an JSON file hence consider it to be an triggering point
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic change: Multiple empty lines were left. IMHO, one empty line should be enough for better readability.

@github-actions
Copy link

github-actions bot commented May 2, 2021

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

@github-actions github-actions bot added the Stale label May 2, 2021
@github-actions github-actions bot closed this May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants