Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZEPPELIN-2075] Can't stop infinite while statement in pyspark Interpreter. #1985

Closed
wants to merge 8 commits into from

Conversation

astroshim
Copy link
Contributor

What is this PR for?

If following code runs with Pyspark Interpreter, there is no way to cancel except Zeppelin Server restart.

%spark.pyspark
import time

while True:
    time.sleep(1)
    print("running..")

What type of PR is it?

Bug Fix | Improvement

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-2075

How should this be tested?

Run above code with Pyspark Interpreter and try to cancel.

Screenshots (if appropriate)

  • before
    pyspark before

  • after
    pyspark after

Questions:

  • Does the licenses files need update? no
  • Is there breaking changes for older versions? no
  • Does this needs documentation? no

Copy link
Member

@Leemoonsoo Leemoonsoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for great contribution. I left few comments. Please take a look.
And would it be too difficult to have an unittest?

@Override
public void cancel(InterpreterContext context) {
SparkInterpreter sparkInterpreter = getSparkInterpreter();
sparkInterpreter.cancel(context);
try {
interrupt();
Copy link
Member

@Leemoonsoo Leemoonsoo Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think
In case of Spark job is running, need to wait sparkInterpreter.cancel(context) and shouldn't call interrupt.
In case of Spark job is not running, then interrupt() need to be called.
what do you think?

try {
interrupt();
} catch (IOException e) {
e.printStackTrace();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell we use logger to print log?



signal.signal(signal.SIGINT, handler_stop_signals)
Copy link
Member

@Leemoonsoo Leemoonsoo Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works without this line and handler_stop_signals() function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It seems that spark has already signal handler.
Thanks.

@Leemoonsoo
Copy link
Member

@astroshim can you take a look CI error ?

@karuppayya
Copy link
Contributor

@astroshim This would be a problem in case of scala snippet also(a loop with a print in other words , that which runs only within driver). Is there a possible fix such that all interpreters(scala, python) can benefit.

@astroshim astroshim closed this Feb 12, 2017
@astroshim astroshim reopened this Feb 12, 2017
Runtime.getRuntime().exec("kill -SIGINT " + pythonPid);
} else {
logger.warn("Non UNIX/Linux system, close the interpreter");
close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this leave the interpreter in a bad state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, but ERROR status would be a problem?

logger.info("Sending SIGINT signal to PID : " + pythonPid);
Runtime.getRuntime().exec("kill -SIGINT " + pythonPid);
} else {
logger.warn("Non UNIX/Linux system, close the interpreter");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure - why is it going to be non Unix if pythonPid == -1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because creating process in windows creates a pseudo process (thread), and the pid is a negative number. ``

@astroshim
Copy link
Contributor Author

Hello. Sorry for late response.
@Leemoonsoo Let me fix CI error.
@karuppayya Let me fix scala too in another PR.

@astroshim astroshim closed this Feb 16, 2017
@astroshim astroshim reopened this Feb 16, 2017
} catch (InterruptedException e) {
e.printStackTrace();
}
pySparkInterpreter.cancel(context);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to something that checks if thread is actually finished or not?
I think thread.join() is required here. otherwise test will not fail even if job is not cancelled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing properly out. let me fix it.

@Leemoonsoo
Copy link
Member

CI test failure is not related (fixed by #2033).

LGTM and merge to master if no further discussions.

@zjffdu
Copy link
Contributor

zjffdu commented Feb 19, 2017

@astroshim It looks like the python process will be killed, does it mean the interpreter is closed and the state of all the previous paragraphs is lost ? I try the cancel function in jupyter, it seems jupyter just cancel the current cell but keep the python process alive, the previous cell's state is still in the python process. Maybe there's a better approach for this.

@zjffdu
Copy link
Contributor

zjffdu commented Feb 19, 2017

Sorry, it's my misunderstanding. It just interrupt the process rather than kill it. The behavior is consistent with jupyter. It's awesome.

@astroshim
Copy link
Contributor Author

@zjffdu yes, this PR is not kill the python process.
Thank you for reviewing!

@astroshim astroshim closed this Feb 19, 2017
@astroshim astroshim reopened this Feb 19, 2017
@FireArrow
Copy link
Contributor

Sorry for the late comment.
This solution wouldn't work with impersonation, as the zeppelin user wouldn't have permission to interrupt another users process (unless zeppelin run as root, which comes with a whole set of other problems). I suggest you read in the impersonation command, if there is one, and use that to send of the kill -SIGINT <pid>.

@astroshim
Copy link
Contributor Author

@FireArrow Note has permission in the zeppelin so if user doesn't have write permission on that note, user can't cancel job.

@FireArrow
Copy link
Contributor

FireArrow commented Feb 24, 2017

@astroshim That's not what I'm talking about. Assume the following:
Zeppelin run as user "zeppelin"
In zeppelin-env.sh: export ZEPPELIN_IMPERSONATE_CMD='sudo -H -i -u ${ZEPPELIN_IMPERSONATE_USER} bash -c '
User zeppelin have permission to execute paswordless sudo to local users allowed to log into Zeppelin

Interpreters will with this run as the user starting the process, and the zeppelin user will not have permission to send any signal to them.
My suggestion is to execute the kill with $ZEPPELIN_IMPERSONATE_CMD to make sure the kill is executed in the same way as the interpreter was started in the first place.

@astroshim
Copy link
Contributor Author

@FireArrow Yea, right. I understood. Let me fix it.
Thanks.

@astroshim
Copy link
Contributor Author

@FireArrow I think it seems not can be possible to know if impersonation mode or not in Interpreter side(I couldn't find it) and checking cancel paragraph for impersonation is also not good idea (Interpreter sever can be launched itself not the same system with ZeppelinServer)
So I think that the canceling job can run only by user who launched interpreter process can be one option. and In this way, the other user can't stop the job in shared mode.
What do you think?

@FireArrow
Copy link
Contributor

I have basically only set up zeppelin in isolated mode with impersonation, so I'm not sure what is the "expected" behavior is outside of that. Someone else will have to pitch in here.

@astroshim
Copy link
Contributor Author

@Leemoonsoo @zjffdu @felixcheung @karuppayya Could you guys help to review about the impersonation mode?

@zjffdu
Copy link
Contributor

zjffdu commented Feb 28, 2017

Even in the impersonation mode, it is the process owner to interrupt the process. Here it is the interpreter process owner interrupt the process instead of the zeppelin server owner. But I can not verify it in impersonation mode. It seems the impersonation mode is broken. I got the following error, can anyone else verify the impersonation mode ?

ERROR [2017-02-28 09:20:18,882] ({pool-2-thread-2} Job.java[run]:194) - Job failed
org.apache.zeppelin.interpreter.InterpreterException: org.apache.hadoop.security.authorize.AuthorizationException: User: user1 is not allowed to impersonate user1

@astroshim
Copy link
Contributor Author

@zjffdu I tested impersonation mode followed http://zeppelin.apache.org/docs/0.7.0/manual/userimpersonation.html and worked well.
BTW for now anybody can interrupt to interpreter process. and that is @FireArrow's worrying. Am i right?

@zjffdu
Copy link
Contributor

zjffdu commented Feb 28, 2017

User zeppelin have permission to execute paswordless sudo to local users allowed to log into Zeppelin
Interpreters will with this run as the user starting the process, and the zeppelin user will not have permission to send any signal to them.

I think @FireArrow 's concern is that zeppelin user can kill the process. Actually it is not the user zeppelin send the signal, it is the interpreter process owner. So I don't think there's any permission issue here. @FireArrow Please help confirm.

@astroshim
Copy link
Contributor Author

@FireArrow ping.

@FireArrow
Copy link
Contributor

Terribly sorry for late answer! Not sure why I didn't get any notifications about your answers 😕
@zjffdu I haven't looked into who actually sends the signal, but if it is as you say, with the user owning the interpreter process sending the signal, it should be fine.
My worry was that it was sent by the user running the server, and that if that user isn't root (which would be a terrible idea for other reasons) the OS wouldn't allow sending signals to another users process.
And it is tricky to set up impersonation, but as far as I know it is not broken. I only have my production environment with this set up right now. I suggest not using spark to test with. The sudo approach clashes with the spark impersonation, so if you really want to try that you have to set export ZEPPELIN_IMPERSONATE_SPARK_PROXY_USER=false, and find a way to make sure the user starting the interpreter has a kerberos ticket (if you need that)

@astroshim
Copy link
Contributor Author

@FireArrow Thank you for confirm your idea and thank @zjffdu help to review!

@jongyoul
Copy link
Member

Will merge it

@asfgit asfgit closed this in 9f22db9 Mar 14, 2017
asfgit pushed a commit that referenced this pull request Mar 14, 2017
…rpreter.

### What is this PR for?
If following code runs with Pyspark Interpreter, there is no way to cancel except Zeppelin Server restart.
```
%spark.pyspark
import time

while True:
    time.sleep(1)
    print("running..")
```

### What type of PR is it?
Bug Fix | Improvement

### What is the Jira issue?
https://issues.apache.org/jira/browse/ZEPPELIN-2075

### How should this be tested?
Run above code with Pyspark Interpreter and try to cancel.

### Screenshots (if appropriate)
- before
![pyspark before](https://cloud.githubusercontent.com/assets/3348133/22696141/615c1206-ed90-11e6-9bbb-339ecdec73fc.gif)

- after
![pyspark after](https://cloud.githubusercontent.com/assets/3348133/22696168/70899172-ed90-11e6-99e1-342eb4094b2c.gif)

### Questions:
* Does the licenses files need update? no
* Is there breaking changes for older versions? no
* Does this needs documentation? no

Author: astroshim <hsshim@zepl.com>

Closes #1985 from astroshim/ZEPPELIN-2075 and squashes the following commits:

84bf09a [astroshim] fix testcase
bc12eaa [astroshim] pass pid to java
b60d89a [astroshim] Merge branch 'master' into ZEPPELIN-2075
f26eacf [astroshim] add test-case for canceling.
c0cac4e [astroshim] fix logging
678c183 [astroshim] remove signal handler
65d8cc6 [astroshim] init python pid variable
6731e56 [astroshim] add signal to cancel job

(cherry picked from commit 9f22db9)
Signed-off-by: Jongyoul Lee <jongyoul@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants