Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMBARI-22888] Cancel operation during package deployment causing repository manager to be broken (dgrinenko) #244

Merged

Conversation

hapylestat
Copy link
Contributor

@hapylestat hapylestat commented Feb 1, 2018

Ambari 2.x Implementation

What changes were proposed in this pull request?

Rework of the way how we killing process and it children. Rather of usage Popen and bash scriptlets, which r hardly manageable, switching to native python os.kill with using /proc FS to extract system information (same as ps, pgrep util doing).

Additionally, such implementation allow to add several required limits to tree killer:

  • exclusion list, currently contains package managers (main goal of the task)
  • exclude caller PID from being killed by occasion

How was this patch tested?

Tested using Unit Tests and on selected cluster node

[09:32:47] :	 [Step 2/2] [INFO] ------------------------------------------------------------------------
[09:32:47] :	 [Step 2/2] [INFO] Reactor Summary:
[09:32:47] :	 [Step 2/2] [INFO] 
[09:32:47] :	 [Step 2/2] [INFO] Ambari Views ....................................... SUCCESS [  1.936 s]
[09:32:47] :	 [Step 2/2] [INFO] utility ............................................ SUCCESS [  1.935 s]
[09:32:47] :	 [Step 2/2] [INFO] Ambari Metrics Common .............................. SUCCESS [  3.578 s]
[09:32:47] :	 [Step 2/2] [INFO] Ambari Server ...................................... SUCCESS [22:30 min]
[09:32:47] :	 [Step 2/2] [INFO] Ambari Agent ....................................... SUCCESS [ 35.743 s]
[09:32:47] :	 [Step 2/2] [INFO] ------------------------------------------------------------------------
[09:32:47] :	 [Step 2/2] [INFO] BUILD SUCCESS
[09:32:47] :	 [Step 2/2] [INFO] ------------------------------------------------------------------------
[09:32:47] :	 [Step 2/2] [INFO] Total time: 22:36 min (Wall Clock)
[09:32:47] :	 [Step 2/2] [INFO] Finished at: 2018-02-01T09:32:47Z
[09:32:47] :	 [Step 2/2] [INFO] Final Memory: 87M/2743M
[09:32:47] :	 [Step 2/2] [INFO] ------------------------------------------------------------------------

@asfgit
Copy link

asfgit commented Feb 1, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/Ambari-Github-PullRequest-Builder/381/
Test PASSed.

pids_to_kill = sorted(all_chield_pids, reverse=True)
for pid in pids_to_kill:
try:
if is_pid_life(pid):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_pid_life
maybe is_pid_alive?

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach to solve issues with broken YUM transactions

Copy link
Contributor

@ncole ncole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the idea here. Will this work with non-root agents or processes that are started with non-root access (like killing namenodes and other cluster processes)?

comm_path_pattern = "/proc/{0}/comm"
cmdline_path_pattern = "/proc/{0}/cmdline"

def read_childrens(pid):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def read_children(pid)

except Exception, e:
logger.warn("Failed to kill PID %d" % (pid))
logger.warn("Reported error: " + repr(e))
def get_all_childrens(base_pid):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def get_all_children(base_pid)

Copy link
Contributor Author

@hapylestat hapylestat Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncole no, actually same as previous implementation (popen didn't use ambari_sudo or sudo command)

It is another kind of problem which would need to be solved via sudo.py...

def kill_process_with_children(parent_pid):
exception_list = ["apt-get", "apt", "yum", "zypper", "zypp"]
signals_to_post = [signal.SIGTERM, signal.SIGKILL]
all_chield_pids = [item[0] for item in get_all_childrens(parent_pid) if item[1].lower() not in exception_list and item[0] != os.getpid()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ever end up with never-dying yum here?

Copy link
Contributor Author

@hapylestat hapylestat Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is rare situation that yum hangs. Most likey it can slowly download package by holding lock. Ambari able to check package manager lock and retry package installation over time. However if some serious problem happen, what cause yum to hang, manual user debugging is needed anyway.

It is lesser evil, than killing it in the middle of the work and then screw the whole system.

@hapylestat
Copy link
Contributor Author

moved PR#249 for trunk to branch-2.6, to make code-base equal, now instead of os.kill using sudo.kill, which will use proper implementation depends on agent user level

@ncole, @dlysnichenko please re-check code as implementation is slightly different, due to was made according to changes in trunk

@hapylestat
Copy link
Contributor Author

retest this please

@asfgit
Copy link

asfgit commented Feb 1, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/Ambari-Github-PullRequest-Builder/396/
Test PASSed.

@hapylestat hapylestat merged commit 9c346be into apache:branch-2.6 Feb 1, 2018
@hapylestat hapylestat deleted the in-work/branch-2.6/AMBARI-22888 branch February 1, 2018 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ambari-commons Ambari Common & Resource manager python libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants