Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] problems of RssMapOutputCollector #715

Closed
3 tasks done
zhaobing001 opened this issue Mar 13, 2023 · 4 comments
Closed
3 tasks done

[Bug] problems of RssMapOutputCollector #715

zhaobing001 opened this issue Mar 13, 2023 · 4 comments

Comments

@zhaobing001
Copy link
Contributor

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

The connection between the client and shufflerserver is not closed. As a result, the maptask container does not exit.
1.map container does not exit when reduce is running
2.When cluster resources are used up and some maps are not allocated, am waits for a one-minute timeout, kills the completed map container, and allocates resources to other maps

error log like:
2023-03-11 02:42:16,826 INFO [Ping Checker] org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:attempt_1676901654399_1531374_m_000190_0 Timed out after 60 secs

In the close method of the RssMapOutputCollector, closing the shuffle client solves this problem

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@jerqi
Copy link
Contributor

jerqi commented Mar 13, 2023

Is it an occasional case?

@zhaobing001
Copy link
Contributor Author

An inevitable case

@jerqi
Copy link
Contributor

jerqi commented Mar 13, 2023

Could you raise a pr to fix this issue?

jerqi pushed a commit that referenced this issue May 19, 2023
…not closed (#882)

### What changes were proposed in this pull request?
The container does not exit because shuffleclient is not closed

### Why are the changes needed?

For #715 

1.The process does not exit after the maptask or reducetask execution is complete. The reason is that ShuffleWriteClient has a thread pool that does not close when the task completes. So turning off ShuffleWriteClient can solve this problem.

2.How do I recreate this scene?
Initialize a small cluster and submit an mr Task whose requested resources exceed the total resources in the cluster.
We can see that all tasks have completed execution without quitting until the timeout time exceeds 60 seconds(mapreduce.task.exit.timeout). The appmaster requests the nodemanager to kill the corresponding container.

The nodemanager logs are as follows
`2023-03-12 13:56:45,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1676901654399_1653119_m_000070_0: [2023-03-12 13:56:44.909]Container killed by the ApplicationMaster.
[2023-03-12 13:56:44.921]Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 45556 as user tc_infra for container container_e304_1676901654399_1653119_01_000072, result=success
[2023-03-12 13:56:44.985]Container killed on request. Exit code is 143
[2023-03-12 13:56:45.403]Container exited with a non-zero exit code 143. 
`
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

existing UTs.

Co-authored-by: zhaobing <zhaobing@zhihu.com>
@jerqi
Copy link
Contributor

jerqi commented May 19, 2023

closed by #882

@jerqi jerqi closed this as completed May 19, 2023
jerqi pushed a commit that referenced this issue May 19, 2023
…not closed (#882)

### What changes were proposed in this pull request?
The container does not exit because shuffleclient is not closed

### Why are the changes needed?

For #715 

1.The process does not exit after the maptask or reducetask execution is complete. The reason is that ShuffleWriteClient has a thread pool that does not close when the task completes. So turning off ShuffleWriteClient can solve this problem.

2.How do I recreate this scene?
Initialize a small cluster and submit an mr Task whose requested resources exceed the total resources in the cluster.
We can see that all tasks have completed execution without quitting until the timeout time exceeds 60 seconds(mapreduce.task.exit.timeout). The appmaster requests the nodemanager to kill the corresponding container.

The nodemanager logs are as follows
`2023-03-12 13:56:45,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1676901654399_1653119_m_000070_0: [2023-03-12 13:56:44.909]Container killed by the ApplicationMaster.
[2023-03-12 13:56:44.921]Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 45556 as user tc_infra for container container_e304_1676901654399_1653119_01_000072, result=success
[2023-03-12 13:56:44.985]Container killed on request. Exit code is 143
[2023-03-12 13:56:45.403]Container exited with a non-zero exit code 143. 
`
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

existing UTs.

Co-authored-by: zhaobing <zhaobing@zhihu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants