Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't crash when Channel goes inactive in read triggered by write error #594

Merged
merged 3 commits into from Aug 27, 2018

Conversation

weissi
Copy link
Member

@weissi weissi commented Aug 23, 2018

Motivation:

Previously we asserted that a Channel cannot go inactive after calling
out for a read that was triggered by (draining the receive buffer after)
a write error.

This assertion was put in place to guard against readComplete events
sent on inactive channels. It did that job just fine but crashes aren't
great so we now conditionally fire the readComplete event if the
Channel stays active.

Modifications:

make the readComplete event firing conditional

Result:

fewer crashes, more happy faces

@weissi weissi requested review from Lukasa and normanmaurer and removed request for Lukasa August 23, 2018 15:29
var error: ChannelError {
switch self {
case .all:
return .alreadyClosed
return .ioOnClosedChannel
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was just the wrong error here, .alreadyClosed means close on previously closed channel. For all others, .ioOnClosedChannel is right

assert(self.lifecycleManager.isActive)
pipeline.fireChannelReadComplete()
if self.lifecycleManager.isActive {
pipeline.fireChannelReadComplete()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No race here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is thread-safe.

I think a more important question is why we fire readComplete more than once. I feel like we shouldn't be doing that: we should probably move the fireChannelReadComplete outside the while loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lukasa I have no idea either why we're doing that but left it because it was that way. But let me just change it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vlm thread-safety is fine here because we're on the Channel's event loop. So (unless we call out), the value of self.lifecycleManager.isActive can't change. So if it's still active after readFromSocket returns, it'll remain active until we return or call out again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lukasa done

Motivation:

Previously we asserted that a Channel cannot go inactive after calling
out for a read that was triggered by (draining the receive buffer after)
a write error.

This assertion was put in place to guard against `readComplete` events
sent on inactive channels. It did that job just fine but crashes aren't
great so we now conditionally fire the `readComplete` event if the
Channel stays active.

Modifications:

make the readComplete event firing conditional

Result:

fewer crashes, more happy faces
@weissi weissi force-pushed the jw-chan-inactive-write-error branch from 271b6fd to 57eb1f4 Compare August 23, 2018 17:37
@weissi weissi added the patch-version-bump-only For PRs that when merged will only cause a bump of the patch version, ie. 1.0.x -> 1.0.(x+1) label Aug 23, 2018
Copy link
Member

@normanmaurer normanmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@weissi
Copy link
Member Author

weissi commented Aug 24, 2018

CI seriously unhappy: https://ci.swiftnio.io/job/swift-nio-master-410-prb/519/console

11:46:21 hudson.plugins.git.GitException: Command "git clean -fdx" returned status code 1:
11:46:21 stdout: Skipping repository .build/checkouts/swift-nio-zlib-support.git--8552607645869553644
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git-914925017792929693
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git--5201548940310384603
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git-7278503767583922314
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git-1529492468280462262
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git-2261880114035040650
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git--3716276613919170701
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git-9184634203830310054
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git-5975790777597269952
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git-9137596717834041680
[...]
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git-7681731979200588851
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git--6741020296262781494
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git--388884417884296308
11:46:21 Skipping repository .build/checkouts/swift-nio-zlib-support.git--1868812526729017302
11:46:21 Removing .build/debug
11:46:21 Removing .build/junit-sh-tests.xml
11:46:21 Removing .build/debug.yaml
11:46:21 Removing .build/build.db
11:46:21 Removing .build/dependencies-state.json
11:46:21 Removing .build/regenerate-token
11:46:21 Removing Package.resolved
11:46:21 
11:46:21 stderr: warning: failed to remove .build/repositories
11:46:21 warning: failed to remove .build/x86_64-unknown-linux
11:46:21 
11:46:21 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1996)
11:46:21 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1964)
11:46:21 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1960)
11:46:21 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1597)
11:46:21 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1609)
11:46:21 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.clean(CliGitAPIImpl.java:787)
11:46:21 	at hudson.plugins.git.GitAPI.clean(GitAPI.java:311)
11:46:21 	at hudson.plugins.git.extensions.impl.CleanCheckout.onCheckoutCompleted(CleanCheckout.java:30)
11:46:21 	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1238)
11:46:21 	at hudson.scm.SCM.checkout(SCM.java:504)
11:46:21 	at hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
11:46:21 	at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
11:46:21 	at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
11:46:21 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
11:46:21 	at hudson.model.Run.execute(Run.java:1727)
11:46:21 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
11:46:21 	at hudson.model.ResourceController.execute(ResourceController.java:97)
11:46:21 	at hudson.model.Executor.run(Executor.java:429)
11:46:21 [Set GitHub commit status (universal)] ERROR on repos [] (sha:eee3aa3) with context:swift-nio-master-410-prb

@weissi
Copy link
Member Author

weissi commented Aug 24, 2018

@tomerd ^^^

@weissi
Copy link
Member Author

weissi commented Aug 24, 2018

@swift-nio-bot test this please

@tomerd
Copy link
Member

tomerd commented Aug 24, 2018

yea looks like ci broke, will look into it

@weissi
Copy link
Member Author

weissi commented Aug 24, 2018

thanks @tomerd

@weissi weissi added this to the 1.9.3 milestone Aug 24, 2018
@tomerd
Copy link
Member

tomerd commented Aug 24, 2018

ugh, the ebs volume got detached. working to fix

@tomerd
Copy link
Member

tomerd commented Aug 24, 2018

@swift-nio-bot test this please

@tomerd
Copy link
Member

tomerd commented Aug 24, 2018

@weissi should be okay now, monitoring too

@weissi
Copy link
Member Author

weissi commented Aug 24, 2018

awesome, thanks so much @tomerd

@weissi
Copy link
Member Author

weissi commented Aug 27, 2018

also hit the confused kernel

18:46:36 ++ lsof -a -d 0-1024 -p 5196
18:46:36 COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
18:46:36 NIOHTTP1S 5196 root    0r   CHR    1,3      0t0   88849 /dev/null
18:46:36 NIOHTTP1S 5196 root    1w   REG 202,80    10106 2101813 /tmp/.swift-nio-http1-server-sh-tests_pB8Eyp/test.out_VT1Hgb
18:46:36 NIOHTTP1S 5196 root    2w   REG 202,80    10106 2101813 /tmp/.swift-nio-http1-server-sh-tests_pB8Eyp/test.out_VT1Hgb
18:46:36 NIOHTTP1S 5196 root    3w  FIFO    0,8      0t0   88578 pipe
18:46:36 NIOHTTP1S 5196 root    4w  FIFO    0,8      0t0   88579 pipe
18:46:36 NIOHTTP1S 5196 root    5u  0000    0,9        0    5986 anon_inode
18:46:36 NIOHTTP1S 5196 root    6u  0000    0,9        0    5986 anon_inode
18:46:36 NIOHTTP1S 5196 root    7u  0000    0,9        0    5986 anon_inode
18:46:36 NIOHTTP1S 5196 root    8u  0000    0,9        0    5986 anon_inode
18:46:36 NIOHTTP1S 5196 root    9u  0000    0,9        0    5986 anon_inode
18:46:36 NIOHTTP1S 5196 root   10u  0000    0,9        0    5986 anon_inode
18:46:36 NIOHTTP1S 5196 root   11u  IPv4 202998      0t0     TCP localhost:36302 (LISTEN)
18:46:36 NIOHTTP1S 5196 root   12u  sock    0,7      0t0  202631 can't identify protocol
18:46:36 ++ do_netstat inet
18:46:36 ++ pf=inet
18:46:36 ++ netstat_options=()
18:46:36 ++ case "$(uname -s)" in
18:46:36 +++ uname -s
18:46:36 ++ netstat_options+=("-A" "$pf" -p)
18:46:36 ++ netstat -an -A inet -p
18:46:36 Active Internet connections (servers and established)
18:46:36 Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
18:46:36 tcp        0      0 127.0.0.1:36302         0.0.0.0:*               LISTEN      5196/NIOHTTP1Server
18:46:36 tcp        0      0 127.0.0.11:46673        0.0.0.0:*               LISTEN      -               
18:46:36 tcp        0      0 127.0.0.1:37096         127.0.0.1:43894         TIME_WAIT   -               
18:46:36 tcp        0      0 127.0.0.1:37092         127.0.0.1:43894         TIME_WAIT   -               
18:46:36 tcp        0      0 127.0.0.1:47122         127.0.0.1:36302         TIME_WAIT   -               
18:46:36 tcp        0      0 127.0.0.1:47124         127.0.0.1:36302         TIME_WAIT   -               
18:46:36 tcp        0      0 127.0.0.1:37094         127.0.0.1:43894         TIME_WAIT   -               
18:46:36 udp        0      0 127.0.0.11:55168        0.0.0.0:*                           -               
18:46:36 ++ local open_fds
18:46:36 +++ server_lsof 5196

which is #563

@weissi
Copy link
Member Author

weissi commented Aug 27, 2018

@swift-nio-bot test this please

@weissi weissi merged commit df764fe into apple:master Aug 27, 2018
@weissi weissi deleted the jw-chan-inactive-write-error branch August 27, 2018 10:50
weissi added a commit that referenced this pull request Aug 29, 2018
…or (#594)

Motivation:

Previously we asserted that a Channel cannot go inactive after calling
out for a read that was triggered by (draining the receive buffer after)
a write error.

This assertion was put in place to guard against `readComplete` events
sent on inactive channels. It did that job just fine but crashes aren't
great so we now conditionally fire the `readComplete` event if the
Channel stays active.

Modifications:

make the readComplete event firing conditional

Result:

fewer crashes, more happy faces
Motivation:

Explain here the context, and why you're making that change.
What is the problem you're trying to solve.

Modifications:

Describe the modifications you've done.

Result:

After your change, what will change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch-version-bump-only For PRs that when merged will only cause a bump of the patch version, ie. 1.0.x -> 1.0.(x+1)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants