Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hanging Ctrl+C on S3 downloads #673

Merged
merged 4 commits into from
Feb 25, 2014
Merged

Fix hanging Ctrl+C on S3 downloads #673

merged 4 commits into from
Feb 25, 2014

Conversation

jamesls
Copy link
Member

@jamesls jamesls commented Feb 24, 2014

This is a regression that was introduced when the IO queue/thread
was introduced (#619).

The current shutdown order of the threads is:

  1. Queue end sentinel for IO thread.
  2. Queue end sentinel for result thread.
  3. Queue end sentinel for worker threads.
  4. .join() threads in this order:
    [io_thread, result_thread [, worker threads]]

Though the actual thread shutdown order is non-deterministic,
it's fairly common that the threads shutdown in roughly the above
order. This means that the IO thread will generally shutdown before
all the worker threads have shutdown.

However, the download tasks can still be enqueueing writes to the
IO queue. If the IO thread shutsdown there's nothing consuming
writes on the other end of the queue. Given that the queue is
bounded in maxsize, .put() calls to the queue will block until space
becomes available. This will never happen if the IO queue is already
shutdown.

The fix here is to ensure that the IO thread is always the last thing
to shutdown. This means any remaining IO requests will be executed
before the IO thread shutsdown. This will prevent deadlock.

Added unit tests demonstrates this issue. I've also added an
integration test that actually sends a SIGINT to the process
and verifies it exits in a timely manner so can ensure we
don't regress on this again.

Note: some unit/integ tests needed updating because they were
using .call() multiple times.

Fixes #650
Fixes #657

This is a regression that was introduced when the IO queue/thread
was introduced (aws#619).

The current shutdown order of the threads is:

1. Queue end sentinel for IO thread.
2. Queue end sentinel for result thread.
3. Queue end sentinel for worker threads.
4. .join() threads in this order:
   [io_thread, result_thread [, worker threads]]

Though the actual thread shutdown order is non-deterministic,
it's fairly common that the threads shutdown in roughly the above
order.  This means that the IO thread will generally shutdown before
all the worker threads have shutdown.

However, the download tasks can still be enqueueing writes to the
IO queue.  If the IO thread shutsdown there's nothing consuming
writes on the other end of the queue.  Given that the queue is
bounded in maxsize, .put() calls to the queue will block until space
becomes available.  This will never happen if the IO queue is already
shutdown.

The fix here is to ensure that the IO thread is always the last thing
to shutdown.  This means any remaining IO requests will be executed
before the IO thread shutsdown.  This will prevent deadlock.

Added unit tests demonstrates this issue.  I've also added an
integration test that actually sends a SIGINT to the process
and verifies it exits in a timely manner so can ensure we
don't regress on this again.

Note: some unit/integ tests needed updating because they were
using .call() multiple times.

Fixes aws#650
Fixes aws#657
while time.time() < deadline:
rc = process.poll()
if rc is not None:
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to put a short time.sleep here? How long does this test typically take to run? Up to 30 seconds seems like a long time to busy loop.

@danielgtaylor
Copy link
Contributor

Great find and fix. LGTM 🚢-it.

@toastdriven
Copy link
Contributor

LGTM as well.

Fallout from removal of .done and .interrupt.
It's possible that between the isdir() check and
the makedirs() check that another thread has come along
and created the directory.

We also need to make sure that any uncaught exception will
cause the download to be cancelled.
@jamesls
Copy link
Member Author

jamesls commented Feb 25, 2014

I've added some addition fixes. @toastdriven mind taking a look?

@toastdriven
Copy link
Contributor

Still LGTM.

@jamesls jamesls merged commit d2cb65c into aws:develop Feb 25, 2014
@jamesls jamesls deleted the s3-hang branch June 23, 2014 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sync and cp hangs with a large amount of files Breaking (ctrl+c) a S3 sync fails to clean up
3 participants