New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: fix use after free race with aio_wait #14956

Merged
merged 5 commits into from May 8, 2017

Conversation

Projects
None yet
3 participants
@liewegas
Member

liewegas commented May 4, 2017

No description provided.

liewegas added some commits May 3, 2017

os/bluestore/KernelDevice: aio_wake/wait and completion callback are …
…exclusive

Codify that these are exclusive.  This means we don't waste time on
aio_wake in the common case of ioc's with callbacks.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore/NVMEDevice: remove unnecessary aio_wake
I'm not sure what this is for; I can't find a corresponding waiter.

Signed-off-by: Sage Weil <sage@redhat.com>
@majianpeng

This comment has been minimized.

Show comment
Hide comment
@majianpeng

majianpeng May 5, 2017

Member

The previous comment w/ wrong format. So send w/ right format.

diff --git a/src/os/bluestore/BlockDevice.h b/src/os/bluestore/BlockDevice.h
index 157add6..bc67905 100644
--- a/src/os/bluestore/BlockDevice.h
+++ b/src/os/bluestore/BlockDevice.h
@@ -64,6 +64,7 @@ struct IOContext {
     if (num_waiting.load()) {
       std::lock_guard<std::mutex> l(lock);
       cond.notify_all();
+      num_running--;
     }
   }
 };
diff --git a/src/os/bluestore/KernelDevice.cc b/src/os/bluestore/KernelDevice.cc
index 77eaf36..c8dbade 100644
--- a/src/os/bluestore/KernelDevice.cc
+++ b/src/os/bluestore/KernelDevice.cc
@@ -365,19 +365,22 @@ void KernelDevice::_aio_thread()
                 << " aios left" << dendl;
        assert(r >= 0);
 
-       int left = --ioc->num_running;
+       //int left = --ioc->num_running;
        // NOTE: once num_running is decremented we can no longer
        // trust aio[] values; they my be freed (e.g., by BlueFS::_fsync)
-       if (left == 0) {
+       if (ioc->num_running == 1) {
          // check waiting count before doing callback (which may
          // destroy this ioc).  and avoid ref to ioc after aio_wake()
          // in case that triggers destruction.
          void *priv = ioc->priv;
-         ioc->aio_wake();
+         //ioc->aio_wake();
          if (priv) {
+           ioc->num_running--;
            aio_callback(aio_callback_priv, priv);
-         }
-       }
+         } else
+           ioc->aio_wake();
+       } else
+         ioc->num_running--;
       }
     }
Member

majianpeng commented May 5, 2017

The previous comment w/ wrong format. So send w/ right format.

diff --git a/src/os/bluestore/BlockDevice.h b/src/os/bluestore/BlockDevice.h
index 157add6..bc67905 100644
--- a/src/os/bluestore/BlockDevice.h
+++ b/src/os/bluestore/BlockDevice.h
@@ -64,6 +64,7 @@ struct IOContext {
     if (num_waiting.load()) {
       std::lock_guard<std::mutex> l(lock);
       cond.notify_all();
+      num_running--;
     }
   }
 };
diff --git a/src/os/bluestore/KernelDevice.cc b/src/os/bluestore/KernelDevice.cc
index 77eaf36..c8dbade 100644
--- a/src/os/bluestore/KernelDevice.cc
+++ b/src/os/bluestore/KernelDevice.cc
@@ -365,19 +365,22 @@ void KernelDevice::_aio_thread()
                 << " aios left" << dendl;
        assert(r >= 0);
 
-       int left = --ioc->num_running;
+       //int left = --ioc->num_running;
        // NOTE: once num_running is decremented we can no longer
        // trust aio[] values; they my be freed (e.g., by BlueFS::_fsync)
-       if (left == 0) {
+       if (ioc->num_running == 1) {
          // check waiting count before doing callback (which may
          // destroy this ioc).  and avoid ref to ioc after aio_wake()
          // in case that triggers destruction.
          void *priv = ioc->priv;
-         ioc->aio_wake();
+         //ioc->aio_wake();
          if (priv) {
+           ioc->num_running--;
            aio_callback(aio_callback_priv, priv);
-         }
-       }
+         } else
+           ioc->aio_wake();
+       } else
+         ioc->num_running--;
       }
     }
@yuyuyu101

reasonable change.

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 5, 2017

Member

I like @majianpeng's solution better.. will fix this up!

Member

liewegas commented May 5, 2017

I like @majianpeng's solution better.. will fix this up!

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 5, 2017

Member

@yuyuyu101 is it the case that there is only a single thread doing io_complete for NVMEDevice? (it won't race with itself?)

Member

liewegas commented May 5, 2017

@yuyuyu101 is it the case that there is only a single thread doing io_complete for NVMEDevice? (it won't race with itself?)

@yuyuyu101

This comment has been minimized.

Show comment
Hide comment
@yuyuyu101

yuyuyu101 May 5, 2017

Member

it exists multi queues with corresponding thread, io will be hash and append to one queue.

Member

yuyuyu101 commented May 5, 2017

it exists multi queues with corresponding thread, io will be hash and append to one queue.

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 5, 2017

Member

is that true for multiple ios on the same iocontext?

basically we need to make sure that the ioc->num_running == 1 check in io_complete is safe. if there are multiple io_complete threads on the same IOContext they can both see, say, 2, and then both decrement to 0, and nobody calls aio_wake.

Member

liewegas commented May 5, 2017

is that true for multiple ios on the same iocontext?

basically we need to make sure that the ioc->num_running == 1 check in io_complete is safe. if there are multiple io_complete threads on the same IOContext they can both see, say, 2, and then both decrement to 0, and nobody calls aio_wake.

@yuyuyu101

This comment has been minimized.

Show comment
Hide comment
@yuyuyu101

yuyuyu101 May 5, 2017

Member

In one setence, NVMEDevice doesn't guarantee this. you can refer to "https://github.com/ceph/ceph/blob/master/src/os/bluestore/NVMEDevice.cc#L1115". each io will be append to a queue by the caller's thread.

but in the current bluestore usage, I think it's safe. bluestore currently call aio_* with the same iocontext(without priv argument) always in one thread.

Member

yuyuyu101 commented May 5, 2017

In one setence, NVMEDevice doesn't guarantee this. you can refer to "https://github.com/ceph/ceph/blob/master/src/os/bluestore/NVMEDevice.cc#L1115". each io will be append to a queue by the caller's thread.

but in the current bluestore usage, I think it's safe. bluestore currently call aio_* with the same iocontext(without priv argument) always in one thread.

liewegas added some commits May 5, 2017

os/bluestore/KernelDevice: remove weird aio_wake from sync read()
I'm not entirely sure why we were ever signaling the cond here; the read
is synchronous!

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore/BlockDevice: fix racy aio_{wake,wait}() on IOContext
Thread 1 (_do_read)                 Thread 2 (_aio_thread)
queues aio
ioc->aio_wait()
  locks ioc->lock
  num_waiting++
                                    finishes aios
                                    ioc->aio_wake
                                      if (num_waiting)
                                        lock ioc->lock (blocks)
  num_running == 0, continue
  ioc->unlock
                                        ioc->lock taken
do_read destroys ioc
                                        use after free, may block forever

The last bit of the race may vary; the key thing is that thread 2 is
taking a lock that thread 1 can destroy; thread 2 doesn't have it pinned
in memory.

Fix this by simplifying the aio_wake, aio_wait.  Since it is mutually
exclusive with a callback completion, we can avoid calling it at all when
a callback in present, and focus on keeping it simple.

Avoid use-after-free by making sure the last decrement happens under
the lock in the aio_wake/wait case.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: remove useless IOContext::num_reading
If we are a syncrhonous read, we don't need this: we don't aio_wait for
sync reads.  If we are an aio_read, we are in the aio_running count anyway,
and there is also no purpose for this counter.

I'm a bit unsure about the NVME use of this counter; I switched it to use
num_running (pretty sure we aren't mixing reads and writes on a single
IOContext) *but* it might make more sense to switch to a private counter.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 8, 2017

Member

@yuyuyu101 @majianpeng does the new version look ok?

Member

liewegas commented May 8, 2017

@yuyuyu101 @majianpeng does the new version look ok?

@majianpeng

This comment has been minimized.

Show comment
Hide comment
@majianpeng

majianpeng May 8, 2017

Member

approve this PR.

Member

majianpeng commented May 8, 2017

approve this PR.

@yuyuyu101 yuyuyu101 merged commit 867b1d8 into ceph:master May 8, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@liewegas liewegas deleted the liewegas:wip-bdev-ioc-lock branch May 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment