New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behavior invoked by moving stacks between threads #6

Closed
vorner opened this Issue Dec 23, 2017 · 18 comments

Comments

Projects
None yet
6 participants
@vorner

vorner commented Dec 23, 2017

A coroutine's stack can be moved from one thread to another. However, a stack may always contain things that are not Send, breaking things.

This is one example:

extern crate may;

use std::cell::RefCell;
use std::rc::Rc;
use std::thread::ThreadId;

use may::coroutine;
use coroutine::yield_now;

thread_local!(static ID: RefCell<Option<Rc<ThreadId>>> = RefCell::new(None));

fn main() {
    may::config().set_io_workers(60);
    may::config().set_workers(60);
    let h = coroutine::spawn(move || {
        let v = (0..10000)
            .map(|i| {
                coroutine::spawn(|| {
                    let handle = Rc::new(std::thread::current().id());
                    ID.with(|id| {
                        *id.borrow_mut() = Some(Rc::clone(&handle));
                    });
                    for _ in 0..10000 {
                        if *handle != std::thread::current().id() {
                            println!("Access to Rc content without a mutex from a different thread, {:?} vs {:?}",
                                     *handle, std::thread::current().id());
                        }
                        yield_now();
                    }
                })
            })
            .collect::<Vec<_>>();
        for i in v {
            i.join().unwrap();
        }
    });
    h.join().unwrap();
}

If something else was accessing the Rc (like making copis of it) from the original thread (which it could, because it is accessible in the thread local storage), it would be undefined behaviour ‒ both the coroutine, that moved, and the thing in that thread (possibly other coroutine) could be accessing the counters in the Rc at the same time, or the data inside, which could be for example a RefCell.

Now, suggesting not to use thread local storage doesn't solve anything, because:

  • I might not use thread local storage myself, but I can't certainly be expected to audit all the libraries I use not to use thread local storage. Even the standard library uses thread local storage internally.
  • There are things that are not Send for other reasons. One example might be Zero-MQ sockets, which explode the whole application if ever touched from a different thread then they were created in.

I believe this problem is fundamental to any attempt to move stacks between threads in Rust. Such thing just breaks the Rust contract.

So my only suggestion is to create the coroutine in one thread (it is possible to check nothing Send crosses a closure boundary, so the closure can be safely sent to another thread) and then pin it there to that thread.

@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 24, 2017

I didn't see UB by running the following code.
you can ref this blog: Don’t visit thread local storage in coroutine context

#[macro_use]
extern crate may;

use std::cell::RefCell;
use std::rc::Rc;
use std::thread::ThreadId;

use may::coroutine;
use coroutine::yield_now;

coroutine_local!(static ID: RefCell<Option<Rc<ThreadId>>> = RefCell::new(None));

fn main() {
    may::config().set_io_workers(4).set_workers(4);
    let h = coroutine::spawn(move || {
        let v = (0..10000)
            .map(|i| {
                coroutine::spawn(|| {
                    let handle = Rc::new(std::thread::current().id());
                    ID.with(|id| {
                        *id.borrow_mut() = Some(Rc::clone(&handle));
                    });
                    for _ in 0..10000 {
                        if *handle != std::thread::current().id() {
                            println!("Access to Rc content without a mutex from a different thread, {:?} vs {:?}",
                                     *handle, std::thread::current().id());
                        }
                        yield_now();
                    }
                })
            })
            .collect::<Vec<_>>();
        for i in v {
            i.join().unwrap();
        }
    });
    h.join().unwrap();
}
@zonyitoo

This comment has been minimized.

zonyitoo commented Dec 24, 2017

Unfortunately, you cannot bypass the TLS problem in Rust. Because libstd uses thread_local every where.

My coio project implemented exactly the same scheduling strategy and APIs as may, and I have found this problem long ago. As far as I can tell, there is no way to bypass.

@vorner

This comment has been minimized.

vorner commented Dec 24, 2017

What do you mean, by „seeing“ UB?

No, this code doesn't contain UB itself, only points to a place where it could happen. But if I, for example, added ID.with(|id| Rc::clone(id.borrow())); Rc::clone(handle) into the inner loop, just above the conditional, that would already be UB. I'd would modifying the reference counts inside the Rc from multiple threads without synchronisation, possibly making it wrong and freeing the insides too early or something.

Note that UB can do anything. That includes pretending to work correctly when you watch. That's the tricky part about UB ‒ you don't have to see it, it can be lurk in the program and do arbitrarily bad things only from time to time (when the customer watches).

And I know I don't want to use TLS from coroutines. The problem is, a blog post is not the appropriate solution. TLS isn't the only problem, there are others ‒ system calls that are expected to happen on specific thread, FFI libraries, etc. You never know if a library you use might be using TLS and not telling you. You are not supposed to have to watch for these things in Rust if you're not using unsafe.

If your library exposes a „safe“ interface which allows your user to invoke UB by it, the library is broken and dangerous to use. So you can either ensure the UB can never happen (by eg. not moving coroutines between threads after they are created), or mark the spawn method as unsafe, because it's the exactly correct mechanism to tell the user of the library he has to check, that there's something not covered by the compiler/library.

@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 24, 2017

@zonyitoo I see this problem now. I have some questions about the issue:

  1. what kind of std API internally using thread_local?
  2. Is it safe to finish calling the API just before yield for another scheduling?
  3. do you have any sensible code that can reproduce the issue?
@zonyitoo

This comment has been minimized.

zonyitoo commented Dec 24, 2017

  1. For instance, panic! and println!
  2. No. It is not safe. Because compiler don't know that you are going to switch this execution context to another thread, so compiler will cache the pointer of TLS in register without actually reload it from memory, which is safe in non-coroutine environment. Reloading from TLS every time you want to access it will have very poor performance.
  3. Well, it is simple, just call println! many many times, and you will see the crash. Because it is a UB, so I can't tell you when it happens, but it will happen. Issue

Also, because Rust is a system programming language, libraries that contains FFI calls to some existing C/C++ libraries dominated most of the Rust's world. You may can control all your libraries to be "thread_local free", but you cannot control all those external libraries in dependencies.

At least, you have no idea when libstd add another thread_local global state in future releases. And you have no way to control it.

I have worked on stackful coroutine with work-stealing scheduler for almost 2 years, and I have already found that this is a dead end. I have tried to

  1. contact one of the LLVM's maintainer to help with generating IRs that can "tell" compiler to re-fetch from TLS, but they refused.
  2. contact alex, the maintainer of Rust's standard library. And we also have many discussion about how libstd can support this kind of stackful coroutines library, and the answer is Rust's libstd won't help.

I am not damping your enthusiasm of making coroutine libraries, but I have to tell you that the way you are going is not very promising. Feel free to contact me, I think we will have lots of common topics.

@KiChjang

This comment has been minimized.

KiChjang commented Dec 24, 2017

@Xudong-Huang We've actually been discussing your crate and its issues in the Rust China community QQ group. We'd like to invite you to join our discussion, please add this QQ group 303838735.

@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 24, 2017

@KiChjang thanks! will see you there.

@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 24, 2017

I use following test and can't see any problem, maybe the test is too simple.
I use the library for sometime and didn't see problem when calling libstd API.

extern crate may;

use std::sync::Arc;
use may::sync::Semphore;

fn main() {
    let sem = Arc::new(Semphore::new(10000));
    may::config().set_workers(100);
    loop {
        sem.wait();
        let s = sem.clone();
        may::coroutine::spawn(move || ->() {
            s.post();
            may::coroutine::yield_now();
            panic!("i'm dying");
        });
    }
}
@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 24, 2017

Access TLS in coroutine is not safe unless:

Before schedule another coroutine the value of the TLS is a fixed one, so that schedule coroutine at any thread has the same TLS value, which is safe.

Just like calling panic, before you schedule another coroutine you must have catch the panic, and at that time the panic count(TLS) will be always 0.

@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 26, 2017

I upload may-caveat doc. I think calling libstd API is safe within may coroutine. and tests like calling println!() and panic!() massively all passed.

The basic goal of MAY is support all libstd APIs be safely called in coroutine. I think this goal is accomplished.

I will close this issue. But if you find some libstd API that is not safely called in coroutine, welcome to reopen the issue.

@vorner @zonyitoo

@vorner

This comment has been minimized.

vorner commented Dec 26, 2017

I'd like to point out that this is not about just libstd (but I'll try to reproduce it with libstd).

Anyway, if you really want to go the way to tell user „don't do this“ and „do this“, that's fine, but you should make corresponding methods unsafe, to ensure they read and follow the instructions. That's the purpose of unsafe, it tells people they need to be careful. Unlike C++ people don't expect to have to watch out for things.

@osa1

This comment has been minimized.

osa1 commented Dec 26, 2017

@Xudong-Huang is it fair to say that we don't have to avoid TLS use as long as we run may with one thread only? Can we update the may_caveat document with this info?

Another question (sorry this is kind of unrelated) is, when calling libraries from a MAY coroutine, I have to make sure the library uses MAY's IO functions rather than std IO functions, correct (e.g. should use may::net::TcpStream instead of std::net::TcpStream) ?

@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 26, 2017

@osa1

is it fair to say that we don't have to avoid TLS use as long as we run may with one thread only?

I'm afraid not, MAY schedule coroutines at least on one worker thread and one io_worker thread. that is at least two threads, change the scheme would hurt IO performance.

when calling libraries from a MAY coroutine, I have to make sure the library uses MAY's IO functions rather than std IO functions, correct (e.g. should use may::net::TcpStream instead of std::net::TcpStream)?

correct, and this is the first rule that list in may_caveat

@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 26, 2017

@vorner

Anyway, if you really want to go the way to tell user „don't do this“ and „do this“, that's fine, but you should make corresponding methods unsafe, to ensure they read and follow the instructions. That's the purpose of unsafe, it tells people they need to be careful. Unlike C++ people don't expect to have to watch out for things.

this would mark the core APIs as unsafe, which would

  • lost optimization by compiler
  • very annoying to use the API

and I don't have a plan to change API specs.

@vorner

This comment has been minimized.

vorner commented Dec 26, 2017

Your API already is unsafe. Pretending it's safe won't make it so, it'll only hurt users.

Very good explanation about unsafe in rust was publish just recently: https://manishearth.github.io/blog/2017/12/24/undefined-vs-unsafe-in-rust/.

To cite:

Basically, in Rust a bit of code is “safe” if it cannot exhibit undefined behavior under all circumstances of that code being used.

@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 26, 2017

@vorner You are right. Thanks for the sharing

@Restioson

This comment has been minimized.

Restioson commented Dec 26, 2017

  • very annoying to use the API

It's also very annoying to encounter UB in 'safe' code...

@Xudong-Huang

This comment has been minimized.

Owner

Xudong-Huang commented Dec 26, 2017

thanks all, I'm considering change the api spec to add unsafe declaration for spawning a coroutine, since it's in an early stage, it's a good time to do that. Apply the rust unsafe rules is important for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment