New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctrie data race #122

Closed
newhook opened this Issue Dec 24, 2015 · 9 comments

Comments

Projects
None yet
2 participants
@newhook

newhook commented Dec 24, 2015

I just got a report of data race in the ctrie impl on committed.

func (c *Ctrie) rdcssRoot(old *iNode, expected *mainNode, nv *iNode) bool {
    desc := &iNode{
        rdcss: &rdcssDescriptor{
            old:      old,
            expected: expected,
            nv:       nv,
        },
    }
    if c.casRoot(old, desc) {
        c.rdcssComplete(false)
        return desc.rdcss.committed
    }
    return false
}
func (c *Ctrie) rdcssComplete(abort bool) *iNode {
...
            if oldeMain == exp {
            // Commit the RDCSS.
            if c.casRoot(r, nv) {
                desc.committed = true
                return nv
Read by goroutine 9:
  customerio/chgbuf/ctrie.(*Ctrie).rdcssRoot()
      /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:881 +0x1e7
  customerio/chgbuf/ctrie.(*Ctrie).Snapshot()
      /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:318 +0xb7
...
Previous write by goroutine 6:
  customerio/chgbuf/ctrie.(*Ctrie).rdcssComplete()
      /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:912 +0x1a1
  customerio/chgbuf/ctrie.(*Ctrie).rdcssReadRoot()
      /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:863 +0x78
  customerio/chgbuf/ctrie.(*Ctrie).readRoot()
      /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:855 +0x33
  customerio/chgbuf/ctrie.(*Ctrie).ReadOnlySnapshot()
...
@tylertreat

This comment has been minimized.

Show comment
Hide comment
@tylertreat

tylertreat Dec 24, 2015

Contributor

Thanks, will need to take a closer look, but that probably needs to be an
atomic.

On Thu, Dec 24, 2015, 11:59 AM Matthew Newhook notifications@github.com
wrote:

I just got a report of data race in the ctrie impl on committed.

func (c *Ctrie) rdcssRoot(old *iNode, expected *mainNode, nv *iNode) bool {
desc := &iNode{
rdcss: &rdcssDescriptor{
old: old,
expected: expected,
nv: nv,
},
}
if c.casRoot(old, desc) {
c.rdcssComplete(false)
return desc.rdcss.committed
}
return false
}

func (c *Ctrie) rdcssComplete(abort bool) *iNode {
...
if oldeMain == exp {
// Commit the RDCSS.
if c.casRoot(r, nv) {
desc.committed = true
return nv

Read by goroutine 9:
customerio/chgbuf/ctrie.(_Ctrie).rdcssRoot()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:881 +0x1e7
customerio/chgbuf/ctrie.(_Ctrie).Snapshot()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:318 +0xb7
...
Previous write by goroutine 6:
customerio/chgbuf/ctrie.(_Ctrie).rdcssComplete()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:912 +0x1a1
customerio/chgbuf/ctrie.(_Ctrie).rdcssReadRoot()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:863 +0x78
customerio/chgbuf/ctrie.(_Ctrie).readRoot()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:855 +0x33
customerio/chgbuf/ctrie.(_Ctrie).ReadOnlySnapshot()
...


Reply to this email directly or view it on GitHub
#122.

Contributor

tylertreat commented Dec 24, 2015

Thanks, will need to take a closer look, but that probably needs to be an
atomic.

On Thu, Dec 24, 2015, 11:59 AM Matthew Newhook notifications@github.com
wrote:

I just got a report of data race in the ctrie impl on committed.

func (c *Ctrie) rdcssRoot(old *iNode, expected *mainNode, nv *iNode) bool {
desc := &iNode{
rdcss: &rdcssDescriptor{
old: old,
expected: expected,
nv: nv,
},
}
if c.casRoot(old, desc) {
c.rdcssComplete(false)
return desc.rdcss.committed
}
return false
}

func (c *Ctrie) rdcssComplete(abort bool) *iNode {
...
if oldeMain == exp {
// Commit the RDCSS.
if c.casRoot(r, nv) {
desc.committed = true
return nv

Read by goroutine 9:
customerio/chgbuf/ctrie.(_Ctrie).rdcssRoot()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:881 +0x1e7
customerio/chgbuf/ctrie.(_Ctrie).Snapshot()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:318 +0xb7
...
Previous write by goroutine 6:
customerio/chgbuf/ctrie.(_Ctrie).rdcssComplete()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:912 +0x1a1
customerio/chgbuf/ctrie.(_Ctrie).rdcssReadRoot()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:863 +0x78
customerio/chgbuf/ctrie.(_Ctrie).readRoot()
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:855 +0x33
customerio/chgbuf/ctrie.(_Ctrie).ReadOnlySnapshot()
...


Reply to this email directly or view it on GitHub
#122.

@newhook

This comment has been minimized.

Show comment
Hide comment
@newhook

newhook Dec 24, 2015

-               return desc.rdcss.committed
+        return atomic.LoadInt32(&desc.rdcss.committed) == 1
        }
        return false
 }
@@ -909,7 +909,7 @@ func (c *Ctrie) rdcssComplete(abort bool) *iNode {
                if oldeMain == exp {
                        // Commit the RDCSS.
                        if c.casRoot(r, nv) {
-                               desc.committed = true
+                atomic.StoreInt32(&desc.committed, 1)
                                return nv

I made that fix locally, but I'm still getting very occasional crashes in a highly concurrent test case.

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x19146b]

goroutine 5 [running]:
testing.tRunner.func1(0xc8200a8000)
    /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/testing/testing.go:450 +0x1f5
customerio/chgbuf/ctrie.(*Ctrie).ilookup(0xc82039cd00, 0xc8203b3ee0, 0xc8203994d0, 0x0, 0x0, 0x3f9990, 0x0, 0x0, 0x8)
    /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:540 +0x7b
customerio/chgbuf/ctrie.(*Ctrie).lookup(0xc82039cd00, 0xc8203994d0, 0x0, 0x0, 0x340ca71c)
    /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:430 +0x9d
customerio/chgbuf/ctrie.(*Ctrie).Lookup(0xc82039cd00, 0xc8203b6090, 0x1, 0x8, 0x0, 0x0, 0x1)
    /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:303 +0x111
customerio/chgbuf.(*data).child(0xc82039cd40, 0xc8203b6090, 0x1, 0x8, 0x256a00, 0x2fdde8)
    /Users/matthew/src/services/src/customerio/chgbuf/data.go:17 +0x68
customerio/chgbuf.(*bucketData).child(0xc82039cd20, 0xc8203b6090, 0x1, 0x8, 0x2fe980, 0x39a00)

newhook commented Dec 24, 2015

-               return desc.rdcss.committed
+        return atomic.LoadInt32(&desc.rdcss.committed) == 1
        }
        return false
 }
@@ -909,7 +909,7 @@ func (c *Ctrie) rdcssComplete(abort bool) *iNode {
                if oldeMain == exp {
                        // Commit the RDCSS.
                        if c.casRoot(r, nv) {
-                               desc.committed = true
+                atomic.StoreInt32(&desc.committed, 1)
                                return nv

I made that fix locally, but I'm still getting very occasional crashes in a highly concurrent test case.

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x19146b]

goroutine 5 [running]:
testing.tRunner.func1(0xc8200a8000)
    /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/testing/testing.go:450 +0x1f5
customerio/chgbuf/ctrie.(*Ctrie).ilookup(0xc82039cd00, 0xc8203b3ee0, 0xc8203994d0, 0x0, 0x0, 0x3f9990, 0x0, 0x0, 0x8)
    /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:540 +0x7b
customerio/chgbuf/ctrie.(*Ctrie).lookup(0xc82039cd00, 0xc8203994d0, 0x0, 0x0, 0x340ca71c)
    /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:430 +0x9d
customerio/chgbuf/ctrie.(*Ctrie).Lookup(0xc82039cd00, 0xc8203b6090, 0x1, 0x8, 0x0, 0x0, 0x1)
    /Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:303 +0x111
customerio/chgbuf.(*data).child(0xc82039cd40, 0xc8203b6090, 0x1, 0x8, 0x256a00, 0x2fdde8)
    /Users/matthew/src/services/src/customerio/chgbuf/data.go:17 +0x68
customerio/chgbuf.(*bucketData).child(0xc82039cd20, 0xc8203b6090, 0x1, 0x8, 0x2fe980, 0x39a00)
@tylertreat

This comment has been minimized.

Show comment
Hide comment
@tylertreat

tylertreat Dec 24, 2015

Contributor

If you have a test which can reproduce semi reliably that would be great.

On Thu, Dec 24, 2015, 3:15 PM Matthew Newhook notifications@github.com
wrote:

  •           return desc.rdcss.committed
    
  •    return atomic.LoadInt32(&desc.rdcss.committed) == 1
    }
    return false
    
    }
    @@ -909,7 +909,7 @@ func (c *Ctrie) rdcssComplete(abort bool) *iNode {
    if oldeMain == exp {
    // Commit the RDCSS.
    if c.casRoot(r, nv) {
  •                           desc.committed = true
    
  •            atomic.StoreInt32(&desc.committed, 1)
                            return nv
    

I made that fix locally, but I'm still getting very occasional crashes in
a highly concurrent test case.

customerio/chgbuf/ctrie.(_Ctrie).ilookup(0xc82039cd00, 0xc8203b3ee0, 0xc8203994d0, 0x0, 0x0, 0x3f9990, 0x0, 0x0, 0x8)
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:540 +0x7b
customerio/chgbuf/ctrie.(_Ctrie).lookup(0xc82039cd00, 0xc8203994d0, 0x0, 0x0, 0x340ca71c)
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:430 +0x9d
customerio/chgbuf/ctrie.(_Ctrie).Lookup(0xc82039cd00, 0xc8203b6090, 0x1, 0x8, 0x0, 0x0, 0x1)
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:303 +0x111
customerio/chgbuf.(_data).child(0xc82039cd40, 0xc8203b6090, 0x1, 0x8, 0x256a00, 0x2fdde8)
/Users/matthew/src/services/src/customerio/chgbuf/data.go:17 +0x68
customerio/chgbuf.(*bucketData).child(0xc82039cd20, 0xc8203b6090, 0x1, 0x8, 0x2fe980, 0x39a00)


Reply to this email directly or view it on GitHub
#122 (comment)
.

Contributor

tylertreat commented Dec 24, 2015

If you have a test which can reproduce semi reliably that would be great.

On Thu, Dec 24, 2015, 3:15 PM Matthew Newhook notifications@github.com
wrote:

  •           return desc.rdcss.committed
    
  •    return atomic.LoadInt32(&desc.rdcss.committed) == 1
    }
    return false
    
    }
    @@ -909,7 +909,7 @@ func (c *Ctrie) rdcssComplete(abort bool) *iNode {
    if oldeMain == exp {
    // Commit the RDCSS.
    if c.casRoot(r, nv) {
  •                           desc.committed = true
    
  •            atomic.StoreInt32(&desc.committed, 1)
                            return nv
    

I made that fix locally, but I'm still getting very occasional crashes in
a highly concurrent test case.

customerio/chgbuf/ctrie.(_Ctrie).ilookup(0xc82039cd00, 0xc8203b3ee0, 0xc8203994d0, 0x0, 0x0, 0x3f9990, 0x0, 0x0, 0x8)
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:540 +0x7b
customerio/chgbuf/ctrie.(_Ctrie).lookup(0xc82039cd00, 0xc8203994d0, 0x0, 0x0, 0x340ca71c)
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:430 +0x9d
customerio/chgbuf/ctrie.(_Ctrie).Lookup(0xc82039cd00, 0xc8203b6090, 0x1, 0x8, 0x0, 0x0, 0x1)
/Users/matthew/src/services/src/customerio/chgbuf/ctrie/ctrie.go:303 +0x111
customerio/chgbuf.(_data).child(0xc82039cd40, 0xc8203b6090, 0x1, 0x8, 0x256a00, 0x2fdde8)
/Users/matthew/src/services/src/customerio/chgbuf/data.go:17 +0x68
customerio/chgbuf.(*bucketData).child(0xc82039cd20, 0xc8203b6090, 0x1, 0x8, 0x2fe980, 0x39a00)


Reply to this email directly or view it on GitHub
#122 (comment)
.

@newhook

This comment has been minimized.

Show comment
Hide comment
@newhook

newhook Dec 25, 2015

The test is part of something much larger. I'll try and reproduce in something small.

newhook commented Dec 25, 2015

The test is part of something much larger. I'll try and reproduce in something small.

@tylertreat

This comment has been minimized.

Show comment
Hide comment
@tylertreat

tylertreat Dec 25, 2015

Contributor

I haven't hit the nil pointer issue, but there definitely appears to be an issue with snapshotting (don't know if it's related to your issue or not). I ran the following program which gives the correct output unless the two goroutines are commented out and the wg is adjusted accordingly.

package main

import (
    "fmt"
    "strconv"
    "sync"

    "github.com/Workiva/go-datastructures/trie/ctrie"
)

func main() {
    trie := ctrie.New(nil)
    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        for i := 0; i < 1000000; i++ {
            trie.Insert([]byte(strconv.Itoa(i)), i)
        }
        wg.Done()
    }()
    go func() {
        for i := 0; i < 1000000; i++ {
            trie.Lookup([]byte(strconv.Itoa(i)))
        }
        wg.Done()
    }()
    //go func() {
    //  for i := 0; i < 1000000; i++ {
    //      trie.Snapshot()
    //  }
    //  wg.Done()
    //}()
    //go func() {
    //  for i := 0; i < 1000000; i++ {
    //      trie.ReadOnlySnapshot()
    //  }
    //  wg.Done()
    //}()
    wg.Wait()

    i := 1
    for _ = range trie.Iterator(nil) {
        fmt.Println(i)
        i++
    }
    fmt.Println("size", trie.Size())
}
Contributor

tylertreat commented Dec 25, 2015

I haven't hit the nil pointer issue, but there definitely appears to be an issue with snapshotting (don't know if it's related to your issue or not). I ran the following program which gives the correct output unless the two goroutines are commented out and the wg is adjusted accordingly.

package main

import (
    "fmt"
    "strconv"
    "sync"

    "github.com/Workiva/go-datastructures/trie/ctrie"
)

func main() {
    trie := ctrie.New(nil)
    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        for i := 0; i < 1000000; i++ {
            trie.Insert([]byte(strconv.Itoa(i)), i)
        }
        wg.Done()
    }()
    go func() {
        for i := 0; i < 1000000; i++ {
            trie.Lookup([]byte(strconv.Itoa(i)))
        }
        wg.Done()
    }()
    //go func() {
    //  for i := 0; i < 1000000; i++ {
    //      trie.Snapshot()
    //  }
    //  wg.Done()
    //}()
    //go func() {
    //  for i := 0; i < 1000000; i++ {
    //      trie.ReadOnlySnapshot()
    //  }
    //  wg.Done()
    //}()
    wg.Wait()

    i := 1
    for _ = range trie.Iterator(nil) {
        fmt.Println(i)
        i++
    }
    fmt.Println("size", trie.Size())
}
@tylertreat

This comment has been minimized.

Show comment
Hide comment
@tylertreat

tylertreat Dec 25, 2015

Contributor

I have a branch with a few fixes, but snapshotting is still broken: https://github.com/tylertreat/go-datastructures/tree/fixes

Simpler test:

package main

import (
    "fmt"
    "strconv"
    "sync"

    "github.com/Workiva/go-datastructures/trie/ctrie"
)

func main() {
    trie := ctrie.New(nil)
    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        for i := 0; i < 7; i++ {
            trie.Insert([]byte(strconv.Itoa(i)), i)
        }
        wg.Done()
    }()
    go func() {
        for i := 0; i < 7; i++ {
            trie.Snapshot()
        }
        wg.Done()
    }()
    wg.Wait()
    fmt.Println("size", trie.Size())
}

Expected size is 7, but if you run it enough times you should see values less than 7 occasionally.

Contributor

tylertreat commented Dec 25, 2015

I have a branch with a few fixes, but snapshotting is still broken: https://github.com/tylertreat/go-datastructures/tree/fixes

Simpler test:

package main

import (
    "fmt"
    "strconv"
    "sync"

    "github.com/Workiva/go-datastructures/trie/ctrie"
)

func main() {
    trie := ctrie.New(nil)
    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        for i := 0; i < 7; i++ {
            trie.Insert([]byte(strconv.Itoa(i)), i)
        }
        wg.Done()
    }()
    go func() {
        for i := 0; i < 7; i++ {
            trie.Snapshot()
        }
        wg.Done()
    }()
    wg.Wait()
    fmt.Println("size", trie.Size())
}

Expected size is 7, but if you run it enough times you should see values less than 7 occasionally.

@tylertreat

This comment has been minimized.

Show comment
Hide comment
@tylertreat

tylertreat Dec 26, 2015

Contributor

Issue appears to be with GCAS. I made an incorrect assumption from Java (where new Object() != new Object()). In Go, this is not quite the case: https://play.golang.org/p/BBde4cX1GC. As a result, GCAS always succeeds, despite the root generation changing during a snapshot.

Contributor

tylertreat commented Dec 26, 2015

Issue appears to be with GCAS. I made an incorrect assumption from Java (where new Object() != new Object()). In Go, this is not quite the case: https://play.golang.org/p/BBde4cX1GC. As a result, GCAS always succeeds, despite the root generation changing during a snapshot.

@tylertreat

This comment has been minimized.

Show comment
Hide comment
@tylertreat

tylertreat Dec 27, 2015

Contributor

I believe I have it fixed with this: https://github.com/Workiva/go-datastructures/compare/master...tylertreat:fixes?expand=1. I haven't been able to reproduce after applying those fixes.

@newhook can you confirm this branch fixes your problem?

Contributor

tylertreat commented Dec 27, 2015

I believe I have it fixed with this: https://github.com/Workiva/go-datastructures/compare/master...tylertreat:fixes?expand=1. I haven't been able to reproduce after applying those fixes.

@newhook can you confirm this branch fixes your problem?

@newhook

This comment has been minimized.

Show comment
Hide comment
@newhook

newhook Dec 27, 2015

Awesome! I'll check on Tuesday :)

newhook commented Dec 27, 2015

Awesome! I'll check on Tuesday :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment