Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible datarace #117

Closed
holiman opened this issue Feb 11, 2019 · 5 comments
Closed

Possible datarace #117

holiman opened this issue Feb 11, 2019 · 5 comments

Comments

@holiman
Copy link
Contributor

holiman commented Feb 11, 2019

I was looking into the shard.go, and saw the pattern below which looks racey to me. I haven't gone through all the steps to fuzz or repro it, but thought I'd post it here to get your thoughts about it.
Possibly problematic code :

func (s *cacheShard) del(key string, hashedKey uint64) error {
	s.lock.RLock()
	itemIndex := s.hashmap[hashedKey]

	if itemIndex == 0 {
		s.lock.RUnlock()
		s.delmiss()
		return ErrEntryNotFound
	}

	wrappedEntry, err := s.entries.Get(int(itemIndex))
	if err != nil {
		s.lock.RUnlock()
		s.delmiss()
		return err
	}
	s.lock.RUnlock()

	s.lock.Lock()
	{
		delete(s.hashmap, hashedKey)
		s.onRemove(wrappedEntry, Deleted)
		resetKeyFromEntry(wrappedEntry)
	}
	s.lock.Unlock()

	s.delhit()
	return nil
}

Multiple readers may enter the first section in paralell, each getting an identical wrappedEntry for an entry to delete. They will then enter the writelocked section sequentially, and resetKeyFromEntry will write destructively on the data. It's possible that the sequential deletion of the same element is fine, but even so it might be that another write reuses the data, and the next of these sequential reads will overwrite the data.

The proper pattern should be to either wrap everything within write-lock, or redo the first check once the writelock is obtained.

func (s *cacheShard) del(key string, hashedKey uint64) error {
	s.lock.RLock()
	itemIndex := s.hashmap[hashedKey]

	if itemIndex == 0 {
		s.lock.RUnlock()
		s.delmiss()
		return ErrEntryNotFound
	}

	wrappedEntry, err := s.entries.Get(int(itemIndex))
	if err != nil {
		s.lock.RUnlock()
		s.delmiss()
		return err
	}
	s.lock.RUnlock()

	s.lock.Lock()
	{

		itemIndex = s.hashmap[hashedKey]

		if itemIndex == 0 {
			s.lock.Unlock()
			s.delmiss()
			return ErrEntryNotFound
		}

		wrappedEntry, err = s.entries.Get(int(itemIndex))
		if err != nil {
			s.lock.Unlock()
			s.delmiss()
			return err
		}

		delete(s.hashmap, hashedKey)
		s.onRemove(wrappedEntry, Deleted)
		resetKeyFromEntry(wrappedEntry)
	}
	s.lock.Unlock()
	s.delhit()
	return nil
}

I'm not fully versed in the codebase, so apologies if I've misunderstood something.

@siennathesane
Copy link
Collaborator

I'd be happy to take a look at any fuzzing examples you have, but currently we don't have any data races that I'm aware of. In the case of delete, reads are free, but writes are not, so we are only locking when we delete the key, not when we check for it's existence.

The way it works is every key gets a hash, and then it's sent to the shards (just optimised byte arrays). When we go to get a key, the key is rehashed and we then get the contents from the shards at the index of the shard it resides in. The last I checked the only times we need to lock are when we need to write to the shard, which means Set and Delete, because otherwise we would definitely have a race condition on data writes.

We've spent the cycles going through an optimising the locking strategy as best we could at each step. The goal is to lock as little as possible for as quickly as possible, as the goal is pure performance, and allowing reads while writing is locked whenever possible. If you think there's a more efficient locking strategy, I'd be happy to review any PRs you come up with. :)

@holiman
Copy link
Contributor Author

holiman commented Feb 15, 2019

I tried reproing something, the result wasn't really what I expected (crash instead of error)
The testcase is below, and aside from that I modified maxShardSize to not do the convertMBToBytes, but use the given value.

func TestCacheDelRandomly(t *testing.T) {

	c := Config{
		Shards:             1,
		LifeWindow:         time.Second,
		CleanWindow:        0,
		MaxEntriesInWindow: 10,
		MaxEntrySize:       10,
		Verbose:            true,
		Hasher:             newDefaultHasher(),
		HardMaxCacheSize:   100,
		Logger:             DefaultLogger(),
	}
	cache, _ := NewBigCache(c)
	var wg sync.WaitGroup
	wg.Add(3)
	var ntest = 100000
	go func(){
		for i := 0; i < ntest; i++{
			r := uint8(rand.Int())
			key := fmt.Sprintf("thekey%d", r)
			cache.Delete(key)
		}
		wg.Done()
	}()
	go func(){
		for i := 0; i < ntest; i++{
			r := uint8(rand.Int())
			key := fmt.Sprintf("thekey%d", r)
			val := []byte(fmt.Sprintf("%x%x%x%x%x%x%x", r,r,r,r,r,r,r))
			cache.Set(key, val)
		}
		wg.Done()
	}()
	go func(){
		for i := 0; i < ntest; i++{
			r := uint8(rand.Int())
			key := fmt.Sprintf("thekey%d", r)
			val := []byte(fmt.Sprintf("%x%x%x%x%x%x%x", r,r,r,r,r,r,r))
			if got, err := cache.Get(key); err == nil && !bytes.Equal(got, val){
				fmt.Printf("ERR: Got %s -> %s (exp %s)\n ", key, got, val)
			}
		}
		wg.Done()
	}()
	wg.Wait()
}
New queue with initialCapacity 100, maxCapacity 100 
2019/02/15 09:34:50 Collision detected. Both "thekey208" and "thekey145" have the same hash e39a7b36da6f8f8b
2019/02/15 09:34:50 Collision detected. Both "thekey120" and "thekey177" have the same hash ecaa2236dfae8c2c
2019/02/15 09:34:50 Collision detected. Both "thekey162" and "thekey157" have the same hash ec9cac36dfa3394e
2019/02/15 09:34:50 Collision detected. Both "thekey162" and "thekey172" have the same hash ec9cac36dfa3394e
2019/02/15 09:34:50 Collision detected. Both "thekey185" and "thekey6" have the same hash ec95a336df9d0b55
panic: runtime error: slice bounds out of range

goroutine 21 [running]:
github.com/allegro/bigcache/queue.(*BytesQueue).peek(0xc0000c2c68, 0x2e, 0x0, 0x3, 0x411d1e, 0xc00014edc0, 0xeb884e38, 0xdfb4c04c9bf101fc)
	/home/user/go/src/github.com/allegro/bigcache/queue/bytes_queue.go:199 +0x229
github.com/allegro/bigcache/queue.(*BytesQueue).Get(0xc0000c2c68, 0x2e, 0xec95a336df9d0b55, 0xc0000ea12c, 0x464aea, 0x91d9a0, 0xc000124180)
	/home/user/go/src/github.com/allegro/bigcache/queue/bytes_queue.go:164 +0x35
github.com/allegro/bigcache.(*cacheShard).set(0xc0000c2c60, 0xc0001a04f0, 0x9, 0xec95a336df9d0b55, 0xc00014ef08, 0xe, 0x20, 0xe, 0xc00014ef08)
	/home/user/go/src/github.com/allegro/bigcache/shard.go:64 +0x34c
github.com/allegro/bigcache.(*BigCache).Set(0xc0000872b0, 0xc0001a04f0, 0x9, 0xc00014ef08, 0xe, 0x20, 0xe, 0x0)
	/home/user/go/src/github.com/allegro/bigcache/bigcache.go:117 +0xab
github.com/allegro/bigcache.TestCacheDelRandomly.func2(0x186a0, 0xc0000872b0, 0xc00008c7c0)
	/home/user/go/src/github.com/allegro/bigcache/bigcache_test.go:394 +0x1ff
created by github.com/allegro/bigcache.TestCacheDelRandomly
	/home/user/go/src/github.com/allegro/bigcache/bigcache_test.go:389 +0x1cb

@holiman
Copy link
Contributor Author

holiman commented Feb 15, 2019

A printout in the bytes_queue before that happens:

Trying to access out of bounds: blocksize 0xc7000000 q.array[50 :3338666034], len 200
panic: runtime error: slice bounds out of range

and

Trying to access out of bounds: blocksize 0x18000000 q.array[50 :402653234], len 200

Looks like blocksize is sometimes corrupted. Also, I managed to get the test to execute without crashing once, and caught the thing I'd been fishing for:

=== RUN   TestCacheDelRandomly
New queue with initialCapacity 100, maxCapacity 200
2019/02/15 09:49:47 Allocated new queue in 2.892µs; Capacity: 200
ERR: Got thekey103 -> fbfbfbfbfbfbfb (exp 67676767676767)
 --- PASS: TestCacheDelRandomly (0.27s)
PASS

@holiman
Copy link
Contributor Author

holiman commented Feb 15, 2019

More context, printing out the data surrounding the blocksize read

Trying to access out of bounds: blocksize 0x38613861 q.array[132 :945895653], len 200
Contents of array in the neighbourhood of 'blocksize': [...79313638613861386138613861386138613828000000497f...]

One of the entries, namely this one (number 138 / 8A)

key thekey138 val 3861386138613861386138613861

has obviously overwritten the location where it looks for blocksize:
[...793136 3861386138613861386138613861 3828000000497f...]

@holiman
Copy link
Contributor Author

holiman commented Feb 15, 2019

If I make the lock around del one contiguous writelock, then all errors disappear.

holiman added a commit to holiman/bigcache that referenced this issue Feb 15, 2019
cristaloleg pushed a commit that referenced this issue Feb 18, 2019
This PR contains a testcase which demonstrates corruption (intermittently). Sometimes leading to `panic`, and sometimes leading to data corruption of values. 

I thought that fixing the datarace suggested in #117 would solve it, but it seems I was wrong about that, there's some other underlying bug. I'll push a commit on top if I find it.
@holiman holiman closed this as completed Mar 19, 2019
u5surf pushed a commit to u5surf/bigcache that referenced this issue Jun 4, 2019
This PR contains a testcase which demonstrates corruption (intermittently). Sometimes leading to `panic`, and sometimes leading to data corruption of values. 

I thought that fixing the datarace suggested in allegro#117 would solve it, but it seems I was wrong about that, there's some other underlying bug. I'll push a commit on top if I find it.
flisky pushed a commit to flisky/bigcache that referenced this issue May 7, 2020
This PR contains a testcase which demonstrates corruption (intermittently). Sometimes leading to `panic`, and sometimes leading to data corruption of values. 

I thought that fixing the datarace suggested in allegro#117 would solve it, but it seems I was wrong about that, there's some other underlying bug. I'll push a commit on top if I find it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants