Possible datarace #117

holiman · 2019-02-11T11:24:10Z

I was looking into the shard.go, and saw the pattern below which looks racey to me. I haven't gone through all the steps to fuzz or repro it, but thought I'd post it here to get your thoughts about it.
Possibly problematic code :

func (s *cacheShard) del(key string, hashedKey uint64) error {
	s.lock.RLock()
	itemIndex := s.hashmap[hashedKey]

	if itemIndex == 0 {
		s.lock.RUnlock()
		s.delmiss()
		return ErrEntryNotFound
	}

	wrappedEntry, err := s.entries.Get(int(itemIndex))
	if err != nil {
		s.lock.RUnlock()
		s.delmiss()
		return err
	}
	s.lock.RUnlock()

	s.lock.Lock()
	{
		delete(s.hashmap, hashedKey)
		s.onRemove(wrappedEntry, Deleted)
		resetKeyFromEntry(wrappedEntry)
	}
	s.lock.Unlock()

	s.delhit()
	return nil
}

Multiple readers may enter the first section in paralell, each getting an identical wrappedEntry for an entry to delete. They will then enter the writelocked section sequentially, and resetKeyFromEntry will write destructively on the data. It's possible that the sequential deletion of the same element is fine, but even so it might be that another write reuses the data, and the next of these sequential reads will overwrite the data.

The proper pattern should be to either wrap everything within write-lock, or redo the first check once the writelock is obtained.

func (s *cacheShard) del(key string, hashedKey uint64) error {
	s.lock.RLock()
	itemIndex := s.hashmap[hashedKey]

	if itemIndex == 0 {
		s.lock.RUnlock()
		s.delmiss()
		return ErrEntryNotFound
	}

	wrappedEntry, err := s.entries.Get(int(itemIndex))
	if err != nil {
		s.lock.RUnlock()
		s.delmiss()
		return err
	}
	s.lock.RUnlock()

	s.lock.Lock()
	{

		itemIndex = s.hashmap[hashedKey]

		if itemIndex == 0 {
			s.lock.Unlock()
			s.delmiss()
			return ErrEntryNotFound
		}

		wrappedEntry, err = s.entries.Get(int(itemIndex))
		if err != nil {
			s.lock.Unlock()
			s.delmiss()
			return err
		}

		delete(s.hashmap, hashedKey)
		s.onRemove(wrappedEntry, Deleted)
		resetKeyFromEntry(wrappedEntry)
	}
	s.lock.Unlock()
	s.delhit()
	return nil
}

I'm not fully versed in the codebase, so apologies if I've misunderstood something.

The text was updated successfully, but these errors were encountered:

siennathesane · 2019-02-14T17:11:51Z

I'd be happy to take a look at any fuzzing examples you have, but currently we don't have any data races that I'm aware of. In the case of delete, reads are free, but writes are not, so we are only locking when we delete the key, not when we check for it's existence.

The way it works is every key gets a hash, and then it's sent to the shards (just optimised byte arrays). When we go to get a key, the key is rehashed and we then get the contents from the shards at the index of the shard it resides in. The last I checked the only times we need to lock are when we need to write to the shard, which means Set and Delete, because otherwise we would definitely have a race condition on data writes.

We've spent the cycles going through an optimising the locking strategy as best we could at each step. The goal is to lock as little as possible for as quickly as possible, as the goal is pure performance, and allowing reads while writing is locked whenever possible. If you think there's a more efficient locking strategy, I'd be happy to review any PRs you come up with. :)

holiman · 2019-02-15T08:39:39Z

I tried reproing something, the result wasn't really what I expected (crash instead of error)
The testcase is below, and aside from that I modified maxShardSize to not do the convertMBToBytes, but use the given value.

func TestCacheDelRandomly(t *testing.T) {

	c := Config{
		Shards:             1,
		LifeWindow:         time.Second,
		CleanWindow:        0,
		MaxEntriesInWindow: 10,
		MaxEntrySize:       10,
		Verbose:            true,
		Hasher:             newDefaultHasher(),
		HardMaxCacheSize:   100,
		Logger:             DefaultLogger(),
	}
	cache, _ := NewBigCache(c)
	var wg sync.WaitGroup
	wg.Add(3)
	var ntest = 100000
	go func(){
		for i := 0; i < ntest; i++{
			r := uint8(rand.Int())
			key := fmt.Sprintf("thekey%d", r)
			cache.Delete(key)
		}
		wg.Done()
	}()
	go func(){
		for i := 0; i < ntest; i++{
			r := uint8(rand.Int())
			key := fmt.Sprintf("thekey%d", r)
			val := []byte(fmt.Sprintf("%x%x%x%x%x%x%x", r,r,r,r,r,r,r))
			cache.Set(key, val)
		}
		wg.Done()
	}()
	go func(){
		for i := 0; i < ntest; i++{
			r := uint8(rand.Int())
			key := fmt.Sprintf("thekey%d", r)
			val := []byte(fmt.Sprintf("%x%x%x%x%x%x%x", r,r,r,r,r,r,r))
			if got, err := cache.Get(key); err == nil && !bytes.Equal(got, val){
				fmt.Printf("ERR: Got %s -> %s (exp %s)\n ", key, got, val)
			}
		}
		wg.Done()
	}()
	wg.Wait()
}

New queue with initialCapacity 100, maxCapacity 100 
2019/02/15 09:34:50 Collision detected. Both "thekey208" and "thekey145" have the same hash e39a7b36da6f8f8b
2019/02/15 09:34:50 Collision detected. Both "thekey120" and "thekey177" have the same hash ecaa2236dfae8c2c
2019/02/15 09:34:50 Collision detected. Both "thekey162" and "thekey157" have the same hash ec9cac36dfa3394e
2019/02/15 09:34:50 Collision detected. Both "thekey162" and "thekey172" have the same hash ec9cac36dfa3394e
2019/02/15 09:34:50 Collision detected. Both "thekey185" and "thekey6" have the same hash ec95a336df9d0b55
panic: runtime error: slice bounds out of range

goroutine 21 [running]:
github.com/allegro/bigcache/queue.(*BytesQueue).peek(0xc0000c2c68, 0x2e, 0x0, 0x3, 0x411d1e, 0xc00014edc0, 0xeb884e38, 0xdfb4c04c9bf101fc)
	/home/user/go/src/github.com/allegro/bigcache/queue/bytes_queue.go:199 +0x229
github.com/allegro/bigcache/queue.(*BytesQueue).Get(0xc0000c2c68, 0x2e, 0xec95a336df9d0b55, 0xc0000ea12c, 0x464aea, 0x91d9a0, 0xc000124180)
	/home/user/go/src/github.com/allegro/bigcache/queue/bytes_queue.go:164 +0x35
github.com/allegro/bigcache.(*cacheShard).set(0xc0000c2c60, 0xc0001a04f0, 0x9, 0xec95a336df9d0b55, 0xc00014ef08, 0xe, 0x20, 0xe, 0xc00014ef08)
	/home/user/go/src/github.com/allegro/bigcache/shard.go:64 +0x34c
github.com/allegro/bigcache.(*BigCache).Set(0xc0000872b0, 0xc0001a04f0, 0x9, 0xc00014ef08, 0xe, 0x20, 0xe, 0x0)
	/home/user/go/src/github.com/allegro/bigcache/bigcache.go:117 +0xab
github.com/allegro/bigcache.TestCacheDelRandomly.func2(0x186a0, 0xc0000872b0, 0xc00008c7c0)
	/home/user/go/src/github.com/allegro/bigcache/bigcache_test.go:394 +0x1ff
created by github.com/allegro/bigcache.TestCacheDelRandomly
	/home/user/go/src/github.com/allegro/bigcache/bigcache_test.go:389 +0x1cb

holiman · 2019-02-15T08:55:23Z

A printout in the bytes_queue before that happens:

Trying to access out of bounds: blocksize 0xc7000000 q.array[50 :3338666034], len 200
panic: runtime error: slice bounds out of range

and

Trying to access out of bounds: blocksize 0x18000000 q.array[50 :402653234], len 200

Looks like blocksize is sometimes corrupted. Also, I managed to get the test to execute without crashing once, and caught the thing I'd been fishing for:

=== RUN   TestCacheDelRandomly
New queue with initialCapacity 100, maxCapacity 200
2019/02/15 09:49:47 Allocated new queue in 2.892µs; Capacity: 200
ERR: Got thekey103 -> fbfbfbfbfbfbfb (exp 67676767676767)
 --- PASS: TestCacheDelRandomly (0.27s)
PASS

holiman · 2019-02-15T09:07:53Z

More context, printing out the data surrounding the blocksize read

Trying to access out of bounds: blocksize 0x38613861 q.array[132 :945895653], len 200
Contents of array in the neighbourhood of 'blocksize': [...79313638613861386138613861386138613828000000497f...]

One of the entries, namely this one (number 138 / 8A)

key thekey138 val 3861386138613861386138613861

has obviously overwritten the location where it looks for blocksize:
[...793136 3861386138613861386138613861 3828000000497f...]

holiman · 2019-02-15T09:14:53Z

If I make the lock around del one contiguous writelock, then all errors disappear.

This PR contains a testcase which demonstrates corruption (intermittently). Sometimes leading to `panic`, and sometimes leading to data corruption of values. I thought that fixing the datarace suggested in #117 would solve it, but it seems I was wrong about that, there's some other underlying bug. I'll push a commit on top if I find it.

This PR contains a testcase which demonstrates corruption (intermittently). Sometimes leading to `panic`, and sometimes leading to data corruption of values. I thought that fixing the datarace suggested in allegro#117 would solve it, but it seems I was wrong about that, there's some other underlying bug. I'll push a commit on top if I find it.

siennathesane added the discussion label Feb 14, 2019

holiman added a commit to holiman/bigcache that referenced this issue Feb 15, 2019

test: failing testcase to demonstrate allegro#117

06dec97

holiman mentioned this issue Feb 15, 2019

Fix data corruption (shard.go) #119

Merged

holiman closed this as completed Mar 19, 2019

Matias-Barrios mentioned this issue Jun 5, 2023

BigCache is logging allocations it should not #363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible datarace #117

Possible datarace #117

holiman commented Feb 11, 2019 •

edited

Loading

siennathesane commented Feb 14, 2019

holiman commented Feb 15, 2019 •

edited

Loading

holiman commented Feb 15, 2019

holiman commented Feb 15, 2019

holiman commented Feb 15, 2019

Possible datarace #117

Possible datarace #117

Comments

holiman commented Feb 11, 2019 • edited Loading

siennathesane commented Feb 14, 2019

holiman commented Feb 15, 2019 • edited Loading

holiman commented Feb 15, 2019

holiman commented Feb 15, 2019

holiman commented Feb 15, 2019

holiman commented Feb 11, 2019 •

edited

Loading

holiman commented Feb 15, 2019 •

edited

Loading