Skip to content
This repository

Mv cache revert #77

Merged
merged 5 commits into from 12 months ago

2 participants

Matthew Von-Maszewski Engel A. Sanchez
Matthew Von-Maszewski
Collaborator

This code reverts to Google's original cache implementation. The code for cache.cc is used twice, once for a file cache and once for a block cache. I had previously made changes that yielded nice, incremental performance improvements to the block cache. No real attempt made to check impact to file cache. Recent testing shows a file cache miss has 10x or larger impact to performance compared to block cache miss. The original Google code performs much better for the file cache. Hence I am reverting to Google's code.

There is one small adjustment to the original LRUCache::Insert routine. Google's code tests for "space" in the cache upon insertion of new object. The code will delete the least recently used cache items until enough space is recovered. This logic fails if the least recently used item is active. The measure of available space does not change when deleting an active cache item. The code then loops to delete the next least recently used item. The loop can end up deleting a large section of the file cache if the files are active in iterators. My adjustment only deletes non-active cache items. This adjustment was previously shipped in my cache replacement code and believed stable / tested.

The branch contains a second adjustment that changes how the file cache size is measured. The Google code measures files by count. This makes memory sizing difficult since each file's index block and bloom filter block can vary in size (and I have other changes that vary the size of files by level). The code now sets the file cache's capacity to 4Mbytes times the max_file_count. 4Mbytes is a median value of samples reviewed from different levels and different datasets. Each file added to the cache subtracts its actual memory usage. So the number of files that actually fit in the cache can vary depending upon how file indexes and file bloom filters vary. But the ability to accurately set the memory model is greatly enhanced / simplified.

Matthew Von-Maszewski
Collaborator

Unfortunately I test things. Yes, I sometimes even submit pull requests before the multi-hour (multi-day) / multiple configurations stuff finishes if the runtime stuff is meeting expectations. In this case the Google cache fails after a few hours if the app.config has a small number for max_open_files (50). The system gets into a thrashing state of opening and closing files. Yep, I fixed this once before and have put that fix back into the Google cache.cc

The fix is to create a second flavor of cache. It is initialized via NewLRUCache2(). This second flavor does NOT have an intermediary sharding layer. The shards are based upon a hash algorithm that does not provide a very even distribution for the file cache. This leads to lots of cache evictions, thus lots of file opens / closes. Restoring my previous hack to make the file cache one big happy shard fixes all things. There might exist a magic capacity value where the Google sharding would perform better for the file cache ... but that takes time and interest. Passing on both for now.

Engel A. Sanchez
Collaborator

@matthewvon I think the new code in LRUCache::Insert will stop looping as soon as it finds an "old" entry that is still active, and it won't try to find further less "old" entries that could be removed. The one_removed flag is set to false right before checking for active, and used in the while loop to exit. The old code was careful to scan up to the point where there was an inactive entry before running that piece of logic. Can you check and see if I'm mistaken?

Matthew Von-Maszewski
Collaborator

@engelsanchez Wow. Good catch. The intended design was for the while() loop to walk the list of cache items until enough had been deleted to satisfy memory constraints. Google's original implementation assumed the head of the list was always deleted. That would cause an infinite loop when I added the if() clause to only delete non-active items. The one_removed flag stopped the infinite loop.

HOWEVER, your query forced me to really look at Google's intention versus my intention. My tweak of Google's code was safe, but would never test any object but the head object. I have submitted new code that will actually walk the list. Like before, please review for code correctness while I execute the 48 to 72 hours of test to validate.

Engel A. Sanchez
Collaborator

Ok. I've finished my visual inspection of all the code changes. It seems correct as far as I can tell. Let's see what the test harvest brings!

:+1::dancer:

Matthew Von-Maszewski clean up most memory leaks in dev tool sst_scan. disable another gran…
…dparent sizing routine that was making sst_3 have too many small files ... and therein causing code to run out of max_open_files.
edea0da
Matthew Von-Maszewski matthewvon merged commit 547c823 into from
Matthew Von-Maszewski matthewvon closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 5 unique commits by 1 author.

Apr 05, 2013
Matthew Von-Maszewski Revert cache logic to Googles original LRU. Cost table cache by memor…
…y size, not simple file count.
fe5f6c5
Apr 07, 2013
Matthew Von-Maszewski Correct the fix for only deleting non-active cache items. Had forgott…
…en the one_removed flag in previous check-in.
2b13379
Apr 08, 2013
Matthew Von-Maszewski Bring back NewLRUCache2 for file cache. Googles sharding fails horrib…
…ly if the file count is in 50 to 75 range.
df50dd4
Apr 09, 2013
Matthew Von-Maszewski Correct Insert loop so that it actually walks the heap instead of onl…
…y looking at first object.
1f897c7
Apr 11, 2013
Matthew Von-Maszewski clean up most memory leaks in dev tool sst_scan. disable another gran…
…dparent sizing routine that was making sst_3 have too many small files ... and therein causing code to run out of max_open_files.
edea0da
This page is out of date. Refresh to see the latest.
13  db/table_cache.cc
@@ -33,9 +33,12 @@ TableCache::TableCache(const std::string& dbname,
33 33
       options_(options),
34 34
 
35 35
       // convert file handle limit into a size limit
36  
-      //  based upon historical kTargetFileSize from version_set.cc
37  
-//      cache_(NewLRUCache2(entries * (2*1048576)))
38  
-      cache_(NewLRUCache2(entries))
  36
+      //  based upon sampling of metadata data sizes across
  37
+      //  levels and various load characteristics
  38
+      // Use NewLRUCache2 because it is NOT sharded.  Sharding
  39
+      //  does horrible things to file cache due to hash function
  40
+      //  not being very good and "capacity" does not split well
  41
+      cache_(NewLRUCache2(entries * (4*1048576)))
39 42
 {
40 43
 }
41 44
 
@@ -69,8 +72,8 @@ Status TableCache::FindTable(uint64_t file_number, uint64_t file_size, int level
69 72
       tf->file = file;
70 73
       tf->table = table;
71 74
 
72  
-//      *handle = cache_->Insert(key, tf, file_size, &DeleteEntry);
73  
-      *handle = cache_->Insert(key, tf, 1, &DeleteEntry);
  75
+      *handle = cache_->Insert(key, tf, table->TableObjectSize(), &DeleteEntry);
  76
+//      *handle = cache_->Insert(key, tf, 1, &DeleteEntry);
74 77
       gPerfCounters->Inc(ePerfTableOpened);
75 78
     }
76 79
   }
9  db/version_set.cc
@@ -30,7 +30,7 @@ static struct
30 30
     uint64_t m_TargetFileSize;                   //!< mostly useless
31 31
     uint64_t m_MaxGrandParentOverlapBytes;       //!< needs tuning, but not essential
32 32
                                                  //!<   since moves eliminated
33  
-    uint64_t m_ExpandedCompactionByteSizeLimit;  //!< needs tuning
  33
+    int64_t  m_ExpandedCompactionByteSizeLimit;  //!< needs tuning
34 34
     uint64_t m_MaxBytesForLevel;                 //!< ignored if m_OverlappedFiles is true
35 35
     uint64_t m_MaxFileSizeForLevel;              //!< google really applies this
36 36
                                                  //!<   to file size of NEXT level
@@ -1482,6 +1482,10 @@ bool Compaction::IsBaseLevelForKey(const Slice& user_key) {
1482 1482
 }
1483 1483
 
1484 1484
 bool Compaction::ShouldStopBefore(const Slice& internal_key) {
  1485
+#if 0
  1486
+    // 04/11/2013 code seems to create way too many small files
  1487
+    //   in lower levels once highers start to populate
  1488
+    // this causes max_open_files to blow out too early
1485 1489
   if (!gLevelTraits[level()+1].m_OverlappedFiles)
1486 1490
   {
1487 1491
     // Scan to find earliest grandparent file that contains key.
@@ -1509,6 +1513,9 @@ bool Compaction::ShouldStopBefore(const Slice& internal_key) {
1509 1513
     // overlapped levels do NOT split their output file
1510 1514
     return false;
1511 1515
   }
  1516
+#else
  1517
+  return false;
  1518
+#endif
1512 1519
 }
1513 1520
 
1514 1521
 void Compaction::ReleaseInputs() {
4  include/leveldb/cache.h
@@ -28,8 +28,12 @@ class Cache;
28 28
 // Create a new cache with a fixed size capacity.  This implementation
29 29
 // of Cache uses a least-recently-used eviction policy.
30 30
 extern Cache* NewLRUCache(size_t capacity);
  31
+
  32
+// Riak customization - just like NewLRUCache except the underlying
  33
+//  structure is NOT sharded.  Better for file cache.
31 34
 extern Cache* NewLRUCache2(size_t capacity);
32 35
 
  36
+
33 37
 class Cache {
34 38
  public:
35 39
   Cache() { }
6  include/leveldb/table.h
@@ -59,9 +59,13 @@ class Table {
59 59
   // return a static copy of the table's counters.
60 60
   SstCounters GetSstCounters() const;
61 61
 
  62
+  // riak routine to retrieve total memory footprint of an open table
  63
+  //  object in memory
  64
+  size_t TableObjectSize();
  65
+
62 66
   // access routines for testing tools, not for public use
63 67
   Block * TEST_GetIndexBlock();
64  
-  size_t TEST_TableObjectSize();
  68
+  size_t TEST_TableObjectSize() {return(TableObjectSize());};
65 69
   size_t TEST_FilterDataSize();
66 70
   static Iterator* TEST_BlockReader(void* Ptr, const ReadOptions& ROptions, const Slice& SliceReturn)
67 71
     {return(BlockReader(Ptr, ROptions, SliceReturn));};
4  table/table.cc
@@ -364,8 +364,10 @@ uint64_t Table::ApproximateOffsetOf(const Slice& key) const {
364 364
 Block *
365 365
 Table::TEST_GetIndexBlock() {return(rep_->index_block);};
366 366
 
  367
+// Riak specific routine.  Calculates total footprint of an open
  368
+//  table in memory.
367 369
 size_t
368  
-Table::TEST_TableObjectSize()
  370
+Table::TableObjectSize()
369 371
 {
370 372
     return(sizeof(Table) + sizeof(Table::Rep) + rep_->index_block->size() + rep_->filter_data_size);
371 373
 };
0  db/leveldb_main.cc → tools/leveldb_main.cc
File renamed without changes
37  tools/sst_scan.cc
@@ -101,10 +101,10 @@ main(
101 101
             //open table, step 2 find table (cache or open)
102 102
             if (status.ok())
103 103
             {
104  
-                leveldb::Cache::Handle * handle;
  104
+                leveldb::Cache::Handle * fhandle;
105 105
 
106  
-                handle=NULL;
107  
-                status=table_cache->TEST_FindTable(meta.number, meta.file_size, -2, &handle);
  106
+                fhandle=NULL;
  107
+                status=table_cache->TEST_FindTable(meta.number, meta.file_size, -2, &fhandle);
108 108
 
109 109
                 // count keys and size keys/filter
110 110
                 if (status.ok())
@@ -122,8 +122,8 @@ main(
122 122
                     count2=0;
123 123
                     tot_size=0;
124 124
 
125  
-                    table = reinterpret_cast<leveldb::TableAndFile*>(table_cache->TEST_GetInternalCache()->Value(handle))->table;
126  
-                    file = reinterpret_cast<leveldb::TableAndFile*>(table_cache->TEST_GetInternalCache()->Value(handle))->file;
  125
+                    table = reinterpret_cast<leveldb::TableAndFile*>(table_cache->TEST_GetInternalCache()->Value(fhandle))->table;
  126
+                    file = reinterpret_cast<leveldb::TableAndFile*>(table_cache->TEST_GetInternalCache()->Value(fhandle))->file;
127 127
                     it = table->TEST_GetIndexBlock()->NewIterator(options.comparator);
128 128
 
129 129
 
@@ -159,21 +159,21 @@ main(
159 159
                     for (it->SeekToFirst(), count=0; it->Valid(); it->Next())
160 160
                     {
161 161
                         leveldb::BlockContents contents;
162  
-                        leveldb::BlockHandle handle;
  162
+                        leveldb::BlockHandle bhandle;
163 163
                         leveldb::Slice slice;
164 164
 
165 165
                         ++block_count;
166 166
                         slice=it->value();
167  
-                        handle.DecodeFrom(&slice);
  167
+                        bhandle.DecodeFrom(&slice);
168 168
 
169 169
                         if (block_info)
170 170
                         {
171 171
                             printf("block %d, offset %llu, size %llu, next %llu\n",
172  
-                                   block_count, handle.offset(), handle.size(), handle.offset()+handle.size());
  172
+                                   block_count, bhandle.offset(), bhandle.size(), bhandle.offset()+bhandle.size());
173 173
                         }   // if
174 174
 
175  
-                        tot_compress+=handle.size();
176  
-                        status=leveldb::ReadBlock(file, read_options, handle, &contents);
  175
+                        tot_compress+=bhandle.size();
  176
+                        status=leveldb::ReadBlock(file, read_options, bhandle, &contents);
177 177
                         if (status.ok())
178 178
                         {
179 179
                             if (first)
@@ -186,6 +186,11 @@ main(
186 186
                                 smallest_block=contents.data.size();
187 187
                             }   // else if
188 188
                             tot_uncompress+=contents.data.size();
  189
+
  190
+                            if (contents.heap_allocated)
  191
+                            {
  192
+                                delete [] contents.data.data();
  193
+                            }   // if
189 194
                         }   // if
190 195
                         else
191 196
                         {
@@ -222,8 +227,12 @@ main(
222 227
                                        table_name.c_str(),count, count2,it2->status().ToString().c_str());
223 228
                             }   // else
224 229
                         }   // for
  230
+
  231
+                        delete it2;
225 232
                     }   // for
226 233
 
  234
+                    delete it;
  235
+
227 236
                     if (!no_csv)
228 237
                     {
229 238
                         if (csv_header)
@@ -270,6 +279,9 @@ main(
270 279
 
271 280
                         printf("\n");
272 281
                     }   // if
  282
+
  283
+                    // cleanup
  284
+                    table_cache->Evict(meta.number);
273 285
                 }   // if
274 286
                 else
275 287
                 {
@@ -278,6 +290,11 @@ main(
278 290
                     error_counter=1;
279 291
                 }   // else
280 292
             }   // if
  293
+
  294
+            // cleanup
  295
+            delete table_cache;
  296
+            delete options.filter_policy;
  297
+
281 298
         }   // else
282 299
     }   // for
283 300
 
117  util/cache.cc
@@ -5,17 +5,12 @@
5 5
 #include <assert.h>
6 6
 #include <stdio.h>
7 7
 #include <stdlib.h>
8  
-#include <sys/time.h>
9 8
 
10 9
 #include "leveldb/cache.h"
11 10
 #include "port/port.h"
12 11
 #include "util/hash.h"
13 12
 #include "util/mutexlock.h"
14 13
 
15  
-#ifdef OS_SOLARIS
16  
-#  include <atomic.h>
17  
-#endif
18  
-
19 14
 namespace leveldb {
20 15
 
21 16
 Cache::~Cache() {
@@ -35,7 +30,7 @@ struct LRUHandle {
35 30
   LRUHandle* prev;
36 31
   size_t charge;      // TODO(opt): Only allow uint32_t?
37 32
   size_t key_length;
38  
-  volatile uint32_t refs;
  33
+  uint32_t refs;
39 34
   uint32_t hash;      // Hash of key(); used for fast sharding and comparisons
40 35
   char key_data[1];   // Beginning of key
41 36
 
@@ -77,7 +72,6 @@ class HandleTable {
77 72
         Resize();
78 73
       }
79 74
     }
80  
-    else old->next_hash=old;
81 75
     return old;
82 76
   }
83 77
 
@@ -87,8 +81,6 @@ class HandleTable {
87 81
     if (result != NULL) {
88 82
       *ptr = result->next_hash;
89 83
       --elems_;
90  
-      // debug aid
91  
-      result->next_hash=result;
92 84
     }
93 85
     return result;
94 86
   }
@@ -124,6 +116,7 @@ class HandleTable {
124 116
       LRUHandle* h = list_[i];
125 117
       while (h != NULL) {
126 118
         LRUHandle* next = h->next_hash;
  119
+        Slice key = h->key();
127 120
         uint32_t hash = h->hash;
128 121
         LRUHandle** ptr = &new_list[hash & (new_length - 1)];
129 122
         h->next_hash = *ptr;
@@ -193,10 +186,9 @@ class LRUCache : public Cache {
193 186
   size_t capacity_;
194 187
 
195 188
   // mutex_ protects the following state.
196  
-  port::RWMutex mutex_;
197  
-
  189
+  port::Mutex mutex_;
198 190
   size_t usage_;
199  
-  volatile uint64_t last_id_;
  191
+  uint64_t last_id_;
200 192
 
201 193
   // Dummy head of LRU list.
202 194
   // lru.prev is newest entry, lru.next is oldest entry.
@@ -217,7 +209,6 @@ LRUCache::~LRUCache() {
217 209
   for (LRUHandle* e = lru_.next; e != &lru_; ) {
218 210
     LRUHandle* next = e->next;
219 211
     assert(e->refs == 1);  // Error if caller has an unreleased handle
220  
-    e->next_hash=e;  // mark as valid delete / deref
221 212
     Unref(e);
222 213
     e = next;
223 214
   }
@@ -227,29 +218,15 @@ void LRUCache::Unref(LRUHandle* e) {
227 218
   assert(e->refs > 0);
228 219
   e->refs--;
229 220
   if (e->refs <= 0) {
230  
-
231  
-    // sign that it is off the list
232  
-    if (e->next_hash==e)
233  
-    {
234  
-        usage_ -= e->charge;
235  
-        (*e->deleter)(e->key(), e->value);
236  
-        free(e);
237  
-    }   // if
238  
-    else
239  
-    {
240  
-        // hack to keep object alive. was for atomic(ref++) bug.
241  
-        //  left for Riak 1.2 safety
242  
-        ++e->refs;
243  
-    }   // else
  221
+    usage_ -= e->charge;
  222
+    (*e->deleter)(e->key(), e->value);
  223
+    free(e);
244 224
   }
245 225
 }
246 226
 
247 227
 void LRUCache::LRU_Remove(LRUHandle* e) {
248 228
   e->next->prev = e->prev;
249 229
   e->prev->next = e->next;
250  
-  // debug aid
251  
-  e->next=NULL;
252  
-  e->prev=NULL;
253 230
 }
254 231
 
255 232
 void LRUCache::LRU_Append(LRUHandle* e) {
@@ -261,31 +238,25 @@ void LRUCache::LRU_Append(LRUHandle* e) {
261 238
 }
262 239
 
263 240
 Cache::Handle* LRUCache::Lookup(const Slice& key, uint32_t hash) {
264  
-  LRUHandle* e;
265  
-  ReadLock l(&mutex_);
266  
-
267  
-  e = table_.Lookup(key, hash);
268  
-
  241
+  MutexLock l(&mutex_);
  242
+  LRUHandle* e = table_.Lookup(key, hash);
269 243
   if (e != NULL) {
270  
-#ifdef OS_SOLARIS
271  
-    atomic_add_int(&e->refs, 1);
272  
-#else
273  
-    __sync_add_and_fetch(&e->refs, 1);
274  
-#endif
275  
-    // was ... e->refs++;
  244
+    e->refs++;
  245
+    LRU_Remove(e);
  246
+    LRU_Append(e);
276 247
   }
277  
-
278 248
   return reinterpret_cast<Cache::Handle*>(e);
279 249
 }
280 250
 
281 251
 void LRUCache::Release(Cache::Handle* handle) {
282  
-  WriteLock l(&mutex_);
  252
+  MutexLock l(&mutex_);
283 253
   Unref(reinterpret_cast<LRUHandle*>(handle));
284 254
 }
285 255
 
286 256
 Cache::Handle* LRUCache::Insert(
287 257
     const Slice& key, uint32_t hash, void* value, size_t charge,
288 258
     void (*deleter)(const Slice& key, void* value)) {
  259
+  MutexLock l(&mutex_);
289 260
 
290 261
   LRUHandle* e = reinterpret_cast<LRUHandle*>(
291 262
       malloc(sizeof(LRUHandle)-1 + key.size()));
@@ -296,8 +267,6 @@ Cache::Handle* LRUCache::Insert(
296 267
   e->hash = hash;
297 268
   e->refs = 2;  // One from LRUCache, one for the returned handle
298 269
   memcpy(e->key_data, key.data(), key.size());
299  
-
300  
-  WriteLock l(&mutex_);
301 270
   LRU_Append(e);
302 271
   usage_ += charge;
303 272
 
@@ -307,41 +276,33 @@ Cache::Handle* LRUCache::Insert(
307 276
     Unref(old);
308 277
   }
309 278
 
310  
-  bool one_removed;
311  
-
312  
-  one_removed=true;
313  
-  while (usage_ > capacity_ && lru_.next != &lru_ && one_removed) {
314  
-    LRUHandle * low_ptr, * cursor;
315  
-
316  
-    one_removed=false;
317  
-
318  
-    // sequential scan when over capacity is overall faster
319  
-    //  than requiring write lock to rearrange LRU on every read
320  
-    low_ptr=NULL;
321  
-    for (cursor=lru_.next;
322  
-         NULL==low_ptr && cursor!=&lru_;
323  
-         cursor=cursor->next)
324  
-    {
325  
-        if (cursor->refs <= 1)
326  
-            low_ptr=cursor;
327  
-    }   // for
328  
-    // removing item that still has active references is
329  
-    //  quite harmful since the removal does not change
330  
-    //  usage.  Result can be accidental flush of entire cache.
331  
-    if (NULL!=low_ptr && &lru_!=low_ptr && low_ptr->refs <= 1)
332  
-    {
333  
-        one_removed=true;
334  
-        LRU_Remove(low_ptr);
335  
-        table_.Remove(low_ptr->key(), low_ptr->hash);
336  
-        Unref(low_ptr);
337  
-    }   // if
338  
-  }   // while
  279
+
  280
+  // Riak - matthewv: code added to remove old only if it was not active.
  281
+  //  Had scenarios where file cache would be largely or totally drained
  282
+  //  because an active object does NOT reduce usage_ upon delete.  So
  283
+  //  the previous while loop would basically delete everything.
  284
+  LRUHandle * next, * cursor;
  285
+
  286
+  for (cursor=lru_.next; usage_ > capacity_ && cursor != &lru_; cursor=next)
  287
+  {
  288
+      // take next pointer before potentially destroying cursor
  289
+      next=cursor->next;
  290
+
  291
+      // only delete cursor if it will actually destruct and
  292
+      //   return value to usage_
  293
+      if (cursor->refs <= 1)
  294
+      {
  295
+          LRU_Remove(cursor);
  296
+          table_.Remove(cursor->key(), cursor->hash);
  297
+          Unref(cursor);
  298
+      }   // if
  299
+  }   // for
339 300
 
340 301
   return reinterpret_cast<Cache::Handle*>(e);
341 302
 }
342 303
 
343 304
 void LRUCache::Erase(const Slice& key, uint32_t hash) {
344  
-  WriteLock l(&mutex_);
  305
+  MutexLock l(&mutex_);
345 306
   LRUHandle* e = table_.Remove(key, hash);
346 307
   if (e != NULL) {
347 308
     LRU_Remove(e);
@@ -356,7 +317,7 @@ class ShardedLRUCache : public Cache {
356 317
  private:
357 318
   LRUCache shard_[kNumShards];
358 319
   port::Mutex id_mutex_;
359  
-  volatile uint64_t last_id_;
  320
+  uint64_t last_id_;
360 321
 
361 322
   static inline uint32_t HashSlice(const Slice& s) {
362 323
     return Hash(s.data(), s.size(), 0);
@@ -396,12 +357,12 @@ class ShardedLRUCache : public Cache {
396 357
     return reinterpret_cast<LRUHandle*>(handle)->value;
397 358
   }
398 359
   virtual uint64_t NewId() {
399  
-      return (++last_id_);
  360
+    MutexLock l(&id_mutex_);
  361
+    return ++(last_id_);
400 362
   }
401 363
   virtual size_t EntryOverheadSize() {return(sizeof(LRUHandle));};
402 364
 };
403 365
 
404  
-
405 366
 }  // end anonymous namespace
406 367
 
407 368
 Cache* NewLRUCache(size_t capacity) {
Commit_comment_tip

Tip: You can add notes to lines in a file. Hover to the left of a line to make a note

Something went wrong with that request. Please try again.