Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fields are missing in metadata when restarting #49

Closed
tangliisu opened this issue Sep 10, 2021 · 6 comments
Closed

fields are missing in metadata when restarting #49

tangliisu opened this issue Sep 10, 2021 · 6 comments

Comments

@tangliisu
Copy link
Contributor

When i enable the persistent cache by following the wiki, i meet the following error message
Memory Pool Manager can not be restored, nextPoolId is not set
when trying to attach the old cache instance
cache = std::make_unique<Cache>(Cache::SharedMemAttach, config);
Some more info:

  1. when i shut down cachelib,
    auto res = cache.shutDown();
    the metadata is successfully saved since
    res == Cache::ShutDownStatus::kSuccess
  2. I check the metadata file (name is "metadata") after shutting down. The file is not empty and it contains all of keys "shm_cache, shm_chained_alloc_hash_table, shm_hash_table, shm_info" followed by the serialized iobuf.
  3. when i restart cachelib, before creating a cache instance by
    cache = std::make_unique<Cache>(Cache::SharedMemAttach, config);
    I checked the metadata file again. It is not empty (the size is 149B)
  4. The shared memory size is consistent before/after we shut down the cachelib so the shared memory is not released.

It seems nextPoolId is not serialized into metadata. Do you have any suggestion here?

@therealgymmy
Copy link
Contributor

Can you share a snippet of your code on initially creating the cache, and then shutdown, and then recover?

We have not seen the issue you described here. The logic throwing is this:

  // Check if nextPoolid is restored properly or not. If not restored,
  // throw error
  if (!object.nextPoolId_ref().is_set()) {
    throw std::logic_error(
        "Memory Pool Manager can not be restored,"
        " nextPoolId is not set");
  }

https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/memory/MemoryPoolManager.cpp#L42

But on shutdown, if we have successfully shut down, we must have serialized the "nextPoolId" and stored in thrift structure.

object.set_nextPoolId(nextPoolId_);

https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/memory/MemoryPoolManager.cpp#L168

So it's quite odd that on recovery you're hitting an error that complains this thrift field is not set. Can you add some debug code before throwing to print out the value of object.nextPoolId_ref().value()?

@tangliisu
Copy link
Contributor Author

i agree it is very odd after checking the code.
Here is the code

creating cache and recover

static std::unique_ptr<Cache> CreatePersistentCache(const size_t cache_size,
                                                    const std::string& cache_name,
                                                    const std::string& cache_metadata_directory,
                                                    const int32_t buckets_power,
                                                    const int32_t locks_power) {
  try {
    bool is_metadata_exist = boost::filesystem::exists(METADATA_PATH);
    LOG(INFO) << folly::sformat("The metadata file exists before cache created?: {}.",
                                is_metadata_exist);
    if (is_metadata_exist) {
      LOG(INFO) << folly::sformat("The size of file is: {}.",
                                  boost::filesystem::file_size(METADATA_PATH));
    }

    // Cache config
    Cache::Config config;
    config.setCacheSize(cache_size)
        .setCacheName(cache_name)
        .setAccessConfig({buckets_power, locks_power})
        .enableCachePersistence(cache_metadata_directory)
        .validate();

    LOG(INFO) << folly::sformat("Cachelib cache_size in config is: {}", cache_size);
    LOG(INFO) << folly::sformat("Cachelib allocationClassFSizeFactor in config is: {}",
                                config.allocationClassSizeFactor);
    LOG(INFO) << folly::sformat(
        "Cachelib buckets_power is: {}, locks_power is: {}", buckets_power, locks_power);

    // Create cache
    std::unique_ptr<Cache> cache;
    try {
      cache = std::make_unique<Cache>(Cache::SharedMemAttach, config);
      // Cache is now restored
    } catch (const std::exception& ex) {
      // Failed to attach the cache. Create a new one but make sure that
      // the old cache is destroyed before creating a new one.
      // This allows us to release any held resources (such as
      // open file descriptors and associated fcntl locks).
      cache.reset();
      LOG(ERROR) << "Failed to attach the old cache when restarting: " << ex.what();
      cache = std::make_unique<Cache>(Cache::SharedMemNew, config);
    }
    return cache;
  } catch (const std::exception& ex) {
    LOG(ERROR) << "Error in persistent cache creation: " << ex.what();
    throw;
  }
}

shutdown

void CachelibCacheHandler::ShutDown() {
  if (!FLAGS_enable_persistent_cachelib) {
    return;
  }
  bool is_metadata_exist = boost::filesystem::exists(METADATA_PATH);
  LOG(INFO) << folly::sformat("The metadata file exists before cache shutdown?: {}.",
                              is_metadata_exist);
  if (is_metadata_exist) {
    LOG(INFO) << folly::sformat("The size of file is: {}.",
                                boost::filesystem::file_size(METADATA_PATH));
  }
  auto status = cache_->shutDown();

  is_metadata_exist = boost::filesystem::exists(METADATA_PATH);
  LOG(INFO) << folly::sformat("The metadata file exists after cache shutdown?: {}.",
                              is_metadata_exist);
  if (is_metadata_exist) {
    LOG(INFO) << folly::sformat("The size of file is: {}.",
                                boost::filesystem::file_size(METADATA_PATH));
  }

  switch (status) {
    case Cache::ShutDownStatus::kSuccess:
      LOG(INFO) << folly::sformat(
          "Cache is successfully shut down and metadata is recorded in directory {}",
          FLAGS_cache_metadata_directory);
      break;
    case Cache::ShutDownStatus::kFailed:
      LOG(ERROR) << folly::sformat("Failed to persist the cache metadata in directory {}",
                                   FLAGS_cache_metadata_directory);
      break;
    default:
      LOG(ERROR) << folly::sformat("Cache metadata is partitally recored in directory {}",
                                   FLAGS_cache_metadata_directory);
  }
}

@sathyaphoenix
Copy link
Contributor

@tangliisu few questions about the occurrence of the issue.

  1. how repeatable is this ? is it consistent to repro ?
  2. are the binaries the same between the shutdown and recovery ? Any thrift changes or cachelib changes between the two ?

@tangliisu
Copy link
Contributor Author

@sathyaphoenix

  1. it is consistent to repro
  2. the binaries are same and no thrift changes
    The way i test the persistent cache is I just restart the test cluster. The thrift/cachelib/binaries are exactly same.

One thing that might be the cause is we use an old version of fbthrift which does not support accessing the thrift members through methods. So I just change the method _ref() in cachelib codebase to the thriftfields. On the changes related to nextPoolId, I change
if (!object.nextPoolId_ref().is_set()) to object.__isset.nextPoolId
*object.nextPoolId_ref() to object.nextPoolId
nextPoolId_(*object.nextPoolId_ref()) to nextPoolId_(object.nextPoolId)
Do you think it is might be the root cause?

@tangliisu
Copy link
Contributor Author

nvm i figured it out. it is a stupid typo. I change if (!object.nextPoolId_ref().is_set()) to if(object.__isset.nextPoolId). sorry for bothering you on this stupid typo :(.

@tangliisu
Copy link
Contributor Author

close the ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants