Skip to content

Commit

Permalink
row_cache: Fix undefined behavior on key linearization
Browse files Browse the repository at this point in the history
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).

There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:

 1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.

 2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.

Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:

  auto prev = std::prev(it);
  if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
     it->set_continuous(true);

but handle_end_of_stream() is not invoked under allocating section.

Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:

  return with_linearized_managed_bytes([&] {
  ...
    return _worker_state->alloc_section(region, [&] {

Fixes scylladb#6637.
Refs scylladb#6108.

Tests:

  - unit (all)

Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
  • Loading branch information
tgrabiec authored and avikivity committed Jun 15, 2020
1 parent 86a4dfc commit e81fc1f
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 8 deletions.
6 changes: 4 additions & 2 deletions mutation_partition.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2605,7 +2605,7 @@ void mutation_cleaner_impl::start_worker() {
stop_iteration mutation_cleaner_impl::merge_some(partition_snapshot& snp) noexcept {
auto&& region = snp.region();
return with_allocator(region.allocator(), [&] {
return with_linearized_managed_bytes([&] {
{
// Allocating sections require the region to be reclaimable
// which means that they cannot be nested.
// It is, however, possible, that if the snapshot is taken
Expand All @@ -2617,13 +2617,15 @@ stop_iteration mutation_cleaner_impl::merge_some(partition_snapshot& snp) noexce
}
try {
return _worker_state->alloc_section(region, [&] {
return with_linearized_managed_bytes([&] {
return snp.merge_partition_versions(_app_stats);
});
});
} catch (...) {
// Merging failed, give up as there is no guarantee of forward progress.
return stop_iteration::yes;
}
});
}
});
}

Expand Down
20 changes: 14 additions & 6 deletions row_cache.cc
Original file line number Diff line number Diff line change
Expand Up @@ -530,8 +530,12 @@ class range_populating_reader {
return _reader.move_to_next_partition(timeout).then([this] (auto&& mfopt) mutable {
{
if (!mfopt) {
this->handle_end_of_stream();
return make_ready_future<read_result>(read_result(std::nullopt, std::nullopt));
return _cache._read_section(_cache._tracker.region(), [&] {
return with_linearized_managed_bytes([&] {
this->handle_end_of_stream();
return make_ready_future<read_result>(read_result(std::nullopt, std::nullopt));
});
});
}
_cache.on_partition_miss();
const partition_start& ps = mfopt->as_partition_start();
Expand Down Expand Up @@ -956,13 +960,15 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
// expensive and we need to amortize it somehow.
do {
STAP_PROBE(scylla, row_cache_update_partition_start);
with_linearized_managed_bytes([&] {
{
if (!update) {
_update_section(_tracker.region(), [&] {
with_linearized_managed_bytes([&] {
memtable_entry& mem_e = *m.partitions.begin();
size_entry = mem_e.size_in_allocator_without_rows(_tracker.allocator());
auto cache_i = _partitions.lower_bound(mem_e.key(), cmp);
update = updater(_update_section, cache_i, mem_e, is_present, real_dirty_acc);
});
});
}
// We use cooperative deferring instead of futures so that
Expand All @@ -974,14 +980,16 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
update = {};
real_dirty_acc.unpin_memory(size_entry);
_update_section(_tracker.region(), [&] {
with_linearized_managed_bytes([&] {
auto i = m.partitions.begin();
memtable_entry& mem_e = *i;
m.partitions.erase(i);
mem_e.partition().evict(_tracker.memtable_cleaner());
current_allocator().destroy(&mem_e);
});
});
++partition_count;
});
}
STAP_PROBE(scylla, row_cache_update_partition_end);
} while (!m.partitions.empty() && !need_preempt());
with_allocator(standard_allocator(), [&] {
Expand Down Expand Up @@ -1128,8 +1136,8 @@ future<> row_cache::invalidate(external_updater eu, dht::partition_range_vector&
seastar::thread::maybe_yield();

while (true) {
auto done = with_linearized_managed_bytes([&] {
return _update_section(_tracker.region(), [&] {
auto done = _update_section(_tracker.region(), [&] {
return with_linearized_managed_bytes([&] {
auto cmp = cache_entry::compare(_schema);
auto it = _partitions.lower_bound(*_prev_snapshot_pos, cmp);
auto end = _partitions.lower_bound(dht::ring_position_view::for_range_end(range), cmp);
Expand Down

0 comments on commit e81fc1f

Please sign in to comment.