Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint manager #341

Closed

Conversation

@lmy1229
Copy link

lmy1229 commented Apr 8, 2019

Project Design: Checkpoint & Recovery

Overview

This PR implements checkpointing, logging and recovery. Checkpointing and recovery are new functionalities. There is also a major re-write of log manager because the master branch's logging is too simple to function right. Since catalog support is under implementation, we reverted all commits related to catalog to keep this PR clean. Logging infrastructure still uses data tables to cater to existing APIs, and should be migrated to SqlTable in the future.
To summarize, the main functionalities implemented are:

  • Making a consistent checkpoint of a table
  • Logging
  • Recover from latest checkpoints and logs
  • Varlens supported in all parts
  • Async writer of checkpoints
  • Handle arbitrarily huge rows when overflowing a checkpoint block
  • Handle crashes during checkpoint or recovery phase
  • SqlTable Transaction Test object for testing

Scope

Parts that our project will rely on:
  • storage::sql_table: the checkpoint manager will reply on the Scan interface provided by the sql_table to scan and persist the content of a table.
  • storage::log_manager: the log recovery will rely on the current implementation of the write-ahead-log. We do not intend to modify the current logging scheme.
  • settings: the checkpoint manager will get the settings from this component and set itself up accordingly.
  • catalog: We need to persist and recover the catalog first, during the checkpoint and recovery respectively, so that we can restore the schema of other tables.
Parts that our project will modify or create:
  • ADD: storage::checkpoint_manager: a new class added to support checkpointing and recovery.
  • ADD: storage::checkpoint_io: defines the BufferedTupleWriter class, which is used to buffer tuples and write data to disk one page at a time.
  • ADD: checkpoint_test: checkpoint testing.
  • Modify: storage::log_manager: recovery capability will be added to the log_manager.
  • Modify: storage_test_util: add functions for checkpoint testing

Architectural Design

The scanning and writing process is illustrated in the diagram here

The checkpoint_manager scans the table and passes each row to the BufferedTupleWriter class to serialize. The BufferedTupleWriter will buffer the tuples into pages and asynchronously write out pages into the file with AsyncBlockWriter. An async block writer will use a background thread to write blocks into checkpoint file asynchronously. Its design is very similar to that of the logger threads in SiloR. An async block writer will have some pre-allocated block buffers. Workers must first ask for a free block buffer. If there is no free block buffer now, it will be blocked until some free block buffer is available. In this way, the async block writer can give some back pressure on the worker to let it slow down if it is generating blocks too quickly. The worker then fills the block buffer and hand it back to the writer thread. The writer thread writes it into the file and make the buffer available again.

The layout of a checkpoint block is as here

When the current block cannot fit the next tuple, the page will be persisted to file and that tuple will be written to next block.

Design Rationale

checkpoint:

Handle crashes during checkpointing

  • Use another metadata file to mark completed: Would need another file.
  • Rename the checkpoint file once finished: Easy to handle. We adopted this option.

save tuples as Columns v.s. Rows:

  • Rows: Easy to implement recovery, using existing SqlTable API. Slower to checkpoint because we need to convert columns into rows.
  • Columns: Easy to checkpoint, by directly writing ProjectedColumns into a file. Because of the varlens, it is hard to find the correct number of tuples so that it perfectly fit into one page. It is also hard to recovery, because currently the SqlTable doesn't support bulk insert.

Where to store varlen in checkpoints:

  • Store in a separate file: Easy to checkpoint and recovery. However, writing checkpoint to 2 files cannot guarantee atomicity.
  • Inlined in each row: Save most storage space. Hard to checkpoint and recovery, because we cannot write the in-memory representation of ProjectedRow directly into the file.
  • Appended after each row: Easy to checkpoint and recover. No disadvantages.

Tuning AsyncBlockWriter:

  • Which block size should we use?
  • If there is no free block buffer, should the BufferedTupleWriter be blocked or just spin?

We did benchmark for these choices.

Throughput (M rows/s)
4K, spin 4.77
512K, spin 4.99
4K, block 3.85
512K, block 5.00

We pick the 512K + block combination at last.

Testing Plan

We created a new SqlTableTestObject class to build tables and run transactions on sql table level to test our code. We open GC thread, logging thread along with transactions to ensure our implementation works well under real scenarios.

  • Correctness is tested by comparing the scan result of a table before and after recovery.

Trade-offs and Potential Problems

The main trade-off in this project is between performance during checkpointing and during recovery. We prefer performance during checkpointing over during recovery. So our design decisions are inclined towards faster runtime.

Future Work

Two major parts that are blocked by other components are:

  1. After SqlTables are used in transactions, change logs to log operations on SqlTables.
  2. Implement catalog checkpointing and recovery. Moreover, catalog is need to recover schema changes, identified by replaying operations in catalog tables.
  3. After the above two parts are finished, we would then be able to recover all tables with their oids, instead of a table of oid fixed to 0 currently.

There are some redundancies in testing code, and can be further compacted.

  1. The RandomSqlTableTestObject is almost a subset of SqlLargeTransactionTestObject, so we can consider compacting them together.
  2. We used PrintAllRows() to compare tables, and this functionality is similar to StorageTestUtil::ProjectionListEqualShallow/StorageTestUtil::ProjectionListEqualDeep.

Possible performance optimization in the future include:

  1. morsel-based multi-thread checkpointing
  2. bulk checkpoint & recovery without converting column store into a row view.
  3. checkpointer & logger thread dynamic allocation.
  4. Newest-to-oldest log replay.
@songzhaozhe songzhaozhe force-pushed the advanced-database-group:checkpoint_manager branch from f4d5979 to f559b93 Apr 8, 2019
@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 9, 2019

Codecov Report

Merging #341 into master will decrease coverage by 0.04%.
The diff coverage is 90.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #341      +/-   ##
==========================================
- Coverage   87.62%   87.57%   -0.05%     
==========================================
  Files         212      217       +5     
  Lines        8257     8546     +289     
==========================================
+ Hits         7235     7484     +249     
- Misses       1022     1062      +40
Impacted Files Coverage Δ
src/include/storage/projected_row.h 100% <ø> (ø) ⬆️
src/include/storage/storage_defs.h 94.28% <ø> (ø) ⬆️
src/include/storage/write_ahead_log/log_record.h 96.36% <ø> (-0.36%) ⬇️
src/include/storage/write_ahead_log/log_io.h 88.57% <ø> (ø) ⬆️
src/include/storage/write_ahead_log/log_manager.h 100% <ø> (ø) ⬆️
src/storage/checkpoint_io.cpp 100% <100%> (ø)
src/include/storage/data_table.h 100% <100%> (ø) ⬆️
src/include/catalog/schema.h 100% <100%> (ø) ⬆️
src/include/storage/sql_table.h 100% <100%> (ø) ⬆️
src/storage/sql_table.cpp 96.36% <100%> (+0.44%) ⬆️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be93dca...60c86af. Read the comment docs.

pervazea added 5 commits Apr 9, 2019
  the table ptr
- Add methods to NamespaceEntry to get ctypes out by specifying column name
src/include/storage/checkpoint_io.h Outdated Show resolved Hide resolved
int32_t row_size = row_buffer->Size();
int32_t tot_size = total_varlen + row_size;
// Flush buffer first if the tuple cannot fit in buffer
if (cur_buffer_size_ + tot_size > block_size_) {

This comment has been minimized.

Copy link
@jrolli

jrolli Apr 10, 2019

Contributor

There is a major logic problem here: a table can have up to 65,535 columns which can have. Ignoring varlens, a tuple could 512 kb (2^16 columns at 8 bytes a piece). With varlens you cannot make any guarantee that a materialized tuple is smaller than your block.

This comment has been minimized.

Copy link
@lmy1229

lmy1229 Apr 15, 2019

Author

Currently, we assume the size of a row is less than 4k bytes. We talked about this with @apavlo and he said we can start with this assumption. For the future extension, we plan to split the data of a row into several pages and use a flag bit in the page header to identify split rows.

PersistBuffer();
}
// ASSUME that the row can always fit in the block
std::memcpy(buffer_ + cur_buffer_size_, row_buffer, row_size);

This comment has been minimized.

Copy link
@jrolli

jrolli Apr 10, 2019

Contributor

Buffer overflow vulnerability! (See comment on line 42)

This comment has been minimized.

Copy link
@lmy1229

lmy1229 Apr 15, 2019

Author

see #341 (comment). Currently, we need to add an assertion to test the size of a row. This won't be a problem once row splitting is supported.

// If the buffer has no contents, just return
return;
}
PosixIoWrappers::WriteFully(out_, buffer_, block_size_);

This comment has been minimized.

Copy link
@jrolli

jrolli Apr 10, 2019

Contributor

This is likely going to have the same performance problems as discussed in #329. Can you split this into an asynchronous write and pipelined buffers? (i.e. a buffer being prepared and a buffer being persisted)

This comment has been minimized.

Copy link
@codeworm96

codeworm96 Apr 10, 2019

Contributor

We have already planned async writing for checkpoint as part of our 100% goal. The synchronized write is only for our 75% goal, and it is only a temporary one. We will have an async implementation soon.

#include <vector>
namespace terrier::storage {

void BufferedTupleWriter::SerializeTuple(ProjectedColumns::RowView *row, ProjectedRow *row_buffer,

This comment has been minimized.

Copy link
@yash620

yash620 Apr 10, 2019

Here you guys are trying to materialize every single tuple in the table. It is basically reimplementing the SqlTable::Select code. So at this point why not just grab the SlotIterator::SqlTable and dereference that to get a tuple slot and then call SqlTable::Select on that tuple slot. The SqlTable::Select will materialize the ProjectedRow you want and you can use that here, instead of having to reconstruct the entire ProjectedRow from scratch after already calling Scan.

This comment has been minimized.

Copy link
@yash620

yash620 Apr 10, 2019

Also this isn't the purpose for how RowView was envisioned to be used. It is for use inside of the storage layer as a way of templating between ProjectedRows and ProjectedColumns. Ideally you guys should be needing to access the BlockLayout (that shouldn't be exposed outside of the DataTable) to get the RowView.

This comment has been minimized.

Copy link
@yash620

yash620 Apr 11, 2019

After some discussion it looks like other people are using ProjectedColumns for purposes like you guys as well (#260 (comment)). So by using Scan you wouldn't have to reason about the visibility of the tuples since the SqlTable would handle that so it would be an easier process. To ensure that you guys don't need the BlockLayout we will try to work on being able to generate the RowView from a ProjectedColumn without needing BlockLayout

This comment has been minimized.

Copy link
@lmy1229

lmy1229 Apr 16, 2019

Author

We abandoned Scan and switched to Select in the latest version, and we do not require BlockLayout anymore. The problem becomes how can we get the schema of a table because we need the Oids of all its columns. Since there are no catalogs yet, is it reasonable to add an API in the SqlTable to return its schema?

This comment has been minimized.

Copy link
@jrolli

jrolli Apr 16, 2019

Contributor

I would recommend against this because it would lead to more work in the end. Instead, I would recommend either cherry-picking the API/files you need from the Catalog PR or laying out the API you need from the catalog and mocking it out in your test fixture so that it always provides some hard-coded or programmatically generated schema.

src/storage/checkpoint_io.cpp Outdated Show resolved Hide resolved
src/storage/checkpoint_io.cpp Outdated Show resolved Hide resolved
int32_t tot_size = total_varlen + row_size;
// Flush buffer first if the tuple cannot fit in buffer
if (cur_buffer_size_ + tot_size > block_size_) {
PersistBuffer();

This comment has been minimized.

Copy link
@yash620

yash620 Apr 10, 2019

This is blocking, can't you get a new buffer and add to that while this is asynchronously persisting in the background

This comment has been minimized.

* A CheckpointManager is responsible for serializing tuples of all tables in the database out and
* divide time into epochs by doing checkpoint.
*/
class CheckpointManager {

This comment has been minimized.

Copy link
@jrolli

jrolli Apr 10, 2019

Contributor

I'm not sure I understand how you envision the API being used. The class is called CheckpointManager but it appears that all of the management (scheduling of next checkpoint, starting and stopping checkpoint-related transactions, tracking of tables to persist, etc.) is expected to occur somewhere else. Semantically, I would expect the checkpoint manager to have a single entry point that manages the lifecycle of taking a checkpoint and getter/setter functions for adjusting its parameters (tables being persisted, checkpoint frequency, etc.). Specifically, it should be expecting access to threads and the worker pool to manage these tasks internally rather than expecting another component to do this for you.

This comment has been minimized.

Copy link
@lmy1229

lmy1229 Apr 16, 2019

Author

You are correct. The checkpoint manager should 'manage' the process of checkpointing. However, we decided to first design the basic APIs for checkpoint and recovery, simply because there is no working catalog in the system. When there is a working catalog, a single call to StartCheckpoint() should first persist the catalog to the file. Then, for each user table, it will fetch its schema from the catalog and persist the table automatically.

*/
void StartCheckpoint(transaction::TransactionContext *txn) {
txn_ = txn;
out_.Open((GetCheckpointFilePath(txn)).c_str());

This comment has been minimized.

Copy link
@yash620

yash620 Apr 10, 2019

This might be a good place to call Checkpoint on each SqlTable, since this is where the checkpoint it starting.

This comment has been minimized.

Copy link
@lmy1229

lmy1229 Apr 16, 2019

Author

Yes, this is the ultimate goal. However, we do not know anything about existing tables, because there is no working catalog yet.

@jrolli

This comment has been minimized.

Copy link
Contributor

jrolli commented Apr 10, 2019

What is your plan for ensuring the catalog is persisted in such a way that it is the first set of tables recovered during a recovery? Specifically, your logic for Recover seems to assume that you already have all of the SqlTable OIDs and associated data loaded into the checkpoint manager, but Recover is also responsible for reading and reconstructing this from the checkpoint file.

// TODO(zhaozhe): check checksum here
catalog::table_oid_t oid = page->GetTableOid();
SqlTable *table = GetTable(oid);
BlockLayout *layout = GetLayout(oid);

This comment has been minimized.

Copy link
@yash620

yash620 Apr 10, 2019

How are you getting this? If you are recovering and you aren't persisting this data, where is it coming from?

This comment has been minimized.

Copy link
@lmy1229

lmy1229 Apr 16, 2019

Author

We do not use BlockLayouts anymore, but we still need to get the schema of a table anyway. This information should come from the catalog, which we will recover first in the recovery process.

/**
* @return the data table used by this test
*/
storage::DataTable *GetTable() { return &table_; }

This comment has been minimized.

Copy link
@jrolli

jrolli Apr 10, 2019

Contributor

You don't appear to use this, and you should not use this even in testing. If there is something at the DataTable level that you need to interact with, even if it is for testing, then there is an API problem with SqlTable. Recommend removing.

* The header of a page in the checkpoint file.
*/
// TODO(Zhaozhe, Mengyang): More fields can be added to header
class CheckpointFilePage {

This comment has been minimized.

Copy link
@jrolli

jrolli Apr 10, 2019

Contributor

If you're going to count on the data getting laid out as expected you may need to use the PACKED directive like is used for ProjectedRow (see include/storage/projected_row.h).

This comment has been minimized.

Copy link
@songzhaozhe

songzhaozhe Apr 15, 2019

Yes. We removed PACKED temporarily because of compilation error due to catalog::table_oid_t is a non-POD type. We might have to cast catalog::table_oid_t to POD type first to allow using PACKED.

// NOLINTNEXTLINE
TEST_F(CheckpointTests, SimpleCheckpointNoVarlen) {
const uint32_t num_rows = 100;
const uint32_t num_columns = 3;

This comment has been minimized.

Copy link
@jrolli

jrolli Apr 10, 2019

Contributor

You need a significantly larger number of tests. Specifically, you should test both 1000s of columns and 1000s of rows because your current test fits completely inside one 4 kb "page". @yash620 identified that you are probably not resetting your page offset between "blocks", but your tests miss this because you never advance to a new block/page.

for (int i = 0; i < static_cast<int>(cols_.size()); i++) {
if (coin(*generator)) { // not null
uint16_t offset = pr_map_->at(col_oids_[i]);
insert->SetNotNull(offset);

This comment has been minimized.

Copy link
@yash620

yash620 Apr 10, 2019

You don't need to SetNotNull AccessForceNotNull you do below already does that

@codeworm96 codeworm96 force-pushed the advanced-database-group:checkpoint_manager branch from 218bb8e to 254a34a May 14, 2019
@codeworm96 codeworm96 force-pushed the advanced-database-group:checkpoint_manager branch from 254a34a to 2e9833f May 14, 2019
@codeworm96 codeworm96 force-pushed the advanced-database-group:checkpoint_manager branch from 2e9833f to b608eb0 May 14, 2019
@codeworm96 codeworm96 force-pushed the advanced-database-group:checkpoint_manager branch from c68e064 to 94154c1 May 14, 2019
@codeworm96 codeworm96 mentioned this pull request May 14, 2019
lmy1229 added 2 commits May 14, 2019
@songzhaozhe songzhaozhe requested a review from jrolli May 15, 2019
codeworm96 added 2 commits May 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.