Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Lazy Schema Change #342

Closed
wants to merge 190 commits into from
Closed

Lazy Schema Change #342

wants to merge 190 commits into from

Conversation

yashNaN
Copy link

@yashNaN yashNaN commented Apr 8, 2019

For the design overview please take a look at the DESIGN.md file. Most of our modifications are localized to sql_table.cpp and sql_table.h . We have added tests in sql_table_test.cpp and catalog_test_util.h, and we have benchmarks in sql_table_benchmark.cpp

storage::TupleSlot row_slot = table.EndRowAndInsert();
table.StartInsertRow();
table.SetIntColInRow(catalog::col_oid_t(0), 100);
table.SetVarcharColInRow(catalog::col_oid_t(1), "name");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also use longer varlen values to test for varlen entries which are not inlined.

TERRIER_ASSERT(attr_size_ == VARLEN_COLUMN, "This constructor is meant for VARLEN columns.");
TERRIER_ASSERT(type_ != type::TypeId::INVALID, "Attribute type cannot be INVALID.");
SetDefault(default_value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is problematic. The most significant bit of VARLEN_COLUMN is set to 1. You should not use that directly as the size for memcpy in SetDefault, which will cause scary memory issues.

@@ -38,62 +39,134 @@ class SqlTableRW {
cols_.emplace_back(name, type, 255, nullable, oid);
}

void AddColumn(transaction::TransactionContext *txn, std::string name, type::TypeId type, bool nullable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddColumn does not support adding varlen columns. You should also test adding varlen columns to make sure your code is handling varlens correctly.

const uint32_t num_threads = MultiThreadTestUtil::HardwareConcurrency();
common::WorkerPool thread_pool(num_threads, {});

for (uint32_t iteration = 0; iteration < num_iterations; iteration++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you do not need to repeat the test by yourself. googletest already provides the --gtest_repeat flag and the GTEST_REPEAT environment variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using the same setup that was already used for the DataTable tests. The repetitions here are useful for increasing the likelihood that CI will error out on race conditions in the concurrent tests without having to make deep modifications to the CMake build scripts.

return pr;
}

// It populates a projected row with random bytes. The schema must not contain VarChar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd better have some assertions to ensure the schema does not contain VarChar.

template <typename Random>
static void PopulateRandomRow(storage::ProjectedRow *row, const catalog::Schema &schema,
const storage::ProjectionMap &pr_map, Random *const generator) {
for (auto &it : pr_map) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm has low efficiency. Why not let the outer loop iterate over the columns and then look up the pr_map?

for (const auto &column : schema.GetColumns()) {
auto col_oid = column.GetOid();
auto *default_value = column.GetDefault();
// Only populate the default values of the columns which are new and have a default value
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the user to change the default value of a column? Or else it is treated as a new column (with a different OID)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible for a user to change the default value of a column. But what is happening here is that if a new column is added and a default value is set for that column on creation then it all tuples within the table should receive that default value for that column. But if a default value is set on an already created column then that default value is only applied to tuples inserted from that point on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage engine only needs to reason about default values that are applied at the time the column is created as these are retroactive (applied to all existing data). Default value changes at any other time would only affect new inserts and have to be handled by the catalog, execution engine, and binder because we are not in a position at this level to reason about that granularity of visibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that part deserves better documentation. The user of SqlTable needs to know for default values which part is handled by SqlTable and which part they should take care of.


// The slot version is not the same as the version_num
col_id_t original_column_ids[out_buffer->NumColumns()];
ModifyProjectionHeaderForVersion(out_buffer, tables_.Find(version_num)->second, old_dt_version, original_column_ids);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tables_.Find(version_num)->second is repeated several times here, and curr_dt_version is unused. Are you doing this for some concurrent issues?


## Future Work
### Pending tasks
#### Default values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated? You have done some work about default values.

auto next_table = tables_->Find(curr_version_);
if (next_table == tables_->CEnd()) { // next_table does not exist (at end)
is_end_ = true;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break the loop with is_end_ set to be true. But immediately after that the assertion checks !is_end_, which is confusing.

*/
catalog::table_oid_t Oid() const { return oid_; }
SlotIterator begin(layout_version_t txn_version) const {
// common::SpinLatch::ScopedSpinLatch guard(&tables_latch_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadcode. Should be removed.

*/
DataTable::SlotIterator begin() const { return table_.data_table->begin(); }
SlotIterator end() const {
// common::SpinLatch::ScopedSpinLatch guard(&tables_latch_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadcode. Should be removed.


/**
* @return the first tuple slot contained in the underlying DataTable
* Returns one past the last tuple slot contained in the last data table. Note that this is not an accurate number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing document: What is the number it is talking about? Tuple slot or something else?

TERRIER_ASSERT(col_ids.size() == col_oids.size(),
"Projection should be the same number of columns as requested col_oids.");
TERRIER_ASSERT(tables_.Find(version_num) != tables_.CEnd(), "Table version must exist before insert");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message seems incorrect. InitializerForProjectedRow does not insert any row.

@@ -33,5 +34,99 @@ struct CatalogTestUtil {
}
return catalog::Schema(columns);
}

// The same as RandomSchem but it won't generate Varchar Column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: RandomSchema?

static bool ProjectionListEqual(const catalog::Schema &schema, const RowType1 *const one, const RowType2 *const other,
const storage::ProjectionMap &map) {
if (one->NumColumns() != other->NumColumns()) return false;
for (auto &it : map) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to #342 (comment).

// We should create a buffer of old Projected Row and update in place. We can't just
// directly erase the data without creating a redo and update the chain.

// auto old_pair = InitializerForProjectedRow(redo_col_oids, old_version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is these commented out code?

SqlTable::~SqlTable() {
while (default_value_map_.CBegin() != default_value_map_.CEnd()) {
auto pair = *(default_value_map_.CBegin());
delete[] pair.second.first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For varlens that are not inlined, simply delete pair.second.first will leak memory.

const ProjectedRowInitializer &initializer, layout_version_t version) const;

// TODO(Yashwanth): don't copy the entire header, no need for template only take in ColumnIds() and then just modify
// that when resetting header only have memc py ColumnIds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: memc py

auto *default_value = column.GetDefault();
// Only populate the default values of the columns which are new and have a default value
if (default_value_map_.Find(col_oid) == default_value_map_.End()) {
uint8_t attr_size = column.GetAttrSize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the results of GetAttrSize() of varlen attributes have the sign bit set. You will allocate a large chunk of memory. This will also leads to out of boundary access in std::memcpy.

auto default_value = pair.first;
auto attr_size = pair.second;
// TODO(Sai): If this becomes a performance bottleneck, we can move this logic to ModifyProjectionHeaderForVersion
storage::StorageUtil::CopyWithNullCheck(default_value, out_buffer, attr_size, pr_map.at(col_oid));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyWithNullCheck only does a shallow copy and VarlenEntry has no reference counting. You may need careful reasoning about how this will interact with the GC.

// NOTE: This map only keeps track of the default values specified at column creation
// For columns which don't have default value or added later, just set to null
// Populating default values into the ProjectedRow inserted later is taken care of by the execution engine
DefaultValueMap default_value_map_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can drop columns. So some default values may be no longer needed in the future. You may want to have some way to shrink the DefaultValueMap.

@jrolli jrolli added blocked This issue or pull request is in progress, but dependent on another task being completed first. and removed in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. labels May 29, 2019
@jrolli
Copy link
Contributor

jrolli commented May 29, 2019

Blocked on #392 which will solidify the correct API for default values.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked This issue or pull request is in progress, but dependent on another task being completed first. class-project This is part of the DB class (15-721) feature Adds a requested feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants