Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy Schema Change #342

Open
wants to merge 190 commits into
base: master
from

Conversation

@yash620
Copy link

yash620 commented Apr 8, 2019

For the design overview please take a look at the DESIGN.md file. Most of our modifications are localized to sql_table.cpp and sql_table.h . We have added tests in sql_table_test.cpp and catalog_test_util.h, and we have benchmarks in sql_table_benchmark.cpp

Yangjun Sheng added 30 commits Mar 14, 2019
Yangjun Sheng
Yangjun Sheng
Yangjun Sheng
table.SetIntColInRow(1, 25721);
storage::TupleSlot row2_slot = table.EndRowAndInsert();
// manually set the version of the transaction to be 1
table.version_ = storage::layout_version_t(1);

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 11, 2019

Contributor

Directly modifying this field is confusing and seems unsafe (may violate your invariant). A method for setting the version of the transaction may better express your intention.

This comment has been minimized.

Copy link
@jrolli

jrolli May 12, 2019

Contributor

The comment is incorrect. This is not manually setting the transaction's version but rather simulating the catalog update that occurs along with a schema update. I'll update the comment to reflect this.

storage::TupleSlot row_slot = table.EndRowAndInsert();
table.StartInsertRow();
table.SetIntColInRow(catalog::col_oid_t(0), 100);
table.SetVarcharColInRow(catalog::col_oid_t(1), "name");

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 11, 2019

Contributor

You should also use longer varlen values to test for varlen entries which are not inlined.

TERRIER_ASSERT(attr_size_ == VARLEN_COLUMN, "This constructor is meant for VARLEN columns.");
TERRIER_ASSERT(type_ != type::TypeId::INVALID, "Attribute type cannot be INVALID.");
SetDefault(default_value);

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 11, 2019

Contributor

This line is problematic. The most significant bit of VARLEN_COLUMN is set to 1. You should not use that directly as the size for memcpy in SetDefault, which will cause scary memory issues.

@@ -38,62 +39,134 @@ class SqlTableRW {
cols_.emplace_back(name, type, 255, nullable, oid);
}

void AddColumn(transaction::TransactionContext *txn, std::string name, type::TypeId type, bool nullable,

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 11, 2019

Contributor

AddColumn does not support adding varlen columns. You should also test adding varlen columns to make sure your code is handling varlens correctly.

const uint32_t num_threads = MultiThreadTestUtil::HardwareConcurrency();
common::WorkerPool thread_pool(num_threads, {});

for (uint32_t iteration = 0; iteration < num_iterations; iteration++) {

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 11, 2019

Contributor

Maybe you do not need to repeat the test by yourself. googletest already provides the --gtest_repeat flag and the GTEST_REPEAT environment variable.

This comment has been minimized.

Copy link
@jrolli

jrolli May 12, 2019

Contributor

We're using the same setup that was already used for the DataTable tests. The repetitions here are useful for increasing the likelihood that CI will error out on race conditions in the concurrent tests without having to make deep modifications to the CMake build scripts.

return pr;
}

// It populates a projected row with random bytes. The schema must not contain VarChar

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 11, 2019

Contributor

You'd better have some assertions to ensure the schema does not contain VarChar.

template <typename Random>
static void PopulateRandomRow(storage::ProjectedRow *row, const catalog::Schema &schema,
const storage::ProjectionMap &pr_map, Random *const generator) {
for (auto &it : pr_map) {

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 11, 2019

Contributor

This algorithm has low efficiency. Why not let the outer loop iterate over the columns and then look up the pr_map?

for (const auto &column : schema.GetColumns()) {
auto col_oid = column.GetOid();
auto *default_value = column.GetDefault();
// Only populate the default values of the columns which are new and have a default value

This comment has been minimized.

Copy link
@lmy1229

lmy1229 May 11, 2019

Is it possible for the user to change the default value of a column? Or else it is treated as a new column (with a different OID)

This comment has been minimized.

Copy link
@yash620

yash620 May 12, 2019

Author

It should be possible for a user to change the default value of a column. But what is happening here is that if a new column is added and a default value is set for that column on creation then it all tuples within the table should receive that default value for that column. But if a default value is set on an already created column then that default value is only applied to tuples inserted from that point on.

This comment has been minimized.

Copy link
@jrolli

jrolli May 12, 2019

Contributor

The storage engine only needs to reason about default values that are applied at the time the column is created as these are retroactive (applied to all existing data). Default value changes at any other time would only affect new inserts and have to be handled by the catalog, execution engine, and binder because we are not in a position at this level to reason about that granularity of visibility.

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

I think that part deserves better documentation. The user of SqlTable needs to know for default values which part is handled by SqlTable and which part they should take care of.


// The slot version is not the same as the version_num
col_id_t original_column_ids[out_buffer->NumColumns()];
ModifyProjectionHeaderForVersion(out_buffer, tables_.Find(version_num)->second, old_dt_version, original_column_ids);

This comment has been minimized.

Copy link
@lmy1229

lmy1229 May 11, 2019

tables_.Find(version_num)->second is repeated several times here, and curr_dt_version is unused. Are you doing this for some concurrent issues?

yash620 and others added 2 commits May 12, 2019

## Future Work
### Pending tasks
#### Default values

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

Outdated? You have done some work about default values.

auto next_table = tables_->Find(curr_version_);
if (next_table == tables_->CEnd()) { // next_table does not exist (at end)
is_end_ = true;
break;

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

Break the loop with is_end_ set to be true. But immediately after that the assertion checks !is_end_, which is confusing.

*/
catalog::table_oid_t Oid() const { return oid_; }
SlotIterator begin(layout_version_t txn_version) const {
// common::SpinLatch::ScopedSpinLatch guard(&tables_latch_);

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

Deadcode. Should be removed.

*/
DataTable::SlotIterator begin() const { return table_.data_table->begin(); }
SlotIterator end() const {
// common::SpinLatch::ScopedSpinLatch guard(&tables_latch_);

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

Deadcode. Should be removed.


/**
* @return the first tuple slot contained in the underlying DataTable
* Returns one past the last tuple slot contained in the last data table. Note that this is not an accurate number

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

Confusing document: What is the number it is talking about? Tuple slot or something else?

TERRIER_ASSERT(col_ids.size() == col_oids.size(),
"Projection should be the same number of columns as requested col_oids.");
TERRIER_ASSERT(tables_.Find(version_num) != tables_.CEnd(), "Table version must exist before insert");

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

The message seems incorrect. InitializerForProjectedRow does not insert any row.

@@ -33,5 +34,99 @@ struct CatalogTestUtil {
}
return catalog::Schema(columns);
}

// The same as RandomSchem but it won't generate Varchar Column

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

typo: RandomSchema?

static bool ProjectionListEqual(const catalog::Schema &schema, const RowType1 *const one, const RowType2 *const other,
const storage::ProjectionMap &map) {
if (one->NumColumns() != other->NumColumns()) return false;
for (auto &it : map) {

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

Similar to #342 (comment).

// We should create a buffer of old Projected Row and update in place. We can't just
// directly erase the data without creating a redo and update the chain.

// auto old_pair = InitializerForProjectedRow(redo_col_oids, old_version);

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

What is these commented out code?

SqlTable::~SqlTable() {
while (default_value_map_.CBegin() != default_value_map_.CEnd()) {
auto pair = *(default_value_map_.CBegin());
delete[] pair.second.first;

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

For varlens that are not inlined, simply delete pair.second.first will leak memory.

const ProjectedRowInitializer &initializer, layout_version_t version) const;

// TODO(Yashwanth): don't copy the entire header, no need for template only take in ColumnIds() and then just modify
// that when resetting header only have memc py ColumnIds()

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

typo: memc py

auto *default_value = column.GetDefault();
// Only populate the default values of the columns which are new and have a default value
if (default_value_map_.Find(col_oid) == default_value_map_.End()) {
uint8_t attr_size = column.GetAttrSize();

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

Note that the results of GetAttrSize() of varlen attributes have the sign bit set. You will allocate a large chunk of memory. This will also leads to out of boundary access in std::memcpy.

auto default_value = pair.first;
auto attr_size = pair.second;
// TODO(Sai): If this becomes a performance bottleneck, we can move this logic to ModifyProjectionHeaderForVersion
storage::StorageUtil::CopyWithNullCheck(default_value, out_buffer, attr_size, pr_map.at(col_oid));

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

CopyWithNullCheck only does a shallow copy and VarlenEntry has no reference counting. You may need careful reasoning about how this will interact with the GC.

// NOTE: This map only keeps track of the default values specified at column creation
// For columns which don't have default value or added later, just set to null
// Populating default values into the ProjectedRow inserted later is taken care of by the execution engine
DefaultValueMap default_value_map_;

This comment has been minimized.

Copy link
@codeworm96

codeworm96 May 13, 2019

Contributor

Users can drop columns. So some default values may be no longer needed in the future. You may want to have some way to shrink the DefaultValueMap.

@jrolli jrolli added blocked and removed in-progress labels May 29, 2019
@jrolli

This comment has been minimized.

Copy link
Contributor

jrolli commented May 29, 2019

Blocked on #392 which will solidify the correct API for default values.

@jrolli jrolli mentioned this pull request Aug 13, 2019
9 of 13 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.