Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Lazy Schema Change #342

Closed
wants to merge 190 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
190 commits
Select commit Hold shift + click to select a range
bcbbcaa
Add a vector of DataTableVersion
Mar 14, 2019
63f80d2
Implemented multiversion select
Mar 15, 2019
ebb0ef1
Add functions for changing schemas in Sql Table, overload functions t…
Mar 15, 2019
1723b43
Add multi_version_sql_table test
Mar 15, 2019
cf91e19
Fix memory leak
Mar 15, 2019
a3f1b96
Implemented multiversion updates, insertion and deletion
Mar 16, 2019
c82b7ff
Add mv-schema insert test
Mar 16, 2019
9c7d5ef
Fix delete bug
Mar 16, 2019
36b38f3
Add mv-delete test
Mar 16, 2019
0f173ee
Merge branch 'master' into schema_change
Mar 16, 2019
c85242e
Add mv-update test.
Mar 17, 2019
66b6aa9
Fix mv Select bug. Fix mv Update bug
Mar 17, 2019
8717f51
Format code
Mar 17, 2019
2ba3914
Remove Slot Iterators in SqlTable
Mar 17, 2019
1d7d2db
Remove single version Select and InitializerForProjectedRow
Mar 17, 2019
2776858
Remove printf
Mar 17, 2019
7506fc9
Remove single version Insert and Delete
Mar 17, 2019
03402cb
Remove ASSERT and unused_variables to deal with release version
Mar 17, 2019
01f48cf
Add string test
Mar 17, 2019
21b968b
Add versions for initializers in sql tables
Mar 17, 2019
87b6bf3
Add utility function for copying from one ProjectedRow to another Pro…
Mar 17, 2019
99d95cb
Implement multiversion scan in SqlTable
Mar 17, 2019
5872f3c
Fix scan bug
Mar 17, 2019
de11304
Add Scan Test
Mar 17, 2019
8d845f6
Add SqlTable::SlotIterator fix scan logic
Mar 17, 2019
7ae48ae
Fix bug in CopyProjectionIntoProjection
Mar 17, 2019
504e450
Format
Mar 17, 2019
a4d7b57
Make ChangeSchema Atomic
Mar 17, 2019
6cdee78
Add sql_table concurrent insert test
Mar 18, 2019
ea1305a
Completely remove SqlTable with single DataTable
Mar 18, 2019
ae06652
Check read back the same content after insert
Mar 18, 2019
c1e42bb
Merge branch 'master' into schema_change
Mar 18, 2019
d36ba88
Add concurrent schema changes test
Mar 18, 2019
b4d0a4a
Pass ProjectionMap into Update to speed up
Mar 18, 2019
00a6157
Use blocklayout instead of TAS, which is not supposed to be exposed o…
Mar 18, 2019
7f701a3
Fix sql update bug
Mar 18, 2019
d807187
Format code
Mar 18, 2019
1d535f6
Fix bug in update
Mar 19, 2019
0500a92
Remove GetTAS function in DataTable. TAS shouldn't be exposed outside…
Mar 19, 2019
6e6a206
Merge branch 'master' into schema_change
Mar 19, 2019
6a58b45
Clean Up Select and Update
Mar 19, 2019
d3ecdfb
Clean Up
Mar 19, 2019
bd81c27
Rename sql_table_test
Mar 19, 2019
d54a401
Add documentation
Mar 20, 2019
fb9de00
Fix format issue
Mar 20, 2019
2ea3c0a
Modified select to use projected row header modification method
yashNaN Mar 20, 2019
7f202d5
Add simple sql table benchmark
Mar 21, 2019
7fb5aae
Add more benchmark
Mar 21, 2019
9b39fd5
Fixed warning for making constructor explicit
yashNaN Mar 21, 2019
1507239
Merge branch 'schema_change' of https://github.com/yash620/terrier in…
yashNaN Mar 21, 2019
fc2c1e2
Changed, select to memcpy projectedrow header, failing tests though
yashNaN Mar 21, 2019
812fb74
Modified projected row header mangling to use ProjectedRowInitializer
yashNaN Mar 21, 2019
7c88a17
Merge of split implementations (#1)
jrolli Mar 21, 2019
2a91470
Merge branch 'master' into schema_change
jrolli Mar 22, 2019
75b0d72
Add and build inverse column maps
jrolli Mar 22, 2019
9ea38ac
Added in todos
yashNaN Mar 23, 2019
40e58b6
merged with schema_change
yashNaN Mar 23, 2019
3bc713c
Added in header mangling for select, yet to test. Waiting on inverse_…
yashNaN Mar 23, 2019
d81f871
Sql table benchmark (#7)
Mar 23, 2019
a3229e7
Merge branch 'schema_change' into inv-col-map
jrolli Mar 23, 2019
b6f7677
Inline BlockLayoutFromSchema into UpdateSchema
jrolli Mar 24, 2019
a87cc46
Added in header mangling for scans, modified scan so that it is incre…
yashNaN Mar 24, 2019
f9df645
Merge pull request #5 from jrolli/inv-col-map
yashNaN Mar 25, 2019
f171243
Merge branch 'schema_change' into yashwn-dev
yashNaN Mar 25, 2019
28718a8
Fix typo
Mar 25, 2019
2307125
Fix the problem that benchmark throws exceed limit exception when ben…
Mar 25, 2019
e701941
Tested and fixed issues with header mangling with scans and inserts. …
yashNaN Mar 25, 2019
829ecf9
Modified sql_table_test to fit modified design of scan
yashNaN Mar 25, 2019
aa452a0
Initial shell for concurrent tests
jrolli Mar 25, 2019
b24bce6
Modified sql_table_test to fit modified design of scan
yashNaN Mar 25, 2019
0d118d2
Merge branch 'yashwn-dev' of https://github.com/yash620/terrier into …
yashNaN Mar 25, 2019
0ad097c
Fixing format
yashNaN Mar 25, 2019
ca7de04
Fixed compiler warnings
yashNaN Mar 26, 2019
a99665e
Merge branch 'schema_change' into sql-concurrent-test
jrolli Mar 26, 2019
4da8640
Removing unused function in projected_row
yashNaN Mar 26, 2019
057accb
Changed format of todos
yashNaN Mar 26, 2019
882d924
Fixed including extra header
yashNaN Mar 26, 2019
0dd5c76
Merge pull request #9 from yangjuns/sql_table_benchmark
Mar 29, 2019
c604438
Addressed pull request details of removing return type on DataTable scan
yashNaN Mar 29, 2019
176763c
Merge pull request #10 from yash620/yashwn-dev
yashNaN Mar 30, 2019
e66de92
Add single version update benchmark
Apr 2, 2019
3845842
Add MultiVersionMatchUpdate
Apr 2, 2019
e6196b1
Use Random bytes for updates in SingleVersionUpdate Benchmark
Apr 2, 2019
d103214
Update comments
Apr 2, 2019
dd497df
Add MultiversionMismatchUpdate
Apr 2, 2019
42fb1ee
Fix mismatch update benchmark because updating the same tuple will ma…
Apr 2, 2019
0c07a63
Fix update. We should use datatable level operations instead of recur…
Apr 2, 2019
7915171
Modified header mangling to not have to malloc on the heap
yashNaN Apr 4, 2019
e0579c6
Fixed formatting
yashNaN Apr 4, 2019
35a994b
More format fixes
yashNaN Apr 4, 2019
144c85b
Removed unecesarry functions from projected row and projected columns
yashNaN Apr 4, 2019
018c7c5
Merge branch 'schema_change' of https://github.com/yash620/terrier in…
yashNaN Apr 4, 2019
bd4739a
Merge pull request #16 from yash620/yashwn-dev
yashNaN Apr 5, 2019
155b974
Delete before Insert
Apr 5, 2019
3fb108d
Use sql table select to avoid duplicated logic
Apr 5, 2019
94030a6
Merge branch 'schema_change' into update_fix
Apr 5, 2019
8d7a026
Format code
Apr 5, 2019
ce5ab7e
Free up memoery when conflict
Apr 5, 2019
8b140d1
Implement concurrent select and insert tests
jrolli Apr 4, 2019
f1d09df
Merge branch 'schema_change' into sql-concurrent-test
jrolli Apr 5, 2019
c7a0eb3
Fix function CopyFromProjectionToProjection
Apr 5, 2019
bcf31eb
Address reivew comments: update values before insert
Apr 5, 2019
32b9a26
Merge pull request #14 from yangjuns/update_fix
Apr 5, 2019
803d1e3
Merge pull request #13 from yangjuns/update_benchmark
Apr 5, 2019
6cbc928
Refactor code to use a concurrent hash map
jrolli Apr 5, 2019
4d72fa1
Clean up formatting
jrolli Apr 6, 2019
d459af4
Merge branch 'schema_change' into concurrent-map
jrolli Apr 6, 2019
d988b1e
Tidy code
jrolli Apr 6, 2019
f6c3150
Add DESIGN.md for project update requirement
jrolli Apr 7, 2019
e010648
Merge pull request #19 from jrolli/concurrent-map
jrolli Apr 7, 2019
c6df5bf
Merge branch 'schema_change' into sql-concurrent-test
jrolli Apr 7, 2019
5359d51
Merge branch 'master' into merge-master
jrolli Apr 8, 2019
1d2ef4e
Fix build errors
jrolli Apr 7, 2019
2a30e06
clang-tidy
jrolli Apr 8, 2019
587cce0
Merge pull request #20 from jrolli/merge-master
yashNaN Apr 8, 2019
d450c13
C++-ify the test...
jrolli Apr 9, 2019
2d63f06
C++-ify the test...
jrolli Apr 9, 2019
aef1390
Clang tidy
jrolli Apr 9, 2019
a6af4a1
Merge branch 'schema_change' into sql-concurrent-test
jrolli Apr 9, 2019
575b96b
Fix format
jrolli Apr 9, 2019
5681652
Increase number of iterations
jrolli Apr 9, 2019
87413eb
Add Scan benchmark (#23)
Apr 9, 2019
25cdb88
Merge pull request #18 from jrolli/sql-concurrent-test
yashNaN Apr 9, 2019
36adf88
Merge branch 'master' into schema_change
jrolli Apr 17, 2019
aac0a38
Fix typo in design doc
jrolli Apr 17, 2019
2f17b80
Improve assert location
jrolli Apr 17, 2019
d7f2841
Fix incrementing end iterator risk
jrolli Apr 17, 2019
44fc635
Remove obsolete comment
jrolli Apr 17, 2019
7178975
Merge pull request #26 from jrolli/schema_change
jrolli Apr 17, 2019
bcbc157
Added default value attribute to Schema::Column
saikiriti93 Apr 18, 2019
eb4053b
Populating default value during UpdateSchema
saikiriti93 Apr 18, 2019
b56aff7
Default values tested in Select
saikiriti93 Apr 18, 2019
9978b50
Added nullptr check for default_value argument
saikiriti93 Apr 18, 2019
87531a2
Default values tested in Scan
saikiriti93 Apr 18, 2019
c0187a3
Fixed format
saikiriti93 Apr 18, 2019
dd41f99
Fix clang-tidy
jrolli Apr 18, 2019
ac59e0a
Expand testing
jrolli Apr 18, 2019
d75b01c
Merge pull request #27 from jrolli/review-edits
jrolli Apr 23, 2019
78b3aec
Merge branch 'master' into schema_change
jrolli Apr 24, 2019
ab71a1f
Fix build errors in deferred action test
jrolli Apr 24, 2019
d832ca2
Fix build errors in SQL table benchmark
jrolli Apr 24, 2019
45a4124
Fix build errors in SQL table test
jrolli Apr 24, 2019
d32991d
Update tests
jrolli Apr 25, 2019
5660b57
Update test calls
jrolli Apr 25, 2019
265772c
Fix params
jrolli Apr 25, 2019
de37aba
Fix params
jrolli Apr 25, 2019
98d13dc
Remove calls to BlockLayout
jrolli Apr 25, 2019
2571408
Remove unused field
jrolli Apr 25, 2019
76a31af
Merge pull request #29 from jrolli/schema_change
jrolli Apr 25, 2019
bb8ede5
Merge branch 'schema_change' into concurrent-tests
jrolli Apr 25, 2019
f744b35
Use row view
jrolli Apr 25, 2019
739ae0b
Use row view
jrolli Apr 25, 2019
9772ede
Merge branch 'concurrent-tests' of github.com:jrolli/terrier into con…
jrolli Apr 26, 2019
abf3db0
Fix bad iterator situation
jrolli Apr 26, 2019
d835052
Fix typo in current copy implementation
jrolli Apr 27, 2019
877cc71
Transition to header mangling
jrolli Apr 27, 2019
a4d3216
Fix logic error
jrolli Apr 27, 2019
0fa78ed
Fix format, causes other errors
jrolli Apr 27, 2019
3eccde7
Fix API issues
jrolli Apr 27, 2019
b56a489
Clean up code
jrolli Apr 27, 2019
6e92e9a
Fix typo
jrolli Apr 27, 2019
c330036
Merge pull request #32 from jrolli/hotfix-scan
yashNaN Apr 30, 2019
932cf75
Changed back to single Column constructor with a defaultValue argument
saikiriti93 May 2, 2019
ee6f639
Merge branch 'schema_change' into default_values
saikiriti93 May 3, 2019
4d6cceb
Merge branch 'master' into schema_change
jrolli May 3, 2019
50c947d
Provide API for setting and clearing default values
saikiriti93 May 3, 2019
3bdca7f
Added a failing test case for populating default values. Single Defau…
saikiriti93 May 3, 2019
22c3041
Modify test case for default value of an old table
saikiriti93 May 4, 2019
404270a
Add new column check in test case
saikiriti93 May 4, 2019
48bcfde
Passing default value test cases
saikiriti93 May 4, 2019
b6631cf
Default value insertion for Scan. Fixed the Scan test case.
saikiriti93 May 4, 2019
66330ac
Added check for version match in Scan
saikiriti93 May 4, 2019
8fcaa12
Merge pull request #33 from jrolli/concurrent-tests
jrolli May 4, 2019
efc35a4
Merge branch 'schema_change' of github.com:yash620/terrier into schem…
jrolli May 4, 2019
3a7ac68
Remove const to make serialization framework happy
jrolli May 4, 2019
614be3e
Merge pull request #34 from jrolli/schema_change
jrolli May 4, 2019
033f0cc
Handle null case in setting a default value
saikiriti93 May 4, 2019
76aebd8
Merge branch 'schema_change' into default_values
saikiriti93 May 4, 2019
8ba1635
Refactoring missing_cols calculation
saikiriti93 May 4, 2019
a9a4565
Updated test comments
saikiriti93 May 5, 2019
4fa1ee1
Changed DefaultValueMap from unordered_map to ConcurrentMap.
saikiriti93 May 5, 2019
9384ea0
Refactoring for default value filling
saikiriti93 May 5, 2019
3562388
Merge pull request #28 from yash620/default_values
saikiriti93 May 5, 2019
14d90c9
Asan hotfix (#36)
jrolli May 5, 2019
c4075a9
Add NOLINTNEXTLINT
jrolli May 5, 2019
9dfece0
Fix linting issue
jrolli May 5, 2019
5b23977
Merge branch 'master' into schema_change
jrolli May 11, 2019
9cca570
Merge pull request #39 from jrolli/schema_change
jrolli May 11, 2019
8d22128
Addressed some pr comments (#40)
yashNaN May 12, 2019
9a7384a
Fix test comments (#41)
jrolli May 12, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
111 changes: 111 additions & 0 deletions DESIGN.md
@@ -0,0 +1,111 @@
# Non-Blocking Alter Table Support (SqlTable)

## Overview
The overall goal of this project is to implement lazy, non-blocking schema changes. The current storage layer does not support any schema changes. The first goal of this project is to support schema change operations including add column, drop column, and default value changes. However, the overall goal would be to carry this out in a non-blocking fashion, for which we decided to go with a lazy evaluation approach. This means that any schema change would not migrate existing tuples to the new schema until information that only exists in the new schema is modified.

## Scope
Almost all of the work will be localized to the SqlTable object as that is the access point from the execution layer to the storage layer. Also our design will not affect the underlying structures of the tuples or the DataTables and after this change they should still be able to provide their functionality without having to go through the SqlTable. Within the SqlTable we will be changing the following modules. Below we will refer to expected version as the version of the tuple the user expects to see and actual version as the version the tuple is currently in. These two versions will differ when a user has updated schema, but due to the lazy evaluation the tuple hasn’t been transformed into the latest version.

### SqlTable::Insert
- Insert will now take in a schema version number that indicates the version of the tuple being inserted.
- In this case the actual and expected versions will be the same as an insert will always put a tuple in the version that is passed in.

### SqlTable::Update
- Update will be required to take in a schema version number that indicates the expected schema version number of the tuple being updated
- In this case the actual version number will differ from the expected version in cases where the tuple being updated has not been shifted to the expected version.
- If the update modifies columns that are not in the actual version then the tuple will be shifted to the expected version before applying the update.

### SqlTable::Scan
- Scan will be required to take in a schema version number that indicates the schema version number that the current transaction sees.

### SqlTable::Delete
- Delete will be required to take in the schema version number that indicates the expected schema version number of the tuple being deleted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also modified InitializerForProjectedColumns, InitializerForProjectedRows .etc. You maybe also need to document them here.

### Catalog
- The catalog will keep track of the visible schema versions for each transactions based on its timestamp. This schema version is then passed on to the SqlTable layer.

### DataTable/Projected Row
- DataTable::SelectIntoBuffer iterates across a projected_row/column’s column ids and fills in the data for each column

- For this operation we use the column_id VERSION_POINTER_COLUMN_ID as a sentinel id to represent a column that the DataTable should skip over and not fill in. We will go into detail as to why this happens within the architectural design below.


## Architectural Design
The design of this project will center around the modifications to SqlTable. The design for other components in the storage layer will remain unchanged. On a schema change the SqlTable will create a new DataTable that will store all the tuples inserted from that point on into the new version. To be lazy it will not modify already existing tuples to transform them into the latest version. In order to support this lazy schema change we need two functionalities: maintaining tuples in multiple different schema versions and providing methods of transforming them into the desired version.

### Multi-versioning
To address the multi-versioning the SqlTable will maintain a map from the schema_version_number to a DataTable. There will be one DataTable for each schema version. The functionality for accessing tuples is already present in DataTable and this way SqlTable will only need to manage the two functionalities we described above. The rest of the functionalities will be handled by the already existing DataTable implementation. Furthermore, each block will maintain metadata of which version all of the tuples within the block belong to. Since each DataTable can only be a single version, a block cannot contains tuples from multiple versions. Below is a description of the multiversion design for each of the SqlTable operations we are modifying and we will refer to expected version as the version of the tuple the user expects to see and actual version as the version the tuple is currently in.

#### SqlTable::Insert
Insert will always insert the passed in tuple into the DataTable of the schema version number passed in

#### SqlTable::Update
- Update has three cases
1. The expected schema version matches the actual schema version
- The update will happen in place on the DataTable of the actual schema version
2. The expected schema version doesn’t match the actual schema version but the update doesn’t touch any columns not in the actual schema version
- The update will happen in place on the DataTable of the actual schema version
3. The expected schema version doesn’t match the actual schema version and the update touches not in the actual schema version. The following steps occur
- Retrieve the tuple from the actual version DataTable
- Transform the tuple to the expected version
- Delete the tuple from the actual version DataTable
- Insert the tuple into the expected version DataTable

#### SqlTable::Scan
- SqlTable will maintain its own slot iterator which will be used to iterate across all the schema version. The iterator interface exposed to the user will not change
- The iterator for Scan must always begin on the latest version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doubting if this method really solves every problem. See the trace in https://github.com/cmu-db/terrier/pull/342/files#r275173692

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the discussion in https://github.com/cmu-db/terrier/pull/342/files#r275175240, it seems it is not the actual solution you used to solve the scan/update conflict. You rely on MVCC to solve the conflict. Maybe the document also need to change accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement is solving a different problem than what was posed there. MVCC solves the problem of a race across transactions during the insert/delete process, but this addresses the risk of processing the same logical tuple twice. Specifically, if we are using a Hyper-style execution model of process a group of tuples as far as we can and this pipeline involves an update that forces migration to the latest version, then we must process the latest version first.

For example, if we add a column and then systematically set the value of this column (via update and not a default value at the time we changed the schema), then we will migrate every tuple to the current version. If we iterate over older versions first, then when we reach the current version they will be visible because a transaction can see its own writes. While we could handle this by tracking migrated tuples explicitly, it is cleaner to iterate over the current version first so that when we trigger migrations they are inserted behind the SlotIterator and won't be visited again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So the user must guarantee that each update/scan call does not interleave with each other. But they can call scan, then update, then scan, etc, which is solved by this method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean we are only allowed to update the scanned tuples in the same transaction? For example, if we updated an unscanned tuple and it was migrated to an already scanned data table, then we would never be able to see it in the scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the update occurred in a different transaction, then the scan will see the old "deleted" tuple because the other transactions writes are not visible to the scan (snapshot isolation).

- In cases where the user is interchanging scan call and update calls, if the scan iterator were to start on an older version then it is possible for a tuple retrieved from the scan to be updated, which could move it into the latest schema version
- This would mean that when the iterator gets to the latest schema version it will read all the tuples in the DataTable for that version and this would cause that tuple to have been read twice within the scan.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the code, under case "Expected schema version differs from the actual schema version", Scan() also returns the temporarily transformed tuples, similar to Select() right? Maybe the doc can make this clearer here.

#### SqlTable::Select
- There are two cases
1. Expected schema version matches actual schema version
- The tuple is directly selected from the DataTable for that schema version
2. Expected schema version differs from the actual schema version
- The tuple is selected from the DataTable for that schema version and during this process it is transformed to the expected schema version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the transformation also happen in the underlying storage? Or it is only a temporary result and then be thrown away. I think it is better to clarify that in the document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transformation here is temporary. The only time we persist the transformation is in case 3 for update where the tuple is moved between DataTables via a delete and insert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the document, you said that your benchmark showed that reading a tuple which version doesn't match has a penalty to a factor of 5x. Consider the case that a hot tuple is in an older version. Each read needs to pay a 5x penalty. If the tuple is hot, we will transform the tuple again and again, and the overhead could be large. I think you may need some policy to migrate the tuple in such case, even we are only reading the tuple, to reduce the overhead.

- This transformed tuple is returned

#### SqlTable::Delete
- Delete will directly delete the tuple from the DataTable of the actual version of that tuple

#### SqlTable::UpdateSchema
- This function is the access point through which users can update the schema by passing in a new schema object
- The SqlTable will construct a new DataTable to maintain all of the tuples inserted for this version.
- To be lazy, none of the already existing tuples will be modified in this call.

### Transformation
In the current interface the user can only retrieve data from a SqlTable through a ProjectedRow or a ProjectedColumn, we will refer to both as Projection in this section for simplicity. The user passes in a Projection which is filled by the storage layer. The Projection passed in by the user will be in the expected version but the actual version of the data could be different so we need to provide a way of transforming between versions. To do this we modify the header of the Projection.

The header contains metadata regarding column_ids and column_offsets which is used by the DataTable to populate the Projection. Furthermore the column_ids can be different between schema versions. Before passing the Projection to the DataTable of the actual version we translate the column_ids that are in the expected version to column_ids of the actual version. Then for any column that is not present in the actual version we set the column_id to a sentinel value (VERSION_POINTER_COLUMN_ID, as no column in the Projection should have that id). We pass this modified Projection to the DataTable which populates it, skipping over any columns with the id set to the sentinel value. Then the we reset the Projection header to the original header and fill in any default values for column that were not present in the actual version. This way we avoid having to copy data from one version to another and it is filled in only once.

## Design Rationale
In order to support lazy schema changes we will need to maintain tuples in multiple different schema versions and provide methods of transforming them into the desired version.

### Backend for different schema versions
The initial decision for storing schema versions within an SQL table was whether to back it by a vector or map. While we appreciated the simplicity and probable performance benefits of a vector, we ultimately decided to go with a map because it did not force our versioning to start at 0. While this constraint does not seem significant at first, we realized that if the database is restored from a checkpoint we should restore the schema version number (since it may be exposed to the user or tracked by a hot-backup) rather than reinitialize the versions to 0.

The second decision for our backend was whether we should protect the underlying data structure with latches or use a latch-free structure. We ultimately decided to use a latch-free data structure because we decided it would be difficult to reason about every possible point the latch would be needed (essentially any version check) without wrapping the map in another abstraction level. Additionally, we had serious concerns about introducing 4 to 10 latch operations on the path of every single SQL action in every single transaction and that would guarantee large numbers of cache invalidations. We therefore decided to go with the ConcurrentMap implementation in the database (wrapper of tbb::concurrent_hash_map) which supports the common case of concurrent insertions and lookups. However, this creates future difficulties for supporting compaction/deletion of obsolete schema versions because erasures are unsafe. Unfortunately, we are not aware of any candidate hash map implementation that supports safe insertions, deletions, and lookups without utilizing latches.

### Transforming old versions into the expected version
We recognized two possible ways to transform the data stored under old schemas into the expected schema for a transaction: (1) attribute-wise copying of the data from an old ProjectedRow to a new one and (2) rewriting the header on the new ProjectedRow (provided by the caller) to be readable by an older data table. We initially implemented (1) because it was far simpler logic. However, when we benchmarked the implementation for cross-version selects we observed a significant performance penalty (factor of 10x). We therefore have switched to (2) and have reduced the penalty to a factor of 5x. We are still working on improving this even further.

## Testing Plan
Our current testing plan is to implement two new test suites that test both the sequential correctness and concurrent robustness of the implementation. The sequential test suite focuses on ensuring that known edge cases are handled correctly. We focus on a sequential test for these situations because we can more tightly control the ordering of actions. We are also implementing a concurrent test suite which will ensure that performs a mini-stress test that tries to verify that rare, but possible, race conditions are likely to be detected. For this we focus on straining access to the versioning scheme by ensuring we are doing concurrent inserts and reads on the hash map.

In addition to formal tests, we are also benchmarking our implementation as we go to ensure we measure and understand the performance impacts of our changes. Specifically, we are focusing on performance impact across a range of simulated workloads (selects, inserts, updates, and a mix) and in two general situations: a single schema version and multiple versions. The goal here is to ensure we can quantify our impact against the current single-versioned implementation as well as quantify the performance degradation for on-the-fly schema manipulation of old data that has not migrated to the new schema version.

## Trade-offs and Potential Problems
**Trade-off:** TBB Concurrent Unordered Map for storing DataTableVersion objects. This decision gives a simple and easy to integrate solution for supporting concurrent insertions (new schemas) and lookups (all DML transactions) on the data tables. However, this will limit options when we start to implement compaction of obsolete schema versions because the data structure does not support concurrent erasures. This likely means we will have to take a heavy-weight approach for compaction such as duplicating the structure without the data to be erased and then use an unlink-delete staged method similar to how the GC already works on undo records.

**Trade-off:** Our decision to manually mangle the headers for ProjectedRow and ProjectedColumn greatly increases the code complexity (manual recreation of the headers) in order to significantly improve performance for reading data across schema versions. Specifically, we avoid an unnecessary allocation and deallocation for temporary projections by allowing old schema versions to write directly into the final projection.

## Future Work
### Pending tasks
#### Default values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated? You have done some work about default values.

- Populate the default values into the ProjectRow during Select and Scan operations.
- Handle changes to the default values. Should they be considered as a schema change or Catalog maintains the default values that can be queried by the SqlTable?

### Stretch goals
- Rolling back schema change transactions.
- Implementing a Compactor to remove DataTables of older versions that don’t contain any tuples.
- Serializing transactions with unsafe/conflicting schema changes by using a central commit latch, allowing only one transaction to commit at a time. Rollback if the validation checks fail.