Skip to content

Conversation

@HiuKwok
Copy link

@HiuKwok HiuKwok commented Jul 4, 2021

patch for CASSANDRA-16786

@HiuKwok
Copy link
Author

HiuKwok commented Jul 4, 2021

@smiklosovic smiklosovic added the javadoc fixing javadoc label Mar 16, 2022
jbellis pushed a commit to jbellis/cassandra that referenced this pull request Jun 26, 2024
When a SAI index is rebuild, create new index files rather than
overwriting existing files, making SAI index components files
essentially truly immutable (once initially written).

To do this, SAI index components have a new parameter, their
"generation": for a given version, the component with the highest
generation is the most recently written one. That generation is added to
the filename, but with the special case that the initial generation (for
a version), generation 0, is special cased and not added to the
filename, and this for backward compatibility. This means in particular
that filenames are completely unchanged by this patch up until one
triggers an index rebuild that does not also correspond to a version
change.

One implication is that on a rebuild, old SAI index components are not
deleted anymore (previously, they were deleted at the end of the
rebuild, and the rebuild was re-creating new versions of the same files,
essentially overwriting the file but in 2 steps) and will remain on
disk. The intent here is that this pave the way for allowing index
"live" rebuild, that is rebuilding an index without making the index
un-queriable during the rebuild. However, this patch does *not*
implement such "live" rebuild, leaving it as a follow up. As of this
patch, the old index component files are also never deleted, though over
time, compaction will get rid of them by virtue of replacing the
sstables that have those components.

Technically, the generation added is not incremented "per component",
but rather per "group" of components, where "group" means (for a
particular sstable) either 1) all the per-sstable components or 2) all
the components of a particular index. As an example, suppose the
components for an sstable are (where 1 is the generation, and including
only a subset of real components to keep the example shorter):
```
<sstable X>-SAI+CA+1+GroupMeta.db
<sstable X>-SAI+CA+1+TokenValues.db
<sstable X>-SAI+CA+1+PrimaryKeyTrie.db
<sstable X>-SAI+CA+1+GroupComplete.db
<sstable X>-SAI+CA+myIndex+Meta.db
<sstable X>-SAI+CA+myIndex+TermsData.db
```
In this example, the per-sstable group is complete, but not the group
for `myIndex` (missing at least the `-ColumnComplete` component). Also,
the per-sstable group is at generation 1, while the `myIndex` group is
at generation 0.. If a rebuild is triggered, the created components
will be (assuming `CA` is the current version and again using a subset
of the actual components for simplicity):
```
<sstable X>-SAI+CA+2+GroupMeta.db
<sstable X>-SAI+CA+2+TokenValues.db
<sstable X>-SAI+CA+2+PrimaryKeyTrie.db
<sstable X>-SAI+CA+2+GroupComplete.db
<sstable X>-SAI+CA+1+myIndex+Meta.db
<sstable X>-SAI+CA+1+myIndex+TermsData.db
<sstable X>-SAI+CA+1+PostingLists.db
<sstable X>-SAI+CA+1+ColumnComplete.db
```
In other words, the generation for the per-sstable group and the
`myIndex` one are independent, but the generation are linked within a
group (the `<sstable X>-SAI+CA+1+PostingLists.db` file is at generation
1 like the rest of that group and not 0, despite there being no
generation 0 file before the rebuild).

The reason generation are linked "by group" is that 1) we need to at
least link the generation to the "completion marker" components to the
rest of the component of the group this is a marker for, and linking the
whole group as above feels like the cleaner solution and 2) linking the
generation of *all* the SAI components of a given sstable would prevent
us from being able to rebuild a single index without rebuilding the
per-sstable components/other indexes.

Because the generation make the name of component not necessarily
fixed for a version, this patch also make component discovery based
on the content of the TOC file. If the TOC is missing a fallback scan
disk instead, but as this can be inefficient in some environment, this
is not the primary method.

Historically, `IndexDescriptor` has been lazily "discovering" the
per-index components for a particular index the first time a method for
that index was called. This makes it harder to reason on when the TOC
is actually read, and given that the TOC is ultimately a mutable file,
that's fragile.

This patch also ensure we read the TOC only at clear points, either on
`IndexDescriptor.load` or on `IndexDescriptor.reload`. For that, the
set of existing index is simply passed to those method, letting them
"load" everything (instead of just the per-sstable parts).

Further, this commit ensures that when we signal that an sstable has
been updated after a reload, then in both `SSTableContextManager` and
`IndexContext`, we do update the `SSTableContext` and/or `SSTableIndex`,
even if one previously exists, as long as the new version is using more
recent underlying files.

However, the fact that `IndexDescriptor` can be reloaded, means that
its content is not immutable, and a lot of code was going through
the `IndexDescriptor` instance to access the underlying components,
which make it not very safe in a number of places (basically, some
piece of course could start using some files at some version, but
then the code using those could end up calling some method that
use different files (and so version) if those had changed in
`IndexDescriptor` in the meantime).

And so a good part of this commit is making sure that everything that
is created using a specific instance of `SSTableContext` continues
using the same files/version for all it's lifetime, and same for
`SSTableIndex`. Effectively, this means capturing the proper perSStable
"group" in `SSTableIndex` and proper perIndex "group" in SSTableIndex
and make sure that is what is used, without indirectly going to the
now mutable IndexDescriptor.
michaelsembwever pushed a commit to thelastpickle/cassandra that referenced this pull request Sep 25, 2024
When a SAI index is rebuild, create new index files rather than
overwriting existing files, making SAI index components files
essentially truly immutable (once initially written).

To do this, SAI index components have a new parameter, their
"generation": for a given version, the component with the highest
generation is the most recently written one. That generation is added to
the filename, but with the special case that the initial generation (for
a version), generation 0, is special cased and not added to the
filename, and this for backward compatibility. This means in particular
that filenames are completely unchanged by this patch up until one
triggers an index rebuild that does not also correspond to a version
change.

One implication is that on a rebuild, old SAI index components are not
deleted anymore (previously, they were deleted at the end of the
rebuild, and the rebuild was re-creating new versions of the same files,
essentially overwriting the file but in 2 steps) and will remain on
disk. The intent here is that this pave the way for allowing index
"live" rebuild, that is rebuilding an index without making the index
un-queriable during the rebuild. However, this patch does *not*
implement such "live" rebuild, leaving it as a follow up. As of this
patch, the old index component files are also never deleted, though over
time, compaction will get rid of them by virtue of replacing the
sstables that have those components.

Technically, the generation added is not incremented "per component",
but rather per "group" of components, where "group" means (for a
particular sstable) either 1) all the per-sstable components or 2) all
the components of a particular index. As an example, suppose the
components for an sstable are (where 1 is the generation, and including
only a subset of real components to keep the example shorter):
```
<sstable X>-SAI+CA+1+GroupMeta.db
<sstable X>-SAI+CA+1+TokenValues.db
<sstable X>-SAI+CA+1+PrimaryKeyTrie.db
<sstable X>-SAI+CA+1+GroupComplete.db
<sstable X>-SAI+CA+myIndex+Meta.db
<sstable X>-SAI+CA+myIndex+TermsData.db
```
In this example, the per-sstable group is complete, but not the group
for `myIndex` (missing at least the `-ColumnComplete` component). Also,
the per-sstable group is at generation 1, while the `myIndex` group is
at generation 0.. If a rebuild is triggered, the created components
will be (assuming `CA` is the current version and again using a subset
of the actual components for simplicity):
```
<sstable X>-SAI+CA+2+GroupMeta.db
<sstable X>-SAI+CA+2+TokenValues.db
<sstable X>-SAI+CA+2+PrimaryKeyTrie.db
<sstable X>-SAI+CA+2+GroupComplete.db
<sstable X>-SAI+CA+1+myIndex+Meta.db
<sstable X>-SAI+CA+1+myIndex+TermsData.db
<sstable X>-SAI+CA+1+PostingLists.db
<sstable X>-SAI+CA+1+ColumnComplete.db
```
In other words, the generation for the per-sstable group and the
`myIndex` one are independent, but the generation are linked within a
group (the `<sstable X>-SAI+CA+1+PostingLists.db` file is at generation
1 like the rest of that group and not 0, despite there being no
generation 0 file before the rebuild).

The reason generation are linked "by group" is that 1) we need to at
least link the generation to the "completion marker" components to the
rest of the component of the group this is a marker for, and linking the
whole group as above feels like the cleaner solution and 2) linking the
generation of *all* the SAI components of a given sstable would prevent
us from being able to rebuild a single index without rebuilding the
per-sstable components/other indexes.

Because the generation make the name of component not necessarily
fixed for a version, this patch also make component discovery based
on the content of the TOC file. If the TOC is missing a fallback scan
disk instead, but as this can be inefficient in some environment, this
is not the primary method.

Historically, `IndexDescriptor` has been lazily "discovering" the
per-index components for a particular index the first time a method for
that index was called. This makes it harder to reason on when the TOC
is actually read, and given that the TOC is ultimately a mutable file,
that's fragile.

This patch also ensure we read the TOC only at clear points, either on
`IndexDescriptor.load` or on `IndexDescriptor.reload`. For that, the
set of existing index is simply passed to those method, letting them
"load" everything (instead of just the per-sstable parts).

Further, this commit ensures that when we signal that an sstable has
been updated after a reload, then in both `SSTableContextManager` and
`IndexContext`, we do update the `SSTableContext` and/or `SSTableIndex`,
even if one previously exists, as long as the new version is using more
recent underlying files.

However, the fact that `IndexDescriptor` can be reloaded, means that
its content is not immutable, and a lot of code was going through
the `IndexDescriptor` instance to access the underlying components,
which make it not very safe in a number of places (basically, some
piece of course could start using some files at some version, but
then the code using those could end up calling some method that
use different files (and so version) if those had changed in
`IndexDescriptor` in the meantime).

And so a good part of this commit is making sure that everything that
is created using a specific instance of `SSTableContext` continues
using the same files/version for all it's lifetime, and same for
`SSTableIndex`. Effectively, this means capturing the proper perSStable
"group" in `SSTableIndex` and proper perIndex "group" in SSTableIndex
and make sure that is what is used, without indirectly going to the
now mutable IndexDescriptor.
blambov pushed a commit to blambov/cassandra that referenced this pull request Feb 20, 2025
Fix NodeRestartTest and SnapshotTest. They were failing due to merge conflicts between apache#1099 and CASSANDRA-18714.
@belliottsmith belliottsmith force-pushed the trunk branch 2 times, most recently from df3eb40 to 54e39a9 Compare July 23, 2025 11:19
michaelsembwever pushed a commit to thelastpickle/cassandra that referenced this pull request Jan 7, 2026
When a SAI index is rebuild, create new index files rather than
overwriting existing files, making SAI index components files
essentially truly immutable (once initially written).

To do this, SAI index components have a new parameter, their
"generation": for a given version, the component with the highest
generation is the most recently written one. That generation is added to
the filename, but with the special case that the initial generation (for
a version), generation 0, is special cased and not added to the
filename, and this for backward compatibility. This means in particular
that filenames are completely unchanged by this patch up until one
triggers an index rebuild that does not also correspond to a version
change.

One implication is that on a rebuild, old SAI index components are not
deleted anymore (previously, they were deleted at the end of the
rebuild, and the rebuild was re-creating new versions of the same files,
essentially overwriting the file but in 2 steps) and will remain on
disk. The intent here is that this pave the way for allowing index
"live" rebuild, that is rebuilding an index without making the index
un-queriable during the rebuild. However, this patch does *not*
implement such "live" rebuild, leaving it as a follow up. As of this
patch, the old index component files are also never deleted, though over
time, compaction will get rid of them by virtue of replacing the
sstables that have those components.

Technically, the generation added is not incremented "per component",
but rather per "group" of components, where "group" means (for a
particular sstable) either 1) all the per-sstable components or 2) all
the components of a particular index. As an example, suppose the
components for an sstable are (where 1 is the generation, and including
only a subset of real components to keep the example shorter):
```
<sstable X>-SAI+CA+1+GroupMeta.db
<sstable X>-SAI+CA+1+TokenValues.db
<sstable X>-SAI+CA+1+PrimaryKeyTrie.db
<sstable X>-SAI+CA+1+GroupComplete.db
<sstable X>-SAI+CA+myIndex+Meta.db
<sstable X>-SAI+CA+myIndex+TermsData.db
```
In this example, the per-sstable group is complete, but not the group
for `myIndex` (missing at least the `-ColumnComplete` component). Also,
the per-sstable group is at generation 1, while the `myIndex` group is
at generation 0.. If a rebuild is triggered, the created components
will be (assuming `CA` is the current version and again using a subset
of the actual components for simplicity):
```
<sstable X>-SAI+CA+2+GroupMeta.db
<sstable X>-SAI+CA+2+TokenValues.db
<sstable X>-SAI+CA+2+PrimaryKeyTrie.db
<sstable X>-SAI+CA+2+GroupComplete.db
<sstable X>-SAI+CA+1+myIndex+Meta.db
<sstable X>-SAI+CA+1+myIndex+TermsData.db
<sstable X>-SAI+CA+1+PostingLists.db
<sstable X>-SAI+CA+1+ColumnComplete.db
```
In other words, the generation for the per-sstable group and the
`myIndex` one are independent, but the generation are linked within a
group (the `<sstable X>-SAI+CA+1+PostingLists.db` file is at generation
1 like the rest of that group and not 0, despite there being no
generation 0 file before the rebuild).

The reason generation are linked "by group" is that 1) we need to at
least link the generation to the "completion marker" components to the
rest of the component of the group this is a marker for, and linking the
whole group as above feels like the cleaner solution and 2) linking the
generation of *all* the SAI components of a given sstable would prevent
us from being able to rebuild a single index without rebuilding the
per-sstable components/other indexes.

Because the generation make the name of component not necessarily
fixed for a version, this patch also make component discovery based
on the content of the TOC file. If the TOC is missing a fallback scan
disk instead, but as this can be inefficient in some environment, this
is not the primary method.

Historically, `IndexDescriptor` has been lazily "discovering" the
per-index components for a particular index the first time a method for
that index was called. This makes it harder to reason on when the TOC
is actually read, and given that the TOC is ultimately a mutable file,
that's fragile.

This patch also ensure we read the TOC only at clear points, either on
`IndexDescriptor.load` or on `IndexDescriptor.reload`. For that, the
set of existing index is simply passed to those method, letting them
"load" everything (instead of just the per-sstable parts).

Further, this commit ensures that when we signal that an sstable has
been updated after a reload, then in both `SSTableContextManager` and
`IndexContext`, we do update the `SSTableContext` and/or `SSTableIndex`,
even if one previously exists, as long as the new version is using more
recent underlying files.

However, the fact that `IndexDescriptor` can be reloaded, means that
its content is not immutable, and a lot of code was going through
the `IndexDescriptor` instance to access the underlying components,
which make it not very safe in a number of places (basically, some
piece of course could start using some files at some version, but
then the code using those could end up calling some method that
use different files (and so version) if those had changed in
`IndexDescriptor` in the meantime).

And so a good part of this commit is making sure that everything that
is created using a specific instance of `SSTableContext` continues
using the same files/version for all it's lifetime, and same for
`SSTableIndex`. Effectively, this means capturing the proper perSStable
"group" in `SSTableIndex` and proper perIndex "group" in SSTableIndex
and make sure that is what is used, without indirectly going to the
now mutable IndexDescriptor.
michaelsembwever pushed a commit to thelastpickle/cassandra that referenced this pull request Jan 7, 2026
Fix NodeRestartTest and SnapshotTest. They were failing due to merge conflicts between apache#1099 and CASSANDRA-18714.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

javadoc fixing javadoc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants