-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Metadata segment reader/writer #2211
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
} | ||
} | ||
|
||
fn apply_log_chunk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: we should remove this
rust/worker/src/types/segment.rs
Outdated
@@ -24,6 +25,7 @@ impl From<SegmentType> for String { | |||
SegmentType::Record => "urn:chroma:segment/record".to_string(), | |||
SegmentType::Sqlite => "urn:chroma:segment/metadata/sqlite".to_string(), | |||
SegmentType::BlockfileMetadata => "urn:chroma:segment/metadata/blockfile".to_string(), | |||
SegmentType::Metadata => "urn:chroma:segment/metadata".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think "urn:chroma:segment/metadata/blockfile" is what we want to use? maybe im confused.
Segment scopes are Record/ Vector / Metadata. Types are concrete impls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, that is what I want. While we're here, though, what exactly is this for? Is this a storage location of some sort or?
@@ -23,37 +27,40 @@ impl ChromaError for FullTextIndexError { | |||
pub(crate) struct FullTextIndexWriter { | |||
posting_lists_blockfile_writer: BlockfileWriter, | |||
frequencies_blockfile_writer: BlockfileWriter, | |||
tokenizer: Box<dyn ChromaTokenizer>, | |||
// This is a crime. | |||
tokenizer: Arc<Mutex<Box<dyn ChromaTokenizer>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could make this thread local if it turns out tokenization sync is bad
} | ||
|
||
impl SegmentWriter for MetadataSegmentWriter { | ||
fn apply_materialized_log_chunk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: This needs to be audited again after materializedLogRecordV2 is merged. Also, TBD add support for updates/deletes/upserts here
MetadataValue::Int(value) => { | ||
match &self.u32_metadata_index_writer { | ||
Some(writer) => { | ||
let _ = writer.set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this diff but one thing that came to my mind is that we don't write atomically across the segments today i.e. it can happen that the compactor node crashes after writing for e.g. to the record segment but before writing to the metadata segment, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capturing our conversation here for posterity:
Yes, that can happen. We designed for this and that's why the flush protocol is: write new blockfiles, update sysdb to point to new blockfiles, purge logs. If we crash anywhere in there we're still correct.
## Description of changes *Summarize the changes made by this PR.* - New functionality - Metadata segment reader/writer outline. They don't yet have methods to add and query stuff. - Note: We don't have the requisite types in Rust to represent full-text or metadata queries yet -- since I'm ooo tomorrow, putting this PR up now for a review as-is. ## Test plan *How are these changes tested?* - [x] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?*
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?