Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Ensure arrow sizing is correct / 64 byte aligned #2426

Merged
merged 10 commits into from
Jul 2, 2024

Conversation

HammadB
Copy link
Collaborator

@HammadB HammadB commented Jun 28, 2024

Description of changes

Summarize the changes made by this PR.

Test plan

How are these changes tested?
Updated all tests in delta.rs to save, load and check the sizes match

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

None

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@HammadB
Copy link
Collaborator Author

HammadB commented Jun 28, 2024

Using the new arrow version, thanks to apache/arrow-rs#5554, I am able to get arrow to written buffers with padding despite being contiguous. I have done a lot of manual verification the data is laid out as we expect on disk.

By manually reading the arrow file using their lower level apis, I can actually parse out the padded size of each buffer and validate that they are indeed 64 byte aligned by using the metadata about the buffers. I have a protoype version of that validation here - https://github.com/chroma-core/chroma/pull/2426/files#diff-03bcd4f01acfa68c46fcc974d3722fa621056b4e0f908d708a2d15028b0e99b1R336 and everything looks correct.

This lets me change the code in get_size of a block to not use arrows internal sizing methods get_array_memory_size or get_buffer_memory_size. These will always report the capacity of the buffers and also if the buffers are shared, the code is just wrong and will overreport as mentioned above. Instead, I look at the len() of each buffer and then manually pad it to 64 bytes. This is a loose coupling, it requires that when we write to buffers, they are padded to 64 bytes, and that when we read from arrow, the buffers are padded to 64 bytes. Otherwise the size computation assuming you can just pad the len to the nearest 64 byte boundary is wrong.

The concern is then what if we change our code to stop allocating on 64 byte boundaries somehow, or arrow changes the alignment without us knowing.

My proposal for this is to write tests for our block construction that will fail if we aren't on 64 byte boundaries - thats easy to ensure.

This leads me to my main question - for arrow writing to files we have two options. Either I can put that validation code into the load block code path, and error if I read a file thats not aligned, OR I can just write tests that ensure that when we read a block, its correct. Which do people prefer?

for column in self.data.columns() {
let column_size = column.get_buffer_memory_size();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this breaks if the underlying buffers are shared, which is true for if we load from Arrow IPC, so we see a size that is way larger than what it actually is.

@HammadB HammadB changed the title [BUG][WIP] Ensure arrow sizing is correct / 64 byte aligned [BUG] Ensure arrow sizing is correct / 64 byte aligned Jul 1, 2024
@HammadB HammadB marked this pull request as ready for review July 1, 2024 21:18
@HammadB HammadB requested a review from sanketkedia July 1, 2024 21:18
@@ -55,6 +55,7 @@ opentelemetry = { version = "0.19.0", default-features = false, features = [
opentelemetry-otlp = "0.12.0"
shuttle = "0.7.1"
regex = "1.10.5"
flatbuffers = "24.3.25"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to add this for my manual reading of the footer

/// Returns the size of the block in bytes
pub(crate) fn get_size(&self) -> usize {
let mut total_size = 0;
for column in self.data.columns() {
total_size += column.get_buffer_memory_size();
let array_data = column.to_data();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the critical change

let mut writer = std::io::BufWriter::new(file);
let writer = arrow::ipc::writer::FileWriter::try_new(&mut writer, &self.data.schema());
let options =
match arrow::ipc::writer::IpcWriteOptions::try_new(64, false, MetadataVersion::V5) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we enable 64 byte alignment manually (it is the default too)

}

pub fn from_bytes(bytes: &[u8], id: Uuid) -> Result<Self, Box<dyn ChromaError>> {
pub fn from_bytes_with_validation(bytes: &[u8], id: Uuid) -> Result<Self, BlockLoadError> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the with_validation methods for both from_bytes and load are meant to be in testing only

@HammadB HammadB requested a review from Ishiihara July 1, 2024 21:26
@HammadB HammadB merged commit b11d029 into main Jul 2, 2024
65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant