-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add append method for InMemoryDataset
#327
Conversation
7d547d3
to
8f67094
Compare
Just adding tests from this point on. |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, the failure above is unrelated to the changes in this PR. git bisect points to 7662655. I will open a separate issue. |
8f67094
to
792bb66
Compare
I was wondering where we track which part of a chunk is reused in the commit, so I tried this out and got the following error:
|
Looks like there was an indexing issue when appending multidimensional datasets. I would have thought this would be caught by The issue was that in the process of appending data to a dataset, we need to get the extant data from the dataset's |
cce87eb
to
b331391
Compare
I updated with the latest changes and now the code snippet I had posted no longer fails, but now it stores incorrect values:
|
1be19a6
to
a368976
Compare
a368976
to
24326ae
Compare
For certain sizes of multidimensional datasets, appending to certain datasets could result in appends which targeted the same raw chunk. This happened in your example above, so the erroneous data you saw was from one append subchunk overwriting the raw data needed by another subchunk. I've added a new branch so that only one append is allowed to target each individual raw chunk. |
Here's some more corrupted data (with the latest changes):
|
This also silently corrupts older versions:
|
referenced regions (i.e. it's unoccupied raw space)
Yep, I started adding in a check for this earlier today and just finished it out. The check I've added ensures that no chunk in a previous version points to the space in the raw dataset we are targeting for the append; if there is preexisting data, we instead write to a new chunk. |
Closing for now. |
Following up from #313, this PR takes a different approach to implementing an
append
method:append
are executed right away, rather than following a lazy execution approach as what was proposed in Detect cases where unused chunk space can be written to #313. This is the same scheme that the rest of the code base currently uses.I still needed to add an
AppendData
object becausewrite_dataset_chunks
currently only writes data into new chunks at commit time.AppendData
instead stores the data to append, the raw indices where the data should be written to, and the corresponding virtual slice that it's a part of.