Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset/tensor info alongside meta #1066

Merged
merged 38 commits into from Jul 21, 2021
Merged

dataset/tensor info alongside meta #1066

merged 38 commits into from Jul 21, 2021

Conversation

verbose-void
Copy link
Contributor

@verbose-void verbose-void commented Jul 19, 2021

🚀 🚀 Pull Request

info vs meta

  • meta is all information hub creates/modifies. this information is more or less required for our datasets/tensors to function.
  • info is all information users (or hub optionally) creates/modifies. if we deleted info, all tensors/datasets should still operate totally fine.

Allows the user to do:

ds = ...
ds.info.update(my_key=0)  # also works for all tensors on the dataset, not just the dataset
ds.info.update({"another_key": [1, 2, "test"]})

# new script
ds = ...
print(ds.info)  # {"my_key": 0, "another_key": [1, 2, "test"]}

no need for with or ds.flush. although, if a user does make changes to the items without calling ds.update, they will not be persistent. to do this, they can still call ds.info.update().

if they make changes inside a with block, it will still flush and persist their changes.

ds = ...
t = ds.create_tensor("tensor")

t.update(key=[1, 2, 3])

t_list = t.info.key
with ds:
    t_list.extend(["hi", 1])


# new script
ds = ...
print(ds.tensor.info.key)  # {"key": [1, 2, 3, "hi", 1]}

What's new?

added a new CallbackCachable class that is based off Cachable. This is only used by the Info class (also new) and exists in the API only.

we may get rid of Cachable and CallbackCachable in the future, for now while reviewing i will not be removing cachables in this PR.

Info is a purely API class. this also means that callbacks are OK because they will be used infrequently.

Checklist:

  • My code follows the style guidelines of this project and the Contributing document
  • I have commented my code, particularly in hard-to-understand areas
  • I have kept the coverage-rate up
  • I have performed a self-review of my own code and resolved any problems
  • I have checked to ensure there aren't any other open Pull Requests for the same change
  • I have described and made corresponding changes to the relevant documentation
  • New and existing unit tests pass locally with my changes

Changes

hub/constants.py Outdated
Comment on lines 42 to 44
# info is 100% optional user-defined information
DATASET_INFO_FILENAME = "dataset_info.json"
TENSOR_INFO_FILENAME = "tensor_info.json"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 additional small files -- this means that in total we will have the following small files:

  • dataset meta/dataset info (2)
  • tensor info/tensor info (2 * num_tensors)
  • chunk_ids (1 * num_tensors)

for a dataset with 2 tensors, there will be 8 small files.
for a dataset with 4 tensors there will be 14 small files.

the more tensors we add, the more small files they will contain. this isn't a problem now, but we should keep it in mind as we proceed.

@@ -108,8 +108,10 @@ def update_headers(self, incoming_num_bytes: int, sample_shape: Tuple[int]):
self.shapes_encoder.add_shape(sample_shape, 1)
self.byte_positions_encoder.add_byte_position(num_bytes_per_sample, 1)

def __len__(self):
@property
def nbytes(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced len with nbytes for all cachables so that cachable classes can have their len computed without returning the num bytes.

@@ -100,6 +100,7 @@ def local_path(request):
return

path = _get_storage_path(request, LOCAL)
LocalProvider(path).clear()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests will fail if --keep-storage is called on the previous test run. this fixes it

@verbose-void verbose-void marked this pull request as ready for review July 19, 2021 19:38
@codecov
Copy link

codecov bot commented Jul 19, 2021

Codecov Report

Merging #1066 (f447bda) into main (65a825e) will increase coverage by 0.25%.
The diff coverage is 95.06%.

❗ Current head f447bda differs from pull request most recent head 19160c5. Consider uploading reports for the commit 19160c5 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1066      +/-   ##
==========================================
+ Coverage   89.98%   90.23%   +0.25%     
==========================================
  Files          95       98       +3     
  Lines        4244     4436     +192     
==========================================
+ Hits         3819     4003     +184     
- Misses        425      433       +8     
Impacted Files Coverage Δ
hub/util/json.py 42.85% <42.85%> (ø)
hub/core/storage/cachable.py 86.27% <87.50%> (-0.40%) ⬇️
hub/api/info.py 95.12% <95.12%> (ø)
hub/api/dataset.py 94.80% <100.00%> (+0.68%) ⬆️
hub/api/tensor.py 95.65% <100.00%> (+0.12%) ⬆️
hub/api/tests/test_info.py 100.00% <100.00%> (ø)
hub/api/tests/test_readonly.py 100.00% <100.00%> (ø)
hub/constants.py 100.00% <100.00%> (ø)
hub/core/chunk.py 95.91% <100.00%> (+0.08%) ⬆️
hub/core/chunk_engine.py 95.65% <100.00%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65a825e...19160c5. Read the comment docs.

Copy link
Contributor

@AbhinavTuli AbhinavTuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes. Should be good to merge once addressed.

hub/core/storage/cachable.py Outdated Show resolved Hide resolved
hub/util/json.py Outdated Show resolved Hide resolved
hub/constants.py Outdated Show resolved Hide resolved
@verbose-void verbose-void merged commit 35224df into main Jul 21, 2021
@verbose-void verbose-void deleted the task/udpate-meta-api branch July 21, 2021 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants