Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPZ replacement format (only) #1047

Merged
merged 97 commits into from
Jul 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
dbd83e3
initial
farizrahman4u Jul 7, 2021
557df26
format
farizrahman4u Jul 7, 2021
95ce176
typo
farizrahman4u Jul 7, 2021
950ac7c
typo
farizrahman4u Jul 7, 2021
b55fc38
bug fix
farizrahman4u Jul 7, 2021
b0f7d88
some docs and fix 1D shapes
verbose-void Jul 7, 2021
3f34045
Merge branch 'fr_optimizations' of github.com:activeloopai/Hub into f…
verbose-void Jul 7, 2021
0c83842
Merge branch 'refactor/2.0/chunk-engine' of github.com:activeloopai/H…
verbose-void Jul 7, 2021
64c3037
Merge branch 'refactor/2.0/chunk-engine' of https://www.github.com/ac…
farizrahman4u Jul 7, 2021
2281692
add assertion for easy debugging
farizrahman4u Jul 7, 2021
4039fe9
Merge branch 'fr_optimizations' of https://www.github.com/activeloopa…
farizrahman4u Jul 7, 2021
26cd377
one off
farizrahman4u Jul 7, 2021
fadc940
segfault fix
farizrahman4u Jul 7, 2021
035bacc
smol fixes
farizrahman4u Jul 7, 2021
1eddf74
add clear cache to memory test in api and fix return in `decode`
verbose-void Jul 7, 2021
3a409e1
add a better exception for pointer GC
verbose-void Jul 7, 2021
0f58f6e
merge conf fix + infer chunk byte size
farizrahman4u Jul 8, 2021
ef11252
smol fix
farizrahman4u Jul 8, 2021
2fb777a
Merge branch 'fr_optimizations' of https://www.github.com/activeloopa…
farizrahman4u Jul 8, 2021
57d2da7
all fix
farizrahman4u Jul 8, 2021
4dde08a
chuunk id optims init
farizrahman4u Jul 8, 2021
df03495
debug msgs
farizrahman4u Jul 8, 2021
71ee06c
fix refcounting bug
farizrahman4u Jul 8, 2021
44b5ade
ren shards->data
farizrahman4u Jul 8, 2021
6f086e2
faster buff load
farizrahman4u Jul 8, 2021
31aa04d
save 1 memcpy
farizrahman4u Jul 8, 2021
e98b008
indexing
farizrahman4u Jul 8, 2021
3b68c57
cache data len
farizrahman4u Jul 8, 2021
2d91772
cache _num_chunks
farizrahman4u Jul 8, 2021
dcea7cc
chunk engine updates cache size
verbose-void Jul 9, 2021
04276f0
rename `remove` -> `remove_from_dirty`
verbose-void Jul 9, 2021
bad4186
Merge branch 'optimize/uploads' into fr_optimizations
verbose-void Jul 9, 2021
36002e2
remove some `sum`s
verbose-void Jul 9, 2021
fea211d
optims for seq access
farizrahman4u Jul 9, 2021
c9bc44a
Merge branch 'fr_optimizations' of https://www.github.com/activeloopa…
farizrahman4u Jul 9, 2021
1dfb3c2
cache entry
farizrahman4u Jul 9, 2021
c8a9931
10s upload speedup
verbose-void Jul 9, 2021
16d5fb9
fix mypy
verbose-void Jul 9, 2021
96280b7
Merge branch 'optimize/uploads' of github.com:activeloopai/Hub into f…
verbose-void Jul 9, 2021
121753d
load chunk ID encoder
verbose-void Jul 9, 2021
7c2221e
mypass binsearch
farizrahman4u Jul 9, 2021
60b73d3
format
farizrahman4u Jul 9, 2021
07456b5
merge conf fix
farizrahman4u Jul 9, 2021
8cb1bce
rem debug line
farizrahman4u Jul 9, 2021
91596e4
fr_optimizations_2
farizrahman4u Jul 10, 2021
9c3c6d5
Merge pull request #1039 from activeloopai/fr_optimizations_2
farizrahman4u Jul 10, 2021
2f28314
optimize tensor iteration
farizrahman4u Jul 11, 2021
7fd0193
dsiter
farizrahman4u Jul 11, 2021
b63c565
fix test
farizrahman4u Jul 11, 2021
ae3c17d
fix test
farizrahman4u Jul 11, 2021
069a9f6
ds iter fixes
farizrahman4u Jul 11, 2021
d0306a5
tests
farizrahman4u Jul 11, 2021
ec4516b
test
farizrahman4u Jul 11, 2021
f6d71f0
format
farizrahman4u Jul 11, 2021
1a107ca
pytorch training optims
farizrahman4u Jul 11, 2021
0cefcd2
rem bad checks
farizrahman4u Jul 11, 2021
318d496
format
farizrahman4u Jul 12, 2021
b1591c4
format + smoll change in encoding format
farizrahman4u Jul 12, 2021
8f475a9
minimize searchsorted calls
farizrahman4u Jul 12, 2021
a7dd7f1
refac chunk_id.py
farizrahman4u Jul 13, 2021
746201c
more refacc
farizrahman4u Jul 13, 2021
440a0b7
encode_*->serialize_*
farizrahman4u Jul 13, 2021
ceae226
Update hub/core/chunk.py
farizrahman4u Jul 13, 2021
88ab4bb
docstring
farizrahman4u Jul 13, 2021
f3c891c
Merge branch 'fr_optimizations' of https://www.github.com/activeloopa…
farizrahman4u Jul 13, 2021
3f3af34
Merge branch 'main' into fr_optimizations
farizrahman4u Jul 13, 2021
4b29507
docstring
farizrahman4u Jul 13, 2021
5093220
Merge branch 'fr_optimizations' of https://www.github.com/activeloopa…
farizrahman4u Jul 13, 2021
fb46618
merge main
farizrahman4u Jul 13, 2021
5b38608
rm comments
farizrahman4u Jul 13, 2021
4f25b21
rm unused import
farizrahman4u Jul 13, 2021
82ce5be
revert dataset.py
farizrahman4u Jul 13, 2021
35d3a4a
revert tensor.py
farizrahman4u Jul 13, 2021
cb4ea21
revert test_api.py
farizrahman4u Jul 13, 2021
868004e
revert ChunkEngine.numpy
farizrahman4u Jul 13, 2021
c7a1321
revert read_sample_from_chunk
farizrahman4u Jul 13, 2021
6519986
rem unreachable
farizrahman4u Jul 13, 2021
0ddc61b
remove iter logic
farizrahman4u Jul 13, 2021
609ea67
revert pytorch.py
farizrahman4u Jul 13, 2021
7823060
revert dataset.py
farizrahman4u Jul 13, 2021
390151b
reverts
farizrahman4u Jul 13, 2021
77ca4f7
revert tensor.py
farizrahman4u Jul 13, 2021
e3ab3bf
reverts
farizrahman4u Jul 13, 2021
0e5e84c
fixes
farizrahman4u Jul 13, 2021
0ee485d
add chunk size tests
verbose-void Jul 13, 2021
7df6b7c
Merge branch 'fr_serialization' of github.com:activeloopai/Hub into f…
verbose-void Jul 13, 2021
d16d550
fixes
farizrahman4u Jul 14, 2021
a2fa181
Merge branch 'fr_serialization' of https://www.github.com/activeloopa…
farizrahman4u Jul 14, 2021
1b0973a
fixes
farizrahman4u Jul 14, 2021
5895438
rem assert
farizrahman4u Jul 14, 2021
f15d71c
test chunk sizes on memds only
farizrahman4u Jul 14, 2021
0d7e9f2
Update hub/core/serialize.py
farizrahman4u Jul 14, 2021
8310e36
Update hub/core/serialize.py
farizrahman4u Jul 14, 2021
3a8ccc8
Update hub/core/serialize.py
farizrahman4u Jul 14, 2021
d9a846b
Update hub/core/serialize.py
farizrahman4u Jul 14, 2021
8c3b83b
rem assertions
farizrahman4u Jul 14, 2021
7007c59
Merge branch 'fr_serialization' of https://www.github.com/activeloopa…
farizrahman4u Jul 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
121 changes: 121 additions & 0 deletions hub/api/tests/test_chunk_sizes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
import numpy as np
from hub.constants import KB


def _update_chunk_sizes(ds, max_chunk_size: int):
"""Updates all chunk sizes for tensors that already exist in `ds`. If
more tensors are created after calling this method, those tensors will NOT have
the same chunk size.
"""

# TODO: set / update chunk sizes API (to replace this function)

min_chunk_size = max_chunk_size // 2

for tensor in ds.tensors.values():
chunk_engine = tensor.chunk_engine

chunk_engine.max_chunk_size = max_chunk_size
chunk_engine.min_chunk_size = min_chunk_size


def _assert_num_chunks(tensor, expected_num_chunks):
chunk_engine = tensor.chunk_engine
actual_num_chunks = chunk_engine.chunk_id_encoder.num_chunks
assert actual_num_chunks == expected_num_chunks


def _create_tensors(ds):
images = ds.create_tensor("images", htype="image", sample_compression=None)
labels = ds.create_tensor("labels", htype="class_label")
return images, labels


def _append_tensors(images, labels):
for i in range(100):
x = np.ones((28, 28), dtype=np.uint8) * i
y = np.uint32(i)

images.append(x)
labels.append(y)


def _extend_tensors(images, labels):
images.extend(np.ones((100, 28, 28), dtype=np.uint8))
labels.extend(np.ones(100, dtype=np.uint32))


def test_append(memory_ds):
ds = memory_ds
images, labels = _create_tensors(ds)
_update_chunk_sizes(ds, 32 * KB)

_append_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 5)

_append_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 10)

_append_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 15)

assert len(ds) == 300


def test_extend(memory_ds):
ds = memory_ds
images, labels = _create_tensors(ds)

_update_chunk_sizes(ds, 32 * KB)

_extend_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 5)

_extend_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 10)

_extend_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 15)

assert len(ds) == 300


def test_extend_and_append(memory_ds):
ds = memory_ds
images, labels = _create_tensors(ds)

_update_chunk_sizes(ds, 32 * KB)

_extend_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 5)

_append_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 10)

_extend_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 15)

_append_tensors(images, labels)

_assert_num_chunks(labels, 1)
_assert_num_chunks(images, 20)

assert len(ds) == 400
3 changes: 1 addition & 2 deletions hub/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,9 @@

CHUNKS_FOLDER = "chunks"

CHUNK_EXTENSION = "npz"
ENCODED_CHUNK_NAMES_FOLDER = "chunks_index"
# unsharded naming will help with backwards compatibility
ENCODED_CHUNK_NAMES_FILENAME = f"unsharded.{CHUNK_EXTENSION}"
ENCODED_CHUNK_NAMES_FILENAME = f"unsharded"

ENCODING_DTYPE = np.uint32
# caclulate the number of bits to shift right when converting a 128-bit uuid into `ENCODING_DTYPE`
Expand Down
39 changes: 17 additions & 22 deletions hub/core/chunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from hub.core.meta.encode.shape import ShapeEncoder
from hub.core.meta.encode.byte_positions import BytePositionsEncoder

from hub.core.serialize import serialize_chunk, deserialize_chunk, infer_chunk_num_bytes


class Chunk(Cachable):
def __init__(
Expand Down Expand Up @@ -108,31 +110,24 @@ def update_headers(self, incoming_num_bytes: int, sample_shape: Tuple[int]):

def __len__(self):
"""Calculates the number of bytes `tobytes` will be without having to call `tobytes`. Used by `LRUCache` to determine if this chunk can be cached."""

shape_nbytes = self.shapes_encoder.nbytes
range_nbytes = self.byte_positions_encoder.nbytes
error_bytes = 32 # to account for any extra delimeters/stuff that `np.savez` may create in excess
return shape_nbytes + range_nbytes + self.num_data_bytes + error_bytes
return infer_chunk_num_bytes(
hub.__version__,
self.shapes_encoder.array,
self.byte_positions_encoder.array,
len_data=len(self._data),
)

def tobytes(self) -> memoryview:
out = BytesIO()

# TODO: for fault tolerance, we should have a chunk store the ID for the next chunk
# TODO: in case the index chunk meta gets pwned (especially during a potentially failed transform job merge)

np.savez(
out,
version=hub.__encoded_version__,
shapes=self.shapes_encoder.array,
byte_positions=self.byte_positions_encoder.array,
data=np.frombuffer(self.memoryview_data, dtype=np.uint8),
return serialize_chunk(
hub.__version__,
self.shapes_encoder.array,
self.byte_positions_encoder.array,
[self._data],
)
out.seek(0)
return out.getbuffer()

@classmethod
def frombuffer(cls, buffer: bytes):
bio = BytesIO(buffer)
npz = np.load(bio)
data = memoryview(npz["data"].tobytes())
return cls(npz["shapes"], npz["byte_positions"], data=data)
if not buffer:
return cls()
version, shapes, byte_positions, data = deserialize_chunk(buffer)
return cls(shapes, byte_positions, data=data)
24 changes: 13 additions & 11 deletions hub/core/meta/encode/chunk_id.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
from typing import Optional, Tuple
import numpy as np
from uuid import uuid4
from hub.core.serialize import serialize_chunkids, deserialize_chunkids


# these constants are for accessing the data layout. see the `ChunkIdEncoder` docstring.
CHUNK_ID_INDEX = 0
Expand Down Expand Up @@ -71,13 +73,11 @@ def __init__(self):
self._encoded_ids = None

def tobytes(self) -> memoryview:
bio = BytesIO()
np.savez(
bio,
version=hub.__encoded_version__,
ids=self._encoded_ids,
)
return bio.getbuffer()
if self._encoded_ids is None:
return serialize_chunkids(
hub.__version__, [np.array([], dtype=ENCODING_DTYPE)]
)
return serialize_chunkids(hub.__version__, [self._encoded_ids])

@staticmethod
def name_from_id(id: ENCODING_DTYPE) -> str:
Expand All @@ -102,9 +102,11 @@ def get_name_for_chunk(self, chunk_index: int) -> str:
@classmethod
def frombuffer(cls, buffer: bytes):
instance = cls()
bio = BytesIO(buffer)
npz = np.load(bio)
instance._encoded_ids = npz["ids"]
if not buffer:
return instance
version, ids = deserialize_chunkids(buffer)
if ids.nbytes:
instance._encoded_ids = ids
return instance

@property
Expand All @@ -117,7 +119,7 @@ def num_chunks(self) -> int:
def num_samples(self) -> int:
if self._encoded_ids is None:
return 0
return int(self._encoded_ids[-1, LAST_INDEX_INDEX] + 1)
return int(self._encoded_ids[-1, LAST_INDEX_INDEX]) + 1

def generate_chunk_id(self) -> ENCODING_DTYPE:
"""Generates a random 64bit chunk ID using uuid4. Also prepares this ID to have samples registered to it.
Expand Down