Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AL-2290] DeepLake VectorStore updates #2375

Merged
merged 141 commits into from
Jun 6, 2023
Merged
Show file tree
Hide file tree
Changes from 132 commits
Commits
Show all changes
141 commits
Select commit Hold shift + click to select a range
2c78b30
First commit
adolkhan May 25, 2023
10cfc8c
Changes
adolkhan May 25, 2023
038be68
Fixed tests
adolkhan May 25, 2023
2b77f36
Refactored vector search and added support for tql medatada filtering
istranic May 25, 2023
6d16bd1
fixed case when retult is empty in python search
istranic May 25, 2023
9e2e3b6
deleted vector search becuase it is no longer used
istranic May 26, 2023
e80f9e4
Fixed bad import
istranic May 26, 2023
266629f
Fixed issue and tests
istranic May 26, 2023
d4da373
Fixed more tests
istranic May 26, 2023
96654b6
more fixes
istranic May 26, 2023
0e0f30b
Updated docstrings
istranic May 26, 2023
db492b5
Fixed .tensors and .summary() for virtual tensor datasets.
khustup May 26, 2023
ba67c5b
Fix tests.
khustup May 26, 2023
af1b316
Merge branch 'filter_improvements' into VectorStoreSmallUpdates
adolkhan May 26, 2023
63b9ba5
Fixes
istranic May 26, 2023
4fb5ca0
Temp disable shuffle after random split.
khustup May 26, 2023
6eade03
Fixed .data() method.
khustup May 26, 2023
6b80c4d
Cleaned up parsing
istranic May 26, 2023
8087e39
fixes
istranic May 26, 2023
09b9c1e
Refactor
adolkhan May 26, 2023
4000d11
Fixed tensors.
khustup May 26, 2023
58814e3
Enabled test for shuffle+random_split.
khustup May 26, 2023
45dccdb
Better codecov
istranic May 26, 2023
a46f491
Final fixes.
khustup May 26, 2023
cfb269a
Merge branch 'virtual-tensors-cleanup' of https://github.com/activelo…
istranic May 26, 2023
b4d2ae5
Fixed black.
khustup May 26, 2023
3268291
fixes
istranic May 26, 2023
c66ec6b
fixes
istranic May 26, 2023
6e234be
Fixed reporting position in query API
istranic May 26, 2023
b65009d
Raw data in remote query
ProgerDav May 28, 2023
084c107
Error message fixes
ProgerDav May 29, 2023
f30b63d
Fix types and lint
ProgerDav May 29, 2023
25364a9
Docstring fix
ProgerDav May 29, 2023
e93115b
refactors
adolkhan May 29, 2023
b8ce70b
Merge branch 'virtual-tensors-cleanup' into VectorStoreSmallUpdates
adolkhan May 29, 2023
2b5a2e4
Merge branch 'filter_improvements' into VectorStoreSmallUpdates
adolkhan May 29, 2023
64e576d
test fixes
adolkhan May 29, 2023
5427287
Fixes and updates
istranic May 29, 2023
cc113eb
prelim commit
istranic May 29, 2023
51890cd
more fixes
istranic May 29, 2023
8b68133
more fixes
istranic May 29, 2023
9f02c4a
Merge branch 'main' of https://github.com/activeloopai/deeplake into …
istranic May 29, 2023
8483fe1
fixes
istranic May 29, 2023
d09e2c3
Fixes
istranic May 29, 2023
f58a81e
More fixes
istranic May 29, 2023
88e078c
Ran black .
istranic May 29, 2023
7142a5b
Resolved merge conflicts
istranic May 29, 2023
a855f70
Merge branch 'main' of https://github.com/activeloopai/deeplake into …
istranic May 29, 2023
eea741d
Merge branch 'VectorStoreSmallUpdates' of https://github.com/activelo…
istranic May 29, 2023
35e675c
improved tests
istranic May 30, 2023
609b8c4
Small linting fix
istranic May 30, 2023
f51e0e9
Several fixes and tests improvements
istranic May 30, 2023
bf16d28
minor fixes
istranic May 30, 2023
1d20cb5
test fixes
adolkhan May 29, 2023
eadf9b5
Ran black .
istranic May 29, 2023
5e39606
fix issue with ds.connect
levongh May 24, 2023
4b13d2e
review comments
levongh May 25, 2023
a9e11b0
update
FayazRahman May 26, 2023
af4df33
Bump version post release (#2381)
FayazRahman May 28, 2023
8762eb4
Merge branch 'main' into VectorStoreSmallUpdates
adolkhan May 30, 2023
a6e49f1
Support for adding custom tensors
adolkhan May 30, 2023
9ca1fb3
Resolved merge conflicts
istranic May 30, 2023
7187638
Add fix
adolkhan May 30, 2023
a0bb241
Added search exception
adolkhan May 30, 2023
dc97c91
fixed tests
adolkhan May 30, 2023
932b5ef
Resolved merge conflict
istranic May 30, 2023
32838f0
Fixes, better formatting, better docstrings
istranic May 30, 2023
984e5bd
Improved search_basic test
istranic May 30, 2023
5594f81
Fixed tests and other issues
istranic May 30, 2023
07cb165
resolved merge conflict
istranic May 30, 2023
e7e9965
Improving typing
istranic May 30, 2023
d3097ad
Fixed typing
istranic May 30, 2023
1d31553
Fixed bad variable name in test
istranic May 30, 2023
0e422c6
Fixed low level search tests
istranic May 31, 2023
0f09711
Fixed tests and added return_view
istranic May 31, 2023
e32b402
prelim commit
istranic May 31, 2023
653ef93
Fixed tests and errors
istranic Jun 1, 2023
8da40be
Improved deletion implementation and tests
istranic Jun 1, 2023
11602ee
WIP
adolkhan Jun 1, 2023
1dceede
Merge branch 'vs_query_finalization' into VectorStoreSmallUpdates
adolkhan Jun 1, 2023
a432505
Added new ingestion logic
adolkhan Jun 1, 2023
a9b00ea
Fixed tests and removed unecessary params from some functions
istranic Jun 1, 2023
e8e30e3
improvements
adolkhan Jun 1, 2023
7cb85c1
Linting, typing, and docstring fixes
istranic Jun 1, 2023
d211d7e
fixed test delete
istranic Jun 1, 2023
e7247b6
New api support
adolkhan Jun 1, 2023
1b6b7f9
Merge branch 'main' into VectorStoreSmallUpdates
adolkhan Jun 1, 2023
1e4b547
Merge branch 'vs_query_finalization' into VectorStoreSmallUpdates
adolkhan Jun 1, 2023
94cc677
tests fix
adolkhan Jun 1, 2023
ea07aaa
data ingestion fix
adolkhan Jun 1, 2023
a04f674
Testing related changes
adolkhan Jun 1, 2023
083f2f2
Added support for default embedding tensor
istranic Jun 1, 2023
00a9248
Resolving conflicts
istranic Jun 1, 2023
cb64dad
Cleaned up docstrings and parameters
istranic Jun 2, 2023
157cc9c
Tweaked tests
istranic Jun 2, 2023
01f8cb1
Merge pull request #2400 from activeloopai/NextGenVectorStore
istranic Jun 2, 2023
364fa60
ingestion logc twicks
adolkhan Jun 2, 2023
777c520
mypy fixes + examples
adolkhan Jun 2, 2023
96ae788
darglint fixes
adolkhan Jun 2, 2023
bb93411
test_parse_add_arguments fixes
adolkhan Jun 2, 2023
e0ea560
Added few more parse_add_arguments tests
adolkhan Jun 2, 2023
5ff3f04
test fix
adolkhan Jun 2, 2023
d54ad85
Minor refactors and tests fixes
istranic Jun 2, 2023
97c4019
Added reporting to the vector store module
istranic Jun 2, 2023
8238b9c
tests
adolkhan Jun 2, 2023
84ac5dd
Merge branch 'VectorStoreSmallUpdates' of https://github.com/activelo…
adolkhan Jun 2, 2023
3f34fe8
Added support for images
istranic Jun 2, 2023
e001c76
Many small fixes to reporting, testing, and style
istranic Jun 3, 2023
04f4fa3
Minor fix
istranic Jun 3, 2023
a87f15a
fix
istranic Jun 3, 2023
587f219
Merge pull request #2406 from activeloopai/vs_images_support
istranic Jun 3, 2023
e35a547
improved docstrings
istranic Jun 3, 2023
48cce49
Fixed bug with ingestion
adolkhan Jun 4, 2023
f23f968
new tests
adolkhan Jun 4, 2023
9f80bb3
Apply suggestions from code review
adolkhan Jun 5, 2023
e1c5f0a
Review changes
adolkhan Jun 5, 2023
0796f49
upd
FayazRahman Jun 5, 2023
0455b2e
Refactoring based on review
adolkhan Jun 5, 2023
16fefbb
docs fix
adolkhan Jun 5, 2023
764050b
small fix
adolkhan Jun 5, 2023
71b7da5
Merge branch 'fy_vector_store_update' into VectorStoreSmallUpdates
adolkhan Jun 5, 2023
938427f
docs
FayazRahman Jun 5, 2023
4e8cf6b
Improving tests
adolkhan Jun 5, 2023
37efd70
Merge branch 'fy_vector_store_update' into VectorStoreSmallUpdates
adolkhan Jun 5, 2023
1c06fd7
Update deeplake/core/vectorstore/deeplake_vectorstore.py
istranic Jun 5, 2023
be1f189
Update deeplake/core/vectorstore/deeplake_vectorstore.py
istranic Jun 5, 2023
c036be9
increased codecov
adolkhan Jun 5, 2023
b92c22f
Merge branch 'VectorStoreSmallUpdates' of https://github.com/activelo…
adolkhan Jun 5, 2023
72216dd
tests + linting
adolkhan Jun 5, 2023
ad6e8ab
final fixes
adolkhan Jun 5, 2023
ff4c3b5
Merge branch 'main' into VectorStoreSmallUpdates
adolkhan Jun 5, 2023
4553b1c
refactor
adolkhan Jun 5, 2023
5c9ad55
black fix
adolkhan Jun 6, 2023
900e3e8
transforms fix
adolkhan Jun 6, 2023
75b197d
fix
FayazRahman Jun 6, 2023
baa38b4
ssh
adolkhan Jun 6, 2023
e1f38a9
ssh
adolkhan Jun 6, 2023
90ce2eb
ssh
adolkhan Jun 6, 2023
af429ea
ssh
adolkhan Jun 6, 2023
fcdfb7f
remove ssh
adolkhan Jun 6, 2023
ea7e4e2
increased treshold
adolkhan Jun 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 36 additions & 17 deletions deeplake/api/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
process_dataset_path,
get_path_type,
)
from deeplake.util.tensor_db import parse_runtime_parameters
from deeplake.hooks import (
dataset_created,
dataset_loaded,
Expand Down Expand Up @@ -199,7 +200,7 @@ def init(
if creds is None:
creds = {}

db_engine = (runtime or {}).get("db_engine", {})
db_engine = parse_runtime_parameters(path, runtime)["tensor_db"]

try:
storage, cache_chain = get_storage_and_cache_chain(
Expand Down Expand Up @@ -349,12 +350,12 @@ def empty(
"""Creates an empty dataset

Args:
path (str, pathlib.Path): - The full path to the dataset. Can be:
- a Deep Lake cloud path of the form ``hub://username/datasetname``. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use 'activeloop login' from command line)
path (str, pathlib.Path): - The full path to the dataset. It can be:
- a Deep Lake cloud path of the form ``hub://org_id/dataset_name``. Requires registration with Deep Lake.
- an s3 path of the form ``s3://bucketname/path/to/dataset``. Credentials are required in either the environment or passed to the creds argument.
- a local file system path of the form ``./path/to/dataset`` or ``~/path/to/dataset`` or ``path/to/dataset``.
- a memory path of the form ``mem://path/to/dataset`` which doesn't save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
runtime (dict): Parameters for Activeloop DB Engine. Only applicable for hub:// paths.
runtime (dict): Parameters for creating a dataset in the Deep Lake Tensor Database. Only applicable for paths of the form ``hub://org_id/dataset_name`` and runtime must be ``{"tensor_db": True}``.
overwrite (bool): If set to ``True`` this overwrites the dataset if it already exists. Defaults to ``False``.
public (bool): Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to ``False``.
memory_cache_size (int): The size of the memory cache to be used in MB.
Expand Down Expand Up @@ -385,7 +386,7 @@ def empty(
if org_id is not None and get_path_type(path) != "local":
raise ValueError("org_id parameter can only be used with local datasets")

db_engine = (runtime or {}).get("db_engine", False)
db_engine = parse_runtime_parameters(path, runtime)["tensor_db"]

if address:
raise ValueError(
Expand Down Expand Up @@ -823,6 +824,7 @@ def like(
token: Optional[str] = None,
org_id: Optional[str] = None,
public: bool = False,
verbose: bool = True,
) -> Dataset:
"""Creates a new dataset by copying the ``source`` dataset's structure to a new location. No samples are copied,
only the meta/info for the dataset and it's tensors.
Expand All @@ -840,6 +842,8 @@ def like(
token (str, optional): Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
org_id (str, Optional): Organization id to be used for enabling enterprise features. Only applicable for local datasets.
public (bool): Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.
verbose (bool): If True, logs will be printed. Defaults to ``True``.


Returns:
Dataset: New dataset object.
Expand All @@ -862,7 +866,16 @@ def like(
token=token,
)
return dataset._like(
dest, src, runtime, tensors, overwrite, creds, token, org_id, public
dest,
src,
runtime,
tensors,
overwrite,
creds,
token,
org_id,
public,
verbose,
)

@staticmethod
Expand All @@ -876,6 +889,7 @@ def _like( # (No reporting)
token: Optional[str] = None,
org_id: Optional[str] = None,
public: bool = False,
verbose: bool = True,
unlink: Union[List[str], bool] = False,
) -> Dataset:
"""Copies the `source` dataset's structure to a new location. No samples are copied, only the meta/info for the dataset and it's tensors.
Expand All @@ -895,11 +909,24 @@ def _like( # (No reporting)
token (str, optional): Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
org_id (str, Optional): Organization id to be used for enabling enterprise features. Only applicable for local datasets.
public (bool): Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to ``False``.
verbose (bool): If True, logs will be printed. Defaults to ``True``.
unlink (Union[List[str], bool]): List of tensors to be unlinked. If ``True`` passed all tensors will be unlinked. Defaults to ``False``, no tensors are unlinked.

Returns:
Dataset: New dataset object.
"""

src = convert_pathlib_to_string_if_needed(src)
if isinstance(src, str):
source_ds = dataset.load(src, verbose=verbose)
else:
source_ds = src

if tensors:
tensors = source_ds._resolve_tensor_list(tensors) # type: ignore
else:
tensors = source_ds.tensors # type: ignore

dest = convert_pathlib_to_string_if_needed(dest)
if isinstance(dest, Dataset):
destination_ds = dest
Expand All @@ -914,20 +941,12 @@ def _like( # (No reporting)
token=token,
org_id=org_id,
public=public,
verbose=verbose,
)

feature_report_path(
dest_path, "like", {"Overwrite": overwrite, "Public": public}, token=token
)
src = convert_pathlib_to_string_if_needed(src)
if isinstance(src, str):
source_ds = dataset.load(src)
else:
source_ds = src

if tensors:
tensors = source_ds._resolve_tensor_list(tensors) # type: ignore
else:
tensors = source_ds.tensors # type: ignore

if unlink is True:
unlink = tensors # type: ignore
Expand Down Expand Up @@ -1110,7 +1129,7 @@ def deepcopy(
)
src_storage = get_base_storage(src_ds.storage)

db_engine = (runtime or {}).get("db_engine", False)
db_engine = parse_runtime_parameters(dest, runtime)["tensor_db"]
dest_storage, cache_chain = get_storage_and_cache_chain(
dest,
db_engine=db_engine,
Expand Down
15 changes: 4 additions & 11 deletions deeplake/client/client.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import deeplake
import requests
from typing import Any, Optional, List, Tuple
from typing import Any, Optional, Dict
from deeplake.util.exceptions import (
AgreementNotAcceptedError,
AuthorizationException,
Expand Down Expand Up @@ -492,7 +492,7 @@ def connect_dataset_entry(

def remote_query(
self, org_id: str, ds_name: str, query_string: str
) -> Tuple[Any, Any]:
) -> Dict[str, Any]:
"""Queries a remote dataset.

Args:
Expand All @@ -501,7 +501,7 @@ def remote_query(
query_string (str): The query string.

Returns:
Tuple[Any, Any]: The indices and scores matching the query.
Dict[str, Any]: The json response containing matching indicies and data from virtual tensors.
"""
response = self.request(
"POST",
Expand All @@ -510,11 +510,4 @@ def remote_query(
endpoint=self.endpoint(),
).json()

indices = response["indices"]
if len(indices) == 0:
return [], []

scores = response.get("score")

indices = [int(i) for i in indices.split(",")]
return indices, scores
return response
34 changes: 33 additions & 1 deletion deeplake/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,4 +193,36 @@
DEFAULT_VECTORSTORE_DEEPLAKE_PATH = "./deeplake_vector_store"
MAX_VECTORSTORE_INGESTION_RETRY_ATTEMPTS = 5
MAX_CHECKPOINTING_INTERVAL = 100000
MAX_DATASET_LENGTH_FOR_CACHING = 100000
VECTORSTORE_EXTEND_MAX_SIZE = 20000
DEFAULT_VECTORSTORE_TENSORS = [
{
"name": "text",
"htype": "text",
"create_id_tensor": False,
"create_sample_info_tensor": False,
"create_shape_tensor": False,
},
{
"name": "metadata",
"htype": "json",
"create_id_tensor": False,
"create_sample_info_tensor": False,
"create_shape_tensor": False,
},
{
"name": "embedding",
"htype": "embedding",
"dtype": np.float32,
"create_id_tensor": False,
"create_sample_info_tensor": False,
"create_shape_tensor": True,
"max_chunk_size": 64 * MB,
},
{
"name": "id",
"htype": "text",
"create_id_tensor": False,
"create_sample_info_tensor": False,
"create_shape_tensor": False,
},
]
45 changes: 27 additions & 18 deletions deeplake/core/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
suppress_iteration_warning,
check_if_iteration,
)
from deeplake.util.tensor_db import parse_runtime_parameters
from deeplake.api.info import load_info
from deeplake.client.log import logger
from deeplake.client.utils import get_user_name
Expand Down Expand Up @@ -2094,7 +2095,7 @@ def query(
self,
query_string: str,
runtime: Optional[Dict] = None,
return_indices_and_scores: bool = False,
return_data: bool = False,
):
"""Returns a sliced :class:`~deeplake.core.dataset.Dataset` with given query results. To use this, install deeplake with ``pip install deeplake[enterprise]``.

Expand All @@ -2104,11 +2105,11 @@ def query(

Args:
query_string (str): An SQL string adjusted with new functionalities to run on the given :class:`~deeplake.core.dataset.Dataset` object
runtime (Optional[Dict]): whether to run query on a remote engine
return_indices_and_scores (bool): by default False. Whether to return indices and scores.
runtime (Optional[Dict]): Runtime parameters for query execution. Supported keys: {"tensor_db": True or False}.
return_data (bool): Defaults to ``False``. Whether to return raw data along with the view.

Raises:
ValueError: if return_indices_and_scores is True and runtime is not {"db_engine": true}
ValueError: if ``return_data`` is True and runtime is not {"tensor_db": true}

Returns:
Dataset: A :class:`~deeplake.core.dataset.Dataset` object.
Expand All @@ -2127,28 +2128,36 @@ def query(
>>> query_ds_train = ds_train.query("(select * where contains(categories, 'car') limit 1000) union (select * where contains(categories, 'motorcycle') limit 1000)")

"""
if runtime is not None and runtime.get("db_engine", False):

deeplake_reporter.feature_report(
feature_name="query",
parameters={
"query_string": query_string[0:100],
"runtime": runtime,
},
)

runtime = parse_runtime_parameters(self.path, runtime)
if runtime["tensor_db"]:
client = DeepLakeBackendClient(token=self._token)
org_id, ds_name = self.path[6:].split("/")
indices, scores = client.remote_query(org_id, ds_name, query_string)
if return_indices_and_scores:
return indices, scores
return self[indices]
response = client.remote_query(org_id, ds_name, query_string)
indices = response["indices"]
view = self[indices]

if return_data:
data = response["data"]
return view, data

if return_indices_and_scores:
return view

if return_data:
raise ValueError(
"return_indices_and_scores is not supported. Please add `runtime = {'db_engine': True}` if you want to return indices and scores"
"`return_data` is only applicable when running queries using the Managed Tensor Database. Please specify `runtime = {'tensor_db': True}`"
)

from deeplake.enterprise import query

deeplake_reporter.feature_report(
feature_name="query",
parameters={
"query_string": query_string,
},
)

return query(self, query_string)

def sample_by(
Expand Down
18 changes: 12 additions & 6 deletions deeplake/core/transform/test_transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -1376,7 +1376,7 @@ class BadSample:


@all_schedulers
@pytest.mark.parametrize("method", ["ds", "multiple"])
@pytest.mark.parametrize("method", ["ds", "multiple", "checkpointed"])
@pytest.mark.parametrize("error_at", ["transform", "chunk_engine"])
def test_ds_append_errors(
local_path, compressed_image_paths, scheduler, method, error_at
Expand All @@ -1388,7 +1388,7 @@ def upload(item, ds):
if isinstance(item["images"], str)
else item["images"]
)
if method == "ds":
if method == "ds" or method == "checkpointed":
ds.append(
{
"labels": np.zeros(10, dtype=np.uint32),
Expand Down Expand Up @@ -1417,14 +1417,18 @@ def upload(item, ds):
# errors out in transform dataset / tensor
bad_sample = {"images": "bad_path", "boxes": [1, 2, 3]}
err_msg = re.escape(
f"Transform failed at index 17 of the input data on the item: {bad_sample}. See traceback for more details."
f"Transform failed at index 17 of the input data on the item: {bad_sample}."
)
else:
# errors out in chunk engine
bad_sample = {"images": BadSample(), "boxes": [1, 2, 3]}
err_msg = re.escape(
f"Transform failed at index 17 of the input data. See traceback for more details."
err_msg = re.escape(f"Transform failed at index 17 of the input data.")

if method == "checkpointed":
err_msg += re.escape(
" Last checkpoint: 10 samples processed. You can slice the input to resume from this point."
)
err_msg += re.escape(" See traceback for more details.")

samples.insert(17, bad_sample)

Expand All @@ -1434,6 +1438,7 @@ def upload(item, ds):
ds,
num_workers=TRANSFORM_TEST_NUM_WORKERS,
scheduler=scheduler,
checkpoint_interval=10 if method == "checkpointed" else 0,
)

ds = create_test_ds(local_path)
Expand All @@ -1444,9 +1449,10 @@ def upload(item, ds):
num_workers=TRANSFORM_TEST_NUM_WORKERS,
scheduler=scheduler,
ignore_errors=True,
checkpoint_interval=10 if method == "checkpointed" else 0,
)

if method == "ds":
if method == "ds" or method == "checkpointed":
assert ds["images"][::2].numpy().shape == (10, *deeplake.read(images[0]).shape)
assert ds["images"][1::2].numpy().shape == (10, *deeplake.read(images[1]).shape)

Expand Down
10 changes: 9 additions & 1 deletion deeplake/core/transform/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ def eval(
cache_size: int = DEFAULT_TRANSFORM_SAMPLE_CACHE_SIZE,
checkpoint_interval: int = 0,
ignore_errors: bool = False,
verbose: bool = False,
**kwargs,
):
"""Evaluates the pipeline on ``data_in`` to produce an output dataset ``ds_out``.
Expand All @@ -165,6 +166,7 @@ def eval(
checkpoint_interval (int): If > 0, the transform will be checkpointed with a commit every ``checkpoint_interval`` input samples to avoid restarting full transform due to intermitten failures. If the transform is interrupted, the intermediate data is deleted and the dataset is reset to the last commit.
If <= 0, no checkpointing is done. Checkpoint interval should be a multiple of num_workers if num_workers > 0. Defaults to 0.
ignore_errors (bool): If ``True``, input samples that causes transform to fail will be skipped and the errors will be ignored **if possible**.
verbose (bool): If ``True``, prints additional information about the transform.
**kwargs: Additional arguments.

Raises:
Expand Down Expand Up @@ -234,7 +236,11 @@ def my_fn(sample_in: Any, samples_out, my_arg0, my_arg1=0):
total_samples = len(data_in)
if checkpointing_enabled:
check_checkpoint_interval(
data_in, checkpoint_interval, num_workers, overwrite
data_in,
checkpoint_interval,
num_workers,
overwrite,
verbose,
)
datas_in = [
data_in[i : i + checkpoint_interval]
Expand Down Expand Up @@ -300,6 +306,8 @@ def my_fn(sample_in: Any, samples_out, my_arg0, my_arg1=0):
index, sample = None, None
if isinstance(e, TransformError):
index, sample = e.index, e.sample
if checkpointing_enabled and isinstance(index, int):
index = samples_processed + index
e = e.__cause__ # type: ignore
if isinstance(e, AllSamplesSkippedError):
raise e
Expand Down
Loading
Loading