Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AL-1565] Filtering with simple query language #1352

Merged
merged 1 commit into from
Nov 30, 2021
Merged

[AL-1565] Filtering with simple query language #1352

merged 1 commit into from
Nov 30, 2021

Conversation

aliubimov
Copy link
Contributor

@aliubimov aliubimov commented Nov 23, 2021

Allows to execute filter query on the dataset using functions

ds.filter(lambda sample: sample.labels.numpy() == 2)

Or using simple query language:

ds.filter('lables == 2')

Query language have following limitations:

  • Scalar types are allowed: integers, float, boolean, strings
  • Operations on the tensors that are result to a scalar min, max, mean, shape, size
  • Tensor to scalar comparison is allowed, if and only if, tensor is size of 1, which can be mapped to the scalar.

@aliubimov aliubimov marked this pull request as draft November 23, 2021 18:32
Comment on lines 67 to 70
def _sanitize_tensor_name(key):
"""Sanitizes tensorname, that it could be binded as variable for `eval()` function call"""
key = key.replace(" ", "_")
return key
Copy link
Contributor Author

@aliubimov aliubimov Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please suggest what sanitization we have to have in place for the tensor bindings.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliubimov need to loop in UI team for this. I'll follow up in slack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliubimov need more context to understand what is this function sanitizing.

Copy link
Contributor Author

@aliubimov aliubimov Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@activesoull @alphazero this particular function have to ensure, that tensor name could be mapped and used as Python identifier. We can also try rename tensor (in a example, where we replace space with underscore), but it is not clear what to do with special symbols.

Copy link
Contributor

@activesoull activesoull Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliubimov if we must ensure that the tensor name is a valid python identifier then it will make sense to just remove all symbols and simplify the string to ensure it contains only the ASCII part of the allowed characters (a-zA-Z0-9_). Everything else must be removed. That's my take.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliubimov its better to not have any sanitization at all, as sanitizing can lead to conflicts.. Image 2 tensors with names "profit_$" and "profit_€". Simply removing the special chars will lead to conflict and escaping the special chars is bad UX. Tensors that are not valid identifiers can be accessed by ds["tensor name"]. This how its already works in hub - non identifier tensor names can't be accessed with . operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farizrahman4u Good point. I still think that having space replaced by underscore is beneficial. That would be something that I would try first as a user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farizrahman4u Removed as per request.

@aliubimov aliubimov added the trigger-test Label trigger to run tests on PRs label Nov 23, 2021
@activeloop-bot activeloop-bot removed the trigger-test Label trigger to run tests on PRs label Nov 23, 2021
@codecov
Copy link

codecov bot commented Nov 24, 2021

Codecov Report

Merging #1352 (1c55972) into main (fcc922f) will decrease coverage by 0.12%.
The diff coverage is 94.66%.

❗ Current head 1c55972 differs from pull request most recent head 2ec6c4a. Consider uploading reports for the commit 2ec6c4a to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1352      +/-   ##
==========================================
- Coverage   91.80%   91.68%   -0.13%     
==========================================
  Files         151      155       +4     
  Lines       10820    11024     +204     
==========================================
+ Hits         9933    10107     +174     
- Misses        887      917      +30     
Flag Coverage Δ
unittests 91.68% <94.66%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
hub/core/query/filter.py 79.54% <79.54%> (ø)
hub/core/compute/serial.py 93.75% <87.50%> (-6.25%) ⬇️
hub/core/query/query.py 93.70% <93.70%> (ø)
hub/core/compute/provider.py 97.29% <96.55%> (+9.79%) ⬆️
hub/core/compute/process.py 100.00% <100.00%> (ø)
hub/core/compute/ray.py 100.00% <100.00%> (ø)
hub/core/compute/thread.py 100.00% <100.00%> (ø)
hub/core/dataset/dataset.py 93.60% <100.00%> (+0.06%) ⬆️
hub/core/query/__init__.py 100.00% <100.00%> (ø)
hub/core/query/test/test_query.py 100.00% <100.00%> (ø)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42f328c...2ec6c4a. Read the comment docs.

@aliubimov aliubimov added the trigger-test Label trigger to run tests on PRs label Nov 24, 2021
@activeloop-bot activeloop-bot removed the trigger-test Label trigger to run tests on PRs label Nov 24, 2021
@aliubimov aliubimov added the trigger-test Label trigger to run tests on PRs label Nov 24, 2021
@activeloop-bot activeloop-bot removed the trigger-test Label trigger to run tests on PRs label Nov 24, 2021
@aliubimov aliubimov marked this pull request as ready for review November 24, 2021 19:06
@aliubimov aliubimov added the trigger-test Label trigger to run tests on PRs label Nov 26, 2021
@activeloop-bot activeloop-bot removed the trigger-test Label trigger to run tests on PRs label Nov 26, 2021
@aliubimov aliubimov added the trigger-test Label trigger to run tests on PRs label Nov 26, 2021
@activeloop-bot activeloop-bot removed the trigger-test Label trigger to run tests on PRs label Nov 26, 2021
@aliubimov aliubimov added the trigger-test Label trigger to run tests on PRs label Nov 27, 2021
@activeloop-bot activeloop-bot removed the trigger-test Label trigger to run tests on PRs label Nov 27, 2021
@aliubimov aliubimov added the trigger-test Label trigger to run tests on PRs label Nov 27, 2021
@activeloop-bot activeloop-bot removed the trigger-test Label trigger to run tests on PRs label Nov 27, 2021
@aliubimov aliubimov added the trigger-test Label trigger to run tests on PRs label Nov 27, 2021
@activeloop-bot activeloop-bot removed the trigger-test Label trigger to run tests on PRs label Nov 27, 2021
Comment on lines 67 to 70
def _sanitize_tensor_name(key):
"""Sanitizes tensorname, that it could be binded as variable for `eval()` function call"""
key = key.replace(" ", "_")
return key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliubimov its better to not have any sanitization at all, as sanitizing can lead to conflicts.. Image 2 tensors with names "profit_$" and "profit_€". Simply removing the special chars will lead to conflict and escaping the special chars is bad UX. Tensors that are not valid identifiers can be accessed by ds["tensor name"]. This how its already works in hub - non identifier tensor names can't be accessed with . operator.

@aliubimov aliubimov added the trigger-test Label trigger to run tests on PRs label Nov 29, 2021
@activeloop-bot activeloop-bot removed the trigger-test Label trigger to run tests on PRs label Nov 29, 2021
@aliubimov aliubimov changed the title Filtering with simple query language [AL-1565] Filtering with simple query language Nov 29, 2021
@aliubimov aliubimov merged commit ca7e985 into activeloopai:main Nov 30, 2021
@aliubimov aliubimov deleted the feature/AL-1565/query-lang branch November 30, 2021 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants