-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AL-1565] Filtering with simple query language #1352
[AL-1565] Filtering with simple query language #1352
Conversation
hub/core/query/query.py
Outdated
def _sanitize_tensor_name(key): | ||
"""Sanitizes tensorname, that it could be binded as variable for `eval()` function call""" | ||
key = key.replace(" ", "_") | ||
return key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please suggest what sanitization we have to have in place for the tensor bindings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aliubimov need to loop in UI team for this. I'll follow up in slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aliubimov need more context to understand what is this function sanitizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@activesoull @alphazero this particular function have to ensure, that tensor name could be mapped and used as Python identifier. We can also try rename tensor (in a example, where we replace space with underscore), but it is not clear what to do with special symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aliubimov if we must ensure that the tensor name is a valid python identifier then it will make sense to just remove all symbols and simplify the string to ensure it contains only the ASCII part of the allowed characters (a-zA-Z0-9_). Everything else must be removed. That's my take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aliubimov its better to not have any sanitization at all, as sanitizing can lead to conflicts.. Image 2 tensors with names "profit_$" and "profit_€". Simply removing the special chars will lead to conflict and escaping the special chars is bad UX. Tensors that are not valid identifiers can be accessed by ds["tensor name"]. This how its already works in hub - non identifier tensor names can't be accessed with .
operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@farizrahman4u Good point. I still think that having space replaced by underscore is beneficial. That would be something that I would try first as a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@farizrahman4u Removed as per request.
Codecov Report
@@ Coverage Diff @@
## main #1352 +/- ##
==========================================
- Coverage 91.80% 91.68% -0.13%
==========================================
Files 151 155 +4
Lines 10820 11024 +204
==========================================
+ Hits 9933 10107 +174
- Misses 887 917 +30
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
hub/core/query/query.py
Outdated
def _sanitize_tensor_name(key): | ||
"""Sanitizes tensorname, that it could be binded as variable for `eval()` function call""" | ||
key = key.replace(" ", "_") | ||
return key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aliubimov its better to not have any sanitization at all, as sanitizing can lead to conflicts.. Image 2 tensors with names "profit_$" and "profit_€". Simply removing the special chars will lead to conflict and escaping the special chars is bad UX. Tensors that are not valid identifiers can be accessed by ds["tensor name"]. This how its already works in hub - non identifier tensor names can't be accessed with .
operator.
Allows to execute filter query on the dataset using functions
Or using simple query language:
Query language have following limitations: