-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contain operator and generic ndarray syntax support in EVA #146
Conversation
…aggregate batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review in progress
left_values.values != right_values.values)) | ||
return Batch(pd.DataFrame(lvalues != rvalues)) | ||
elif self.etype == ExpressionType.COMPARE_CONTAINS: | ||
res = [[all(x in p for x in q) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to work without for loops? Typically, it is slow if we use python loops. We should be able to perform this using numpy/pandas internal operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was also trying to look for an existing sub array check in numpy. However I do not find it.
What we can do is using set instead of array. But as far as my knowledge, I do not think we can avoid the for loop to check cell by cell.
for left, right in zip(lvalues, rvalues)] | ||
return Batch(pd.DataFrame(res)) | ||
elif self.etype == ExpressionType.COMPARE_IS_CONTAINED: | ||
res = [[all(x in q for x in p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -30,6 +30,7 @@ class DataFrameColumn(BaseModel): | |||
_name = Column('name', String(100)) | |||
_type = Column('type', Enum(ColumnType), default=Enum) | |||
_is_nullable = Column('is_nullable', Boolean, default=False) | |||
_array_type = Column('array_type', Enum(NdArrayType), nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add the array_type parameter to the udfIO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 0853244.
@@ -234,23 +234,6 @@ def get_dataset_metadata(self, database_name: str, dataset_name: str) -> \ | |||
metadata.schema = df_columns | |||
return metadata | |||
|
|||
def udf_io( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the catalog manager as a single source of access to the catalog models/services. It will help keep things modular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add back in 50812db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Contain operator and generic ndarray syntax support in EVA
Examples:
array_type can be one of the following:
[INT8, UINT8, INT16, INT32, INT64, UNICODE, BOOL, FLOAT32, FLOAT64, DECIMAL, STR, DATETIME]
Notice: array_type is enforced when writing with petastorm, so data needs to be at least convertible to the defined array_type.
2. Add contain comparison operator
@>
,<@
,Examples:
Check issue #144 or test/integration_tests/test_udf_executor.py for more example usages
TODO: