Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contain operator and generic ndarray syntax support in EVA #146

Merged
merged 38 commits into from
Mar 26, 2021
Merged

Conversation

xzdandy
Copy link
Collaborator

@xzdandy xzdandy commented Mar 1, 2021

  1. Add generic Ndarray syntax support, which is NDARRAY [array_type](dimensions)
    Examples:
CREATE UDF DummyObjectDetector
INPUT  (Frame_Array NDARRAY UINT8(3, 256, 256))
OUTPUT (label NDARRAY STR(10))
TYPE  Classification
IMPL  'test/util.py

array_type can be one of the following: [INT8, UINT8, INT16, INT32, INT64, UNICODE, BOOL, FLOAT32, FLOAT64, DECIMAL, STR, DATETIME]
Notice: array_type is enforced when writing with petastorm, so data needs to be at least convertible to the defined array_type.
2. Add contain comparison operator @>,<@,
Examples:

SELECT id,DummyObjectDetector(data) FROM MyVideo
WHERE DummyObjectDetector(data).label <@ ['person', 'bicycle']
ORDER BY id;

Check issue #144 or test/integration_tests/test_udf_executor.py for more example usages

TODO:

  • fix broken unittest testcases.

@xzdandy xzdandy marked this pull request as ready for review March 1, 2021 22:58
@xzdandy xzdandy requested a review from gaurav274 March 1, 2021 22:58
Copy link
Member

@gaurav274 gaurav274 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review in progress

src/expression/comparison_expression.py Show resolved Hide resolved
test/parser/test_parser.py Show resolved Hide resolved
left_values.values != right_values.values))
return Batch(pd.DataFrame(lvalues != rvalues))
elif self.etype == ExpressionType.COMPARE_CONTAINS:
res = [[all(x in p for x in q)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to work without for loops? Typically, it is slow if we use python loops. We should be able to perform this using numpy/pandas internal operations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was also trying to look for an existing sub array check in numpy. However I do not find it.
What we can do is using set instead of array. But as far as my knowledge, I do not think we can avoid the for loop to check cell by cell.

for left, right in zip(lvalues, rvalues)]
return Batch(pd.DataFrame(res))
elif self.etype == ExpressionType.COMPARE_IS_CONTAINED:
res = [[all(x in q for x in p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -30,6 +30,7 @@ class DataFrameColumn(BaseModel):
_name = Column('name', String(100))
_type = Column('type', Enum(ColumnType), default=Enum)
_is_nullable = Column('is_nullable', Boolean, default=False)
_array_type = Column('array_type', Enum(NdArrayType), nullable=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add the array_type parameter to the udfIO.

Copy link
Collaborator Author

@xzdandy xzdandy Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 0853244.

@@ -234,23 +234,6 @@ def get_dataset_metadata(self, database_name: str, dataset_name: str) -> \
metadata.schema = df_columns
return metadata

def udf_io(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the catalog manager as a single source of access to the catalog models/services. It will help keep things modular.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add back in 50812db

Copy link
Member

@gaurav274 gaurav274 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gaurav274 gaurav274 merged commit e238bb3 into master Mar 26, 2021
@gaurav274 gaurav274 deleted the contain branch March 26, 2021 05:53
xzdandy pushed a commit to gaurav274/Eva that referenced this pull request Mar 19, 2022
Contain operator and generic ndarray syntax support in EVA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants