feat: add support for aggregates and toxicity classification #551

jarulraj · 2023-01-02T23:40:41Z

Based on Add support for COUNT, SUM, AVG, MIN, MAX #519
Brings back pip package testing in CI
Reduces verbosity of YOLO model
Update links in readme to point to the stable version in read-the-docs
Add support for toxicity detection based on feat: Add toxic meme detection via UDF #516
Add support for querying based on video timestamps (based on feat: Support timestamps and querying for timestamps #520)

gaurav274 · 2023-01-03T00:43:34Z

eva/parser/lark_visitor/_functions.py

+                    agg_func_name = self.visit(child).value
+            elif isinstance(child, Token):
+                token = child.value
+                # Support for COUNT(*)


I don't understand this logic. Are we hardcoding * to id in the parser? If yes, I guess even though it is hacky, it saves us from handling this corner case in the binder. We could change it to IDENTIFIER_COLUMN, which is supposed to be a unique row id in all the tables.

Yes. This query currently works. I was also worried if "id" will be always present. How should I change it to IDENTIFIER_COLUMN?

The unit tests do not create tables using IDENTIFIER_COLUMN -- so the test case fails.

Yeah, I verified we don't support projecting IDENTIFIER_COLUMN, which causes the binder to fail.
id won't work for images or other tables. Ideally, the binder should take care of it.
An if condition here should fix it.

Reference: https://github.com/georgia-tech-db/eva/blob/04368ef72a04973b5ad3bfdf98b34e957440539e/eva/binder/binder_utils.py#L77

gaurav274 · 2023-01-03T00:46:25Z

eva/models/storage/batch.py

@@ -378,7 +380,14 @@ def aggregate(self, method: str) -> None:
        Arguments:
            method: string with one of the five above options
        """
-        self._frames = self._frames.agg([method])
+        # Aggregate ndarray
+        if isinstance(self._frames.iat[0, 0], np.ndarray):


Is it aggregating each row of the array? If yes, I suspect that will break the execution logic.

Yes, how will it break the execution logic?

The NDARRAY case is for object detection array etc. The normal case is the one that existed earlier -- self._frames = self._frames.agg([method])

I have reverted back the NDARRAY case as it does not make sense.

Aggregate on ndarray and primitive column will result in different row counts.

We have a different set of aggregate operators to apply row-wise aggregates likeArray_Count

gaurav274 · 2023-01-03T00:47:35Z

test/integration_tests/test_select_executor.py

+        self.assertEqual(actual_batch.frames.iat[0, 0], 10)
+        self.assertEqual(actual_batch.frames.iat[0, 1], 4.5)
+
+        complex_aggregate_query = """SELECT SUM(id), COUNT(label)


We should add a test case with aggregate on ndarray column.

When the query operates on an ndarray column, it does not reduce it to a single row. It actually keeps as many rows around as the number of the rows in the input column.

I have reverted back the NDARRAY case as it does not make sense.

Do we raise an error if the query tries to aggregate on an array column? We should add a test case to verify it. Thanks!

jarulraj · 2023-01-03T02:37:00Z

I just added a ToxicityClassifier UDF -- that runs on top of OCR labels.

gaurav274 · 2023-01-03T05:58:43Z

eva/expression/aggregation_expression.py

@@ -55,9 +55,14 @@ def evaluate(self, *args, **kwargs):
        elif self.etype == ExpressionType.AGGREGATION_MAX:


We miss an else condition to raise an error. Right now, we silently ignore it and return the origin batch.

eva/readers/opencv_reader.py

gaurav274 · 2023-01-03T06:04:33Z

eva/udfs/ndarray/timestamp.py

@@ -0,0 +1,49 @@
+# coding=utf-8


What is this for?

gaurav274 · 2023-01-03T06:06:45Z

eva/udfs/toxicity_classifier.py

+            single_result = self.model.predict(text)
+            toxicity_score = single_result["toxicity"][0]
+            if toxicity_score >= self.threshold:
+                outcome = outcome.append({"labels": "toxic"}, ignore_index=True)


I changed it to use list for the append operation. DataFrame throws a lot of warnings. You can refer to other udfs.

Did you push your change?

gaurav274 · 2023-01-03T06:08:08Z

eva/udfs/ndarray/timestamp.py

+        if len(inp.columns) != 1:
+            raise ValueError("input must only contain one column (seconds)")
+
+        seconds = pd.DataFrame(inp[inp.columns[0]])


Isn't it no-op?

This is a timestamp UDF.

SELECT id, seconds, Timestamp(seconds) FROM MyVideo WHERE Timestamp(seconds) <= "00:00:01";

gaurav274 · 2023-01-03T06:10:57Z

test/catalog/test_catalog_manager.py

@@ -76,6 +76,7 @@ def test_create_multimedia_table_catalog_entry(self, mock):
            ColumnDefinition(
                "data", ColumnType.NDARRAY, NdArrayType.UINT8, [None, None, None]


array_dimension should be a tuple.

gaurav274 · 2023-01-03T06:12:50Z

I just added a ToxicityClassifier UDF -- that runs on top of OCR labels.

Thanks! This will be a fun example to showcase 💯

review-notebook-app · 2023-01-07T02:25:57Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Ubuntu and others added 20 commits October 22, 2022 23:30

Timestamp stuff

9842b1b

helper function

ae503a9

Add support for all aggregation expressions

9f3a3b3

Fix aggregate expression parsing

30f983c

Fix repeat column names while using aggregate functions

5f895f5

Repeat row values when using aggregate functions

c65f72a

Make batch aggregate work on arrays

32bcdd6

final timestamp changes hopefully

8b91a2b

Add support for COUNT(*)

bbce1e2

[BUMP]: v0.1.4+dev

ba0bdf1

Merge branch 'master' into dev

ab1c582

ci: test

576474b

ci: test

a4dc56c

ci: test

8fe6b2e

ci: test

b0e1f90

feat: add support for more aggregate functions (COUNT, SUM, etc.)

b495995

Merge branch 'Erickkbentz-add_aggregate_functions' into dev

bc44548

update tests

adbce39

update tests

bf7f581

update tests

8118aea

jarulraj requested review from gaurav274 and pchunduri6 January 2, 2023 23:41

docs: update read-the-docs links

cc0f593

gaurav274 reviewed Jan 3, 2023

View reviewed changes

jarulraj added 3 commits January 2, 2023 20:22

feat: added toxicity classifier

cd1c6e1

add missing data

0ddfdfc

run linter

dd566b4

jarulraj changed the title ~~feat: add support for aggregates~~ feat: add support for aggregates and toxicity classification Jan 3, 2023

jarulraj added 2 commits January 2, 2023 21:16

disable aggregate on ndarray column

47dd547

need to update tests

4d665c0

This was referenced Jan 3, 2023

Add support for COUNT, SUM, AVG, MIN, MAX #519

Closed

feat: Add toxic meme detection via UDF #516

Closed

jarulraj added 3 commits January 3, 2023 00:00

Merge branch 'master' of github.com:tracli/eva into tracli-master

b473cd1

fix unit tests

7dec2f8

Merge branch 'tracli-master' into dev

a2991e3

jarulraj mentioned this pull request Jan 3, 2023

feat: Support timestamps and querying for timestamps #520

Closed

jarulraj added 2 commits January 3, 2023 00:28

minor fix

c2bc047

minor fixes

b936e4c

gaurav274 reviewed Jan 3, 2023

View reviewed changes

jarulraj added 7 commits January 3, 2023 01:51

fixes

3212ffa

fixes

26c9506

docs: add link to car plate detection tutorial

77a0c5f

docs: added toxicity classification application

44475b8

test

6bedf39

checkpoint

c476b77

Merge branch 'dev' of github.com:georgia-tech-db/eva into dev

cdb9dc6

checkpoint

c4d77c5

jarulraj mentioned this pull request Jan 7, 2023

Projecting IDENTIFIER_COLUMN for SELECT * #555

Closed

jarulraj merged commit 39183a4 into master Jan 7, 2023

jarulraj mentioned this pull request Jan 7, 2023

fix: 03-emotion-analysis tutorial notebook #554

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for aggregates and toxicity classification #551

feat: add support for aggregates and toxicity classification #551

jarulraj commented Jan 2, 2023 •

edited

gaurav274 Jan 3, 2023

jarulraj Jan 3, 2023

jarulraj Jan 3, 2023

gaurav274 Jan 3, 2023

gaurav274 Jan 3, 2023

gaurav274 Jan 3, 2023

jarulraj Jan 3, 2023

jarulraj Jan 3, 2023

gaurav274 Jan 3, 2023

gaurav274 Jan 3, 2023

jarulraj Jan 3, 2023 •

edited

jarulraj Jan 3, 2023

gaurav274 Jan 3, 2023

jarulraj commented Jan 3, 2023

gaurav274 Jan 3, 2023

gaurav274 Jan 3, 2023

gaurav274 Jan 3, 2023

jarulraj Jan 7, 2023

gaurav274 Jan 3, 2023

jarulraj Jan 7, 2023

jarulraj Jan 7, 2023

gaurav274 Jan 3, 2023

gaurav274 commented Jan 3, 2023

review-notebook-app bot commented Jan 7, 2023

		@@ -55,9 +55,14 @@ def evaluate(self, args, *kwargs):
		elif self.etype == ExpressionType.AGGREGATION_MAX:

		@@ -76,6 +76,7 @@ def test_create_multimedia_table_catalog_entry(self, mock):
		ColumnDefinition(
		"data", ColumnType.NDARRAY, NdArrayType.UINT8, [None, None, None]

feat: add support for aggregates and toxicity classification #551

feat: add support for aggregates and toxicity classification #551

Conversation

jarulraj commented Jan 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jarulraj Jan 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jarulraj commented Jan 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaurav274 commented Jan 3, 2023

review-notebook-app bot commented Jan 7, 2023

jarulraj commented Jan 2, 2023 •

edited

jarulraj Jan 3, 2023 •

edited