Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for aggregates and toxicity classification #551

Merged
merged 39 commits into from
Jan 7, 2023
Merged

Conversation

jarulraj
Copy link
Member

@jarulraj jarulraj commented Jan 2, 2023

  1. Based on Add support for COUNT, SUM, AVG, MIN, MAX  #519
  2. Brings back pip package testing in CI
  3. Reduces verbosity of YOLO model
  4. Update links in readme to point to the stable version in read-the-docs
  5. Add support for toxicity detection based on feat: Add toxic meme detection via UDF #516
  6. Add support for querying based on video timestamps (based on feat: Support timestamps and querying for timestamps #520)

agg_func_name = self.visit(child).value
elif isinstance(child, Token):
token = child.value
# Support for COUNT(*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this logic. Are we hardcoding * to id in the parser? If yes, I guess even though it is hacky, it saves us from handling this corner case in the binder. We could change it to IDENTIFIER_COLUMN, which is supposed to be a unique row id in all the tables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This query currently works. I was also worried if "id" will be always present. How should I change it to IDENTIFIER_COLUMN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests do not create tables using IDENTIFIER_COLUMN -- so the test case fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I verified we don't support projecting IDENTIFIER_COLUMN, which causes the binder to fail.
id won't work for images or other tables. Ideally, the binder should take care of it.
An if condition here should fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -378,7 +380,14 @@ def aggregate(self, method: str) -> None:
Arguments:
method: string with one of the five above options
"""
self._frames = self._frames.agg([method])
# Aggregate ndarray
if isinstance(self._frames.iat[0, 0], np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it aggregating each row of the array? If yes, I suspect that will break the execution logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, how will it break the execution logic?

The NDARRAY case is for object detection array etc. The normal case is the one that existed earlier -- self._frames = self._frames.agg([method])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted back the NDARRAY case as it does not make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregate on ndarray and primitive column will result in different row counts.

We have a different set of aggregate operators to apply row-wise aggregates likeArray_Count

self.assertEqual(actual_batch.frames.iat[0, 0], 10)
self.assertEqual(actual_batch.frames.iat[0, 1], 4.5)

complex_aggregate_query = """SELECT SUM(id), COUNT(label)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test case with aggregate on ndarray column.

Copy link
Member Author

@jarulraj jarulraj Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the query operates on an ndarray column, it does not reduce it to a single row. It actually keeps as many rows around as the number of the rows in the input column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted back the NDARRAY case as it does not make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we raise an error if the query tries to aggregate on an array column? We should add a test case to verify it. Thanks!

@jarulraj jarulraj changed the title feat: add support for aggregates feat: add support for aggregates and toxicity classification Jan 3, 2023
@jarulraj
Copy link
Member Author

jarulraj commented Jan 3, 2023

I just added a ToxicityClassifier UDF -- that runs on top of OCR labels.

@@ -55,9 +55,14 @@ def evaluate(self, *args, **kwargs):
elif self.etype == ExpressionType.AGGREGATION_MAX:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We miss an else condition to raise an error. Right now, we silently ignore it and return the origin batch.

eva/readers/opencv_reader.py Show resolved Hide resolved
@@ -0,0 +1,49 @@
# coding=utf-8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

single_result = self.model.predict(text)
toxicity_score = single_result["toxicity"][0]
if toxicity_score >= self.threshold:
outcome = outcome.append({"labels": "toxic"}, ignore_index=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to use list for the append operation. DataFrame throws a lot of warnings. You can refer to other udfs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you push your change?

if len(inp.columns) != 1:
raise ValueError("input must only contain one column (seconds)")

seconds = pd.DataFrame(inp[inp.columns[0]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it no-op?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a timestamp UDF.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT id, seconds, Timestamp(seconds) FROM MyVideo WHERE Timestamp(seconds) <= "00:00:01";

@@ -76,6 +76,7 @@ def test_create_multimedia_table_catalog_entry(self, mock):
ColumnDefinition(
"data", ColumnType.NDARRAY, NdArrayType.UINT8, [None, None, None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array_dimension should be a tuple.

@gaurav274
Copy link
Member

I just added a ToxicityClassifier UDF -- that runs on top of OCR labels.

Thanks! This will be a fun example to showcase 💯

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants