Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cache warm_up] warm_up slice with dashboard default_filters #9311

Merged

Conversation

graceguo-supercat
Copy link

@graceguo-supercat graceguo-supercat commented Mar 16, 2020

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

SUMMARY

Superset offers a couple of ways to warm_up cache:

Superset will generate a query like form_data={slice_id: _slice_id_} so that we can warm_up query saved with a given slice_id.

Now a lot dashboard are having default filters (and filters can have scopes), warm_up single slice without dashboard context make our cache hit rate declining. Currently in airbnb we have 10% dashboard with default filters, but 20% of dashboards are landed with default_filters.

This PR is to add dashboard context into the cache warm_up call. You can pass extra dashboard_id parameter to wam_up API, indicating the slice is called with a given dashboard. It should generate a query like

form_data = {"slice_id":926, "extra_filters": [
    {
      "col": "region",
      "op": "in",
      "val": [
        "East Asia & Pacific",
        "Latin America & Caribbean"
      ]
    }
  ]}

TEST PLAN

added new unit tests

REVIEWERS

@john-bodley @serenajiang @etr2460 @dpgaspar @mistercrunch

CONTAINER_TYPES = ["COLUMN", "GRID", "TABS", "TAB", "ROW"]


def get_dashboard_extra_filters(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@graceguo-supercat I thought this logic already existed in the backend and the idea we discussed was possibly refactoring the logic to ensure that this as well as the Celery warmup logic was using the same dashboard filter logic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Celery task is here: https://github.com/apache/incubator-superset/blob/000a038af193f2770dfe23a053bd1ff5e9e72bd4/superset/tasks/cache.py#L39

It missed a logic to check default_filters's scope if the scope is not globally, but pretty close. So my plan is to add logic for warm_up endpoint, then update Celery task to re-use this function in another PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh never mind, the Celery warmup change is very simple. So i just commit Celery warmup changes here. And all unit tests can be re-used without any change!

@codecov-io
Copy link

codecov-io commented Mar 16, 2020

Codecov Report

Merging #9311 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9311   +/-   ##
=======================================
  Coverage   59.08%   59.08%           
=======================================
  Files         374      374           
  Lines       12202    12205    +3     
  Branches     2986     2989    +3     
=======================================
+ Hits         7209     7211    +2     
- Misses       4814     4815    +1     
  Partials      179      179           
Impacted Files Coverage Δ
...plore/components/controls/FixedOrMetricControl.jsx 51.78% <0.00%> (+0.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1916a1...9752ff4. Read the comment docs.

Copy link
Member

@etr2460 etr2460 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good so far!

after looking through this and not knowing potential values for the filter object, i'm wondering if most of the .get("value", {}) and .get("value", []) statements should be .get("value") or {} and .get("value") or []

superset/views/utils.py Show resolved Hide resolved
slice_id: int, dashboard_id: int
) -> List[Dict[str, Any]]:
session = db.session()
slc = session.query(Slice).filter_by(id=slice_id).one_or_none()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you don't actually need slc here, instead you could look for a slice in dashboard.slices based on the slice_id passed into the function. That would save a db call

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

superset/views/utils.py Outdated Show resolved Hide resolved
superset/views/utils.py Outdated Show resolved Hide resolved
superset/views/utils.py Outdated Show resolved Hide resolved
superset/views/utils.py Outdated Show resolved Hide resolved
if (
node_type == "CHART"
and node.get("meta")
and node.get("meta").get("chartId", 0) == slice_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the default value of 0 for get here because None will never be equal to slice_id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could concat lines 336 and 337 to be,

and node.get("meta", {}).get("chartId") == slice_id


if node_type in CONTAINER_TYPES:
children = node.get("children", [])
if children:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be unnecessary, if you're concerned that children could be None or 0, then i'd recommend making the line above children = node.get("children") or []

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida would you mind taking a look as this PR refactors some of the logic related to your Celery task work.

if (
node_type == "CHART"
and node.get("meta")
and node.get("meta").get("chartId", 0) == slice_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could concat lines 336 and 337 to be,

and node.get("meta", {}).get("chartId") == slice_id

children = node.get("children", [])
if children:
# for child_id in children_ids:
if any(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is within the last if-statement, I think you can just write return any(...) instead of if any(...): return True.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@graceguo-supercat
Copy link
Author

graceguo-supercat commented Mar 17, 2020

looking good so far!

after looking through this and not knowing potential values for the filter object, i'm wondering if most of the .get("value", {}) and .get("value", []) statements should be .get("value") or {} and .get("value") or []

@etr2460 Thanks for code review! I follow documentation here:
https://docs.python.org/2/library/stdtypes.html#dict.get
and
https://stackoverflow.com/questions/33263623/dict-getkey-default-vs-dict-getkey-or-default

I would say most of my usage is for the dict that doesn't have key. Why do you think they should be .get("value") or {} and .get("value") or []?

scopes_by_filter_field = filter_scopes.get(filter_id, {})
for col, val in columns.items():
current_field_scopes = scopes_by_filter_field.get(col, {})
scoped_container_ids = current_field_scopes.get("scope", ["ROOT_ID"])
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this place, current_field_scopes.get("scope") the returned value could be [], which means this filter currently didn't apply to any tab. If i use current_field_scopes.get("scope") or ["ROOT_ID"]
the returned value will become ["ROOT_ID"] (which means globally apply)

@etr2460
Copy link
Member

etr2460 commented Mar 17, 2020

The main difference here applies if it's possible for a key to be set to a falsey value. If you have the dict:

d = {
  "key": None,
  "foo": False,
}

then:

d.get("key", []) # equals None
d.get("foo", {}) # equals False
d.get("key") or [] # equals []
d.get("foo") or {} # equals {}

This is really important to get right if you try to iterate through an array after "getting" it like this, as iterating over False or None will throw

@graceguo-supercat
Copy link
Author

graceguo-supercat commented Mar 17, 2020

The main difference here applies if it's possible for a key to be set to a falsey value. If you have the dict:

d = {
  "key": None,
  "foo": False,
}

then:

d.get("key", []) # equals None
d.get("foo", {}) # equals False
d.get("key") or [] # equals []
d.get("foo") or {} # equals {}

This is really important to get right if you try to iterate through an array after "getting" it like this, as iterating over False or None will throw

Correct. If the key exists, I should use falsy value, instead of assign a default value. See my example above.

@etr2460
Copy link
Member

etr2460 commented Mar 17, 2020

What i'm saying is if you default it to [] with or then you don't even need to check if it's falsey. The empty array will simply bypass the for loop

@graceguo-supercat
Copy link
Author

graceguo-supercat commented Mar 17, 2020

What i'm saying is if you default it to [] with or then you don't even need to check if it's falsey. The empty array will simply bypass the for loop

My question is this comment:
after looking through this and not knowing potential values for the filter object, i'm wondering if most of the .get("value", {}) and .get("value", []) statements should be .get("value") or {} and .get("value") or []

Why should change most of dict.get(key, default_value) to dict.get(key) or default_value?

@etr2460
Copy link
Member

etr2460 commented Mar 17, 2020

Here's a specific example:

You have filter_scopes = json_metadata.get("filter_scopes", {}) in the code, and then you later assume filter_scopes is a Dict. However, because the json_metadata can be edited by the user, it could be None or False. This would break with your current code, but not with json_metadata.get("filter_scopes") or {}

In general, I think we need to be a lot more defensive with parsing the json_metadata blob here, because an invalid metadata field isn't validated on save and it would break the code here

# is chart in this dashboard?
if (
dashboard is None
or not dashboard.json_metadata
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here i check json_metadata exists and not None.

):
return []

try:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add try catch exception for parsing dashboard metadata

filter_scopes: Dict,
default_filters: Dict[str, Dict[str, List]],
slice_id: int,
) -> List[Dict[str, Any]]:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added type check for layout, filter_scopes and default_filters, make sure they are Dictionary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typecheck here isn't quite right because you've already assumed the type is correct in the function statement. I'd recommend moving the type check into the parent function so that the types are correct here

Copy link
Member

@etr2460 etr2460 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with one other comment. you might want to get a stamp from John or Beto too since i'm not super familiar with this code

filter_scopes: Dict,
default_filters: Dict[str, Dict[str, List]],
slice_id: int,
) -> List[Dict[str, Any]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typecheck here isn't quite right because you've already assumed the type is correct in the function statement. I'd recommend moving the type check into the parent function so that the types are correct here

if col not in immune_fields:
extra_filters.append({"col": col, "op": "in", "val": val})
layout = json.loads(dashboard.position_json or "{}")
# do not apply filters if chart is immune to them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. I would nix the comment as the following function has not reference to immunity.

dashboard is None
or not dashboard.json_metadata
or not dashboard.slices
or slice_id not in [slc.id for slc in dashboard.slices]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably preferred as it should short circuit,

or not any([slc for slc in dashboard.slices if slc.id == slice_id])

@graceguo-supercat graceguo-supercat merged commit adebd40 into apache:master Mar 18, 2020
@graceguo-supercat graceguo-supercat deleted the gg-WarmupSliceInWithDash branch March 22, 2020 05:06
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.36.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 0.36.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants