Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Implement model specific lookups by id to improve performance #20974

Merged
merged 4 commits into from
Aug 9, 2022

Conversation

bkyryliuk
Copy link
Member

@bkyryliuk bkyryliuk commented Aug 4, 2022

SUMMARY

Currently the access checks are happening 2ce in the explore flow, once during the object lookup and later doing an explicit check. Existing implementation was spending > 500 ms in our deployment on doing object by id lookup and that contributed to ~1 - 1.5 second slowdown of the explore endpoint.

The proposed solution skips the filter on the chart / datasource lookup and let's explore to handle the access checks and return 403 if it was denied. Prior behavior was to return 404.

Out of scope

def get_permissions(
function takes > 1 second in our deployment, it is part of the explore flow while bootsrapping user data
def bootstrap_user_data(user: User, include_perms: bool = False) -> Dict[str, Any]:
I will have separate PR to optimize the permissions lookup.

Before

Screen Shot 2022-08-03 at 1 41 41 PM

After

image

Screen Shot 2022-08-04 at 8 46 54 AM

TESTING INSTRUCTIONS

[x] Deployed and tested on dropbox staging

ADDITIONAL INFORMATION

More granular breakdown

@classmethod
def find_by_id(
    cls, model_id: Union[str, int], session: Session = None
) -> Optional[Model]:
    """
    Find a model by id, if defined applies `base_filter`
    """
    with newrelic.agent.FunctionTrace("superset.dao.base.find_by_id.session", group="DAO"):
        session = session or db.session
    with newrelic.agent.FunctionTrace("superset.dao.base.find_by_id.query_contruction", group="DAO"):
        query = session.query(cls.model_cls)
        if cls.base_filter:
            data_model = SQLAInterface(cls.model_cls, session)
            query = cls.base_filter(  # pylint: disable=not-callable
                cls.id_column_name, data_model
            ).apply(query, None)
        id_filter = {cls.id_column_name: model_id}
    try:
        with newrelic.agent.FunctionTrace("superset.dao.base.find_by_id.query_execution", group="DAO"):
            return query.filter_by(**id_filter).one_or_none()
    except StatementError:
        # can happen if int is passed instead of a string or similar
        return None

Screen Shot 2022-08-03 at 5 00 28 PM

@@ -491,6 +491,8 @@ def get_viz_annotation_data(
chart = ChartDAO.find_by_id(annotation_layer["value"])
if not chart:
raise QueryObjectValidationError(_("The chart does not exist"))
if not chart.datasource:
Copy link
Member Author

@bkyryliuk bkyryliuk Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addresses mypy issue in

            viz_obj = get_viz(
                datasource_type=chart.datasource.type,
                datasource_id=chart.datasource.id,
                form_data=form_data,
                force=force,
            )

@bkyryliuk bkyryliuk added the v1.5 label Aug 4, 2022
@bkyryliuk bkyryliuk changed the title [WIP] Implement model specific lookups by id to improve performance perf: Implement model specific lookups by id to improve performance [WIP] Aug 4, 2022
@bkyryliuk bkyryliuk force-pushed the bogdan/dao_query branch 2 times, most recently from ce579da to b5da0be Compare August 4, 2022 17:57
@pull-request-size pull-request-size bot added size/L and removed size/M labels Aug 4, 2022
@bkyryliuk bkyryliuk force-pushed the bogdan/dao_query branch 9 times, most recently from a8b2d63 to 28db76e Compare August 5, 2022 16:40
@apache apache deleted a comment from codecov bot Aug 5, 2022
@bkyryliuk bkyryliuk force-pushed the bogdan/dao_query branch 2 times, most recently from 9f56905 to eeb4fa5 Compare August 5, 2022 16:58
@codecov
Copy link

codecov bot commented Aug 5, 2022

Codecov Report

Merging #20974 (ab59aa4) into master (eb5369f) will decrease coverage by 0.03%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master   #20974      +/-   ##
==========================================
- Coverage   66.38%   66.34%   -0.04%     
==========================================
  Files        1767     1767              
  Lines       67232    67317      +85     
  Branches     7138     7138              
==========================================
+ Hits        44633    44664      +31     
- Misses      20773    20827      +54     
  Partials     1826     1826              
Flag Coverage Δ
hive 53.15% <16.66%> (-0.05%) ⬇️
mysql 80.92% <66.66%> (-0.11%) ⬇️
postgres 80.98% <66.66%> (-0.12%) ⬇️
presto 53.05% <16.66%> (-0.05%) ⬇️
python 81.46% <66.66%> (-0.12%) ⬇️
sqlite 79.58% <66.66%> (-0.11%) ⬇️
unit 50.49% <66.66%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/common/query_context_processor.py 88.49% <0.00%> (-1.51%) ⬇️
superset/dao/base.py 95.40% <100.00%> (ø)
superset/explore/utils.py 100.00% <100.00%> (ø)
superset/explore/form_data/commands/utils.py 92.30% <0.00%> (-7.70%) ⬇️
superset/explore/form_data/api.py 89.24% <0.00%> (-6.46%) ⬇️
superset/views/datasource/utils.py 93.33% <0.00%> (-2.02%) ⬇️
superset/models/sql_lab.py 77.01% <0.00%> (-1.74%) ⬇️
superset/explore/permalink/api.py 91.37% <0.00%> (-1.73%) ⬇️
superset/models/helpers.py 39.40% <0.00%> (-1.38%) ⬇️
... and 8 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@bkyryliuk bkyryliuk requested a review from a team August 5, 2022 19:23
@bkyryliuk bkyryliuk changed the title perf: Implement model specific lookups by id to improve performance [WIP] perf: Implement model specific lookups by id to improve performance Aug 5, 2022
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this perf issue!

cls,
model_id: Union[str, int],
session: Session = None,
no_filter: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no_filter feels slightly ambiguous. Would skip_filter or ignore_filter be more explicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's go with skip_filter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit nit: can we maybe call this skip_base_filter to make it even more explicit? I was confused for a while why get_by_id would need still need a filter.

tests/unit_tests/charts/dao/dao_tests.py Outdated Show resolved Hide resolved
Comment on lines +26 to +43
from superset.connectors.sqla.models import SqlaTable
from superset.models.core import Database

engine = session.get_bind()
SqlaTable.metadata.create_all(engine) # pylint: disable=no-member

db = Database(database_name="my_database", sqlalchemy_uri="sqlite://")
sqla_table = SqlaTable(
table_name="my_sqla_table",
columns=[],
metrics=[],
database=db,
)

session.add(db)
session.add(sqla_table)
session.flush()
yield session
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note about missing cleanup as previously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ktmud
Copy link
Member

ktmud commented Aug 8, 2022

I wonder if the slowness in your check permission query is because postgres did not automatically create indexes on the sqlatable_user table (ref).

superset_test=> \d sqlatable_user;
                             Table "public.sqlatable_user"
  Column  |  Type   | Collation | Nullable |                  Default                   
----------+---------+-----------+----------+--------------------------------------------
 id       | integer |           | not null | nextval('sqlatable_user_id_seq'::regclass)
 user_id  | integer |           |          | 
 table_id | integer |           |          | 
Indexes:
    "sqlatable_user_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "sqlatable_user_table_id_fkey" FOREIGN KEY (table_id) REFERENCES tables(id)
    "sqlatable_user_user_id_fkey" FOREIGN KEY (user_id) REFERENCES ab_user(id)

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bkyryliuk
Copy link
Member Author

bkyryliuk commented Aug 8, 2022

I wonder if the slowness in your check permission query is because postgres did not automatically create indexes on the sqlatable_user table (ref).

superset_test=> \d sqlatable_user;
                             Table "public.sqlatable_user"
  Column  |  Type   | Collation | Nullable |                  Default                   
----------+---------+-----------+----------+--------------------------------------------
 id       | integer |           | not null | nextval('sqlatable_user_id_seq'::regclass)
 user_id  | integer |           |          | 
 table_id | integer |           |          | 
Indexes:
    "sqlatable_user_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "sqlatable_user_table_id_fkey" FOREIGN KEY (table_id) REFERENCES tables(id)
    "sqlatable_user_user_id_fkey" FOREIGN KEY (user_id) REFERENCES ab_user(id)

sqlatable_user

Yeah - we have both indexes:
image
image

new relic states that a lot of time is spent in python code, not sql execution. I have 2 more places I want to optimized for better perf: https://apache-superset.slack.com/archives/G013HAE6Y0K/p1659722615103529?thread_ts=1659395755.971919&cid=G013HAE6Y0K and will have PRs coming up soon

@bkyryliuk bkyryliuk requested a review from villebro August 8, 2022 23:40
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice polishing in the last commit 👍

@bkyryliuk bkyryliuk merged commit 17b5803 into apache:master Aug 9, 2022
@bkyryliuk bkyryliuk deleted the bogdan/dao_query branch August 9, 2022 18:15
michael-s-molina pushed a commit that referenced this pull request Aug 30, 2022
…20974)

* Implement model specific lookups by id to improve performance

* Address comments e.g. better variable names and test cleanup

* commit after cleanup

* even better name and test cleanup via rollback

Co-authored-by: Bogdan Kyryliuk <bogdankyryliuk@dropbox.com>
@mistercrunch mistercrunch added 🍒 1.5.2 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.1.0 and removed 🚢 2.1.3 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🍒 1.5.2 🍒 1.5.3 🚢 2.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants