Speed up grid_data endpoint by 10x #24284

ashb · 2022-06-07T12:14:20Z

These changes make the endpoint go from almost 20s down to 1.5s and the
changes are two fold:

Keep datetimes as objects for as long as possible

Previously we were converting start/end dates for a task group to a
string, and then in the parent parsing it back to a datetime to find
the min and max of all the child nodes.

The fix for that was to leave it as a datetime (or a
pendulum.DateTime technically) and use the existing
AirflowJsonEncoder class to "correctly" encode these objects on
output.
Reduce the number of DB queries from 1 per task to 1.

The removed get_task_summaries function was called for each task,
and was making a query to the database to find info for the given
DagRuns.

The helper function now makes just a single DB query for all
tasks/runs and constructs a dict to efficiently look up the ti by
run_id.

potiuk

❤️ it

ashb · 2022-06-07T12:20:57Z

(I'm still not thrilled at it taking 1.5s mind you, but it's a darn sight better)

ashb · 2022-06-07T12:22:44Z

Tests were done with a dag with 2000 dummy task grouped into 100 task groups

potiuk · 2022-06-07T12:30:11Z

I guess just transferring and parsing the resulting json will take most of the time now. BTW. Are we using deflate there :D? AnNd there are quite a few tricks with Javascript VM optimization for JSON which we MIGHT take a look at :)

ashb · 2022-06-07T12:36:36Z

There are many much much slower points to optomize on the front end before we have to worry about time spent parsing date times on the client side.

github-actions · 2022-06-07T12:54:57Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

norm

Love it.

airflow/utils/json.py

airflow/www/views.py

bbovenzi · 2022-06-13T17:30:29Z

I'm getting this error during auto-refresh:

TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'

ashb · 2022-06-13T17:33:42Z

Stack trace for that?

I didn't test with running DAGs. Oops! (Good thing we can add tests now)

bbovenzi · 2022-06-13T17:44:36Z

Stack trace for that?

I didn't test with running DAGs. Oops! (Good thing we can add tests now)

TypeError
TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'

Traceback (most recent call last)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2464, in __call__
 
    def __call__(self, environ, start_response):
        """The WSGI server calls the Flask application object as the
        WSGI application. This calls :meth:`wsgi_app` which can be
        wrapped to applying middleware."""
        return self.wsgi_app(environ, start_response)
 
    def __repr__(self):
        return "<%s %r>" % (self.__class__.__name__, self.name)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2450, in wsgi_app
            try:
                ctx.push()
                response = self.full_dispatch_request()
            except Exception as e:
                error = e
                response = self.handle_exception(e)
            except:  # noqa: B001
                error = sys.exc_info()[1]
                raise
            return response(environ, start_response)
        finally:
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1867, in handle_exception
            # if we want to repropagate the exception, we can attempt to
            # raise it with the whole traceback in case we can do that
            # (the function was actually called from the except part)
            # otherwise, we just raise the error again
            if exc_value is e:
                reraise(exc_type, exc_value, tb)
            else:
                raise e
 
        self.log_exception((exc_type, exc_value, tb))
        server_error = InternalServerError()
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    import collections.abc as collections_abc
 
    def reraise(tp, value, tb=None):
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
        raise value
 
    implements_to_string = _identity
 
else:
    iterkeys = lambda d: d.iterkeys()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
        ctx = self.request_context(environ)
        error = None
        try:
            try:
                ctx.push()
                response = self.full_dispatch_request()
            except Exception as e:
                error = e
                response = self.handle_exception(e)
            except:  # noqa: B001
                error = sys.exc_info()[1]
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
            request_started.send(self)
            rv = self.preprocess_request()
            if rv is None:
                rv = self.dispatch_request()
        except Exception as e:
            rv = self.handle_user_exception(e)
        return self.finalize_request(rv)
 
    def finalize_request(self, rv, from_error_handler=False):
        """Given the return value from a view function this finalizes
        the request by converting it into a response and invoking the
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
            return self.handle_http_exception(e)
 
        handler = self._find_error_handler(e)
 
        if handler is None:
            reraise(exc_type, exc_value, tb)
        return handler(e)
 
    def handle_exception(self, e):
        """Handle an exception that did not have an error handler
        associated with it, or that was raised from an error handler.
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    import collections.abc as collections_abc
 
    def reraise(tp, value, tb=None):
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
        raise value
 
    implements_to_string = _identity
 
else:
    iterkeys = lambda d: d.iterkeys()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
        self.try_trigger_before_first_request_functions()
        try:
            request_started.send(self)
            rv = self.preprocess_request()
            if rv is None:
                rv = self.dispatch_request()
        except Exception as e:
            rv = self.handle_user_exception(e)
        return self.finalize_request(rv)
 
    def finalize_request(self, rv, from_error_handler=False):
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
            getattr(rule, "provide_automatic_options", False)
            and req.method == "OPTIONS"
        ):
            return self.make_default_options_response()
        # otherwise dispatch to the handler for that endpoint
        return self.view_functions[rule.endpoint](**req.view_args)
 
    def full_dispatch_request(self):
        """Dispatches the request and on top of that performs request
        pre and postprocessing as well as HTTP exception catching and
        error handling.
File "/opt/airflow/airflow/www/auth.py", line 43, in decorated
 
            dag_id = (
                request.args.get("dag_id") or request.form.get("dag_id") or (request.json or {}).get("dag_id")
            )
            if appbuilder.sm.check_authorization(permissions, dag_id):
                return func(*args, **kwargs)
            elif not g.user.is_anonymous and not g.user.perms:
                return (
                    render_template(
                        'airflow/no_roles_permissions.html',
                        hostname=socket.getfqdn()
File "/opt/airflow/airflow/www/views.py", line 3622, in grid_data
 
            dag_runs = query.order_by(DagRun.execution_date.desc()).limit(num_runs).all()
            dag_runs.reverse()
            encoded_runs = [wwwutils.encode_dag_run(dr) for dr in dag_runs]
            data = {
                'groups': dag_to_grid(dag, dag_runs, session),
                'dag_runs': encoded_runs,
            }
 
        # avoid spaces to reduce payload size
        return (
File "/opt/airflow/airflow/www/views.py", line 396, in dag_to_grid
            'children': children,
            'tooltip': task_group.tooltip,
            'instances': group_summaries,
        }
 
    return task_group_to_grid(dag.task_group, dag_runs, grouped_tis)
 
 
def task_group_to_dict(task_item_or_group):
    """
    Create a nested dict representation of this TaskGroup and its children used to construct
File "/opt/airflow/airflow/www/views.py", line 357, in task_group_to_grid
 
        # Task Group
        task_group = item
 
        children = [
            task_group_to_grid(child, dag_runs, grouped_tis) for child in task_group.topological_sort()
        ]
 
        def get_summary(dag_run, children):
            child_instances = [child['instances'] for child in children if 'instances' in child]
            child_instances = [
File "/opt/airflow/airflow/www/views.py", line 357, in <listcomp>
 
        # Task Group
        task_group = item
 
        children = [
            task_group_to_grid(child, dag_runs, grouped_tis) for child in task_group.topological_sort()
        ]
 
        def get_summary(dag_run, children):
            child_instances = [child['instances'] for child in children if 'instances' in child]
            child_instances = [
File "/opt/airflow/airflow/www/views.py", line 341, in task_group_to_grid
                if record:
                    set_overall_state(record)
                    yield record
 
            if item.is_mapped:
                instances = list(_mapped_summary(grouped_tis.get(item.task_id, [])))
            else:
                instances = list(map(_get_summary, grouped_tis.get(item.task_id, [])))
 
            return {
                'id': item.task_id,
File "/opt/airflow/airflow/www/views.py", line 333, in _mapped_summary
                            'end_date': ti_summary.end_date,
                            'mapped_states': {ti_summary.state: ti_summary.state_count},
                            'state': None,  # We change this before yielding
                        }
                        continue
                    record['start_date'] = min(record['start_date'], ti_summary.start_date)
                    record['end_date'] = max(record['end_date'], ti_summary.end_date)
                    record['mapped_states'][ti_summary.state] = ti_summary.state_count
                if record:
                    set_overall_state(record)
                    yield record
TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'

These changes make the endpoint go from almost 20s down to 1.5s and the changes are two fold: 1. Keep datetimes as objects for as long as possible Previously we were converting start/end dates for a task group to a string, and then in the parent parsing it back to a datetime to find the min and max of all the child nodes. The fix for that was to leave it as a datetime (or a pendulum.DateTime technically) and use the existing `AirflowJsonEncoder` class to "correctly" encode these objects on output. 2. Reduce the number of DB queries from 1 per task to 1. The removed `get_task_summaries` function was called for each task, and was making a query to the database to find info for the given DagRuns. The helper function now makes just a single DB query for all tasks/runs and constructs a dict to efficiently look up the ti by run_id.

Note that this possibly has incorrect behaviour, in that the end_date of a TaskGroup is set to the max of all the children's end dates, even if some are still running. (This is the existing behaviour and is not changed or altered by this change - limiting it to just performance fixes)

ashb · 2022-06-14T14:39:26Z

PTAL @bbovenzi I've fixed the issue

bbovenzi · 2022-06-14T16:01:47Z

Looking good. But there are some linting issues. run yarn lint in airflow/www. They should have been caught in precommit. I'll look into that

ashb · 2022-06-15T12:02:00Z

Helm/Kube tests are not caused by this PR and are failing on main too. Merging.

* Speed up grid_data endpoint by 10x These changes make the endpoint go from almost 20s down to 1.5s and the changes are two fold: 1. Keep datetimes as objects for as long as possible Previously we were converting start/end dates for a task group to a string, and then in the parent parsing it back to a datetime to find the min and max of all the child nodes. The fix for that was to leave it as a datetime (or a pendulum.DateTime technically) and use the existing `AirflowJsonEncoder` class to "correctly" encode these objects on output. 2. Reduce the number of DB queries from 1 per task to 1. The removed `get_task_summaries` function was called for each task, and was making a query to the database to find info for the given DagRuns. The helper function now makes just a single DB query for all tasks/runs and constructs a dict to efficiently look up the ti by run_id. * Add support for mapped tasks in the grid data * Don't fail when not all tasks have a finish date. Note that this possibly has incorrect behaviour, in that the end_date of a TaskGroup is set to the max of all the children's end dates, even if some are still running. (This is the existing behaviour and is not changed or altered by this change - limiting it to just performance fixes) (cherry picked from commit 451a6f4)

ashb requested review from ryanahamilton and bbovenzi as code owners June 7, 2022 12:14

boring-cyborg bot added the area:webserver Webserver related Issues label Jun 7, 2022

ashb requested review from jedcunningham and removed request for ryanahamilton June 7, 2022 12:14

ashb added this to the Airflow 2.3.3 milestone Jun 7, 2022

potiuk approved these changes Jun 7, 2022

View reviewed changes

ashb requested a review from ephraimbuddy June 7, 2022 12:37

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 7, 2022

norm approved these changes Jun 7, 2022

View reviewed changes

dstandish reviewed Jun 7, 2022

View reviewed changes

airflow/utils/json.py Show resolved Hide resolved

bbovenzi reviewed Jun 7, 2022

View reviewed changes

airflow/www/views.py Show resolved Hide resolved

potiuk mentioned this pull request Jun 13, 2022

Fix flaky order of returned dag runs #24405

Merged

ashb force-pushed the quicker-grid-data-endpoint branch from ed13f48 to d471f1f Compare June 13, 2022 17:03

jedcunningham mentioned this pull request Jun 13, 2022

New grid view in Airflow 2.3.0 has very slow performance on large DAGs relative to tree view in 2.2.5 #23772

Closed

2 tasks

ashb added 3 commits June 14, 2022 15:36

Add support for mapped tasks in the grid data

1760e77

ashb force-pushed the quicker-grid-data-endpoint branch from d471f1f to b1bf967 Compare June 14, 2022 14:37

ashb requested review from potiuk and bbovenzi June 14, 2022 14:56

fix eslint errors

85cbbea

bbovenzi approved these changes Jun 14, 2022

View reviewed changes

ashb merged commit 451a6f4 into apache:main Jun 15, 2022

ashb deleted the quicker-grid-data-endpoint branch June 15, 2022 12:02

This was referenced Jun 25, 2022

Script to filter candidates for PR of the month based on heuristics #24654

Merged

Scheduler crashes with psycopg2.errors.DeadlockDetected exception #23361

Closed

ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Jun 30, 2022

This was referenced Jul 2, 2022

Status of testing of Apache Airflow 2.3.3rc1 #24806

Closed

Status of testing of Apache Airflow 2.3.3rc3 #24863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up grid_data endpoint by 10x #24284

Speed up grid_data endpoint by 10x #24284

ashb commented Jun 7, 2022

potiuk left a comment

ashb commented Jun 7, 2022

ashb commented Jun 7, 2022

potiuk commented Jun 7, 2022

ashb commented Jun 7, 2022

github-actions bot commented Jun 7, 2022

norm left a comment

bbovenzi commented Jun 13, 2022

ashb commented Jun 13, 2022

bbovenzi commented Jun 13, 2022

ashb commented Jun 14, 2022

bbovenzi commented Jun 14, 2022 •

edited

Loading

ashb commented Jun 15, 2022

Speed up grid_data endpoint by 10x #24284

Speed up grid_data endpoint by 10x #24284

Conversation

ashb commented Jun 7, 2022

potiuk left a comment

Choose a reason for hiding this comment

ashb commented Jun 7, 2022

ashb commented Jun 7, 2022

potiuk commented Jun 7, 2022

ashb commented Jun 7, 2022

github-actions bot commented Jun 7, 2022

norm left a comment

Choose a reason for hiding this comment

bbovenzi commented Jun 13, 2022

ashb commented Jun 13, 2022

bbovenzi commented Jun 13, 2022

ashb commented Jun 14, 2022

bbovenzi commented Jun 14, 2022 • edited Loading

ashb commented Jun 15, 2022

bbovenzi commented Jun 14, 2022 •

edited

Loading