[1/n] Move subscription to web worker #9795

salazarm · 2022-09-26T18:59:38Z

Summary & Motivation

This PR recreates #8734

Organizations have runs that can output thousands of logs and upwards of 500MB of logs. This causes extreme slowdown in the UI as the mainthread struggles to keep up with the giant spam of logs coming from the backend both when the run is active and especially when the run is completed and all of the logs are loaded upfront. The main source of lag in the UI is deserializing the JSON data. To reduce this time we're planning on only sending the logs/data for things that are actually visible in the UI.

https://elementl.quip.com/To6cAXcqGIbi/rfc-Improving-Job-Run-Performance

Right now the worker will only be enabled if a query parameter worker=1 is in the URL

How I Tested These Changes

Loaded a huge run with dagit-debug and saw everything loaded fine.

Ran a few runs with and without the worker and the worker consistently beat the non worker version:

many_events.py:

No worker:

115898.29999999981 ms
114627.20000000019 ms
115695.5 ms
116075.09999999963 ms
Average: 115.5 seconds

Worker:

97091 ms
96878.70000000019 ms
97609.20000000019 ms
97558.39999999944 ms
Average: 97.2 seconds

~16% improvement

Adtriba run debug file:

No Worker:

332907.6000000015 ms
328559.40000000224
311353.5
324224.69999999925

Average: 324.26 seconds

Worker:

96900.20000000019
102775.20000000298
99352.19999999925
100879.39999999851

Average: 99.98 seconds

~71% improvement

linear · 2022-09-26T18:59:40Z

BRR-289 run page performance w/ large jobs

sample runs to load with dagit-debug cli

https://drive.google.com/drive/folders/1ON6Wf8CF2c1pqF1E8P8R-Jh_aTv3cZFt

vercel · 2022-09-26T18:59:41Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Ignored Deployments

Name	Status	Updated
dagit-storybook	⬜️ Ignored (Inspect)	Sep 29, 2022 at 8:21PM (UTC)
dagster	⬜️ Ignored (Inspect)	Sep 29, 2022 at 8:21PM (UTC)
dagster-oss-cloud-consolidated	⬜️ Ignored (Inspect)	Sep 29, 2022 at 8:21PM (UTC)

…-289-run-page-performance-w-large-jobs-1

alangenfeld

I see 1/N in the title but the PR summary does not make it clear what the expected scope of this is. From what i can tell this on its own doesn't improve anything, just adds a web worker hop to parsing the same volume of data.

alangenfeld · 2022-09-26T21:21:12Z

js_modules/dagit/packages/core/src/runs/usePipelineRunLogsSubscription.tsx

+    worker.addEventListener('message', (event) => {
+      try {
+        onLogs(JSON.parse(arrayBufferToString(event.data)));
+      } catch (_) {}


we should make some noise somewhere on failure - this would be really gnarly to debug

Yeah, I'll remove the try/catch

salazarm · 2022-09-26T21:52:03Z

@alangenfeld Yeah this is just setup for what we want to do which is only send data that would be visible from the webworker to the mainthread so that the mainthread does less work overall.

The linked linear task has more information and heres this quip describing the plan: https://elementl.quip.com/To6cAXcqGIbi/rfc-Improving-Job-Run-Performance

Btw I did do 4 trial runs for each version with and without the worker and the worker version was consistently faster at loading a run with 200k events

salazarm · 2022-09-27T07:44:12Z

Just tested this on Adtriba's debug file from a while ago and the improvements are huge there:
No Worker:

332907.6000000015 ms
328559.40000000224
311353.5
324224.69999999925

Worker:

96900.20000000019
102775.20000000298
99352.19999999925
100879.39999999851

I don't see a lot more room for improvement on the frontend but my macbook's python server is pretty slow so its the bottleneck for me. On a faster machine it's possible it could overwhelm the frontend. I'll see if I can setup dagster on my personal workstation pc

alangenfeld · 2022-09-27T16:39:42Z

What exactly is that time measuring? May be useful to include profiler screenshots.

Its not obvious to me why things are faster with this change if the volume of data parsed is the same in both approaches - unless

the main source of lag in the UI is deserializing the JSON data

isn't right or needs some more specific detail.

salazarm · 2022-09-27T16:48:28Z

What exactly is that time measuring? May be useful to include profiler screenshots.

I can put up a commit with the test setup. It just measures from pageload until the last log message from the webserver is handled (I used a requestAnimationFrame to get the time when rendering of the last message is done)

Its not obvious to me why things are faster with this change if the volume of data parsed is the same in both approaches - unless
the main source of lag in the UI is deserializing the JSON data
isn't right or needs some more specific detail.

Yeah I'm going to be honest it did catch me by surprise. I can do some more digging to figure out exactly where the time is coming from. I setup my workstation so hopefully a beefier setup will make it easier to profile (my macbook hangs everytime I try to profile the non-worker version)

hellendag

The perf benefit looks really cool. Thanks for putting this together!

hellendag · 2022-09-27T16:20:48Z

js_modules/dagit/packages/core/src/app/PythonErrorFragment.tsx

+import {gql} from '@apollo/client';
+
+export const PYTHON_ERROR_FRAGMENT = gql`
+  fragment PythonErrorFragment on PythonError {
+    __typename
+    message
+    stack
+    cause {
+      message
+      stack
+    }
+  }
+`;


This fragment is already defined in PythonErrorInfo, can you reuse that?

yeah, I didn't want the initial PR to have a ton of changes so that its easier to review but I'll do what I did in the original PR where I moved the dependencies around and changed all the imports so that theres only 1 definition

js_modules/dagit/packages/core/src/runs/LogsProvider.tsx

hellendag · 2022-09-27T16:29:23Z

js_modules/dagit/packages/core/src/workers/runLogs/apolloClient.ts

+
+let apolloClient: ApolloClient<any> | undefined = undefined;
+
+export function setup(data: any) {


Can this type be tightened up?

hellendag · 2022-09-27T16:30:08Z

js_modules/dagit/packages/core/src/workers/runLogs/apolloClient.ts

+  });
+
+  apolloClient = new ApolloClient({
+    cache: new InMemoryCache(),


Worth using createAppCache?

hellendag · 2022-09-27T19:31:11Z

js_modules/dagit/packages/core/src/workers/util.ts

+// https://developer.chrome.com/blog/how-to-convert-arraybuffer-to-and-from-string/
+
+export function arrayBufferToString(buf: ArrayBuffer): string {
+  return String.fromCharCode.apply(null, new Uint16Array(buf) as any);
+}
+
+// Chunk the data into 16kb bytes to avoid hitting max argument errors when converting back to a string
+// with String.fromCharCode.apply
+const MAX_BUFFER_SIZE = 16000;
+export function stringToArrayBuffers(str: string): ArrayBuffer[] {
+  const buffers: ArrayBuffer[] = [];
+  let buf = new ArrayBuffer(0);
+  let view = new Uint16Array(buf);
+  for (let i = 0, strLen = str.length; i < strLen; i++) {
+    const index = i % MAX_BUFFER_SIZE;
+    if (index === 0) {
+      buf = new ArrayBuffer(Math.min(MAX_BUFFER_SIZE, str.length - i) * 2); // 2 bytes for each char\
+      view = new Uint16Array(buf);
+      buffers.push(buf);
+    }
+    view[index] = str.charCodeAt(i);
+  }
+  return buffers;
+}


The commented URL looks fairly old and has an update note -- is there a more recent native API we can use for this?

This is using native APIs underneath, it's just chunking to avoid going over the max number of arguments allowed for a function (Since we call String.fromCharCode.apply on the buffer, the number of arguments is the size of the array).

hellendag · 2022-09-27T19:39:19Z

js_modules/dagit/packages/core/src/workers/runLogs/runLogs.tsx

+import {PipelineRunLogsSubscription} from '../../runs/types/PipelineRunLogsSubscription';
+
+export type Message = INITIALIZE;
+type INITIALIZE = {


type InitializeMessage? It's a little confusing on the eyes for a type to be ALL_CAPS since it looks like a constant value at a glance.

hellendag · 2022-09-27T19:40:11Z

js_modules/dagit/packages/core/src/workers/runLogs/runLogs.worker.ts

+type SHUTDOWN = {
+  type: 'SHUTDOWN';
+  staticPathRoot?: undefined;
+};


Define INITIALIZE and SHUTDOWN types in the same location in a union so that you can have data be a single union type? Just to tidy things up a bit.

js_modules/dagit/packages/core/src/workers/runLogs/runLogs.worker.ts

js_modules/dagit/packages/core/src/workers/runLogs/apolloClient.ts

js_modules/dagit/packages/core/src/runs/PipelineRunLogsSubscription.tsx

salazarm · 2022-09-28T07:38:11Z

I profiled both the worker and non-worker version. Getting a chrome profile is difficult because the buffer fills up pretty fast and the load takes a long time so instead I counted the number of GanttChart re-renders, the number of calls to throttleSetNodes and the number of times throttleSetNodes actually ran. This is with the 500mb debug file. I ran these runs multiple times and the results were very consistent.

You can see the worker version did more batching because it had less runs of setNodes than there were calls to throttleSetNodes.
This is because the GanttChart re-render is slow so the worker ends up queuing multiple messages for the main thread to process. I'm not entirely sure why this behavior doesn't happen without the worker. This means the non-worker version scale linearly with the number of batches of messages while the worker version takes as long as the webserver takes plus the time of a render for the last batch of messages.

In this profile I turned off the GanttChart which is the slowest part of the page to re-render. With the GanttChart disabled there were no batched throttleSetNode calls in either the worker or non-worker version because the rendering was done before the web-server sent the next message.

With the worker version the main bottleneck now is the python webserver so I think we can skip all the other virtualization stuff since we wouldn't get much benefit. I suspect there are a lot of unnecessary react-renders in the Gantt chart but I'm not entirely sure. It's probably not worthwhile speeding this up too much because the app still feels usable though the last render is brutal and can last 10 seconds

salazarm · 2022-09-28T08:30:07Z

I was able to take a full profile for both versions with the Gantt chart off.

Non-worker:

Worker:

One interesting thing is that the non worker version has double the "system" time, though its not much.
In both cases the mainthread spends a lot of time idle just waiting for the next message. In the non-worker case there is a lot more idle time though. I'm not entirely sure why there are a lot of possibilities here.

That last render is very long

I should diff the bottom-up performance data to figure out what that extra scripting time is in the non-worker version. I was expecting both of them to do the same amount of work since its the same number of renders for each one.

salazarm · 2022-09-29T12:50:42Z

@hellendag I removed the branching based off the URL and made the worker the default since it performs better. Also thanks for the detailed feedback! I know the PR was really rough I was mostly trying to get something quick out there for feedback and testing

…-289-run-page-performance-w-large-jobs-1

alangenfeld

copying this over from slack DMs - context was investigating the mysterious overall speed-up of the worker version:

so i was playing around with this change

        if (!logs.hasMorePastEvents) {
          requestAnimationFrame(throttledSetNodes);
        }

just against master, and noticed in the react profiler that even though we try to don’t set nodes, the apollo subscription update still causes RunWithData to re-render every time we receive a websocket payload

which upon googling appears to be a common source of performance issues

https://stackoverflow.com/questions/61876931/how-to-prevent-re-rendering-with-usesubscription
https://medium.com/@seniv/improve-performance-of-your-react-apollo-application-440692e37026

1/n

df7adad

salazarm requested review from alangenfeld, bengotow and hellendag September 26, 2022 18:59

salazarm added 2 commits September 26, 2022 15:00

undo makefile change

7a41a23

Merge branch 'master' of github.com:dagster-io/dagster into marco/brr…

11bd65c

…-289-run-page-performance-w-large-jobs-1

vercel bot deployed to Preview – dagster September 26, 2022 19:20 View deployment

uncomment

96170af

alangenfeld reviewed Sep 26, 2022

View reviewed changes

salazarm requested a review from alangenfeld September 26, 2022 21:52

salazarm added 2 commits September 26, 2022 17:57

remove try/catch

e90181e

chunking

ab0100a

hellendag reviewed Sep 27, 2022

View reviewed changes

salazarm changed the title ~~[1/n] Virtualize run data subscription - Move subscription to web worker~~ [1/n] Move subscription to web worker Sep 28, 2022

salazarm added 2 commits September 29, 2022 08:36

single source of truth for fragments

5908745

feedback

1541de4

salazarm requested a review from hellendag September 29, 2022 12:45

another fragment

62640cf

salazarm added 3 commits September 29, 2022 08:51

TABLE_SCHEMA_FRAGMENT

603eba3

unused import

0760389

another type

82c3538

salazarm added 2 commits September 29, 2022 10:14

regenerate graphql

4a83ddf

Merge branch 'master' of github.com:dagster-io/dagster into marco/brr…

194d9f2

…-289-run-page-performance-w-large-jobs-1

vercel bot deployed to Preview – dagster September 29, 2022 15:55 View deployment

salazarm added 2 commits September 29, 2022 12:47

move out more fragments to trim bundle

88b6418

add back url flag

704767a

salazarm force-pushed the marco/brr-289-run-page-performance-w-large-jobs-1 branch from d28ab99 to 704767a Compare September 29, 2022 20:20

alangenfeld reviewed Sep 29, 2022

View reviewed changes

salazarm mentioned this pull request Sep 30, 2022

[perf] Run logs - Move useSubscription to component that returns null to avoid re-rendering children #9844

Merged

salazarm closed this Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/n] Move subscription to web worker #9795

[1/n] Move subscription to web worker #9795

salazarm commented Sep 26, 2022 •

edited

Loading

linear bot commented Sep 26, 2022

vercel bot commented Sep 26, 2022 •

edited

Loading

alangenfeld left a comment

alangenfeld Sep 26, 2022

salazarm Sep 26, 2022

salazarm commented Sep 26, 2022 •

edited

Loading

salazarm commented Sep 27, 2022 •

edited

Loading

alangenfeld commented Sep 27, 2022

salazarm commented Sep 27, 2022

hellendag left a comment

hellendag Sep 27, 2022

salazarm Sep 28, 2022

hellendag Sep 27, 2022

hellendag Sep 27, 2022

hellendag Sep 27, 2022

salazarm Sep 28, 2022

hellendag Sep 27, 2022

hellendag Sep 27, 2022

salazarm commented Sep 28, 2022 •

edited

Loading

salazarm commented Sep 28, 2022 •

edited

Loading

salazarm commented Sep 29, 2022

alangenfeld left a comment


		let apolloClient: ApolloClient<any> \| undefined = undefined;

		export function setup(data: any) {

[1/n] Move subscription to web worker #9795

[1/n] Move subscription to web worker #9795

Conversation

salazarm commented Sep 26, 2022 • edited Loading

Summary & Motivation

How I Tested These Changes

linear bot commented Sep 26, 2022

vercel bot commented Sep 26, 2022 • edited Loading

alangenfeld left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salazarm commented Sep 26, 2022 • edited Loading

salazarm commented Sep 27, 2022 • edited Loading

alangenfeld commented Sep 27, 2022

salazarm commented Sep 27, 2022

hellendag left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salazarm commented Sep 28, 2022 • edited Loading

salazarm commented Sep 28, 2022 • edited Loading

salazarm commented Sep 29, 2022

alangenfeld left a comment

Choose a reason for hiding this comment

salazarm commented Sep 26, 2022 •

edited

Loading

vercel bot commented Sep 26, 2022 •

edited

Loading

salazarm commented Sep 26, 2022 •

edited

Loading

salazarm commented Sep 27, 2022 •

edited

Loading

salazarm commented Sep 28, 2022 •

edited

Loading

salazarm commented Sep 28, 2022 •

edited

Loading