Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to best handle "child cannot end after parent" errors? #8

Open
soulgalore opened this issue May 3, 2019 · 25 comments
Open

How to best handle "child cannot end after parent" errors? #8

soulgalore opened this issue May 3, 2019 · 25 comments

Comments

@soulgalore
Copy link
Contributor

Hi @aslushnikov
when I test cnet.com sometimes my trace logs is broken with the child cannot end after parent error. I've pushed an example trace here: https://github.com/soulgalore/tracium/blob/failing-cnet/examples/cnet.json

I think this started to happen 1-2 Chrome releases ago. I guess it should be fixed upstream but I wonder if Tracium should just swallow the error and continue?

Best
Peter

@mbalabash
Copy link
Contributor

mbalabash commented May 3, 2019

Hello
I have same error (Fatal trace logic error - child cannot end after parent). But only when i run Tracium on CI server (Travis CI).
On my personal laptop it doesn't happen.

@patrickhulce
Copy link
Collaborator

Our thinking was that if this has happened, all our assumptions about trace processing might not hold so it's better to alert the consumer than provide potentially false numbers.

Anything extra about the type of activity you're recording you can share?

@mbalabash
Copy link
Contributor

mbalabash commented May 3, 2019

Hi @patrickhulce
Thanks for quick response.

I have next html file:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Estimo Template</title>
  </head>
  <body>
    <script src="https://unpkg.com/react@16/umd/react.development.js"></script>
    <h1>Estimo</h1>
  </body>
</html>

And next code to capture Chrome trace:

...
await page.tracing.start({ path: traceFile })
try {
  await page.goto(getUrlToHtmlFile(lib.html), { timeout: options.timeout })
} catch (err) {
  handleSessionError(err, browser)
}
await page.tracing.stop()
...

Then i'm trying to parse it with Tracium:

async function generateTasksReport(pathToTraceFile) {
  const content = JSON.parse(await readFile(pathToTraceFile))
  const tasks = tracium.computeMainThreadTasks(content, {
    flatten: true,
  })
  return tasks
}

On my local machine it works perfectly.
But when i run it on CI it throws an error:

  /home/travis/build/mbalabash/estimo/node_modules/tracium/lib/main-thread-tasks.js:155
  Rejected promise returned by test. Reason:
  Error {
    message: 'Fatal trace logic error - child cannot end after parent',
  }
  Function._computeRecursiveSelfTime (node_modules/tracium/lib/main-thread-tasks.js:155:13)
  node_modules/tracium/lib/main-thread-tasks.js:159:37
  Function._computeRecursiveSelfTime (node_modules/tracium/lib/main-thread-tasks.js:159:8)
  node_modules/tracium/lib/main-thread-tasks.js:159:37
  Function._computeRecursiveSelfTime (node_modules/tracium/lib/main-thread-tasks.js:159:8)
  Function.getMainThreadTasks (node_modules/tracium/lib/main-thread-tasks.js:244:23)
  Function.compute (node_modules/tracium/lib/main-thread-tasks.js:272:28)
  Object.computeMainThreadTasks (node_modules/tracium/index.js:13:36)

This may be due to the fact that script refers to remote url?

@soulgalore
Copy link
Contributor Author

From my side I think it's content related somehow? It happens sometimes when I test cnet.com, I test the same way all the time, but sometimes the trace fails. Since it's CNET I was thinking there's one of the ads or something that is acting up somehow?

@patrickhulce
Copy link
Collaborator

Hm interesting, thanks for the repro case @mbalabash.

@soulgalore are your failures also on a linux box or CI environment only?

@soulgalore
Copy link
Contributor Author

@patrickhulce I could reproduce today on my Mac, else we are running in Docker on Linux (Ubuntu) where we also get it. I've also got reports from users that got the same error but on other pages, but they haven't shared their URLs. We used to use our own parsing lib but wanna move over to Tracium asap. And we got the same problem with events in the wrong order.

@soulgalore
Copy link
Contributor Author

Had a go and tried to reproduce with puppeteer (I use Chromedriver normally). I didn't get the same error but the same behaviour that one out of three/four runs fails. The one failing are waiting a long time on some content (I'm running without any throttling) and the runs that are ok finish within a couple of seconds, but when it fails it takes 15-20s (maybe). And when it fails I get LHError: NO_TRACING_STARTED. So I think there's some content that somehow creates the problem?

const puppeteer = require('puppeteer');
const tracium = require('tracium');
const fs = require('fs');
const util = require('util');
const readFile = util.promisify(fs.readFile);

puppeteer.launch({ headless: false })
    .then(async (browser) => {
            const page = await browser.newPage();
            await page.setViewport({ width: 1024, height: 800 });
            await page.tracing.start({ path: 'trace.json' });
            await page.goto('http://www.cnet.com', { waitUntil: 'load', timeout: 0 });
            await page.tracing.stop();
            browser.close();
            const content = JSON.parse(await readFile('trace.json'));
            const tasks = tracium.computeMainThreadTasks(content, {
                    flatten: true,
             });
})

@patrickhulce
Copy link
Collaborator

Thanks @soulgalore! NO_TRACING_STARTED is a different problem that's particularly prevalent in m74+. We've got a fix for it in LH master but it hasn't been ported over here yet.

I'll have to dig in to those repro cases you've provided when I've got time. I haven't seen those cases locally before, but I'll try with your settings :)

@mbalabash
Copy link
Contributor

Thanks! @patrickhulce
I have small update for my case.

This may be due to the fact that script refers to remote url?

No. I got same error when used local scripts.

@soulgalore
Copy link
Contributor Author

Yeah, I understand that it's another error but I thought I saw the same kind of behaviour. I've seen this on another URL (https://www.aftonbladet.se), it's a Swedish newspaper with a lot of ads. Let me see if I can find a way to reproduce.

@mbalabash
Copy link
Contributor

Hi @patrickhulce
Do you have any updates or some ideas?

@patrickhulce
Copy link
Collaborator

So far the only case where I've seen this happen where the trace isn't totally useless/non-sensical is when the ending trace event of the child that is B/E is missing and it was nested under an X.

Available options I see:

  1. Anytime we are missing an end event for a B/E pair, we just end it at the last logical place that would've made sense (the end of the parent for example). This is a bit aggressive and could lead to some misleading conclusions, but might be good enough.
  2. We omit any toplevel tasks (and all their children) that contained this nonsensical scenario.
  3. We only support the above case if it's the last event in the trace.
  4. We continue to fatally throw for a case of invalid data.

I tend to support option 4 in Lighthouse, but I see the value of 1/2 in the tracium case where other consumers might be ready to be more forgiving (maybe it's even a setting to pass in failLoudly: true or something?)

@soulgalore
Copy link
Contributor Author

failLoudly: true

If that's ok for you all that would be great. I've been testing Tracium in different scenarios the last days and have gotten others errors too. I wonder if it could be because of that I use Chromedriver and somehow some events is dropped? I also do other scenarios, like testing a SPA.

I've got cases where one event don't get a duration/selfTime so that the https://github.com/aslushnikov/tracium/blob/master/lib/main-thread-tasks.js#L258 check fails.

Also got traces where I get end event without a start as in https://github.com/aslushnikov/tracium/blob/master/lib/main-thread-tasks.js#L111

Also using the debug log in those places would help.

@patrickhulce
Copy link
Collaborator

I wonder if it could be because of that I use Chromedriver and somehow some events is dropped? I also do other scenarios, like testing a SPA.

Interesting, yeah our trace processing up until tracium has always been cold load of page with raw protocol so it's possible these other scenarios increase the likelihood of these otherwise unlikely events from LH perspective.

@soulgalore if you could get us copies of any of these traces you're talking about I'd love to add some test cases and make sure we can handle them appropriately :)

@soulgalore
Copy link
Contributor Author

I can collect a couple traces today/tomorrow. For SPA I'm injecting a nav start event for it to work. However I think the problem is on my side since I use the Chromedriver. Checking the docs it says "...will continue tracing until Chrome closes.". Without checking the code, I guess when we get the trace, the driver has not stopped the trace and it doesn't wait on Tracing.tracingComplete, so we probably don't get all the events. I'll try to switch to use CDP directly instead, that will minimise the risk (and work) for me :)

@soulgalore
Copy link
Contributor Author

soulgalore commented May 11, 2019

I made a quick POC and it works much better using CDP directly, sorry my fault, I'll try to push it to our test servers later today and then I will know for sure.

@soulgalore
Copy link
Contributor Author

Ok been running these for a while and it works much better, so it was on my side! I've got one use case where Tracium sometimes breaks using a SPA. The use case is like this: Hit a menu item to open up a menu, wait for it to appear, start a trace, click on the new menu item and measure that request and load.

Sometimes (not always) I run into Fatal trace logic error - unexpected end event with an event that looks like this:

{ pid: 4034,
  tid: 4034,
  ts: 17974008372878,
  ph: 'E',
  cat: 'blink,devtools.timeline',
  name: 'HitTest',
  args:
   { endData:
      { x: 683,
        y: 282,
        rectilinear: true,
        move: true,
        nodeId: 95,
        nodeName: 'CANVAS class=\'flot-overlay\'' } },
  tts: 2163034 }

Maybe I can fix that on my side, not sure though. I can post the full trace log if it helps!

@patrickhulce
Copy link
Collaborator

Thanks @soulgalore that's good to hear many of the issues sorted themselves out!

wait for it to appear, start a trace, click on the new menu item and measure that request and load.

Ah I can see this scenario if we start a trace in the middle of some event and we only get the end part of it which isn't the case in normal LH operations.

I think turning this into a flag to drop these invalid cases (maybe stick the events into an array so the user can fail on their own if there were lots of them?) seems like the best way to go.

@patrickhulce
Copy link
Collaborator

Oh and a copy of that full trace would be awesome for test case prepping if you're willing :)

@soulgalore
Copy link
Contributor Author

Sharing is caring: https://gist.github.com/soulgalore/6b5d13a485334afe009eab3866379e5f

@mbalabash
Copy link
Contributor

Hello @patrickhulce @aslushnikov

I added docker support for reproducting Travis CI errors in estimo.

And it seems like strange chrome / puppeteer behavior.
Test results are unstable. Same test for same container may pass few times and fails few times with error: Fatal trace logic error - child cannot end after parent

I run puppeteer like this:

const browser = await puppeteer.launch({ headless, args: ['--no-sandbox', '--disable-setuid-sandbox'] })

All code for work with puppeteer here.

Docker config

I'm still do stuff like this (to analyze only js parse/execution time):

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Estimo Template</title>
  </head>
  <body>
    <h1>Estimo</h1>
    <script src="https://unpkg.com/react@16/umd/react.development.js"></script>
  </body>
</html>

And i got errors only in Docker / CI (on my host machine all works perfectly).
Unstable tests confusing me a little.
Do you have any ideas?

Thanks

@patrickhulce
Copy link
Collaborator

Ah thanks so much for the docker repro @mbalabash! There's been movement on getting tracium publishable from within LH itself, so once that is done I'll work on getting a version published that can better handle these cases.

aslushnikov added a commit that referenced this issue Jun 18, 2019
Trace events are not sorted by default. Proper sorting not only uses
timestamps, but also accounts for trace event duration.

Reference #8
aslushnikov added a commit that referenced this issue Jun 18, 2019
Trace events are not sorted by default. Proper sorting not only uses
timestamps, but also accounts for trace event duration.

Reference #8
aslushnikov added a commit that referenced this issue Jun 18, 2019
Trace events are not sorted by default. Proper sorting not only uses
timestamps, but also accounts for trace event duration.

Reference #8
@aslushnikov
Copy link
Owner

I pushed the 0.2.1 version that sorts out some of the errors here that relate to trace event sorting.
Unfortunately, these don't help with the original cnet trace there - I'll have to look into it separately.

@soulgalore
Copy link
Contributor Author

Thanks @aslushnikov! Right now I'm using a modified version of Tracium, where we don't throw the errors. I've been continuously running Tracium and got a couple of more errors (that happens sometimes, not all the time). I get them all on a single navigation on "normal" pages.

  • Fatal trace logic error - unexpected end event
  • Fatal trace logic error - expected start event, got
  • Fatal trace logic error - child cannot end after parent
  • Invalid task timing data

I can revert and run them on another server just to collect the trace log. But I wonder if you don't also get them in Lighthouse? We run the same trace config and since I skipping enabling trace using Chromedriver and instead using CDP I'm 99.9% sure the trace log that we collect is correct in the sense it's what we expect out of Chrome.

@patrickhulce
Copy link
Collaborator

patrickhulce commented Jun 18, 2019

I've been digging more into these and there's a major root issue we've identified that Chrome's invariants for trace event ordering have drifted slightly from what was expected when the original code was written ~2 years ago. To be fair to our Chromium colleagues, there was a document that notes order could be different in the future, but it was difficult to find these bugs without any examples of how it could be different. Now that we have them it'll just take time to sort through :)

I'm working on some LH PRs that should hopefully help a lot with this class of error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants