Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chromedp does not load whole page if the page contains too many javascript-generated elements #304

Closed
dynahcatq opened this issue Apr 16, 2019 · 10 comments

Comments

@dynahcatq
Copy link

dynahcatq commented Apr 16, 2019

What versions are you running?

$ git rev-parse HEAD
ac47d6ba0e04cf60a7c5375fd139a743fe443fe6
$ google chrome version:
Version 73.0.3683.86 (Official Build) (64-bit)
$ go version
go version go1.11.4 darwin/amd64

What did you do?

When trying to navigate to a page that generate a lot of elements using javascript, it is very likely (see further analysis) that chromedp cannot see the generated elements. Here is an example code:

func main() {
	ctx, cancel := chromedp.NewContext(context.Background())
	defer cancel()

	ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
	defer cancel()

	var result string
	err := chromedp.Run(ctx,
		chromedp.Navigate(`https://streamelements.com/shanksy/store`),
		chromedp.WaitReady(`md-card`, chromedp.ByQuery),
		chromedp.Text(`md-card`, &result, chromedp.ByQuery),
	)
	if err != nil {
		panic(err)
	}

	fmt.Println(result)
}

The code simply navigate to https://streamelements.com/shanksy/store and wait for any md-card element to be ready then get text.

What did you expect to see?

Helios Prime SwitchdescriptionMR8 Helios Prime Switch MR8 Helios Prime Switchshopping_basket 4 items leftmonetization_on20000 chat!redeem helios_prime_switchRedeem item (The first item currently for sell on the website)

What did you see instead?

panic: context deadline exceeded

Further analysis

Being doing more test on this and the results are listed below:

  1. There is a chance that the code will success without timing out, sometimes it took me 1 try, sometimes 28 tries, sometimes 183 times. I tested it using this script on my mac: bash i=0 && while ! go run fetch.go; do sleep 1 && (( ++i )) && echo $i; done
  2. Increasing the context timeout to 100s, 200s, 300s does not solve the problem.
  3. If I change url to another streamelement store, say https://streamelements.com/lacari/store, the go code always retrieve the text successfully. It seems like it is because the number of items is less than the one in my go code (49 comparing to 280).

By the way I do not have chromium installed, so I guess the chromedp was running on my google chrome browser.

@mvdan
Copy link
Contributor

mvdan commented Apr 17, 2019

Is the element you're looking for inside an iframe, by any chance? If so, this is a duplicate of #72, as we lack proper support for iframes currently.

@xpr0ger
Copy link

xpr0ger commented Apr 17, 2019

I have the same issue. And no, there is no one iframe on a page.

@mvdan
Copy link
Contributor

mvdan commented Apr 17, 2019

You have the same issue with the same page? If not, please post a program to reproduce your problem.

@xpr0ger
Copy link

xpr0ger commented Apr 17, 2019

go.mod

github.com/chromedp/cdproto v0.0.0-20190407221054-e2ccc0cc2d77
github.com/chromedp/chromedp v0.1.4-0.20190408165214-b481eeac5108
go version go1.12.4 windows/amd64
chrome version 73.0.3683.103

First of all, i checked code snippet of @dynahcatq got the same result.

My code snippet:

func main() {
	for i := 0; i <10; i++ {
		str, err := getLink()
		if err == nil {
			log.Println(str)
			break
		}

		log.Println(err)
	}
}

func getLink() (string, error) {
	ctx, cancel := chromedp.NewContext(
		context.Background(),
		chromedp.WithLogf(log.Printf),
	)
	defer cancel()

	c, cancel := context.WithTimeout(ctx, 30*time.Second)
	defer cancel()

	var links []*cdp.Node
	err := chromedp.Run(c,
		chromedp.Navigate(`https://www.fragrantica.ru/search/?query=Brocard%20Vintage%20Rose%20Water%20Vintage`),
		chromedp.WaitVisible(`//button[text()[contains(., 'Показать больше результатов')]]`),
		chromedp.Nodes(`//div[contains(@class, 'card-section')]/*/a[1]`, &links),
	)

	if err != nil {
		return "", err
	}

	str := ""
	for _, link := range links {
		str = link.AttributeValue("href")
		break
	}

	return str, nil
}

Output

2019/04/17 21:45:14 context deadline exceeded
2019/04/17 21:45:44 context deadline exceeded
2019/04/17 21:46:15 context deadline exceeded
2019/04/17 21:46:45 context deadline exceeded
2019/04/17 21:47:15 context deadline exceeded
2019/04/17 21:47:17 https://www.fragrantica.ru/perfume/Brocard/Rose-Water-Vintage-35501.html

If I decrease timeout to 10 seconds then I get the correct result in the second iteration.

@dynahcatq
Copy link
Author

@mvdan I believe that the element (md-card tag) is not inside an iframe based on three observations:

  1. document.getElementsByTagName("iframe")[0].contentWindow.document.getElementsByTagName("md-card") gives an empty HTMLCollection.
  2. The same code I provided worked well sometimes, while most of the time it reached timeout. If it was an iframe issue, the code should always reach timeout.
  3. As I described, the original code that could often hit timeout navigates to https://streamelements.com/shanksy/store; and while I changed the url to https://streamelements.com/lacari/store, it works perfectly fine all the time. Which means that chromedp can always see md-card tag when navigate to the later url. Based on the fact that these two url are all from https://streamelements.com (they are just different store), I believe they generates elements in a same way, thus eliminate the iframe concern.

@mvdan mvdan removed the needs info label Apr 20, 2019
@mvdan
Copy link
Contributor

mvdan commented Apr 20, 2019

Thanks all for the details. I'll have a proper look soon.

@muzykantov
Copy link

I have the same problem too.

@mvdan
Copy link
Contributor

mvdan commented Apr 22, 2019

I had a look, and this is definitely one weird page. It navigates to five frames in total, which seems to trigger some sort of race which results in chromedp never loading the nodes you're after.

I'm not a JS expert, and it doesn't help that the page is very slow to load and has lots of obfuscated JS code. If anyone can reproduce this with a single short HTML file, that would be very appreciated and speed up my debugging.

@mvdan
Copy link
Contributor

mvdan commented Apr 22, 2019

Ok, after an exhaustive amount of debugging, I finally figured out what's the problem. It is related to child frames (like iframes) after all, but not in the way I was initially suspecting.

The issue is that the page loads a number of child frames before the top frame has finished loading its nodes. chromedp was incorrectly seeing the child frames as the new top-level frames, so the original frame was left behind and we never saw its interesting nodes.

The fix is simple, all the tests pass (including a regression test for this case), so I see no reason to not include this into v0.2.0. Thanks all for the patience.

Also, to clarify - this fix is somewhat related to #72, in that it's a required step to have iframes work well, but it's not a full fix for that issue.

@mvdan mvdan closed this as completed in dd67f50 Apr 22, 2019
@dynahcatq
Copy link
Author

Ok, after an exhaustive amount of debugging

Can confirm. I spent days debugging and only get to the racing part, but have no idea what was causing the race. Really appreciate your effort and the fix @mvdan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants