Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use with headless Chrome #2

Closed
derekperkins opened this issue Jan 25, 2017 · 16 comments
Closed

Use with headless Chrome #2

derekperkins opened this issue Jan 25, 2017 · 16 comments

Comments

@derekperkins
Copy link

derekperkins commented Jan 25, 2017

I'm most excited to see this in conjunction with headless Chrome that is currently in canary. I don't see any reason it should work differently, since the dev tools is one of two ways to integrate, but wanted to at least put it on on table in case it affects development of the library.
https://chromium.googlesource.com/chromium/src/+/lkgr/headless/README.md

More Resources:
https://bugs.chromium.org/p/chromium/issues/detail?id=546953
https://news.ycombinator.com/item?id=11839303

@kenshaw
Copy link
Member

kenshaw commented Jan 25, 2017

Yes, my plan is to integrate this with headless chrome (and also is one of the reasons this was written to begin with), but the main barrier to that at the moment is that as of now (at least as I know) you still need to manually build chrome for the headless support. This is not exactly a trivial task, and isn't exactly something that can be done quickly/easily by even the most senior of developers. My plan is to integrate/embed headless chrome as a separate Go package, so that you don't even need the chrome dependency! Instead, you just do go get github.com/knq/headless (or something) and you have an embedded chrome instance exposed via a Go package for OSX/Windows/Linux. It's a big vision, and it's not going to happen for quite some time.

@derekperkins
Copy link
Author

derekperkins commented Jan 25, 2017

@kenshaw What is your end goal and timeline with the project? We're currently doing about 1M renders / day with PhantomJS currently, looking to do 100M / day by the end of the year. The rest of our code base is in Go, so this would eliminate a lot of moving parts for us, while adding the value of having real Chrome do the work. We're probably 6-12 weeks out from really moving on this, at which point we'd be happy to contribute as much as we can to the project.

My thought was to have Chrome running inside a Docker container, so it wouldn't necessarily have to be embedded as a Go package, while still being very accessible.

You might also look at talking with the http://sitespeed.io folks, as this could also remove a lot of overhead on their side.

@kenshaw
Copy link
Member

kenshaw commented Jan 26, 2017

The end goal for the project is simply to provide a cleaner, faster, easier way to drive Chrome with Go: namely Desktop Chrome and Mobile Android / Safari. Secondary, is support for other browsers (Edge, Firefox, etc) that could be made compatible with a shim, or by modifying the protocol slightly. For reference on compatibility, you can check here.

Part and parcel to everything is that there are some really cool / interesting ideas I have for chromedp, namely embedding headless as a separate Go package, so that in production there is no additional burden of xvfb, xwindows, etc. This would cut deployment images down by a couple gigabytes of redundant dependencies (xvfb, xwindows, etc) and just have a single fat Go binary (likely under 100 megabytes based on what I've seen). For the record, connecting to a Docker image with Chrome is already possible. The protocol that chromedp uses is just a websocket connection. You can configure / use specify the runner to connect directly to Chrome running in a Docker image, and it won't manage any of the process stuff.

I can see a lot of other projects evolving on top of this (web page testing frameworks, automated page profilers, advanced search engine, etc), but simply getting a solid, fast implementation out was the first major hurdle to cross. Now that it's released in public, I'm hoping the community at large can help with tracking down major issues/problems -- in that way, we all benefit!

@ghost
Copy link

ghost commented Jan 26, 2017

@kenshaw
I've been thinking of writing a web scraping app to download manga in mass for offline access in a searchable database (mainly for personal use) from all sorts of sources. 20zinnm referred me to this project recently. I've been looking at libs like goquery but I'm willing to give this a try to help find bugs. I'm new to go but I'm really liking the language so far.

When I run chromedp does it open a new process of chrome in front of me or in the background? If it does can I specify to run it not in the background. Finally can I use go routines to simultaneously use chromedp to visit multiple websites at the same time and extract information?

@kenshaw
Copy link
Member

kenshaw commented Jan 26, 2017

Yes, you can use chromedp like that. If you just do chromedp.New() and don't pass it a chromedp/runner.Runner instance, then it will launch a new, isolated chrome instance for you, using the default options. You could use chromedp to manage multiple tabs at once, but I would not recommend that, as there are issues/problems when chrome does rendering "off screen" (basically the tab is suspended).

chromedp is really for driving higher level functions of Chrome -- if you need to use Go to simply scrape data en masse, it might likely be better to simply use the standard Go net/http.Request and net/http.Client.

@ghost
Copy link

ghost commented Jan 26, 2017

I see, I appreciate the insight. Thanks.

@clanstyles
Copy link

@derekperkins I'm very interested in the same thing. We don't quite have as many screen shots being taken, but it's enough. We currently use Go WebKit bindings and have it running. Like you said, we're using Docker to keep a maintainable headless chrome installation.

@derekperkins
Copy link
Author

@clanstyles Are you using https://github.com/sourcegraph/go-webkit2? What has your experience been with it and what makes you want to use this lib + Chrome?

@clanstyles
Copy link

clanstyles commented Jan 27, 2017

@derekperkins it works fine, but you need to build out your own features. The largest issue is the screen resolution and making sure the the GTK window stays that size. Something was buggy with that. My solution works, but this had more features and looked a lot nicer.

I run the mentioned service in a docker container. It's a pretty expensive task.

Ah also, there's some really annoying issues with the events. webkit2.LoadFinished can be called, but the page isn't always done rendering. Then on the flip side, if a page has a persistent async ajax call it wont ever "finish". I've had to add in special timeouts to prevent this type of stuff. On a large machine with 12 cores, I can only run 25 instances at a time. I've even then run into webkit random binding crashes. I haven't been able to identify what solves them and debugging bindings in Go isn't friendly. After speaking with people, you have to actually place your debug code in the C++ code and just try to trace through it.

@derekperkins
Copy link
Author

Thanks for the detail. I'm very interested to see how your benchmarks from that compare to headless Chrome.

@clanstyles
Copy link

clanstyles commented Jan 27, 2017

I'll try to get something working shortly, I have a few other things on my plate in front of me.

The largest issue with all of these seems to be that they're not thread safe. You end up waiting for "idle time" to be able to create another action.

https://github.com/yukinying/chrome-headless-browser-docker is an example of a chrome headless browser

@kenshaw
Copy link
Member

kenshaw commented Jan 29, 2017

BTW, I added an example with a short writeup with launching a headless docker image. However, the only docker image that I could find that claimed to support chrome, didn't actually seem to work (page navigation didn't happen). I will make a point to open/maintain a docker image for headless chrome, however. I have no specific ETA on that at this time.

@kenshaw
Copy link
Member

kenshaw commented Feb 3, 2017

@kenshaw kenshaw closed this as completed Feb 3, 2017
@derekperkins
Copy link
Author

Can't wait to try it out!

FYI, your Github link is broken. Should be https://github.com/knq/chrome-headless

@kenshaw
Copy link
Member

kenshaw commented Feb 3, 2017

Oh ... I had meant to put it up as docker-chrome-headless. Oh well. People will figure it out ;)

@derekperkins
Copy link
Author

If you change the name, Github will set up a 301 redirect for you

riptl pushed a commit to riptl/chromedp-custom that referenced this issue Apr 9, 2019
expose browser events via a hook function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants