Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to readme #6

Merged
merged 5 commits into from
Jul 24, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 53 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,38 @@
# js-renderer

This is an online puppeteer service to render pages with javascript (js). Mainly useful for web scraping (not using splash). It is a service that executes JS on the page/URL and then returns the resulting DOM. If you run it closer to your users the response times will be much faster.
JavaScript is the bane of a web scraper's life. Scraping is all about extracting data from a web page and JavaScript is there adding content, hiding blocks, moving the DOM around and just reading the HTML from the server is just not enough. What you ideally want is a way to run all that JavaScript on the page so you can see what's left after that. Then you can get down to some serious scraping.

At times while scraping web pages you will come across websites or web pages that only render on a browser that renders the loaded javascript. If you curl it or use something like [Scrapy](https://scrapy.org/), you just end up with not useful HTML.
There are tools to do this out there but most have their own compliactions or restrictions that stop them from being used out on the edge. Js-renderer-fly has none of those problems and with Fly, you can deploy to close to your users too.

This project aims to solve that issue with Puppeteer. With Scrapy you can use [Splash](https://github.com/scrapy-plugins/scrapy-splash) but it is Scrapy specific and not easy to configure.
This is an online puppeteer service to render pages with javascript (js) very useful for web scraping.

## Uses

This project uses Puppeteer to render the page as a full browser and Express to open Puppeter as an API.

## Quick Try

There is a youtube views puller included to try out, you can run the following to see the views of "Despacito" apparently the most viewed youtube video:

```
node yt-views.js
```

If you want to try any other video just pass the youtube video URL as a param to the script like below:

```
node yt-views https://www.youtube.com/watch?v=XqZsoesa55w
```

It should give you an output like below:

```
Pulling views from youtube, please wait...
Baby Shark Dance | Sing and Dance! | @Baby Shark Official | PINKFONG Songs for Children has 6,077,338,169 views
```

It is the second most popular video on youtube.

## Run locally

If you have node install you can do:
Expand All @@ -26,13 +49,13 @@ npm install
docker-compose up
```

The hit `http://localhost:8080/api/render?url=https://instagram.com`
The hit `http://localhost:8080/api/render?url=https://www.youtube.com/watch?v=kJQP7kiw5Fk`

## How to use it

If you want to use it for scraping, use the following URL on Fly.io:

https://js-renderer-fly.fly.dev/api/render?url=https://instagram.com
https://js-renderer-fly.fly.dev/api/render?url=https://www.youtube.com/watch?v=kJQP7kiw5Fk

### Styles broken

Expand All @@ -44,21 +67,22 @@ Styles and images will look broken but the HTML tags will be there. Happy Web Sc

Fly.io has great [documentation](https://fly.io/docs/) to get started. You can find a quick speed run how how to get your app running closer to your users with this [guide](https://fly.io/docs/speedrun/). Please follow the following steps to deploy it on fly.io

### Prerequisites

1. [Install](https://fly.io/docs/getting-started/installing-flyctl/) the flyctl CLI command
1. Register on fly with `flyctl auth signup` , if you already have a fly account login with `flyctl auth login`
1. Clone this repo with `git clone git@github.com:geshan/js-renderer-fly.git`

### Steps

1. Clone this repo with `git clone git@github.com:geshan/js-renderer-fly.git` if you are logged in the SSH support enabled else try `https://github.com/geshan/js-renderer-fly.git`
1. Then run `cd js-renderer-fly`
1. After that execute `flyctl init`
1. Then type in a name like `js-renderer-fly`
1. After that execute `flyctl init --dockerfile` hit return for a app name to be generated (unless there's a name you really want), I tried with: `js-renderer-fly`
1. Then select and org, generally it will be your firstname-lastname
1. After that, select `Dockerfile` as the builder
1. It should create a fly.toml file in the project root (I have not committed it, it is in .gitignore). Below is a screenshot of `flyctl init` output I got:
![Flyctl init output for js-renderer](imgs/01fly-init.png?raw=true)
1. Now run `flyctl deploy` to deploy the app -- this will take some time it will build the container, push it and deploy it. Below is a screenshot after `flyctl deploy` ended
![Flyctl deploy output for js-renderer](imgs/02fly-deploy.png?raw=true)
1. Then you can try `flyctl info` it will give the details of the app including host name. In addition to it, some more details will be added to your `fly.toml` file like the internal port of the container, service's concurrency and timeouts.
1. It should create a fly.toml file in the project root (I have not committed it, it is in .gitignore).
1. Now run `flyctl deploy` to deploy the app -- this will take some time it will build the container, push it and deploy it. It will build the docker container, push it to the fly docker container registery and deploy it giving out information about the number of instances and their health.
1. Then you can try `flyctl info` it will give the details of the app including host name.
1. Following that, you can try `flyctl open` and your app will open on the browser. For me it was opening `https://js-renderer-fly.fly.dev`
1. To try your specific URL suffix it with `/api/render?url=<your-url>` like `/api/render?url=https://instagram.com` as Instagram is built with react a regular curl like reqeust will not render the final DOM.
1. To try your specific URL suffix it with `/api/render?url=<your-url>` like `/api/render?url=https://www.youtube.com/watch?v=kJQP7kiw5Fk` as Youtube pages will not render the final DOM with a regular curl.
1. Enjoy!

### Fly default resources
Expand All @@ -69,3 +93,17 @@ So I wanted to check how much resources were allocated to this app on fly by def
1. `flyctl scale vm` - showed me micro-2x is a 0.25 CPU cores with 512 MB of memory.

If you want to increase CPU/memory or run more instances in a particular region please refer to the official fly docs on [scaling](https://fly.io/docs/scaling/).

### More fly commands

You can suspend your service with `flyctl suspend` it will pause your service until you resume it. If you try `flyctl status` after suspend it will not show any isntances running. To get the instances back execute `flyctl resume`.

#### Fly on 3 continents

Now your service is running well in one data center for me it was iad which is `Ashburn, Virginia (US)`. Now let's add some more:

1. To see the regions availble run `flyctl platform regions` , I could see regions all over the world from Oregon to Sydney.
1. Let's add an instance to Australia in Sydney, to do this run `flyctl regions add syd` yes it is that easy.
1. Now check `flyctl status` and you will see an instance running on Sydney
1. Lets add one more in Europe at Amsterdam with `flyctl regions add ams`, great so we are mostly covered with the app running in 3 continents.
1. Of course you can run `flyctl status` again to see your app shining in 3 continents.
142 changes: 140 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
},
"homepage": "https://github.com/geshan/js-renderer#readme",
"dependencies": {
"@geshan/axrio": "^1.0.1",
"express": "^4.17.1",
"puppeteer-core": "^3.3.0"
}
Expand Down
15 changes: 15 additions & 0 deletions yt-views.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
const axrio = require('@geshan/axrio');

(async function run() {
console.log(`Pulling views from youtube, please wait...`);
try {
const ytVideoUrl = process.argv[2] ? process.argv[2] : 'https://www.youtube.com/watch?v=kJQP7kiw5Fk';
const $ = await axrio.getPage(`https://js-renderer-fly.fly.dev/api/render?url=${ytVideoUrl}`, 12000);
const title = $('h1.title>yt-formatted-string').text();
const views = $('span.view-count').text();
console.log(`${title} has ${views}`);
} catch(e) {
console.log(`Error while fetching views: `, e);
}
process.exit(0);
})();