From 8a41fbd8ee2d794fb92b563e679a639e3aa53d88 Mon Sep 17 00:00:00 2001 From: Vratislav Bartonicek Date: Tue, 1 Oct 2019 14:05:39 +0200 Subject: [PATCH 1/3] Scraping category - content --- docs/scraping/cheerio_scraper.md | 364 ++++++++++++++- docs/scraping/index.md | 38 +- docs/scraping/introduction.md | 294 +++++++++++++ docs/scraping/legacy_phantomjs_crawler.md | 6 +- docs/scraping/puppeteer_scraper.md | 511 +++++++++++++++++++++- docs/scraping/web_scraper.md | 401 ++++++++++++++++- 6 files changed, 1594 insertions(+), 20 deletions(-) create mode 100644 docs/scraping/introduction.md diff --git a/docs/scraping/cheerio_scraper.md b/docs/scraping/cheerio_scraper.md index 301d55f291..1125be6e2b 100644 --- a/docs/scraping/cheerio_scraper.md +++ b/docs/scraping/cheerio_scraper.md @@ -2,12 +2,366 @@ title: Cheerio Scraper --- -## [](#cheerio-scraper)Cheerio Scraper +# [](#scraping-with-cheerio-scraper)Scraping with Cheerio Scraper -Cheerio Scraper is a ready-made solution for crawling the web using plain HTTP requests to retrieve HTML pages and then parsing and inspecting the HTML using the Cheerio library. It's blazing fast. +This scraping tutorial will go into the nitty gritty details of extracting data from `https://apify.com/store` using the `apify/cheerio-scraper`. If you arrived here from the [Getting started with Apify scrapers](https://apify.com/docs/scraping/tutorial/introduction), tutorial, great! You are ready to continue where we left off. If you haven't seen the Getting started yet, check it out, it will help you learn about Apify and scraping in general and set you up for this tutorial, because this one builds on topics and code examples discussed there. -Cheerio is a server-side version of the popular jQuery library that does not run in the browser, but instead constructs a DOM out of a HTML string and then provides the user with API to work with that DOM. +## [](#getting-to-know-our-tools)Getting to know our tools -Cheerio Scraper is ideal for scraping websites that do not rely on client-side JavaScript to serve their content. It can be as much as 20 times faster than using a full browser solution such as Puppeteer. +In the [Getting started with Apify scrapers](https://apify.com/docs/scraping/tutorial/introduction) tutorial, we've confirmed that the scraper works as expected, so now it's time to add more data to the results. -[Visit the Cheerio Scraper tutorial to get started!](./scraping/tutorial/cheerio-scraper) +To do that, we'll be using the [`Cheerio`](https://github.com/cheeriojs/cheerio) library. This may not sound familiar, so let me try again. Does [`jQuery` library](https://jquery.com/) ring a bell? If it does you're in luck, because `Cheerio` is just `jQuery` that doesn't need an actual browser to run. Everything else is the same. All the functions you already know are there and even the familiar `
To learn more about `Cheerio`, see [the docs on their GitHub page](https://github.com/cheeriojs/cheerio). + +Now that's out of the way, let's open one of the actor detail pages in the Store, for example the [`apify/web-scraper`](https://apify.com/apify/web-scraper) page and use our DevTools-Fu to scrape some data. + +> If you're wondering why we're using `apify/web-scraper` as an example instead of `cheerio-scraper`, it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers! + +## [](#quick-recap)Quick recap + +Before we start, let's do a quick recap of the data we chose to scrape: + +1. **URL** - The URL that goes directly to the actor's detail page. +2. **Unique identifier** - Such as `apify/web-scraper`. +3. **Title** - The title visible in the actor's detail page. +4. **Description** - The actor's description. +5. **Last run date**- When the actor was last run. +6. **Number of runs** - How many times the actor was run. + +![data to scrape](https://apifyusercontent.com/7274765d35b9a7c781e5bcc705a3dbdcf3c308ec/68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6170696679746563682f6163746f722d736372617065722f6d61737465722f646f63732f6275696c642f2e2e2f696d672f7363726170696e672d70726163746963652e6a7067 "Overview of data to be scraped.") + +We've already scraped number 1 and 2 in the [Getting started with Apify scrapers](https://apify.com/docs/scraping/tutorial/introduction) tutorial, so let's get to the next one on the list: Title + +### [](#title)Title + +![actor title](https://apifyusercontent.com/5274e02a1c45ed96a7d8c0147ac6e3d99f883ed0/68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6170696679746563682f6163746f722d736372617065722f6d61737465722f646f63732f6275696c642f2e2e2f696d672f7469746c652e6a7067 "Finding actor title in DevTools.") + +By using the element selector tool, we find out that the title is there under an `

` tag, as titles should be. Maybe surprisingly, we find that there are actually two `

` tags on the detail page. This should get us thinking. Is there any parent element that includes our `

` tag, but not the other ones? Yes, there is! There is a `
` element that we can use to select only the heading we're interested in. + +> Remember that you can press CTRL+F (CMD+F) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler code all the time. + +To get the title we just need to find it using a `header h1` selector, which selects all `

` elements that have a `
` ancestor. And as we already know, there's only one. + + // Using Cheerio. + return { + title: $('header h1').text(), + }; + +### [](#description)Description + +Getting the actor's description is a little more involved, but still pretty straightforward. We can't just simply search for a `

` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the actor description is nested within the `

` element too, same as the title. Sadly, we're still left with two `

` tags. To finally select only the description, we choose the `

` tag that has a `class` that starts with `Text__Paragraph`. + +![actor description selector](https://apifyusercontent.com/28dee1e51c6ac3e8ec67f0eb953b4a71c775f217/68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6170696679746563682f6163746f722d736372617065722f6d61737465722f646f63732f6275696c642f2e2e2f696d672f6465736372697074696f6e2e6a7067 "Finding actor description in DevTools.") + + return { + title: $('header h1').text(), + description: $('header p[class^=Text__Paragraph]').text(), + }; + +### [](#last-run-date)Last run date + +The DevTools tell us that the `lastRunDate` can be found in the second of the two `

    ` element and within that element we're looking for the third `
  • ` element. We grab its text, but we're only interested in the number of runs. So we parse the number out using a regular expression, but its type is still a `string`, so we finally convert the result to a `number` by wrapping it with a `Number()` call. + +### [](#wrapping-it-up)Wrapping it up + +And there we have it! All the data we needed in a single object. For the sake of completeness, let's add the properties we parsed from the URL earlier and we're good to go. + + const { url } = request; + + // ... + + const uniqueIdentifier = url.split('/').slice(-2).join('/'); + + return { + url, + uniqueIdentifier, + title: $('header h1').text(), + description: $('header p[class^=Text__Paragraph]').text(), + lastRunDate: new Date( + Number( + $('time') + .eq(1) + .attr('datetime'), + ), + ), + runCount: Number( + $('ul.stats li:nth-of-type(3)') + .text() + .match(/\d+/)[0], + ), + }; + +All we need to do now is add this to our `pageFunction`: + + async function pageFunction(context) { + const { request, log, skipLinks, $ } = context; // $ is Cheerio + if (request.userData.label === 'START') { + log.info('Store opened!'); + // Do some stuff later. + } + if (request.userData.label === 'DETAIL') { + const { url } = request; + log.info(`Scraping ${url}`); + await skipLinks(); + + // Do some scraping. + const uniqueIdentifier = url.split('/').slice(-2).join('/'); + + return { + url, + uniqueIdentifier, + title: $('header h1').text(), + description: $('header p[class^=Text__Paragraph]').text(), + lastRunDate: new Date( + Number( + $('time') + .eq(1) + .attr('datetime'), + ), + ), + runCount: Number( + $('ul.stats li:nth-of-type(3)') + .text() + .match(/\d+/)[0], + ), + }; + } + } + +### [](#test-run-3)Test run 3 + +As always, try hitting that **Save & Run** button and visit the Dataset preview of clean items. You should see a nice table of all the attributes correctly scraped. You nailed it! + +## [](#pagination)Pagination + +Pagination is just a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the actors, just the first page of results. That's because to load the rest of the actors, one needs to click the orange **Show more** button at the very bottom of the list. This is pagination. + +> This is a typical JavaScript pagination, sometimes called infinite scroll. Other pages may use links that take you to the next page. If you encounter those, just make a Pseudo URL for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing. + +If you paid close attention, you may now see a problem. How do we click a button in the page when we're working with Cheerio? We don't have a browser to do it and we only have the HTML of the page to work with. So the simple answer is that we can't click a button. Does that mean that we cannot get the data at all? Usually not, but it requires some clever DevTools-Fu. + +### [](#analyzing-the-page)Analyzing the page + +While with `apify/web-scraper` and `apify/puppeteer-scraper`, we could get away with simply clicking a button, with `apify/cheerio-scraper` we need to dig a little deeper into the page's architecture. For this, we will use the Network tab of the Chrome DevTools. + +> It's a very powerful tool with a lot of features, so if you're not familiar with it, please see this tutorial: [https://developers.google.com/web/tools/chrome-devtools/network/](https://developers.google.com/web/tools/chrome-devtools/network/) which explains everything much better than we ever could. + +We want to know what happens when we click the **Show more** button, so we open the DevTools Network tab and clear it. Then we click the Show more button and wait for incoming requests to appear in the list. + +![inspect-network](https://apifyusercontent.com/2b51728bb8363c8ac71d8bab191c938fa3a5ddc9/68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6170696679746563682f6163746f722d736372617065722f6d61737465722f646f63732f6275696c642f2e2e2f696d672f696e73706563742d6e6574776f726b2e6a7067 "Inspecting network in DevTools.") + +Now, this is interesting. It seems that we've only received two images after clicking the button and no additional data. This means that the data about actors must already be available in the page and the Show more button only displays it. This is good news. + +### [](#finding-the-actors)Finding the actors + +Now that we know the information we seek is already in the page, we just need to find it. The first actor in the store is `apify/web-scraper` so let's try using the search tool in the Elements tab to find some reference to it. The first few hits do not provide any interesting information, but in the end, we find our goldmine. There is a `