-
Notifications
You must be signed in to change notification settings - Fork 109
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
1,084 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Crawly into | ||
# Crawly intro | ||
--- | ||
|
||
Crawly is an application framework for crawling web sites and | ||
|
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
# Basic concepts | ||
--- | ||
|
||
## Spiders | ||
|
||
Spiders are modules which define how a certain site (or a group of | ||
sites) will be scraped, including how to perform the crawl | ||
(i.e. follow links) and how to extract structured data from their | ||
pages (i.e. scraping items). In other words, Spiders are the place | ||
where you define the custom behaviour for crawling and parsing pages | ||
for a particular site. | ||
|
||
For spiders, the scraping cycle goes through something like this: | ||
|
||
You start by generating the initial Requests to crawl the first URLs, | ||
and use a callback function called with the response downloaded | ||
from those requests. | ||
|
||
In the callback function, you parse the response (web page) and return | ||
a ` %Crawly.ParsedItem{}` struct. This struct should contain new | ||
requests to follow and items to be stored. | ||
|
||
In the callback functions, you parse the page contents, typically using | ||
Floki (but you can also use any other library you prefer) and generate | ||
items with the parsed data. | ||
|
||
Spiders are executed in the context of Crawly.Worker processes, and | ||
you can control the amount of concurrent workers via | ||
`concurrent_requests_per_domain` setting. | ||
|
||
All requests are being processed sequentially and are pre-processed by | ||
Middlewares. | ||
|
||
All items are processed sequentially and are processed by Item pipelines. | ||
|
||
### Behaviour functions | ||
|
||
In order to make a working web crawler, all the behaviour callbacks need | ||
to be implemented. | ||
|
||
`init()` - a part of the Crawly.Spider behaviour. This function should | ||
return a KVList which contains a `start_urls` entry with a list, which defines | ||
the starting requests made by Crawly. | ||
|
||
`base_url()` - defines a base_url of the given Spider. This function | ||
is used in order to filter out all requests which are going outside of | ||
the crawled website. | ||
|
||
`parse_item(response)` - a function which defines how a given response | ||
is translated into the `Crawly.ParsedItem` structure. On the high | ||
level this function defines the extraction rules for both Items and Requests. | ||
|
||
## Requests and Responses | ||
|
||
Crawly uses Request and Response objects for crawling web sites. | ||
|
||
Typically, Request objects are generated in the spiders and pass | ||
across the system until they reach the Crawly.Worker process, which | ||
executes the request and returns a Response object which travels back | ||
to the spider that issued the request. The Request objects are being | ||
modified by the selected Middlewares, before hitting the worker. | ||
|
||
The request is defined as the following structure: | ||
``` elixir | ||
@type t :: %Crawly.Request{ | ||
url: binary(), | ||
headers: [header()], | ||
prev_response: %{}, | ||
options: [option()] | ||
} | ||
|
||
@type header() :: {key(), value()} | ||
``` | ||
|
||
Where: | ||
1. url - is the url of the request | ||
2. headers - define http headers which are going to be used with the | ||
given request | ||
3. options - would define options (like follow redirects). | ||
|
||
Crawly uses HTTPoison library to perform the requests, but we have | ||
plans to extend the support with other pluggable backends like | ||
selenium and others. | ||
|
||
Responses are defined in the same way as HTTPoison responses. See more | ||
details here: https://hexdocs.pm/httpoison/HTTPoison.Response.html#content | ||
|
||
## Parsed Item | ||
|
||
ParsedItem is a structure which is filled by the `parse_item/1` | ||
callback of the Spider. The structure is defined in the following way: | ||
|
||
```elixir | ||
@type item() :: %{} | ||
@type t :: %__MODULE__{ | ||
items: [item()], | ||
requests: [Crawly.Request.t()] | ||
} | ||
|
||
``` | ||
The parsed item is being processed by Crawly.Worker process, which | ||
sends all requests to the `Crawly.RequestsStorage` process, | ||
responsible for pre-processing requests and storing them for the | ||
future execution, all items are being sent to `Crawly.DataStorage` | ||
process, which is responsible for pre-processing items and storing them | ||
on disk. | ||
|
||
For now only one Storage backend is supported (writing on disc). But | ||
in future Crawly will also support work with amazon S3, sql and others. | ||
|
||
## Request Middlewares | ||
|
||
Crawly is using a concept of pipelines when it comes to processing of | ||
the elements sent to the system. In this section we will cover the | ||
topic of requests middlewares - a powerful tool which allows to modify | ||
the request before sending it to the target website. In most of the | ||
spider developers would want to modify request headers, which allows | ||
requests to look more natural to the crawled websites. | ||
|
||
At this point Crawly includes the following request middlewares: | ||
1. `Crawly.Middlewares.DomainFilter` - this middleware will disable | ||
scheduling for all requests leading outside of the crawled | ||
site. The middleware uses `base_url()` defined in the | ||
`Crawly.Spider` behaviour in order to do it's job | ||
2. ` Crawly.Middlewares.RobotsTxt` - this middleware ensures that | ||
Crawly respects the robots.txt defined by the target website. | ||
3. `Crawly.Middlewares.UniqueRequest` - this middleware ensures that | ||
crawly would not schedule the same URL(request) multiple times. | ||
4. `Crawly.Middlewares.UserAgent` - this middleware is used to set a | ||
User Agent HTTP header. Allows to rotate UserAgents, if the last | ||
one is defined as a list. | ||
|
||
A list of request middlewares which are going to be used with a given | ||
project is defined in the project settings. | ||
|
||
## Item pipelines | ||
|
||
Crawly is using a concept of pipelines when it comes to processing of | ||
the elements sent to the system. In this section we will cover the | ||
topic of item pipelines - a tool which is used in order to pre-process | ||
items before storing them in the storage. | ||
|
||
At this point Crawly includes the following Item pipelines: | ||
1. `Crawly.Pipelines.Validate` - validates that a given item has all | ||
the required fields. All items which don't have all required fields | ||
are dropped. | ||
2. `Crawly.Pipelines.DuplicatesFilter` - filters out items which are | ||
already stored the system. | ||
3. `Crawly.Pipelines.JSONEncoder`- converts items into JSON format. | ||
4. `Crawly.Pipelines.CSVEncoder`- converts items into CSV format. | ||
5. `Crawly.Pipelines.WriteToFile`- Writes information to a given file. | ||
|
||
The list of item pipelines used with a given project is defined in the | ||
project settings. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Ethical aspects of crawling | ||
--- | ||
|
||
It's important to be polite, when doing a web crawling. You should | ||
avoid cases when your spiders are putting harm on the scrapped | ||
websites. As it's mentioned here: https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy#comments-listing | ||
|
||
1. A polite crawler respects robots.txt. | ||
2. A polite crawler never degrades a website’s performance. | ||
3. A polite crawler identifies its creator with contact information. | ||
4. A polite crawler is not a pain in the buttocks of system | ||
administrators. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# HTTP API | ||
--- | ||
|
||
Crawly supports a basic HTTP API, which allows to control the Engine | ||
behaviour. | ||
|
||
## Starting a spider | ||
|
||
The following command will start a given Crawly spider: | ||
|
||
``` | ||
curl -v localhost:4001/spiders/<spider_name>/schedule | ||
``` | ||
|
||
## Stopping a spider | ||
|
||
The following command will stop a given Crawly spider: | ||
|
||
``` | ||
curl -v localhost:4001/spiders/<spider_name>/stop | ||
``` | ||
|
||
## Getting currently running spiders | ||
|
||
``` | ||
curl -v localhost:4001/spiders | ||
``` | ||
|
||
## Getting spider stats | ||
|
||
``` | ||
curl -v localhost:4001/spiders/<spider_name>/scheduled-requests | ||
curl -v localhost:4001/spiders/<spider_name>/scraped-items | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Installation guide | ||
--- | ||
|
||
Crawly requires Elixir v1.7 or higher. In order to make a Crawly | ||
project execute the following steps: | ||
|
||
1. Generate an new Elixir project: `mix new <project_name> --sup` | ||
2. Add Crawly to you mix.exs file | ||
```elixir | ||
def deps do | ||
[{:crawly, "~> 0.6.0"}] | ||
end | ||
``` | ||
3. Fetch crawly: `mix deps.get` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,152 @@ | ||
# Crawly intro | ||
--- | ||
|
||
Crawly is an application framework for crawling web sites and | ||
extracting structured data which can be used for a wide range of | ||
useful applications, like data mining, information processing or | ||
historical archival. | ||
|
||
## Walk-through of an example spider | ||
|
||
In order to show you what Crawly brings to the table, we’ll walk you | ||
through an example of a Crawly spider using the simplest way to run a spider. | ||
|
||
Here’s the code for a spider that scrapes blog posts from the Erlang | ||
Solutions blog: https://www.erlang-solutions.com/blog.html, | ||
following the pagination: | ||
|
||
```elixir | ||
defmodule Esl do | ||
@behaviour Crawly.Spider | ||
|
||
@impl Crawly.Spider | ||
def base_url(), do: "https://www.erlang-solutions.com" | ||
|
||
def init() do | ||
[ | ||
start_urls: ["https://www.erlang-solutions.com/blog.html"] | ||
] | ||
end | ||
|
||
@impl Crawly.Spider | ||
def parse_item(response) do | ||
# Getting new urls to follow | ||
urls = | ||
response.body | ||
|> Floki.find("a.more") | ||
|> Floki.attribute("href") | ||
|> Enum.uniq() | ||
|
||
# Convert URLs into requests | ||
requests = | ||
Enum.map(urls, fn url -> | ||
url | ||
|> build_absolute_url(response.request_url) | ||
|> Crawly.Utils.request_from_url() | ||
end) | ||
|
||
# Extract item from a page, e.g. | ||
# https://www.erlang-solutions.com/blog/introducing-telemetry.html | ||
title = | ||
response.body | ||
|> Floki.find("article.blog_post h1:first-child") | ||
|> Floki.text() | ||
|
||
author = | ||
response.body | ||
|> Floki.find("article.blog_post p.subheading") | ||
|> Floki.text(deep: false, sep: "") | ||
|> String.trim_leading() | ||
|> String.trim_trailing() | ||
|
||
time = | ||
response.body | ||
|> Floki.find("article.blog_post p.subheading time") | ||
|> Floki.text() | ||
|
||
url = response.request_url | ||
|
||
%Crawly.ParsedItem{ | ||
:requests => requests, | ||
:items => [%{title: title, author: author, time: time, url: url}] | ||
} | ||
end | ||
|
||
def build_absolute_url(url, request_url) do | ||
URI.merge(request_url, url) |> to_string() | ||
end | ||
end | ||
``` | ||
|
||
Put this code into your project and run it using the Crawly REST API: | ||
`curl -v localhost:4001/spiders/Esl/schedule` | ||
|
||
When it finishes you will get the ESL.jl file stored on your | ||
filesystem containing the following information about blog posts: | ||
|
||
```json | ||
{"url":"https://www.erlang-solutions.com/blog/erlang-trace-files-in-wireshark.html","title":"Erlang trace files in Wireshark","time":"2018-06-07","author":"by Magnus Henoch"} | ||
{"url":"https://www.erlang-solutions.com/blog/railway-oriented-development-with-erlang.html","title":"Railway oriented development with Erlang","time":"2018-06-13","author":"by Oleg Tarasenko"} | ||
{"url":"https://www.erlang-solutions.com/blog/scaling-reliably-during-the-world-s-biggest-sports-events.html","title":"Scaling reliably during the World’s biggest sports events","time":"2018-06-21","author":"by Erlang Solutions"} | ||
{"url":"https://www.erlang-solutions.com/blog/escalus-4-0-0-faster-and-more-extensive-xmpp-testing.html","title":"Escalus 4.0.0: faster and more extensive XMPP testing","time":"2018-05-22","author":"by Konrad Zemek"} | ||
{"url":"https://www.erlang-solutions.com/blog/mongooseim-3-1-inbox-got-better-testing-got-easier.html","title":"MongooseIM 3.1 - Inbox got better, testing got easier","time":"2018-07-25","author":"by Piotr Nosek"} | ||
.... | ||
``` | ||
|
||
## What just happened? | ||
|
||
When you ran the curl command: | ||
```curl -v localhost:4001/spiders/Esl/schedule``` | ||
|
||
Crawly runs a spider ESL, Crawly looked for a Spider definition inside | ||
it and ran it through its crawler engine. | ||
|
||
The crawl started by making requests to the URLs defined in the | ||
start_urls attribute of the spider's init, and called the default | ||
callback method `parse_item`, passing the response object as an | ||
argument. In the parse callback, we loop: | ||
1. Look through all pagination the elements using a Floki Selector and | ||
extract absolute URLs to follow. URLS are converted into Requests, | ||
using | ||
`Crawly.Utils.request_from_url()` function | ||
2. Extract item(s) (items are defined in separate modules, and this part | ||
will be covered later on) | ||
3. Return a Crawly.ParsedItem structure which is containing new | ||
requests to follow and items extracted from the given page, all | ||
following requests are going to be processed by the same `parse_item` function. | ||
|
||
Crawly is fully asynchronous. Once the requests are scheduled, they | ||
are picked up by separate workers and are executed in parallel. This | ||
also means that other requests can keep going even if some request | ||
fails or an error happens while handling it. | ||
|
||
|
||
While this enables you to do very fast crawls (sending multiple | ||
concurrent requests at the same time, in a fault-tolerant way) Crawly | ||
also gives you control over the politeness of the crawl through a few | ||
settings. You can do things like setting a download delay between each | ||
request, limiting the amount of concurrent requests per domain or | ||
respecting robots.txt rules | ||
|
||
``` | ||
This is using JSON export to generate the JSON lines file, but you can | ||
easily extend it to change the export format (XML or CSV, for | ||
example). | ||
``` | ||
|
||
## What else? | ||
|
||
You’ve seen how to extract and store items from a website using | ||
Crawly, but this is just a basic example. Crawly provides a lot of | ||
powerful features for making scraping easy and efficient, such as: | ||
|
||
1. Flexible request spoofing (for example user-agents rotation, | ||
cookies management (this feature is planned.)) | ||
2. Items validation, using pipelines approach. | ||
3. Filtering already seen requests and items. | ||
4. Filter out all requests which targeted at other domains. | ||
5. Robots.txt enforcement. | ||
6. Concurrency control. | ||
7. HTTP API for controlling crawlers. | ||
8. Interactive console, which allows you to create and debug spiders more easily. |
Oops, something went wrong.