Skip to content

Commit

Permalink
Merge 2decb2c into e04dbe6
Browse files Browse the repository at this point in the history
  • Loading branch information
oltarasenko committed Dec 10, 2019
2 parents e04dbe6 + 2decb2c commit b54f578
Show file tree
Hide file tree
Showing 11 changed files with 1,084 additions and 2 deletions.
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Crawly into
# Crawly intro
---

Crawly is an application framework for crawling web sites and
Expand Down
Binary file added documentation/assets/logo.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
154 changes: 154 additions & 0 deletions documentation/basic_concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Basic concepts
---

## Spiders

Spiders are modules which define how a certain site (or a group of
sites) will be scraped, including how to perform the crawl
(i.e. follow links) and how to extract structured data from their
pages (i.e. scraping items). In other words, Spiders are the place
where you define the custom behaviour for crawling and parsing pages
for a particular site.

For spiders, the scraping cycle goes through something like this:

You start by generating the initial Requests to crawl the first URLs,
and use a callback function called with the response downloaded
from those requests.

In the callback function, you parse the response (web page) and return
a ` %Crawly.ParsedItem{}` struct. This struct should contain new
requests to follow and items to be stored.

In the callback functions, you parse the page contents, typically using
Floki (but you can also use any other library you prefer) and generate
items with the parsed data.

Spiders are executed in the context of Crawly.Worker processes, and
you can control the amount of concurrent workers via
`concurrent_requests_per_domain` setting.

All requests are being processed sequentially and are pre-processed by
Middlewares.

All items are processed sequentially and are processed by Item pipelines.

### Behaviour functions

In order to make a working web crawler, all the behaviour callbacks need
to be implemented.

`init()` - a part of the Crawly.Spider behaviour. This function should
return a KVList which contains a `start_urls` entry with a list, which defines
the starting requests made by Crawly.

`base_url()` - defines a base_url of the given Spider. This function
is used in order to filter out all requests which are going outside of
the crawled website.

`parse_item(response)` - a function which defines how a given response
is translated into the `Crawly.ParsedItem` structure. On the high
level this function defines the extraction rules for both Items and Requests.

## Requests and Responses

Crawly uses Request and Response objects for crawling web sites.

Typically, Request objects are generated in the spiders and pass
across the system until they reach the Crawly.Worker process, which
executes the request and returns a Response object which travels back
to the spider that issued the request. The Request objects are being
modified by the selected Middlewares, before hitting the worker.

The request is defined as the following structure:
``` elixir
@type t :: %Crawly.Request{
url: binary(),
headers: [header()],
prev_response: %{},
options: [option()]
}

@type header() :: {key(), value()}
```

Where:
1. url - is the url of the request
2. headers - define http headers which are going to be used with the
given request
3. options - would define options (like follow redirects).

Crawly uses HTTPoison library to perform the requests, but we have
plans to extend the support with other pluggable backends like
selenium and others.

Responses are defined in the same way as HTTPoison responses. See more
details here: https://hexdocs.pm/httpoison/HTTPoison.Response.html#content

## Parsed Item

ParsedItem is a structure which is filled by the `parse_item/1`
callback of the Spider. The structure is defined in the following way:

```elixir
@type item() :: %{}
@type t :: %__MODULE__{
items: [item()],
requests: [Crawly.Request.t()]
}

```
The parsed item is being processed by Crawly.Worker process, which
sends all requests to the `Crawly.RequestsStorage` process,
responsible for pre-processing requests and storing them for the
future execution, all items are being sent to `Crawly.DataStorage`
process, which is responsible for pre-processing items and storing them
on disk.

For now only one Storage backend is supported (writing on disc). But
in future Crawly will also support work with amazon S3, sql and others.

## Request Middlewares

Crawly is using a concept of pipelines when it comes to processing of
the elements sent to the system. In this section we will cover the
topic of requests middlewares - a powerful tool which allows to modify
the request before sending it to the target website. In most of the
spider developers would want to modify request headers, which allows
requests to look more natural to the crawled websites.

At this point Crawly includes the following request middlewares:
1. `Crawly.Middlewares.DomainFilter` - this middleware will disable
scheduling for all requests leading outside of the crawled
site. The middleware uses `base_url()` defined in the
`Crawly.Spider` behaviour in order to do it's job
2. ` Crawly.Middlewares.RobotsTxt` - this middleware ensures that
Crawly respects the robots.txt defined by the target website.
3. `Crawly.Middlewares.UniqueRequest` - this middleware ensures that
crawly would not schedule the same URL(request) multiple times.
4. `Crawly.Middlewares.UserAgent` - this middleware is used to set a
User Agent HTTP header. Allows to rotate UserAgents, if the last
one is defined as a list.

A list of request middlewares which are going to be used with a given
project is defined in the project settings.

## Item pipelines

Crawly is using a concept of pipelines when it comes to processing of
the elements sent to the system. In this section we will cover the
topic of item pipelines - a tool which is used in order to pre-process
items before storing them in the storage.

At this point Crawly includes the following Item pipelines:
1. `Crawly.Pipelines.Validate` - validates that a given item has all
the required fields. All items which don't have all required fields
are dropped.
2. `Crawly.Pipelines.DuplicatesFilter` - filters out items which are
already stored the system.
3. `Crawly.Pipelines.JSONEncoder`- converts items into JSON format.
4. `Crawly.Pipelines.CSVEncoder`- converts items into CSV format.
5. `Crawly.Pipelines.WriteToFile`- Writes information to a given file.

The list of item pipelines used with a given project is defined in the
project settings.
12 changes: 12 additions & 0 deletions documentation/ethical_aspects.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Ethical aspects of crawling
---

It's important to be polite, when doing a web crawling. You should
avoid cases when your spiders are putting harm on the scrapped
websites. As it's mentioned here: https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy#comments-listing

1. A polite crawler respects robots.txt.
2. A polite crawler never degrades a website’s performance.
3. A polite crawler identifies its creator with contact information.
4. A polite crawler is not a pain in the buttocks of system
administrators.
34 changes: 34 additions & 0 deletions documentation/http_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# HTTP API
---

Crawly supports a basic HTTP API, which allows to control the Engine
behaviour.

## Starting a spider

The following command will start a given Crawly spider:

```
curl -v localhost:4001/spiders/<spider_name>/schedule
```

## Stopping a spider

The following command will stop a given Crawly spider:

```
curl -v localhost:4001/spiders/<spider_name>/stop
```

## Getting currently running spiders

```
curl -v localhost:4001/spiders
```

## Getting spider stats

```
curl -v localhost:4001/spiders/<spider_name>/scheduled-requests
curl -v localhost:4001/spiders/<spider_name>/scraped-items
```
14 changes: 14 additions & 0 deletions documentation/installation_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Installation guide
---

Crawly requires Elixir v1.7 or higher. In order to make a Crawly
project execute the following steps:

1. Generate an new Elixir project: `mix new <project_name> --sup`
2. Add Crawly to you mix.exs file
```elixir
def deps do
[{:crawly, "~> 0.6.0"}]
end
```
3. Fetch crawly: `mix deps.get`
152 changes: 152 additions & 0 deletions documentation/introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Crawly intro
---

Crawly is an application framework for crawling web sites and
extracting structured data which can be used for a wide range of
useful applications, like data mining, information processing or
historical archival.

## Walk-through of an example spider

In order to show you what Crawly brings to the table, we’ll walk you
through an example of a Crawly spider using the simplest way to run a spider.

Here’s the code for a spider that scrapes blog posts from the Erlang
Solutions blog: https://www.erlang-solutions.com/blog.html,
following the pagination:

```elixir
defmodule Esl do
@behaviour Crawly.Spider

@impl Crawly.Spider
def base_url(), do: "https://www.erlang-solutions.com"

def init() do
[
start_urls: ["https://www.erlang-solutions.com/blog.html"]
]
end

@impl Crawly.Spider
def parse_item(response) do
# Getting new urls to follow
urls =
response.body
|> Floki.find("a.more")
|> Floki.attribute("href")
|> Enum.uniq()

# Convert URLs into requests
requests =
Enum.map(urls, fn url ->
url
|> build_absolute_url(response.request_url)
|> Crawly.Utils.request_from_url()
end)

# Extract item from a page, e.g.
# https://www.erlang-solutions.com/blog/introducing-telemetry.html
title =
response.body
|> Floki.find("article.blog_post h1:first-child")
|> Floki.text()

author =
response.body
|> Floki.find("article.blog_post p.subheading")
|> Floki.text(deep: false, sep: "")
|> String.trim_leading()
|> String.trim_trailing()

time =
response.body
|> Floki.find("article.blog_post p.subheading time")
|> Floki.text()

url = response.request_url

%Crawly.ParsedItem{
:requests => requests,
:items => [%{title: title, author: author, time: time, url: url}]
}
end

def build_absolute_url(url, request_url) do
URI.merge(request_url, url) |> to_string()
end
end
```

Put this code into your project and run it using the Crawly REST API:
`curl -v localhost:4001/spiders/Esl/schedule`

When it finishes you will get the ESL.jl file stored on your
filesystem containing the following information about blog posts:

```json
{"url":"https://www.erlang-solutions.com/blog/erlang-trace-files-in-wireshark.html","title":"Erlang trace files in Wireshark","time":"2018-06-07","author":"by Magnus Henoch"}
{"url":"https://www.erlang-solutions.com/blog/railway-oriented-development-with-erlang.html","title":"Railway oriented development with Erlang","time":"2018-06-13","author":"by Oleg Tarasenko"}
{"url":"https://www.erlang-solutions.com/blog/scaling-reliably-during-the-world-s-biggest-sports-events.html","title":"Scaling reliably during the World’s biggest sports events","time":"2018-06-21","author":"by Erlang Solutions"}
{"url":"https://www.erlang-solutions.com/blog/escalus-4-0-0-faster-and-more-extensive-xmpp-testing.html","title":"Escalus 4.0.0: faster and more extensive XMPP testing","time":"2018-05-22","author":"by Konrad Zemek"}
{"url":"https://www.erlang-solutions.com/blog/mongooseim-3-1-inbox-got-better-testing-got-easier.html","title":"MongooseIM 3.1 - Inbox got better, testing got easier","time":"2018-07-25","author":"by Piotr Nosek"}
....
```

## What just happened?

When you ran the curl command:
```curl -v localhost:4001/spiders/Esl/schedule```

Crawly runs a spider ESL, Crawly looked for a Spider definition inside
it and ran it through its crawler engine.

The crawl started by making requests to the URLs defined in the
start_urls attribute of the spider's init, and called the default
callback method `parse_item`, passing the response object as an
argument. In the parse callback, we loop:
1. Look through all pagination the elements using a Floki Selector and
extract absolute URLs to follow. URLS are converted into Requests,
using
`Crawly.Utils.request_from_url()` function
2. Extract item(s) (items are defined in separate modules, and this part
will be covered later on)
3. Return a Crawly.ParsedItem structure which is containing new
requests to follow and items extracted from the given page, all
following requests are going to be processed by the same `parse_item` function.

Crawly is fully asynchronous. Once the requests are scheduled, they
are picked up by separate workers and are executed in parallel. This
also means that other requests can keep going even if some request
fails or an error happens while handling it.


While this enables you to do very fast crawls (sending multiple
concurrent requests at the same time, in a fault-tolerant way) Crawly
also gives you control over the politeness of the crawl through a few
settings. You can do things like setting a download delay between each
request, limiting the amount of concurrent requests per domain or
respecting robots.txt rules

```
This is using JSON export to generate the JSON lines file, but you can
easily extend it to change the export format (XML or CSV, for
example).
```

## What else?

You’ve seen how to extract and store items from a website using
Crawly, but this is just a basic example. Crawly provides a lot of
powerful features for making scraping easy and efficient, such as:

1. Flexible request spoofing (for example user-agents rotation,
cookies management (this feature is planned.))
2. Items validation, using pipelines approach.
3. Filtering already seen requests and items.
4. Filter out all requests which targeted at other domains.
5. Robots.txt enforcement.
6. Concurrency control.
7. HTTP API for controlling crawlers.
8. Interactive console, which allows you to create and debug spiders more easily.

0 comments on commit b54f578

Please sign in to comment.